]> xenbits.xensource.com Git - xen.git/log
xen.git
10 years agodt-uart: use ':' as separator between path and options
Ian Campbell [Thu, 8 Jan 2015 11:53:55 +0000 (11:53 +0000)]
dt-uart: use ':' as separator between path and options

',' is a valid character in a device-tree path (see ePAPR v1.1 Table
2-1), in fact ',' is actually pretty common in node names.

Using ',' as a separator breaks for example on fast models. If you use
the full path (/smb/motherboard/iofpga@3,00000000/uart@090000) rather
than the alias then earlyprintk gives:

(XEN) Looking for UART console /smb/motherboard/iofpga@3
(XEN) Unable to find device "/smb/motherboard/iofpga@3"
(XEN) Bad console= option 'dtuart'

I actually noticed this on Jetson where the uart is
"/serial@0,70006300" and there happened to be no alias defined.

Instead use ':' as the separator, it is defined to terminate the path
in the context of /chosen/stdout-path (Table 3-4) which is pretty
closely analogous to the dtuart= option and so makes a pretty good
choice (especially since the next patch adds support for stdout-path).

Since no DT aware driver current supports any options there is no
point in retaining support for ',' for backwards compatibility.

Additionally, expand the buffer for the dtuart option, a path can be
far longer than 30 characters (in fact the maximum size of a single
node name is 31, so it's not even necessarily enough for an alias).
128 is completely arbitrary and allows for paths at least 8 deep even
with worst case node names.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit f01af57300cb60ab0fd8487fb5bbbe97bee234f0)

Conflicts:
docs/misc/xen-command-line.markdown
        -- dropped since precursor patch not present.

10 years agolibxl: Don't ignore error when we fail to give access to ioport/irq/iomem
Julien Grall [Wed, 14 Jan 2015 13:06:48 +0000 (13:06 +0000)]
libxl: Don't ignore error when we fail to give access to ioport/irq/iomem

If we fail to give the access, the domain will unlikely work correctly.
So we should bail out at the first error.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
(Based on commit 7070eec417934360bf3aed434191246dfe4f8091)

10 years agotools: libxl: do not leak diskpath during local disk attach
Ian Campbell [Thu, 6 Nov 2014 13:00:31 +0000 (13:00 +0000)]
tools: libxl: do not leak diskpath during local disk attach

libxl__device_disk_local_initiate_attach is assigning dls->diskpath with a
strdup of the device path. This is then passed to the callback, e.g.
parse_bootloader_result but bootloader_cleanup will not free it.

Since the callback is within the scope of the (e)gc and therefore doesn't need
to be malloc'd, a gc'd alloc will do. All other assignments to this field use
the gc.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=767295

Reported-by: Gedalya <gedalya@gedalya.net>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 379b351889a8f02abe30a06e2ce9ba8b381b91ab)

10 years agotools: libxl: do not overrun input buffer in libxl__parse_mac
Ian Campbell [Thu, 6 Nov 2014 13:59:43 +0000 (13:59 +0000)]
tools: libxl: do not overrun input buffer in libxl__parse_mac

Valgrind reports:
==7971== Invalid read of size 1
==7971==    at 0x40877BE: libxl__parse_mac (libxl_internal.c:288)
==7971==    by 0x405C5F8: libxl__device_nic_from_xs_be (libxl.c:3405)
==7971==    by 0x4065542: libxl__append_nic_list_of_type (libxl.c:3484)
==7971==    by 0x4065542: libxl_device_nic_list (libxl.c:3504)
==7971==    by 0x406F561: libxl_retrieve_domain_configuration (libxl.c:6661)
==7971==    by 0x805671C: reload_domain_config (xl_cmdimpl.c:2037)
==7971==    by 0x8057F30: handle_domain_death (xl_cmdimpl.c:2116)
==7971==    by 0x8057F30: create_domain (xl_cmdimpl.c:2580)
==7971==    by 0x805B4B2: main_create (xl_cmdimpl.c:4652)
==7971==    by 0x804EAB2: main (xl.c:378)

This is because on the final iteration the tok += 3 skips over the terminating
NUL to the next byte, and then *tok reads it. Fix this by using endptr as the
iterator.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Don Slutz <dslutz@verizon.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 5a430eca0b27354456d1245ed3f637d5f2e17883)

10 years agolibxc: check return values on mmap() and madvise() on xc_alloc_hypercall_buffer()
Luis R. Rodriguez [Tue, 20 May 2014 12:37:35 +0000 (05:37 -0700)]
libxc: check return values on mmap() and madvise() on xc_alloc_hypercall_buffer()

On a Thinkpad T4440p with OpenSUSE tumbleweed with v3.15-rc4
and today's latest xen tip from the git tree strace -f reveals
we end up on a never ending wait shortly after

write(20, "backend/console/5\0", 18 <unfinished ...>

This is right before we just wait on the qemu process which we
had mmap'd for. Without this you'll end up getting stuck on a
loop if mmap() worked but madvise() did not. While at it I noticed
even the mmap() error fail was not being checked, fix that too.

Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit e86539a388314cd3dca88f5e69d7873343197cd8)

10 years agox86/HVM: prevent use-after-free when destroying a domain
Mihai Donțu [Tue, 6 Jan 2015 12:49:52 +0000 (12:49 +0000)]
x86/HVM: prevent use-after-free when destroying a domain

hvm_domain_relinquish_resources() can free certain domain resources
which can still be accessed, e.g. by HVMOP_set_param, while the domain
is being cleaned up.

This is CVE-2015-0361 / XSA-116.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Tested-by: Răzvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d3c151fd3a4365fc6107198bfc975807d40d157d)

10 years agoxen/arm: dump guest stack even if not the current VCPU
Frediano Ziglio [Thu, 23 Oct 2014 16:18:52 +0000 (17:18 +0100)]
xen/arm: dump guest stack even if not the current VCPU

If show_guest_stack was called from Xen context (for instance hitting
'0' key on Xen console) get_page_from_gva was not able to get the
page returning NULL.
Detecting different domain and changing VTTBR register make
get_page_from_gva works for different domains.

Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 281c892e3bdb0a56cfd8ab6516cd1a33c095c857)

10 years agoxen/arm: Handle platforms with edge-triggered virtual timer
Julien Grall [Fri, 28 Nov 2014 15:17:06 +0000 (15:17 +0000)]
xen/arm: Handle platforms with edge-triggered virtual timer

Some platforms (such as Xgene and ARMv8 models) use an edge-triggered interrupt
for the virtual timer. Even if the timer output signal is masked in the
context switch, the GIC will keep track that of any interrupts raised
while IRQs are disabled. As soon as IRQs are re-enabled, the virtual
interrupt timer will be injected to Xen.

If an idle vVCPU was scheduled next then the interrupt handler doesn't
expect to the receive the IRQ and will crash:

(XEN)    [<0000000000228388>] _spin_lock_irqsave+0x28/0x94 (PC)
(XEN)    [<0000000000228380>] _spin_lock_irqsave+0x20/0x94 (LR)
(XEN)    [<0000000000250510>] vgic_vcpu_inject_irq+0x40/0x1b0
(XEN)    [<000000000024bcd0>] vtimer_interrupt+0x4c/0x54
(XEN)    [<0000000000247010>] do_IRQ+0x1a4/0x220
(XEN)    [<0000000000244864>] gic_interrupt+0x50/0xec
(XEN)    [<000000000024fbac>] do_trap_irq+0x20/0x2c
(XEN)    [<0000000000255240>] hyp_irq+0x5c/0x60
(XEN)    [<0000000000241084>] context_switch+0xb8/0xc4
(XEN)    [<000000000022482c>] schedule+0x684/0x6d0
(XEN)    [<000000000022785c>] __do_softirq+0xcc/0xe8
(XEN)    [<00000000002278d4>] do_softirq+0x14/0x1c
(XEN)    [<0000000000240fac>] idle_loop+0x134/0x154
(XEN)    [<000000000024c160>] start_secondary+0x14c/0x15c
(XEN)    [<0000000000000001>] 0000000000000001

The proper solution is to context switch the virtual interrupt state at
the GIC level. This would also avoid masking the output signal which
requires specific handling in the guest OS and more complex code in Xen
to deal with EOIs, and so is desirable for that reason too.

Sadly, this solution requires some refactoring which would not be
suitable for a freeze exception for the Xen 4.5 release.

For now implement a temporary solution which ignores the virtual timer
interrupt when the idle VCPU is running.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- tweaked some wording in the comment ]
(cherry picked from commit f688aec47c452d6aef382739d0781735672ef995)

10 years agocall vgic_en/disable_irqs holding the rank_lock
Stefano Stabellini [Tue, 24 Jun 2014 18:30:00 +0000 (19:30 +0100)]
call vgic_en/disable_irqs holding the rank_lock

Modifying ienable and calling vgic_enable_irqs or vgic_disable_irqs need
to be atomic. Move the calls to vgic_enable_irqs and vgic_disable_irqs
before releasing the rank lock.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- partial backport of just the locking part of 5b3a817ea33b
         "xen/arm: observe itargets setting in vgic_enable_irqs and
         vgic_disable_irqs"
]

10 years agoxen/arm: domain_vgic_init: Avoid double free on shared_irqs
Julien Grall [Fri, 25 Jul 2014 14:17:26 +0000 (15:17 +0100)]
xen/arm: domain_vgic_init: Avoid double free on shared_irqs

When the function domain_vgic_init is failing to initialize pending_irqs,
it will free shared_irqs. Few call later, domain_vgic_free will be called
an try to free a second time the same variable. This will result to a double
free.

Remove the free in domain_vgic_init and rely on domain_vgic_free to correctly
release the memory.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 7b41618f5a08145b0198af4a8a2ce361d7e677e6)

10 years agox86/HVM: don't crash guest upon problems occurring in user mode
Jan Beulich [Wed, 10 Dec 2014 11:30:15 +0000 (12:30 +0100)]
x86/HVM: don't crash guest upon problems occurring in user mode

This extends commit 5283b310 ("x86/HVM: only kill guest when unknown VM
exit occurred in guest kernel mode") to a few more cases, including the
failed VM entry one that XSA-110 was needed to be issued for.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/HVM: prevent infinite VM entry retries

This reverts the VMX side of commit 28b4baac ("x86/HVM: don't crash
guest upon problems occurring in user mode") and gets SVM in line with
the resulting VMX behavior. This is because Andrew validly says

"A failed vmentry is overwhelmingly likely to be caused by corrupt
 VMC[SB] state.  As a result, injecting a fault and retrying the the
 vmentry is likely to fail in the same way."

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Tim Deegan <tim@xen.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 28b4baacd599e8c10e6dac055f6a939bb730fb8a
master date: 2014-11-25 10:08:57 +0100
master commit: 04ae2f6837b35bcfb689baf15f493da626929fb5
master date: 2014-12-02 12:48:01 +0100

10 years agox86/cpuidle: don't count C1 multiple times
Jan Beulich [Wed, 10 Dec 2014 11:29:05 +0000 (12:29 +0100)]
x86/cpuidle: don't count C1 multiple times

Commit 4ca6f9f0 ("x86/cpuidle: publish new states only after fully
initializing them") resulted in the state counter to be incremented
for C1 despite that using a fixed table entry (and the statically
initialized counter value already accounting for it and C0).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
master commit: 0aabd10525326edfe5098c2ec5bfe05db7732c32
master date: 2014-11-25 10:05:29 +0100

10 years agoEFI: allow retry of ExitBootServices() call
Jan Beulich [Wed, 10 Dec 2014 11:26:56 +0000 (12:26 +0100)]
EFI: allow retry of ExitBootServices() call

The specification is kind of vague under what conditions
ExitBootServices() may legitimately fail, requiring the OS loader to
retry:

"If MapKey value is incorrect, ExitBootServices() returns
 EFI_INVALID_PARAMETER and GetMemoryMap() with ExitBootServices() must
 be called again. Firmware implementation may choose to do a partial
 shutdown of the boot services during the first call to
 ExitBootServices(). EFI OS loader should not make calls to any boot
 service function other then GetMemoryMap() after the first call to
 ExitBootServices()."

While our code guarantees the map key to be valid, there are systems
where a firmware internal notification sent while processing
ExitBootServices() reportedly results in changes to the memory map.
In that case, make a best effort second try: Avoid any boot service
calls other than the two named above, with the possible exception of
error paths. Those aren't a problem, since if we end up needing to
retry, we're hosed when something goes wrong as much as if we didn't
make the retry attempt.

For x86, a minimal adjustment to efi_arch_process_memory_map() is
needed for it to cope with potentially being called a second time.

For arm64, while efi_process_memory_map_bootinfo() is easy to verify
that it can safely be called more than once without violating spec
constraints, it's not so obvious for fdt_add_uefi_nodes(), hence a
step by step approach:
- deletion of memory nodes and memory reserve map entries: the 2nd pass
  shouldn't find any as the 1st one deleted them all,
- a "chosen" node should be found as it got added in the 1st pass,
- the various "linux,uefi-*" nodes all got added during the 1st pass
  and hence only their contents may get updated.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roy Franz <roy.franz@linaro.org>
master commit: 0540b854f6733759593e829bc3f13c9b45974e32
master date: 2014-11-17 15:07:03 +0100

10 years agox86: (allow to) override LIST_POISON*
Jan Beulich [Wed, 10 Dec 2014 11:22:00 +0000 (12:22 +0100)]
x86: (allow to) override LIST_POISON*

Having these point into space not controlled by the hypervisor provides
an unnecessary attack surface. Allow architectures to override them and
utilize that override to make them non-canonical addresses (thus
causing #GP rather than #PF when dereferenced).

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 404227138e1e49c8073e946649a8d4173b35625c
master date: 2014-11-17 15:05:53 +0100

10 years agoadjust number of domains in cpupools when destroying domain
Juergen Gross [Wed, 10 Dec 2014 11:20:40 +0000 (12:20 +0100)]
adjust number of domains in cpupools when destroying domain

Commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a (move domain to
cpupool0 before destroying it) introduced an error in the accounting
of cpupools regarding the number of domains. The number of domains
is nor adjusted when a domain is moved to cpupool0 in kill_domain().

Correct this by introducing a cpupool function doing the move
instead of open coding it by calling sched_move_domain().

Reported-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
Reviewed-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
master commit: 934e7baa6c12d19cfaf24e8f8e27d6c6a8b8c5e4
master date: 2014-11-12 12:39:58 +0100

10 years agoswitch to write-biased r/w locks
Keir Fraser [Mon, 8 Dec 2014 14:26:57 +0000 (15:26 +0100)]
switch to write-biased r/w locks

This is to improve fairness: A permanent flow of read acquires can
otherwise lock out eventual writers indefinitely.

This is CVE-2014-9065 / XSA-114.

Signed-off-by: Keir Fraser <keir@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2a549b9c8aa48dc39d7c97e5a93978b781b3a1db
master date: 2014-12-08 14:45:46 +0100

10 years agox86/HVM: confine internally handled MMIO to solitary regions
Jan Beulich [Thu, 27 Nov 2014 13:11:57 +0000 (14:11 +0100)]
x86/HVM: confine internally handled MMIO to solitary regions

While it is generally wrong to cross region boundaries when dealing
with MMIO accesses of repeated string instructions (currently only
MOVS) as that would do things a guest doesn't expect (leaving aside
that none of these regions would normally be accessed with repeated
string instructions in the first place), this is even more of a problem
for all virtual MSI-X page accesses (both msixtbl_{read,write}() can be
made dereference NULL "entry" pointers this way) as well as undersized
(1- or 2-byte) LAPIC writes (causing vlapic_read_aligned() to access
space beyond the one memory page set up for holding LAPIC register
values).

Since those functions validly assume to be called only with addresses
their respective checking functions indicated to be okay, it is generic
code that needs to be fixed to clip the repetition count.

To be on the safe side (and consistent), also do the same for buffered
I/O intercepts, even if their only client (stdvga) doesn't put the
hypervisor at risk (i.e. "only" guest misbehavior would result).

This is CVE-2014-8867 / XSA-112.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: c5397354b998d030b021810b8202de93b9526818
master date: 2014-11-27 14:01:40 +0100

10 years agox86: limit checks in hypercall_xlat_continuation() to actual arguments
Jan Beulich [Thu, 27 Nov 2014 13:10:52 +0000 (14:10 +0100)]
x86: limit checks in hypercall_xlat_continuation() to actual arguments

HVM/PVH guests can otherwise trigger the final BUG_ON() in that
function by entering 64-bit mode, setting the high halves of affected
registers to non-zero values, leaving 64-bit mode, and issuing a
hypercall that might get preempted and hence become subject to
continuation argument translation (HYPERVISOR_memory_op being the only
one possible for HVM, PVH also having the option of using
HYPERVISOR_mmuext_op). This issue got introduced when HVM code was
switched to use compat_memory_op() - neither that nor
hypercall_xlat_continuation() were originally intended to be used by
other than PV guests (which can't enter 64-bit mode and hence have no
way to alter the high halves of 64-bit registers).

This is CVE-2014-8866 / XSA-111.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 0ad715304b04739fd2fc9517ce8671d3947c7621
master date: 2014-11-27 14:00:23 +0100

10 years agox86/mm: fix a reference counting error in MMU_MACHPHYS_UPDATE
Andrew Cooper [Thu, 20 Nov 2014 16:43:39 +0000 (17:43 +0100)]
x86/mm: fix a reference counting error in MMU_MACHPHYS_UPDATE

Any domain which can pass the XSM check against a translated guest can cause a
page reference to be leaked.

While shuffling the order of checks, drop the quite-pointless MEM_LOG().  This
brings the check in line with similar checks in the vicinity.

Discovered while reviewing the XSA-109/110 followup series.

This is XSA-113.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
10 years agox86emul: enforce privilege level restrictions when loading CS
Jan Beulich [Tue, 18 Nov 2014 13:28:45 +0000 (14:28 +0100)]
x86emul: enforce privilege level restrictions when loading CS

Privilege level checks were basically missing for the CS case, the
only check that was done (RPL == DPL for nonconforming segments)
was solely covering a single special case (return to non-conforming
segment).

Additionally in long mode the L bit set requires the D bit to be clear,
as was recently pointed out for KVM by Nadav Amit
<namit@cs.technion.ac.il>.

Finally we also need to force the loaded selector's RPL to CPL (at
least as long as lret/retf emulation doesn't support privilege level
changes).

This is CVE-2014-8595 / XSA-110.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 1d68c1a70e00ed95ef0889cfa005379dab27b37d
master date: 2014-11-18 14:16:23 +0100

10 years agox86: don't allow page table updates on non-PV page tables in do_mmu_update()
Jan Beulich [Tue, 18 Nov 2014 13:27:46 +0000 (14:27 +0100)]
x86: don't allow page table updates on non-PV page tables in do_mmu_update()

paging_write_guest_entry() and paging_cmpxchg_guest_entry() aren't
consistently supported for non-PV guests (they'd deref NULL for PVH or
non-HAP HVM ones). Don't allow respective MMU_* operations on the
page tables of such domains.

This is CVE-2014-8594 / XSA-109.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
master commit: e4292c5aac41b80f33d4877104348d5ee7c95aa4
master date: 2014-11-18 14:15:21 +0100

10 years agox86/PVH: replace bogus assertion with conditional
Jan Beulich [Thu, 13 Nov 2014 08:50:13 +0000 (09:50 +0100)]
x86/PVH: replace bogus assertion with conditional

While PVH guests currently have to start in 64-bit mode, nothing keeps
them from entering compatibility mode via a suitable ring-0 code
selector and making a hypercall from there. Fail such attempts rather
than asserting they won't happen.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 01c019c843df313f62ca5c106544026867da9e9e
master date: 2014-11-04 13:07:23 +0100

10 years agoprocess softirqs while dumping domains
Andrew Cooper [Thu, 13 Nov 2014 08:49:40 +0000 (09:49 +0100)]
process softirqs while dumping domains

Process softirqs once per domain, and once every 64 vcpus in a guest to avoid
being hit by the NMI watchdog.  Discovered against a VM which had accidentally
been assigned 8192 vcpus.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: 9cf71226edabd8b9bc81a5eb57823dacbe8b4bd8
master date: 2014-10-31 11:28:36 +0100

10 years agox86/HVM: only kill guest when unknown VM exit occurred in guest kernel mode
Jan Beulich [Thu, 13 Nov 2014 08:48:55 +0000 (09:48 +0100)]
x86/HVM: only kill guest when unknown VM exit occurred in guest kernel mode

A recent KVM change by Nadav Amit <namit@cs.technion.ac.il> pointed out
that unconditional VM exits (like VMX'es ones for the INVEPT, INVVPID,
and XSETBV instructions) may result from guest user mode activity (in
the example cases, e.g. prior to a privilege level check being done).
Consequently convert the unconditional domain_crash() to a conditional
one (when guest is in kernel mode) with the alternative of injecting
#UD (when in user mode).

This is meant to be a precaution against in-guest security issues
introduced when any such VM exit becomes possible (on newer hardware)
without the hypervisor immediately being aware of it. There are no such
unhandled VM exits currently (and hence this is not an active security
issue), but old (no longer security maintained) versions exhibit issues
in the cases given as examples above.

Suggested-by: Tim Deegan <tim@xen.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 5283b310e14884341f51be35253cdd59c4cb034c
master date: 2014-10-31 11:32:27 +0100

10 years agoVMX: values written to MSR_IA32_SYSENTER_E[IS]P should be canonical
Jan Beulich [Thu, 13 Nov 2014 08:48:13 +0000 (09:48 +0100)]
VMX: values written to MSR_IA32_SYSENTER_E[IS]P should be canonical

A recent KVM change by Nadav Amit <namit@cs.technion.ac.il> helped spot
that we have the same issue as they did.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 93cc5c6f1641e90eb120826d42f103b7726efb8e
master date: 2014-10-31 11:31:11 +0100

10 years agox86/HVM: sanity check xsave area when migrating or restoring from older Xen versions
Don Koch [Thu, 13 Nov 2014 08:47:30 +0000 (09:47 +0100)]
x86/HVM: sanity check xsave area when migrating or restoring from older Xen versions

Xen 4.3.0, 4.2.3 and older transferred a maximum sized xsave area (as
if all the available XCR0 bits were set); the new version only
transfers based on the actual XCR0 bits. This may result in a smaller
area if the last sections were missing (e.g., the LWP area from an AMD
machine). If the size doesn't match the XCR0 derived size, the size is
checked against the maximum size and the part of the xsave area
between the actual and maximum used size is checked for zero data. If
either the max size check or any part of the overflow area is
non-zero, we return with an error.

Signed-off-by: Don Koch <dkoch@verizon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d7bb8e88a087690feba63ef83c13ba067f041da0
master date: 2014-10-27 16:45:09 +0100

10 years agoEFI: allow to suppress the use of runtime services
Jan Beulich [Thu, 13 Nov 2014 08:46:46 +0000 (09:46 +0100)]
EFI: allow to suppress the use of runtime services

On certain systems some of the memory map entries designated for use by
runtime services cannot be mapped (frequently due to firmware bugs). On
others, some of the memory map entries aren't even marked for runtime
services use, yet are being used by them. For both cases give people a
way to suppress use of runtime services altogether.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5680244040d26234a43c7f884ecf98fa4928d0da
master date: 2014-10-27 16:43:52 +0100

10 years agox86: tolerate running on EFI runtime services page tables in map_domain_page()
Jan Beulich [Thu, 13 Nov 2014 08:45:42 +0000 (09:45 +0100)]
x86: tolerate running on EFI runtime services page tables in map_domain_page()

In the event of a #PF while in an EFI runtime service function we
otherwise can't dump the page tables, making the analysis of the
problem more cumbersome.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e65436ba36be8f1b735573d8fc9af7d8a053ba5f
master date: 2014-10-27 16:43:12 +0100

10 years agohvm/load: correct length checks for zeroextended records
Andrew Cooper [Thu, 13 Nov 2014 08:44:26 +0000 (09:44 +0100)]
hvm/load: correct length checks for zeroextended records

In the case that Xen is attempting to load a zeroextended HVM record where the
difference needing extending would overflow the data blob, _hvm_check_entry()
will incorrectly fail before working out that it would have been safe.

The "len + sizeof(*d)" check is wrong.  Consider zeroextending a 16 byte
record into a 32 byte structure.  "32 + hdr" will fail the overall context
length check even though the pre-extended record in the stream is 16 bytes.

The first condition is reduced to just a length check for hvm save header,
while the second condition is extended to include a check that the record in
the stream not exceeding the stream length.

The error messages are extended to include further useful information.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <Paul.Durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 66d0c0aa1f3e57e873fd64d1d370e11758d25442
master date: 2014-10-27 16:41:50 +0100

10 years agovmx: fix save/restore issue with apicv
Yang Zhang [Thu, 13 Nov 2014 08:43:33 +0000 (09:43 +0100)]
vmx: fix save/restore issue with apicv

This patch fixes two issues:

1. Interrupts on PIR are lost during save/restore. Syncing the PIR
into IRR during save will fix it.

2. EOI exit bitmap doesn't set up correctly after restore. Here we
will construct the eoi exit bitmap via (IRR | ISR). Though it may cause
unnecessary eoi exit of the interrupts that pending in IRR or ISR during
save/restore, each pending interrupt only causes one vmexit. The
subsequent interrupts will adjust the eoi exit bitmap correctly. So
the performance hurt can be ignored.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 607e8494c42397fb249191904066cace6ac9a880
master date: 2014-10-27 16:40:18 +0100

10 years agofix listing of vcpus when domains lacking any vcpus exist
Andrew Cooper [Thu, 13 Nov 2014 08:42:03 +0000 (09:42 +0100)]
fix listing of vcpus when domains lacking any vcpus exist

On a system which looks like this:

[root@st04 ~]# xl list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0   752     4     r-----   46699.3
(null)                                       1     0     0     --p---       0.0
(null)                                       2     0     0     --p---       0.0
(null)                                       3     0     0     --p---       0.0
badger                                      25     0     1     --p---       0.0

`xl vcpu-list` failes as so:

[root@st04 ~]# xl vcpu-list
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
Domain-0                             0     0    0   -b-   12171.0  all
Domain-0                             0     1    1   -b-   11779.6  all
Domain-0                             0     2    2   -b-   11599.0  all
Domain-0                             0     3    3   r--   11007.0  all
libxl: critical: libxl__calloc: libxl: FATAL ERROR: memory allocation failure (libxl__calloc, 4294935299 x 40)
: Cannot allocate memory
libxl: FATAL ERROR: memory allocation failure (libxl__calloc, 4294935299 x 40)

The root cause of this is in Xen.  getdomaininfo() has no way of expressing
"this domain has no vcpus".  Previously, info->max_vcpu_id would be returned
uninitialised in such a case.

Unfortunately, setting it to 0 as a default is not appropriate.  A max_vcpu_id
of 0 and nr_online_cpus of 0 is the valid state for a single vcpu domain which
is in the process of being destroyed.

As all components are required to add 1 to max_vcpu_id to get the number of
vcpus, an id of ~0U is not valid to be used.  Explicitly define this as an
invalid max vcpu value, and use it to express "no vcpus" in getdomaininfo()

In libxl, the issue is seen as libxl_list_vcpu() attempts to use the
uninitialised domaininfo.max_vcpu_id for memory allocation.

Check domaininfo.max_vcpu_id against the new sentinel value
XEN_INVALID_MAX_VCPU_ID, and return early.  This means that it is now valid
for libxl_list_vcpu() to return NULL for a domain which lacks any vcpus.

As part of this change, remove the pointless call to libxl_get_max_cpus(),
whose returned value is unconditionally clobbered in the for() loop.

Reported-by: Euan Harris <euan.harris@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Don Slutz <dslutz@verizon.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
tools/libxl: Fix libxl_list_vcpu() following c/s 93e52d52

My reasoning regarding nr_cpus_out was wrong, as I had confused nr_cpus_out
with nr_vcpus_out.

Dario pointed this out, but the patch (having gained appropriate acks) got
committed before I could post a correction.

Noticed-by: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 93e52d5242804ff928131553599afa769f85481b
master date: 2014-10-23 10:19:53 +0200
master commit: c90a755f261b8ccb3dac7e1f695122cac95c6053
master date: 2014-10-23 12:29:00 +0100

10 years agox86/paging: make log-dirty operations preemptible
Jan Beulich [Fri, 17 Oct 2014 13:57:42 +0000 (15:57 +0200)]
x86/paging: make log-dirty operations preemptible

Both the freeing and the inspection of the bitmap get done in (nested)
loops which - besides having a rather high iteration count in general,
albeit that would be covered by XSA-77 - have the number of non-trivial
iterations they need to perform (indirectly) controllable by both the
guest they are for and any domain controlling the guest (including the
one running qemu for it).

Note that the tying of the continuations to the invoking domain (which
previously [wrongly] used the invoking vCPU instead) implies that the
tools requesting such operations have to make sure they don't issue
multiple similar operations in parallel.

Note further that this breaks supervisor-mode kernel assumptions in
hypercall_create_continuation() (where regs->eip gets rewound to the
current hypercall stub beginning), but otoh
hypercall_cancel_continuation() doesn't work in that mode either.
Perhaps time to rip out all the remains of that feature?

This is part of CVE-2014-5146 / XSA-97.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 070493dfd2788e061b53f074b7ba97507fbcbf65
master date: 2014-10-06 11:22:04 +0200

10 years agoAMD/guest_iommu: properly disable guest iommu support
Andrew Cooper [Fri, 17 Oct 2014 13:56:51 +0000 (15:56 +0200)]
AMD/guest_iommu: properly disable guest iommu support

AMD Guest IOMMU support was added to allow correct use of PASID and PRI
hardware support with an ATS-aware guest driver.

However, support cannot possibly function as guest_iommu_set_base() has no
callers.  This means that its MMIO region's P2M pages are not set to
p2m_mmio_dm, preventing any invocation of the MMIO read/write handlers.

c/s fd186384 "x86/HVM: extend LAPIC shortcuts around P2M lookups" introduces a
path (via hvm_mmio_internal()) where iommu_mmio_handler claims its MMIO range,
and causes __hvm_copy() to fail with HVMCOPY_bad_gfn_to_mfn.

iommu->mmio_base defaults to 0, with a range of 8 pages, and is unilaterally
enabled in any HVM guests when the host IOMMU(s) supports any extended
features.

Unfortunately, HVMLoader's AP boot trampoline executes an `lmsw` instruction
at linear address 0x100c which unconditionally requires emulation.  The
instruction fetch in turn fails as __hvm_copy() fails with
HVMCOPY_bad_gfn_to_mfn.

The result is that multi-vcpu HVM guests do not work on newer AMD hardware, if
IOMMU support is enabled in the BIOS.

Change the default mmio_base address to ~0ULL.  This prevents
guest_iommu_mmio_range() from actually claiming any physical range
whatsoever, which allows the emulation of `lmsw` to succeed.

Reported-by: Roberto Luongo <rluongo@ready.it>
Suggested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Roberto Luongo <rluongo@ready.it>
Acked-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
master commit: 02e4c89b2cc3c1f19ef42ed4fcb1931d8d704bb5
master date: 2014-10-06 11:20:12 +0200

10 years agodon't allow Dom0 access to IOMMUs' MMIO pages
Jan Beulich [Fri, 17 Oct 2014 13:56:07 +0000 (15:56 +0200)]
don't allow Dom0 access to IOMMUs' MMIO pages

Just like for LAPIC, IO-APIC, MSI, and HT we shouldn't be granting Dom0
access to these. This implicitly results in these pages also getting
marked reserved in the machine memory map Dom0 uses to determine the
ranges where PCI devices can have their MMIO ranges placed.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: fdf30377fbc4fa6798bfda7d69e5d448c2b8e834
master date: 2014-10-06 11:15:01 +0200

10 years agox86: restore reserving of IO-APIC pages in XENMEM_machine_memory_map output
Jan Beulich [Fri, 17 Oct 2014 13:55:27 +0000 (15:55 +0200)]
x86: restore reserving of IO-APIC pages in XENMEM_machine_memory_map output

Commit d1222afda4 ("x86: allow Dom0 read-only access to IO-APICs") had
an unintended side effect: By no longer adding IO-APIC pages to Dom0's
iomem_caps these also no longer get reported as reserved in the machine
memory map presented to it (which got added there intentionally by
commit b8a456caed ["x86: improve reporting through
XENMEM_machine_memory_map"] because many BIOSes fail to add these).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 9d8edc5a8b4a0937193f80da72abdb44c5aeaec6
master date: 2014-10-06 11:13:19 +0200

10 years agox86/MSI: fix MSI-X case of freeing IRQ
Jan Beulich [Fri, 17 Oct 2014 13:54:31 +0000 (15:54 +0200)]
x86/MSI: fix MSI-X case of freeing IRQ

Commit d1b6d0a024 ("x86: enable multi-vector MSI") went a little too
far with moving things around in msi_free_irqs() in order to streamline
the code: We shouldn't drop the MSI-X control page reference before
calling destroy_irq(), as the latter will call us back via
desc->handler->shutdown() (effectively invoking to msi_set_mask_bit()).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 44d20c69516f8d5e71fc14bf216e230a9910d729
master date: 2014-10-06 11:11:28 +0200

10 years agox86/EFI: fix freeing of uninitialized pointer
Roy Franz [Fri, 17 Oct 2014 13:53:46 +0000 (15:53 +0200)]
x86/EFI: fix freeing of uninitialized pointer

The only valid response from the LocateHandle() call is EFI_BUFFER_TOO_SMALL,
so exit if we get anything else.  We pass a 0 size/NULL pointer buffer, so the
only other returns we will get is an error.  Return right away as there is
nothing to do.  Also return if there is an error allocating the buffer, as the
previous code path also allowed for an undefined pointer to be freed.

Signed-off-by: Roy Franz <roy.franz@linaro.org>
Re-structure the change.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: c61690fb76f9a51a8c932d76929b67bd0940febe
master date: 2014-09-24 11:09:11 +0200

10 years agoVMX: don't unintentionally leave x2APIC MSR intercepts disabled
Jan Beulich [Fri, 17 Oct 2014 13:52:27 +0000 (15:52 +0200)]
VMX: don't unintentionally leave x2APIC MSR intercepts disabled

These should be re-enabled in particular when the virtualized APIC
transitions to HW-disabled state.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 72af6f455ac6afcd46d9a556f90349f2397507e8
master date: 2014-09-16 13:58:20 +0200

10 years agox86, idle: add barriers to CLFLUSH workaround
H. Peter Anvin [Fri, 17 Oct 2014 13:51:20 +0000 (15:51 +0200)]
x86, idle: add barriers to CLFLUSH workaround

... since the documentation is explicit that CLFLUSH is only ordered
with respect to MFENCE.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 48d32458bcd453e31b458bca868a079a6d0a38af
master date: 2014-09-09 18:09:08 +0200

10 years agoVT-d: suppress UR signaling for further desktop chipsets
Jan Beulich [Wed, 1 Oct 2014 13:06:39 +0000 (15:06 +0200)]
VT-d: suppress UR signaling for further desktop chipsets

This extends commit d6cb14b34f ("VT-d: suppress UR signaling for
desktop chipsets") as per the finally obtained list of affected
chipsets from Intel.

Also pad the IDs we had listed there before to full 4 hex digits.

This is CVE-2013-3495 / XSA-59.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: 3e2331d271cc0882e4013c8f20398c46c35f90a1
master date: 2014-09-18 15:03:22 +0200

10 years agox86/NMI: allow processing unknown NMIs when watchdog is enabled
Ross Lagerwall [Wed, 1 Oct 2014 13:05:40 +0000 (15:05 +0200)]
x86/NMI: allow processing unknown NMIs when watchdog is enabled

Change NMI processing so that if watchdog=force is passed on the
command-line and the NMI is not caused by a perf counter overflow (i.e.
likely not a watchdog "tick"), the NMI is handled by the unknown NMI
handler.

This allows injection of NMIs from IPMI controllers that don't set the
IOCK/SERR bits to trigger the unknown NMI handler rather than be
ignored.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Fix command line parsing (don't enable the watchdog on e.g.
"watchdog=xyz").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
x86/NMI: allow passing just "watchdog" again

This capability got inadvertently lost in commit 3ea2ba980a ("x86/NMI:
allow processing unknown NMIs when watchdog is enabled") due to an
oversight of mine.

Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3ea2ba980afe7356c613c8e1ba00d223d1c25412
master date: 2014-08-28 16:11:37 +0200
master commit: fd553ae5f0f57baa63d033bedee84f607de57d33
master date: 2014-09-03 15:09:59 +0200

10 years agox86/ats: Disable Address Translation Services by default
Andrew Cooper [Wed, 1 Oct 2014 13:03:52 +0000 (15:03 +0200)]
x86/ats: Disable Address Translation Services by default

Xen cannot safely use any ATS functionality until it gains asynchronous queued
invalidation support, because of the current synchronous wait for completion.

Do not turn ATS on by default.

While editing the default in the command line documentation, correct the
statement regarding PCI Passthrough.  ATS is purely a performance
optimisation, and is certainly not required for PCI Passthrough to function.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
master commit: ad6eddb742577d182e634785bcfaf92732a50024
master date: 2014-08-28 16:05:10 +0200

10 years agox86/irq: process softirqs in irq keyhandlers
Andrew Cooper [Wed, 1 Oct 2014 13:02:42 +0000 (15:02 +0200)]
x86/irq: process softirqs in irq keyhandlers

Large machines with lots of interrupts can trip over the Xen watchdog.

Suggested-by: Santosh Jodh <Santosh.Jodh@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Santosh Jodh <Santosh.Jodh@citrix.com>
x86/IO-APIC: don't process softirqs during early boot

Commit e13b320399 ("x86/irq: process softirqs in irq keyhandlers")
made this unconditional, but the boot time use of __print_IO_APIC()
(when "apic_verbosity=debug" was given) can't tolerate that.

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
master commit: e13b3203990706db1313ec2aadd9a30b249ee793
master date: 2014-08-22 14:32:45 +0200
master commit: bd083922f9e78ed19ef98e7de372e5f568402ed3
master date: 2014-08-26 17:56:52 +0200

10 years agoVMX: fix DebugCtl MSR clearing
Jan Beulich [Wed, 1 Oct 2014 13:01:57 +0000 (15:01 +0200)]
VMX: fix DebugCtl MSR clearing

The previous shortcut was wrong, as it bypassed the necessary vmwrite:
All we really want to avoid if the guest writes zero is to add the MSR
to the host-load list.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: dfa625e15f3d6c374637f2bb789e1f444c2781c3
master date: 2014-08-22 14:29:37 +0200

10 years agox86/HVM: properly bound x2APIC MSR range
Jan Beulich [Wed, 1 Oct 2014 12:59:00 +0000 (14:59 +0200)]
x86/HVM: properly bound x2APIC MSR range

While the write path change appears to be purely cosmetic (but still
gets done here for consistency), the read side mistake permitted
accesses beyond the virtual APIC page.

Note that while this isn't fully in line with the specification
(digesting MSRs 0x800-0xBFF for the x2APIC), this is the minimal
possible fix addressing the security issue and getting x2APIC related
code into a consistent shape (elsewhere a 256 rather than 1024 wide
window is being used too). This will be dealt with subsequently.

This is CVE-2014-7188 / XSA-108.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 61fdda7acf3de11f3d50d50e5b4f4ecfac7e0d04
master date: 2014-10-01 14:54:47 +0200

10 years agox86emul: only emulate software interrupt injection for real mode
Jan Beulich [Tue, 23 Sep 2014 12:40:51 +0000 (14:40 +0200)]
x86emul: only emulate software interrupt injection for real mode

Protected mode emulation currently lacks proper privilege checking of
the referenced IDT entry, and there's currently no legitimate way for
any of the respective instructions to reach the emulator when the guest
is in protected mode.

This is XSA-106.

Reported-by: Andrei LUTAS <vlutas@bitdefender.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
master commit: 346d4545569928b652c40c7815c1732676f8587c
master date: 2014-09-23 14:33:50 +0200

10 years agox86/emulate: check cpl for all privileged instructions
Andrew Cooper [Tue, 23 Sep 2014 12:40:12 +0000 (14:40 +0200)]
x86/emulate: check cpl for all privileged instructions

Without this, it is possible for userspace to load its own IDT or GDT.

This is XSA-105.

Reported-by: Andrei LUTAS <vlutas@bitdefender.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrei LUTAS <vlutas@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0e442727ceccfa32a7276cccd205b4722e68fdc1
master date: 2014-09-23 14:33:06 +0200

10 years agox86/shadow: fix race condition sampling the dirty vram state
Andrew Cooper [Tue, 23 Sep 2014 12:39:05 +0000 (14:39 +0200)]
x86/shadow: fix race condition sampling the dirty vram state

d->arch.hvm_domain.dirty_vram must be read with the domain's paging lock held.

If not, two concurrent hypercalls could both end up attempting to free
dirty_vram (the second of which will free a wild pointer), or both end up
allocating a new dirty_vram structure (the first of which will be leaked).

This is XSA-104.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 46a49b91f1026f64430b84dd83e845a33f06415e
master date: 2014-09-23 14:31:47 +0200

10 years agoevtchn: check control block exists when using FIFO-based events
David Vrabel [Tue, 9 Sep 2014 13:31:37 +0000 (15:31 +0200)]
evtchn: check control block exists when using FIFO-based events

When using the FIFO-based event channels, there are no checks for the
existance of a control block when binding an event or moving it to a
different VCPU.  This is because events may be bound when the ABI is
in 2-level mode (e.g., by the toolstack before the domain is started).

The guest may trigger a Xen crash in evtchn_fifo_set_pending() if:

  a) the event is bound to a VCPU without a control block; or
  b) VCPU 0 does not have a control block.

In case (a), Xen will crash when looking up the current queue.  In
(b), Xen will crash when looking up the old queue (which defaults to a
queue on VCPU 0).

By allocating all the per-VCPU structures when enabling the FIFO ABI,
we can be sure that v->evtchn_fifo is always valid.

EVTCHNOP_init_control_block for all the other CPUs need only map the
shared control block.

A single check in evtchn_fifo_set_pending() before accessing the
control block fixes all cases where the guest has not initialized some
control blocks.

This is XSA-107.

Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a4e0cea6fced50e251453dfe52e1b9dde77a84f5
master date: 2014-09-09 15:25:58 +0200

10 years agoupdate Xen version to 4.4.2-pre
Jan Beulich [Tue, 9 Sep 2014 13:30:54 +0000 (15:30 +0200)]
update Xen version to 4.4.2-pre

10 years agoupdate Xen version to 4.4.1 RELEASE-4.4.1
Jan Beulich [Tue, 2 Sep 2014 06:20:19 +0000 (08:20 +0200)]
update Xen version to 4.4.1

10 years agox86_emulate: properly do IP updates and other side effects on success
Jan Beulich [Fri, 22 Aug 2014 12:07:36 +0000 (14:07 +0200)]
x86_emulate: properly do IP updates and other side effects on success

The two MMX/SSE/AVX code blocks failed to update IP properly, and these
as well as get_reg_refix(), which "manually" updated IP so far, failed
to do the TF and RF processing needed at the end of successfully
emulated instructions.

Fix the test utility at once to check IP is properly getting updated,
and while at it macroize the respective code quite a bit, hopefully
making it easier to add further tests when the need arises.

Reported-by: Andrei LUTAS <vlutas@bitdefender.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper@citrix.com>
master commit: 3af450fd2d9403f208d3ac6459716f027b8597ad
master date: 2014-08-08 09:34:03 +0200

10 years agolz4: check for underruns
Jan Beulich [Fri, 22 Aug 2014 12:06:20 +0000 (14:06 +0200)]
lz4: check for underruns

While overruns are already being taken care of, underruns (resulting
from overflows in the respective "op + length" (or similar) operations
weren't.

This is CVE-2014-4611.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 9143a6c55ef7e8f630857cb08c03844d372c2345
master date: 2014-08-04 13:43:03 +0200

10 years agoRevert "x86/paging: make log-dirty operations preemptible"
Ian Jackson [Tue, 19 Aug 2014 11:50:41 +0000 (12:50 +0100)]
Revert "x86/paging: make log-dirty operations preemptible"

This reverts commit 89f167e04ed89109c06092b8edf46c1519cab5f3.

This fix has been discovered to cause regressions.  Investigations are
ongoing.

Revert-reqeuested-by: Jan Beulich <JBeulich@suse.com>
10 years agox86/cpu: undo BIOS CPUID max_leaf limit before querying for features
Andrew Cooper [Tue, 12 Aug 2014 13:51:52 +0000 (15:51 +0200)]
x86/cpu: undo BIOS CPUID max_leaf limit before querying for features

If IA32_MISC_ENABLE[22] is set by the BIOS, CPUID.0.EAX will be limited to 3.
Lift this limit before considering whether to query CPUID.7[ECX=0].EBX for
features.

Without this change, dom0 is able to see this feature leaf (as the limit was
subsequently lifted), and will set features appropriately in HVM domain cpuid
policies.

The specific bug XenServer observed was the advertisement of the FSGSBASE
feature, but an inability to set CR4.FSGSBASE as Xen considered the bit to be
reserved as cpu_has_fsgsbase incorrectly evaluated as false.

This is a regression introduced by c/s 44e24f8567 "x86: don't call
generic_identify() redundantly" where the redundant call actually resampled
CPUID.7[ECX=0] properly to obtain the feature flags.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a1ac4cf52e38386bac7ac3440c7da0099662ca5c
master date: 2014-07-29 17:02:25 +0200

10 years agoxen: arm: Correctly handle do_sysreg exception injection from 64-bit userspace
Ian Campbell [Tue, 12 Aug 2014 13:50:13 +0000 (15:50 +0200)]
xen: arm: Correctly handle do_sysreg exception injection from 64-bit userspace

The do_sysreg case was missing a return, so it would increment PC and
inject the trap to the second instruction of the handler.

This is CVE-2014-5148 / XSA-103.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>
master commit: f2ae8bfa498831ee6343d672066b898d3cd73892
master date: 2014-08-12 15:38:01 +0200

10 years agoxen: arm: Handle traps from 32-bit userspace on 64-bit kernel as undef
Ian Campbell [Tue, 12 Aug 2014 13:49:14 +0000 (15:49 +0200)]
xen: arm: Handle traps from 32-bit userspace on 64-bit kernel as undef

We are not setup to handle these properly. This turns a host crash
into a trap to the guest kernel which will likely result in killing
the offending process.

This is part of CVE-2014-5147 / XSA-102.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>
master commit: c0020e0997024eb741d60de9a480bf2878f891af
master date: 2014-08-12 15:37:25 +0200

10 years agoxen: arm: Correctly handle exception injection from userspace on 64-bit.
Ian Campbell [Tue, 12 Aug 2014 13:48:28 +0000 (15:48 +0200)]
xen: arm: Correctly handle exception injection from userspace on 64-bit.

Firstly we must be prepared to propagate traps from 32-bit userspace even for
64-bit guests, so wrap the existing inject_undef??_exception into
inject_undef_exception and use that when injecting an undef exception. The
various other exception cases (aborts etc) already do this.

Secondly when injecting the trap we must pick the correct exception vector
depending on whether the source of the trap was 32-bit EL0, 64-bit EL0 or EL1.

This is part of CVE-2014-5147 / XSA-102.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>
master commit: a98669781769e821413dfef4ef99b93171375610
master date: 2014-08-12 15:36:15 +0200

10 years agoxen: arm: handle AArch32 userspace when dumping 64-bit guest state.
Ian Campbell [Tue, 12 Aug 2014 13:45:49 +0000 (15:45 +0200)]
xen: arm: handle AArch32 userspace when dumping 64-bit guest state.

A 64-bit guest can still be in 32-bit mode when running userspace,
handle this case by dumping the correct 32-bit state.

Note that on ARM it is not possible to change mode without the help
of the next exception level, hence there is no way a 64-bit guest can
be running in 32-bit kernel modes.

This is part of CVE-2014-5147 / XSA-102.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>
master commit: fc0cafeab30fe93963457fafbad7a01c7f55ea5f
master date: 2014-08-12 15:32:27 +0200

10 years agox86/paging: make log-dirty operations preemptible
Jan Beulich [Tue, 12 Aug 2014 13:44:26 +0000 (15:44 +0200)]
x86/paging: make log-dirty operations preemptible

Both the freeing and the inspection of the bitmap get done in (nested)
loops which - besides having a rather high iteration count in general,
albeit that would be covered by XSA-77 - have the number of non-trivial
iterations they need to perform (indirectly) controllable by both the
guest they are for and any domain controlling the guest (including the
one running qemu for it).

This is CVE-2014-5146 / XSA-97.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 95e6d82224689fdfd967a093a4d69efc24c17e91
master date: 2014-08-12 15:30:11 +0200

10 years agoupdate Xen version to 4.4.1-rc2 4.4.1-rc2
Jan Beulich [Tue, 5 Aug 2014 11:41:22 +0000 (13:41 +0200)]
update Xen version to 4.4.1-rc2

10 years agox86/mem_event: fix regression affecting CR0 memory events
Tamas K Lengyel [Mon, 28 Jul 2014 12:59:00 +0000 (14:59 +0200)]
x86/mem_event: fix regression affecting CR0 memory events

This is a patch repairing a regression in code previously functional in 4.1.x.
It appears that, during some refactoring work, call to hvm_memory_event_cr0 was lost.

This function was originally called in mov_to_cr() of vmx.c, but the commit
http://xenbits.xen.org/hg/xen-unstable.hg/rev/1276926e3795 abstracted the
original code into generic functions up a level in hvm.c, dropping the call
in the process.

The same issue affected the CR3 and CR4 events, which were fixed in patch
http://xenbits.xensource.com/hg/xen-unstable.hg/rev/7ab899e46347.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 5d570c1d0274cac3b333ef378af3325b3b69905e
master date: 2014-07-23 18:05:11 +0200

10 years agox86/mem_event: prevent underflow of vcpu pause counts
Andrew Cooper [Mon, 28 Jul 2014 12:57:47 +0000 (14:57 +0200)]
x86/mem_event: prevent underflow of vcpu pause counts

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Tested-by: Aravindh Puthiyaparambil <aravindp@cisco.com>
master commit: 868d9b99b39c53dc1f6ae9bfd7b148c206fd7240
master date: 2014-07-23 18:08:04 +0200

10 years agoevtchn: eliminate 64k ports limitation
Jan Beulich [Mon, 28 Jul 2014 12:56:33 +0000 (14:56 +0200)]
evtchn: eliminate 64k ports limitation

The introduction of FIFO event channels claimed to support over 100k
ports, but failed to widen a number of 16-bit variables/operations.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
master commit: 8f7f6ab879a9ad9d2bf66b8c6b46a0653086b79f
master date: 2014-04-11 11:25:56 +0200

10 years agox86/mem_event: validate the response vcpu_id before acting on it
Andrew Cooper [Mon, 28 Jul 2014 12:54:15 +0000 (14:54 +0200)]
x86/mem_event: validate the response vcpu_id before acting on it

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
master commit: ee75480b3c8856db9ef1aa45418f35ec0d78989d
master date: 2014-07-23 18:07:11 +0200

10 years agoavoid crash when doing shutdown with active cpupools
Juergen Gross [Mon, 28 Jul 2014 12:53:22 +0000 (14:53 +0200)]
avoid crash when doing shutdown with active cpupools

When shutting down the machine while there are cpus in a cpupool other than
Pool-0 a crash is triggered due to cpupool handling rejecting offlining the
non-boot cpus in other cpupools.

It is easy to detect this case and allow offlining those cpus.

Reported-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Stefan Bader <stefan.bader@canonical.com>
master commit: 05377dede434c746e6708f055858378d20f619db
master date: 2014-07-23 18:03:19 +0200

10 years agoproperly reference count DOMCTL_{,un}pausedomain hypercalls
Andrew Cooper [Mon, 28 Jul 2014 12:52:10 +0000 (14:52 +0200)]
properly reference count DOMCTL_{,un}pausedomain hypercalls

For safety reasons, c/s 6ae2df93c27 "mem_access: Add helper API to setup
ring and enable mem_access" has to pause the domain while it performs a set of
operations.

However without properly reference counted hypercalls, xc_mem_event_enable()
now unconditionally unpauses a previously paused domain.

To prevent toolstack software running wild, there is an arbitrary limit of 255
on the toolstack pause count.  This is high enough for several components of
the toolstack to safely use, but prevents over/underflow of d->pause_count.

The previous domain_{,un}pause_by_systemcontroller() functions are updated to
return an error code.  domain_pause_by_systemcontroller() is modified to have
a common stub and take a pause_fn pointer, allowing for both sync and nosync
domain pauses.  domain_pause_for_debugger() has a hand-rolled nosync pause
replaced with the new domain_pause_by_systemcontroller_nosync(), and has its
variables shuffled slightly to avoid rereading current multiple times.

Suggested-by: Don Slutz <dslutz@verizon.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
With a couple of formatting adjustments:
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/gdbsx: invert preconditions for XEN_DOMCTL_gdbsx_{,un}pausevcpu hypercalls

c/s 3eb1c708ab "properly reference count DOMCTL_{,un}pausedomain hypercalls"
accidentally inverted the use of d->controller_pause_count.

Revert back to how it was originally, i.e. the XEN_DOMCTL_gdbsx_{,un}pausevcpu
hypercalls are only valid for a domain already paused by the system controller.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3eb1c708ab0fe1067a436498a684907afa14dacf
master date: 2014-07-03 16:51:13 +0200
master commit: 680d79f10bb70691a9ae3b4a6a8b669e0f2837f6
master date: 2014-07-25 11:53:31 +0200

10 years agoVT-d/ATS: correct and clean up dev_invalidate_iotlb()
Jan Beulich [Mon, 28 Jul 2014 12:50:45 +0000 (14:50 +0200)]
VT-d/ATS: correct and clean up dev_invalidate_iotlb()

While this was intended to only do cleanup (replace the two bogus
"ret |= " constructs, and a simple formatting correction), this now
also
- fixes the bit manipulations for size_order > 0
  a) correct an off-by-one in the use of size_order for shifting (till
     now double the requested size got invalidated)
  b) in fact setting bit 12 and up if necessary (without which too
     small a region might have got invalidated)
  c) making them capable of dealing with regions of 4Gb size and up
- corrects the return value handling, such that a later iteration's
  success won't clear an earlier iteration's error indication
- uses PCI_BDF2() instead of open coding it
- bail immediately on bad passed in invalidation type, rather than
  repeatedly printing the same message for each ATS-capable device, at
  once also no longer hiding that failure from the caller

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: fd33987ba27607c3cc7da258cf1d86d21beeb735
master date: 2014-06-30 15:57:40 +0200

10 years agoxen: arm: implement generic multiboot compatibility strings
Ian Campbell [Fri, 18 Jul 2014 13:08:11 +0000 (14:08 +0100)]
xen: arm: implement generic multiboot compatibility strings

This causes Xen to accept the more generic names specified in
http://wiki.xen.org/wiki/Xen_ARM_with_Virtualization_Extensions/Multiboot as of
2014-06-06.

These names are more generic than those proposed by Andre in
http://thread.gmane.org/gmane.linux.linaro.announce.boot/326 and those
used in earlier drafts of the /Multiboot wiki page.

This will allow bootloaders to not special case Xen (or at least to reduce
the amount which is required).

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit a860dfeec090fe46d856b5d3fc6da28ccf7d1ba5)

10 years agoxen: arm: flush TLB after overwriting 1:1 mapping in boot page tables
Ian Campbell [Mon, 14 Jul 2014 16:39:10 +0000 (17:39 +0100)]
xen: arm: flush TLB after overwriting 1:1 mapping in boot page tables

Otherwise a stale TLB entry can shadow the fixmap/UART or DTB mapping

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit f1870804e58565399cd770e93f62e7ce57cd5231)

10 years agoxen: arm: use physical processor ID (MPIDR) when calling psci CPU_ON
Ian Campbell [Mon, 14 Jul 2014 16:21:47 +0000 (17:21 +0100)]
xen: arm: use physical processor ID (MPIDR) when calling psci CPU_ON

Xen's logical CPU map can differ from the underlying layout.

Also add an emacs magic block to this file.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit d99504872178523b024bfb36736b158a42c5060e)

10 years agoxen: Install arch-arm directory headers
Julien Grall [Tue, 8 Jul 2014 17:04:48 +0000 (18:04 +0100)]
xen: Install arch-arm directory headers

Some headers for ARM are not installed on the host. This may make external
software relying on Xen headers failed to compile on ARM.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit f224b60791b539df66c1fe89d7866170653428b6)

10 years agoQEMU_TAG update - FIX!
Ian Jackson [Thu, 3 Jul 2014 12:58:23 +0000 (13:58 +0100)]
QEMU_TAG update - FIX!

My qemu push and tag update script had transposed a revision number
from qemu-xen-4.3-testing into xen-4.4-testing's Config.mk!

The script is now fixed, but we also need to fix the tree.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
10 years agoQEMU_TAG update
Ian Jackson [Wed, 2 Jul 2014 15:06:31 +0000 (16:06 +0100)]
QEMU_TAG update

10 years agotools: arm: report an error if the guest RAM is too large
Ian Campbell [Thu, 22 May 2014 09:46:37 +0000 (10:46 +0100)]
tools: arm: report an error if the guest RAM is too large

Due to the layout of the guest physical address space we cannot support more
than 768M of RAM before overrunning the area set aside for the grant table. Due
to the presence of the magic pages at the end of the RAM region guests are
actually limited to 767M.

Catch this case during domain build and fail gracefully instead of obscurely
later on.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 5a959f44ed03398870b6ec0dfebb59dcd5981f94)

10 years agotools/libxl: Fix free() of wild pointer in libxl__initiate_device_remove()
Andrew Cooper [Wed, 18 Jun 2014 18:04:14 +0000 (19:04 +0100)]
tools/libxl: Fix free() of wild pointer in libxl__initiate_device_remove()

libxl__initiate_device_remove() had a preexisting error path issue where
libxl_dominfo_dispose() could be called on a libxl_dominfo object before it
had been initialised with libxl_dominfo_init().

This was safe until c/s ab44401 added the pointer ssid_label, which point
libxl_dominfo_dispose() free()s.

Unconditionally initialise info in libxl__initiate_device_remove() before
taking an error path which will free it.

Coverity-ID: 1223212
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Ian Campbell <Ian.Campbell@citrix.com>
CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
(cherry picked from commit ddb4aa5dfa13781e8f31ba20923c14c1a083ce83)

10 years agoblktap2: Fix two 'maybe uninitialized' variables
Dario Faggioli [Fri, 20 Jun 2014 14:09:00 +0000 (16:09 +0200)]
blktap2: Fix two 'maybe uninitialized' variables

for which gcc 4.9.0 complains about, like this:

block-qcow.c: In function `get_cluster_offset':
block-qcow.c:431:3: error: `tmp_ptr' may be used uninitialized in this function
[-Werror=maybe-uninitialized]
   memcpy(tmp_ptr, l1_ptr, 4096);
   ^
block-qcow.c:606:7: error: `tmp_ptr2' may be used uninitialized in this
function [-Werror=maybe-uninitialized]
   if (write(s->fd, tmp_ptr2, 4096) != 4096) {
       ^
cc1: all warnings being treated as errors
/home/dario/Sources/xen/xen/xen.git/tools/blktap2/drivers/../../../tools/Rules.mk:89:
 recipe for target 'block-qcow.o' failed
make[5]: *** [block-qcow.o] Error 1

The proper behavior is to return upon allocation failure.
About what to return, 0 seems the best option, looking
at both the function and the call sites.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 345e44a85d71a1a910385f33c7f1ba3683026d18)

10 years agoxen: arm: make sure gcc doesn't use floating-point registers on arm64
Ian Campbell [Thu, 26 Jun 2014 16:30:14 +0000 (17:30 +0100)]
xen: arm: make sure gcc doesn't use floating-point registers on arm64

By using -mgeneral-regs-only which is the Aarch64 equivalent to
-msoft-float.

Otherwise gcc will corrupt the d* registers, which we don't save/restore when
trapping to/from the hypervisor.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit c0726c18e8135f87a5a5793d993d6bea1e3fa925)

10 years agoxen: arm: Implement OSDLR_EL1 trap as RAZ/WO.
Ian Campbell [Fri, 13 Jun 2014 12:15:04 +0000 (13:15 +0100)]
xen: arm: Implement OSDLR_EL1 trap as RAZ/WO.

I'm not sure why this wasn't added at the same time as the other
debug registers.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit 92b0b80f0d2d29d0e80bf35ea839ed6058b7f0fa)

10 years agoxen: arm: take FIQ exceptions to Xen not guest by setting HCR_EL2.FMO
Ian Campbell [Thu, 26 Jun 2014 08:53:42 +0000 (09:53 +0100)]
xen: arm: take FIQ exceptions to Xen not guest by setting HCR_EL2.FMO

As with HCR_EL2.{IMO,AMO} we want to route FIQs to Xen not the guest. See ARM
ARM DDI 0406C.b B1.8.4.

So far none of the platforms which we support use FIQ for anything, but when we
end up supporting one it would be far better to surprise Xen with them than
whatever guest happens to be running...

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit 4bb74e39987b428429c2aacad7f59356d4942e39)

Conflicts:
xen/arch/arm/traps.c

10 years agoxen/arm: Implement a dummy debug monitor for ARM32
Julien Grall [Thu, 24 Apr 2014 22:45:55 +0000 (23:45 +0100)]
xen/arm: Implement a dummy debug monitor for ARM32

XSA-93 (commit 0b18220 "xen/arm: Don't let guess access to Debug and Performance
Monitors registers") disable Debug Registers access.

When CONFIG_PERF_EVENTS is enabled in the Linux Kernel, it will try to
initialize the debug monitors. If an error occured Linux won't use this
feature.

The implementation made Xen expose a minimal set of registers which let think
the guest (i.e.) thinks HW debug won't work.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
[ ijc -- s/DBGCR/DBGBCR/ to use correct register name ]
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 68c69978352adb5ab7c06598056f9eb88d7d6031)
[ ijc -- s/is_32bit_domain/is_pv32_domain/ ]

10 years agoxen/arm: Implement a dummy Performance Monitor for ARM32
Julien Grall [Thu, 24 Apr 2014 22:45:54 +0000 (23:45 +0100)]
xen/arm: Implement a dummy Performance Monitor for ARM32

XSA-93 (commit 0b18220 "xen/arm: Don't let guess access to Debug and Performance
Monitor registers") disable Performance Monitor.

When CONFIG_PERF_EVENTS is enabled in the Linux Kernel, regardless the
ID_DFR0 (which tell if Perfomance Monitors Extension is implemented) the
kernel will try to access to PMCR.

Therefore we tell the guest we have 0 counters. Unfortunately we must always
support PMCCNTR (the cycle counter): we just RAZ/WI for all PM register,
which doesn't crash the kernel at least.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit aa0d443718372b46c432af7cb6274050cda32fc6)

10 years agoxen/arm: Panic when we receive an unexpected trap
Julien Grall [Tue, 17 Jun 2014 20:44:28 +0000 (21:44 +0100)]
xen/arm: Panic when we receive an unexpected trap

The current implementation of do_unexpected_trap make Xen spin forever
on the current physical CPU. This may lead to stall guests VCPU and print
unhelpful message (RCU stall...).

Usually when Xen receives an unexpected trap, it means that something goes
wrong either in the hypervisor or in the CPU. In this case we should
directly panic to also stop the other CPUs.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 4f5ab681d208993f94553203f4be323b3c929070)

10 years agoxen: arm: initialise the grant_table_gpfn array on allocation
Ian Campbell [Wed, 25 Jun 2014 12:58:59 +0000 (13:58 +0100)]
xen: arm: initialise the grant_table_gpfn array on allocation

Avoids leaking uninitialised memory via the grant table setup hypercall.

This is XSA-101.

Reported-by: Julien Grall <julien.grall@linaro.org>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
10 years agox86/nmi: be less verbose when testing the NMI watchdog
David Vrabel [Tue, 24 Jun 2014 12:22:54 +0000 (14:22 +0200)]
x86/nmi: be less verbose when testing the NMI watchdog

There's no need to print all the CPUs that are ok, only the ones that
got stuck.

The resulting output is either:

  Testing NMI watchdog on all CPUs: 1 4 6 stuck

or

  Testing NMI watchdog on all CPUs: ok

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: f64b1901564b6206dbbe946699619fcd22446de8
master date: 2014-05-15 15:32:36 +0200

10 years agox86: Intel CPU family update
Jan Beulich [Tue, 24 Jun 2014 12:22:09 +0000 (14:22 +0200)]
x86: Intel CPU family update

... according to revision 49 of the Intel SDM.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 3385bf3aad25082e3bc6ab0e1cbd639512983e4d
master date: 2014-03-18 11:51:43 +0100

10 years agox86/mwait_idle: fix trace output
Ross Lagerwall [Tue, 24 Jun 2014 07:44:23 +0000 (09:44 +0200)]
x86/mwait_idle: fix trace output

Use the C-state's type when tracing, not its index since the index is
not set by the mwait_idle driver.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: d17ac1d5433ba2c25d7fab11baba59173e339896
master date: 2014-06-20 10:37:21 +0200

10 years agoVT-d/qinval: make local variable used for communication with IOMMU "volatile"
Jan Beulich [Tue, 24 Jun 2014 07:43:31 +0000 (09:43 +0200)]
VT-d/qinval: make local variable used for communication with IOMMU "volatile"

Without that there is - afaict - nothing preventing the compiler from
putting the variable into a register for the duration of the wait loop.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: ceec46c02074e1b2ade0b13c3c4a2f3942ae698c
master date: 2014-06-20 10:25:33 +0200

10 years agox86/EFI: allow FPU/XMM use in runtime service functions
Jan Beulich [Tue, 24 Jun 2014 07:42:49 +0000 (09:42 +0200)]
x86/EFI: allow FPU/XMM use in runtime service functions

UEFI spec update 2.4B developed a requirement to enter runtime service
functions with CR0.TS (and CR0.EM) clear, thus making feasible the
already previously stated permission for these functions to use some of
the XMM registers. Enforce this requirement (along with the connected
ones on FPU control word and MXCSR) by going through a full FPU save
cycle (if the FPU was dirty) in efi_rs_enter() (along with loading  the
specified values into the other two registers).

Note that the UEFI spec mandates that extension registers other than
XMM ones (for our purposes all that get restored eagerly) are preserved
across runtime function calls, hence there's nothing we need to restore
in efi_rs_leave() (they do get saved, but just for simplicity's sake).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: e0fe297dabc96d8161d568f19a99722c4739b9f9
master date: 2014-06-18 15:53:27 +0200

10 years agoIOMMU: prevent VT-d device IOTLB operations on wrong IOMMU
Malcolm Crossley [Tue, 24 Jun 2014 07:41:45 +0000 (09:41 +0200)]
IOMMU: prevent VT-d device IOTLB operations on wrong IOMMU

PCIe ATS allows for devices to contain IOTLBs, the VT-d code was iterating
around all ATS capable devices and issuing IOTLB operations for all IOMMUs,
even though each ATS device is only accessible via one particular IOMMU.

Issuing an IOMMU operation to a device not accessible via that IOMMU results
in an IOMMU timeout because the device does not reply. VT-d IOMMU timeouts
result in a Xen panic.

Therefore this bug prevents any Intel system with 2 or more ATS enabled IOMMUs,
each with an ATS device connected to them, from booting Xen.

The patch adds a IOMMU pointer to the ATS device struct so the VT-d code can
ensure it does not issue IOMMU ATS operations on the wrong IOMMU. A void
pointer has to be used because AMD and Intel IOMMU implementations do not have
a common IOMMU structure or indexing mechanism.

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 84c340ba4c3eb99278b6ba885616bb183b88ad67
master date: 2014-06-18 15:50:02 +0200

10 years agox86/mce: don't spam the console with "CPUx: Temperature z"
Konrad Rzeszutek Wilk [Tue, 24 Jun 2014 07:40:56 +0000 (09:40 +0200)]
x86/mce: don't spam the console with "CPUx: Temperature z"

If the machine has been quite busy it ends up with these messages
printed on the hypervisor console:

(XEN) CPU3: Temperature/speed normal
(XEN) CPU1: Temperature/speed normal
(XEN) CPU0: Temperature/speed normal
(XEN) CPU1: Temperature/speed normal
(XEN) CPU0: Temperature/speed normal
(XEN) CPU2: Temperature/speed normal
(XEN) CPU3: Temperature/speed normal
(XEN) CPU0: Temperature/speed normal
(XEN) CPU2: Temperature/speed normal
(XEN) CPU3: Temperature/speed normal
(XEN) CPU1: Temperature/speed normal
(XEN) CPU0: Temperature above threshold
(XEN) CPU0: Running in modulated clock mode
(XEN) CPU1: Temperature/speed normal
(XEN) CPU2: Temperature/speed normal
(XEN) CPU3: Temperature/speed normal

While the state changes are important, the non-altered state
information is not needed. As such add a latch mechanism to only print
the information if it has changed since the last update (and the
hardware doesn't properly suppress redundant notifications).

This was observed on Intel DQ67SW,
BIOS SWQ6710H.86A.0066.2012.1105.1504 11/05/2012

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Christoph Egger <chegger@amazon.de>
master commit: 323338f86fb6cd6f6dba4f59a84eed71b3552d21
master date: 2014-06-16 11:59:32 +0200

10 years agox86/HVM: refine SMEP test in HVM_CR4_GUEST_RESERVED_BITS()
Jan Beulich [Tue, 24 Jun 2014 07:40:01 +0000 (09:40 +0200)]
x86/HVM: refine SMEP test in HVM_CR4_GUEST_RESERVED_BITS()

Andrew validly points out that the use of the macro on the restore path
can't rely on the CPUID bits for the guest already being in place (as
their setting by the tool stack in turn requires the other restore
operations already having taken place). And even worse, using
hvm_cpuid() is invalid here because that function assumes to be used in
the context of the vCPU in question.

Reverting to the behavior prior to the change from checking
cpu_has_sm?p to hvm_vcpu_has_sm?p() would break the other (non-restore)
use of the macro. So let's revert to the prior behavior only for the
restore path, by adding a respective second parameter to the macro.

Obviously the two cpu_has_* uses in the macro should really also be
converted to hvm_cpuid() based checks at least for the non-restore
path.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: David Vrabel <david.vrabel@citrix.com>
master commit: 584287380baf81e5acdd9dc7dfc7ffccd1e9a856
master date: 2014-06-10 13:12:05 +0200

10 years agoavoid crash on HVM domain destroy with PCI passthrough
Juergen Gross [Tue, 24 Jun 2014 07:38:48 +0000 (09:38 +0200)]
avoid crash on HVM domain destroy with PCI passthrough

c/s bac6334b5 "move domain to cpupool0 before destroying it" introduced a
problem when destroying a HVM domain with PCI passthrough enabled. The
moving of the domain to cpupool0 includes moving the pirqs to the cpupool0
cpus, but the event channel infrastructure already is unusable for the
domain. So just avoid moving pirqs for dying domains.

Signed-off-by: Juergen Gross <jgross@suse.com>
master commit: b9ae60907e6dbc686403e52a7e61a6f856401a1b
master date: 2014-06-10 12:04:08 +0200

10 years agox86: fix reboot/shutdown with running HVM guests
Roger Pau Monné [Tue, 24 Jun 2014 07:37:37 +0000 (09:37 +0200)]
x86: fix reboot/shutdown with running HVM guests

If there's a guest using VMX/SVM when the hypervisor shuts down, it
can lead to the following crash due to VMX/SVM functions being called
after hvm_cpu_down has been called. In order to prevent that, check in
{svm/vmx}_ctxt_switch_from that the cpu virtualization extensions are
still enabled.

(XEN) Domain 0 shutdown: rebooting machine.
(XEN) Assertion 'read_cr0() & X86_CR0_TS' failed at vmx.c:644
(XEN) ----[ Xen-4.5-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82d0801d90ce>] vmx_ctxt_switch_from+0x1e/0x14c
...
(XEN) Xen call trace:
(XEN)    [<ffff82d0801d90ce>] vmx_ctxt_switch_from+0x1e/0x14c
(XEN)    [<ffff82d08015d129>] __context_switch+0x127/0x462
(XEN)    [<ffff82d080160acf>] __sync_local_execstate+0x6a/0x8b
(XEN)    [<ffff82d080160af9>] sync_local_execstate+0x9/0xb
(XEN)    [<ffff82d080161728>] map_domain_page+0x88/0x4de
(XEN)    [<ffff82d08014e721>] map_vtd_domain_page+0xd/0xf
(XEN)    [<ffff82d08014cda2>] io_apic_read_remap_rte+0x158/0x29f
(XEN)    [<ffff82d0801448a8>] iommu_read_apic_from_ire+0x27/0x29
(XEN)    [<ffff82d080165625>] io_apic_read+0x17/0x65
(XEN)    [<ffff82d080166143>] __ioapic_read_entry+0x38/0x61
(XEN)    [<ffff82d080166aa8>] clear_IO_APIC_pin+0x1a/0xf3
(XEN)    [<ffff82d080166bae>] clear_IO_APIC+0x2d/0x60
(XEN)    [<ffff82d080166f63>] disable_IO_APIC+0xd/0x81
(XEN)    [<ffff82d08018228b>] smp_send_stop+0x58/0x68
(XEN)    [<ffff82d080181aa7>] machine_restart+0x80/0x20a
(XEN)    [<ffff82d080181c3c>] __machine_restart+0xb/0xf
(XEN)    [<ffff82d080128fb9>] smp_call_function_interrupt+0x99/0xc0
(XEN)    [<ffff82d080182330>] call_function_interrupt+0x33/0x43
(XEN)    [<ffff82d08016bd89>] do_IRQ+0x9e/0x63a
(XEN)    [<ffff82d08016406f>] common_interrupt+0x5f/0x70
(XEN)    [<ffff82d0801a8600>] mwait_idle+0x29c/0x2f7
(XEN)    [<ffff82d08015cf67>] idle_loop+0x58/0x76
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Assertion 'read_cr0() & X86_CR0_TS' failed at vmx.c:644
(XEN) ****************************************

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 39ede234d1fd683430ffb1784d6d35b096f16457
master date: 2014-06-05 17:53:35 +0200

10 years agox86/domctl: two functional fixes to XEN_DOMCTL_[gs]etvcpuextstate
Andrew Cooper [Tue, 24 Jun 2014 07:36:49 +0000 (09:36 +0200)]
x86/domctl: two functional fixes to XEN_DOMCTL_[gs]etvcpuextstate

Interacting with the vcpu itself should be protected by vcpu_pause().
Buggy/naive toolstacks might encounter adverse interaction with a vcpu context
switch, or increase of xcr0_accum.  There are no much problems with current
in-tree code.

Explicitly permit a NULL guest handle as being a request for size.  It is the
prevailing Xen style, and without it, valgrind's ioctl handler is unable to
determine whether evc->buffer actually got written to.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
# Commit 895661ae98f0249f50280b4acfb9dda70b76d7e9
# Date 2014-06-10 12:03:16 +0200
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Jan Beulich <jbeulich@suse.com>
x86/domctl: further fix to XEN_DOMCTL_[gs]etvcpuextstate

Do not clobber errors from certain codepaths.  Clobbering of -EINVAL from
failing "evc->size <= PV_XSAVE_SIZE(_xcr0_accum)" was a pre-existing bug.

However, clobbering -EINVAL/-EFAULT from the get codepath was a bug
unintentionally introduced by 090ca8c1 "x86/domctl: two functional fixes to
XEN_DOMCTL_[gs]etvcpuextstate".

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 090ca8c155b7321404ea7713a28aaedb7ac4fffd
master date: 2014-06-05 17:52:57 +0200
master commit: 895661ae98f0249f50280b4acfb9dda70b76d7e9
master date: 2014-06-10 12:03:16 +0200

10 years agoVT-d: honor APEI firmware-first mode in XSA-59 workaround code
Jan Beulich [Tue, 24 Jun 2014 07:34:57 +0000 (09:34 +0200)]
VT-d: honor APEI firmware-first mode in XSA-59 workaround code

When firmware-first mode is being indicated by firmware, we shouldn't
be modifying AER registers - these are considered to be owned by
firmware in that case. Violating this is being reported to result in
SMI storms. While circumventing the workaround means re-exposing
affected hosts to the XSA-59 issues, this in any event seems better
than not booting at all. Respective messages are being issued to the
log, so the situation can be diagnosed.

The basic building blocks were taken from Linux 3.15-rc. Note that
this includes a block of code enclosed in #ifdef CONFIG_X86_MCE - we
don't define that symbol, and that code also wouldn't build without
suitable machine check side code added; that should happen eventually,
but isn't subject of this change.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: 1cc37ba8dbd89fb86dad3f6c78c3fba06019fe21
master date: 2014-06-05 17:49:14 +0200

10 years agox86/PVH: avoid call to handle_mmio
Mukesh Rathor [Tue, 24 Jun 2014 07:33:18 +0000 (09:33 +0200)]
x86/PVH: avoid call to handle_mmio

handle_mmio() is currently unsafe for pvh guests. A call to it would
result in call to vioapic_range that will crash xen since the vioapic
ptr in struct hvm_domain is not initialized for pvh guests.

However, one path exists for such a call. If a pvh guest, dom0 or domU,
unintentionally touches non-existing memory, an EPT violation would occur.
This would result in unconditional call to hvm_hap_nested_page_fault. In
that function, because get_gfn_type_access returns p2m_mmio_dm for non
existing mfns by default, handle_mmio() will get called. This would result
in xen crash instead of the guest crash. This patch addresses that.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
master commit: 7c4870915c2d50acbc66347a532e33b452f64f17
master date: 2014-06-04 11:27:50 +0200

10 years agoACPI: Prevent acpi_table_entries from falling into a infinite loop
Malcolm Crossley [Tue, 24 Jun 2014 07:31:57 +0000 (09:31 +0200)]
ACPI: Prevent acpi_table_entries from falling into a infinite loop

If a buggy BIOS programs an ACPI table with to small an entry length
then acpi_table_entries gets stuck in an infinite loop.

To aid debugging, report the error and exit the loop.

Based on Linux kernel commit 369d913b242cae2205471b11b6e33ac368ed33ec

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Use < instead of <= (which I wrongly suggested), return -ENODATA
instead of -EINVAL, and make description match code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 9c1e8cae657bc13e8b1ddeede17603d77f3ad341
master date: 2014-06-04 11:26:15 +0200

10 years agox86, amd_ucode: flip revision numbers in printk
Aravind Gopalakrishnan [Tue, 24 Jun 2014 07:30:28 +0000 (09:30 +0200)]
x86, amd_ucode: flip revision numbers in printk

A failure would result in log message like so-
(XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 failed
                                           ^^^^^^^^^^^^^^^^^^^^^^
The above message has the revision numbers inverted. Fix this.

Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
master commit: 071a4c70a634f7d4f74cde4086ff3202968538c9
master date: 2014-06-02 10:19:27 +0200

10 years agopage-alloc: scrub pages used by hypervisor upon freeing
Jan Beulich [Tue, 17 Jun 2014 14:01:35 +0000 (16:01 +0200)]
page-alloc: scrub pages used by hypervisor upon freeing

... unless they're part of a fully separate pool (and hence can't ever
be used for guest allocations).

This is CVE-2014-4021 / XSA-100.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
master commit: 4bd78937ec324bcef4e29ef951e0ff9815770de1
master date: 2014-06-17 15:21:10 +0200