]> xenbits.xensource.com Git - xen.git/log
xen.git
9 years agox86/time: fix gtime_to_gtsc for vtsc=1 PV guests
Jan Beulich [Mon, 9 May 2016 11:15:14 +0000 (13:15 +0200)]
x86/time: fix gtime_to_gtsc for vtsc=1 PV guests

For vtsc=1 PV guests, rdtsc is trapped and calculated from get_s_time()
using gtime_to_gtsc. Similarly the tsc_timestamp, part of struct
vcpu_time_info, is calculated from stime_local_stamp using
gtime_to_gtsc.

However gtime_to_gtsc can return 0, if time < vtsc_offset, which can
actually happen when gtime_to_gtsc is called passing stime_local_stamp
(the caller function is __update_vcpu_system_time).

In that case the pvclock protocol doesn't work properly and the guest is
unable to calculate the system time correctly. As a consequence when the
guest tries to set a timer event (for example calling the
VCPUOP_set_singleshot_timer hypercall), the event will be in the past
causing Linux to hang.

The purpose of the pvclock protocol is to allow the guest to calculate
the system_time in nanosec correctly. The guest calculates as follow:

  from_vtsc_scale(rdtsc - vcpu_time_info.tsc_timestamp) + vcpu_time_info.system_time

Given that with vtsc=1:
  rdtsc = to_vtsc_scale(NOW() - vtsc_offset)
  vcpu_time_info.tsc_timestamp = to_vtsc_scale(vcpu_time_info.system_time - vtsc_offset)

The expression evaluates to NOW(), which is what we want.  However when
stime_local_stamp < vtsc_offset, vcpu_time_info.tsc_timestamp is
actually 0. As a consequence the calculated overall system_time is not
correct.

This patch fixes the issue by letting gtime_to_gtsc return a negative
integer in the form of a wrapped around unsigned integer, thus when the
guest subtracts vcpu_time_info.tsc_timestamp from rdtsc will calculate
the right value.

Signed-off-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d22c9bf7c3067b17cbd9cdfd8b81941dd6fb8d77
master date: 2016-04-28 15:06:56 +0200

9 years agounmodified_drivers: enable use of register_oldmem_pfn_is_ram() API
Mike Meyer Mon Apr 4 15:02:59 2016 +0200 [Mon, 4 Apr 2016 13:02:59 +0000 (15:02 +0200)]
unmodified_drivers: enable use of register_oldmem_pfn_is_ram() API

Git: a0f793d82d5ec2d0b67c57d7130bf01c91396c60

During the investigation of very slow dump times of guest images in
Amazon EC2 instance, it was discovered that the
register_oldmem_pfn_is_ram() API implemented by the upstream kernel
commit 997c136f518c5debd63847e78e2a8694f56dcf90:

        fs/proc/vmcore.c: add hook to read_from_oldmem() to check
                           for non-ram pages

was not being called.  This was due to the PV driver with the call
to register_oldmem_pfn_is_ram() API was not including the
kernel header file that is used to communicate support of the API in the
kernel.  Fix the issue by including the required header file.

Signed-off-by: Mike Meyer <mike.meyer@teradata.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Olaf Hering <olaf@aepfle.de>
9 years agox86/HVM: fix forwarding of internally cached requests
Jan Beulich [Mon, 9 May 2016 11:05:42 +0000 (13:05 +0200)]
x86/HVM: fix forwarding of internally cached requests

Forwarding entire batches to the device model when an individual
iteration of them got rejected by internal device emulation handlers
with X86EMUL_UNHANDLEABLE is wrong: The device model would then handle
all iterations, without the internal handler getting to see any past
the one it returned failure for. This causes misbehavior in at least
the MSI-X and VGA code, which want to see all such requests for
internal tracking/caching purposes. But note that this does not apply
to buffered I/O requests.

This in turn means that the condition in hvm_process_io_intercept() of
when to crash the domain was wrong: Since X86EMUL_UNHANDLEABLE can
validly be returned by the individual device handlers, we mustn't
blindly crash the domain if such occurs on other than the initial
iteration. Instead we need to distinguish hvm_copy_*_guest_phys()
failures from device specific ones, and then the former need to always
be fatal to the domain (i.e. also on the first iteration), since
otherwise we again would end up forwarding a request to qemu which the
internal handler didn't get to see.

The adjustment should be okay even for stdvga's MMIO handling:
- if it is not caching then the accept function would have failed so we
  won't get into hvm_process_io_intercept(),
- if it issued the buffered ioreq then we only get to the p->count
  reduction if hvm_send_ioreq() actually encountered an error (in which
  we don't care about the request getting split up).

Also commit 4faffc41d ("x86/hvm: limit reps to avoid the need to handle
retry") went too far in removing code from hvm_process_io_intercept():
When there were successfully handled iterations, the function should
continue to return success with a clipped repeat count.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
x86/HVM: fix forwarding of internally cached requests (part 2)

Commit 96ae556569 ("x86/HVM: fix forwarding of internally cached
requests") wasn't quite complete: hvmemul_do_io() also needs to
propagate up the clipped count. (I really should have re-tested the
forward port resulting in the earlier change, instead of relying on the
testing done on the older version of Xen which the fix was first needed
for.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 96ae556569b8eaedc0bb242932842c3277b515d8
master date: 2016-03-31 14:52:04 +0200
master commit: 670ee15ac1e3de7c15381fdaab0e531489b48939
master date: 2016-04-28 15:09:26 +0200

9 years agox86/fpu: improve check for XSAVE* not writing FIP/FDP fields
David Vrabel [Mon, 9 May 2016 11:05:13 +0000 (13:05 +0200)]
x86/fpu: improve check for XSAVE* not writing FIP/FDP fields

The hardware may not write the FIP/FDP fields with a XSAVE*
instruction.  e.g., with XSAVEOPT/XSAVES if the state hasn't changed
or on AMD CPUs when a floating point exception is not pending.  We
need to identify this case so we can correctly apply the check for
whether to save/restore FCS/FDS.

By poisoning FIP in the saved state we can check if the hardware
writes to this field.  The poison value is both: a) non-canonical; and
b) random with a vanishingly small probability of matching a value
written by the hardware (1 / (2^63) = 10^-19).

The poison value is fixed and thus knowable by a guest (or guest
userspace).  This could allow the guest to cause Xen to incorrectly
detect that the field has not been written.  But: a) this requires the
FIP register to be a full 64 bits internally which is not the case for
all current AMD and Intel CPUs; and b) this only allows the guest (or
a guest userspace process) to corrupt its own state (i.e., it cannot
affect the state of another guest or another user space process).

This results in smaller code with fewer branches and is more
understandable.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Intel confirmed that 64-bit {F,}XRSTOR sign-extend FIP from bit 47.
While leaving the description above intact, modify the code comment
accordingly.

Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e869abd77aa32fb0a5212d34ae954e4dbcb8f7a5
master date: 2016-03-18 09:49:01 +0100

9 years agox86/hvm: add HVM_PARAM_X87_FIP_WIDTH
David Vrabel [Mon, 9 May 2016 11:04:26 +0000 (13:04 +0200)]
x86/hvm: add HVM_PARAM_X87_FIP_WIDTH

The HVM parameter HVM_PARAM_X87_FIP_WIDTH to allow tools and the guest
to adjust the width of the FIP/FDP registers to be saved/restored by
the hypervisor.  This is in case the hypervisor hueristics do not do
the right thing.

Add this parameter to the set saved during domain save/migrate.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 5d768fb1f3f7b011e7b6e75909c7f4841730de60
master date: 2016-02-26 12:30:11 +0100

9 years agox86/fpu: add a per-domain field to set the width of FIP/FDP
David Vrabel [Mon, 9 May 2016 11:03:15 +0000 (13:03 +0200)]
x86/fpu: add a per-domain field to set the width of FIP/FDP

The x86 architecture allows either: a) the 64-bit FIP/FDP registers to
be restored (clearing FCS and FDS); or b) the 32-bit FIP/FDP and
FCS/FDS registers to be restored (clearing the upper 32-bits).

Add a per-domain field to indicate which of these options a guest
needs.  The options are: 8, 4 or 0.  Where 0 indicates that the
hypervisor should automatically guess the FIP width by checking the
value of FIP/FDP when saving the state (this is the existing
behaviour).

The FIP width is initially automatic but is set explicitly in the
following cases:

- 32-bit PV guest: 4
- Newer CPUs that do not save FCS/FDS: 8

The x87_fip_width field is placed into an existing 1 byte hole in
struct arch_domain.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Fix build.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 879b44b041f26de35e4b527bf0f3c361eb52bd82
master date: 2016-02-26 12:29:21 +0100

9 years agox86: limit GFNs to 32 bits for shadowed superpages.
Tim Deegan [Wed, 16 Mar 2016 16:56:04 +0000 (16:56 +0000)]
x86: limit GFNs to 32 bits for shadowed superpages.

Superpage shadows store the shadowed GFN in the backpointer field,
which for non-BIGMEM builds is 32 bits wide.  Shadowing a superpage
mapping of a guest-physical address above 2^44 would lead to the GFN
being truncated there, and a crash when we come to remove the shadow
from the hash table.

Track the valid width of a GFN for each guest, including reporting it
through CPUID, and enforce it in the shadow pagetables.  Set the
maximum witth to 32 for guests where this truncation could occur.

This is XSA-173.

Reported-by: Ling Liu <liuling-it@360.cn>
Signed-off-by: Tim Deegan <tim@xen.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
9 years agox86: fix information leak on AMD CPUs
Jan Beulich [Tue, 29 Mar 2016 13:19:51 +0000 (15:19 +0200)]
x86: fix information leak on AMD CPUs

The fix for XSA-52 was wrong, and so was the change synchronizing that
new behavior to the FXRSTOR logic: AMD's manuals explictly state that
writes to the ES bit are ignored, and it instead gets calculated from
the exception and mask bits (it gets set whenever there is an unmasked
exception, and cleared otherwise). Hence we need to follow that model
in our workaround.

This is CVE-2016-3158 / CVE-2016-3159 / XSA-172.
[xen/arch/x86/xstate.c:xrstor: CVE-2016-3158]
[xen/arch/x86/i387.c:fpu_fxrstor: CVE-2016-3159]

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7bd9dc3adfbb014c55f0928ebb3b20950ca9c019
master date: 2016-03-29 14:24:26 +0200

9 years agoupdate Xen version to 4.5.4-pre
Jan Beulich [Tue, 29 Mar 2016 13:19:07 +0000 (15:19 +0200)]
update Xen version to 4.5.4-pre

9 years agoupdate Xen version to 4.5.3 RELEASE-4.5.3
Jan Beulich [Wed, 23 Mar 2016 13:57:27 +0000 (14:57 +0100)]
update Xen version to 4.5.3

9 years agovmx: restore debug registers when injecting #DB traps
Ross Lagerwall [Fri, 18 Mar 2016 07:09:54 +0000 (08:09 +0100)]
vmx: restore debug registers when injecting #DB traps

Commit a929bee0e652 ("x86/vmx: Fix injection of #DB traps following
XSA-156") prevents an infinite loop in certain #DB traps. However, it
changed the behavior to not call hvm_hw_inject_trap() for #DB and #AC
traps which which means that the debug registers are not restored
correctly and nullified commit b56ae5b48c38 ("VMX: fix/adjust trap
injection").

To fix this, restore the original code path through hvm_inject_trap(),
but ensure that the struct hvm_trap is populated with all the required
data.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: ba22f1f4732acb4d5aebd779122e91753a0e374d
master date: 2016-03-15 12:19:13 +0100

9 years agox86: don't flush the whole cache when changing cachability
David Vrabel [Fri, 18 Mar 2016 07:09:10 +0000 (08:09 +0100)]
x86: don't flush the whole cache when changing cachability

Introduce the FLUSH_VA_VALID flag to flush_area_mask() and friends to
say that it is safe to use CLFLUSH (i.e., the virtual address is still
valid).

Use this when changing the cachability of the Xen direct mappings (in
response to the guest changing the cachability of its mappings). This
significantly improves performance by avoiding an expensive WBINVD.

This fixes a performance regression introduced by
c61a6f74f80eb36ed83a82f713db3143159b9009 (x86: enforce consistent
cachability of MMIO mappings), the fix for XSA-154.

e.g., A set_memory_wc() call in Linux:

before: 4097 us
after:    47 us

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: dff593c7b6eb1cfd4591b662a880a0c9325cce40
master date: 2016-03-10 16:51:03 +0100

9 years agolibvchan: Read prod/cons only once.
Konrad Rzeszutek Wilk [Wed, 9 Mar 2016 15:57:36 +0000 (16:57 +0100)]
libvchan: Read prod/cons only once.

We must ensure that the prod/cons are only read once and that
the compiler won't try to optimize the reads. That is split
the read of these in multiple instructions influencing later
branch code. As such insert barriers when fetching the cons
and prod index.

This is part of XSA155.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
master commit: 7d66a4ba695ab8d13b214fb816dd59e443ae1ec9
master date: 2015-12-18 09:50:02 -0500

9 years agox86emul: limit-check branch targets
Jan Beulich [Fri, 4 Mar 2016 12:16:07 +0000 (13:16 +0100)]
x86emul: limit-check branch targets

All branches need to #GP when their target violates the segment limit
(in 16- and 32-bit modes) or is non-canonical (in 64-bit mode). For
near branches facilitate this via a zero-byte instruction fetch from
the target address (resulting in address translation and validation
without an actual read from memory), while far branches get dealt with
by breaking up the segment register loading into a read-and-validate
part and a write one. The latter at once allows correcting some
ordering issues in how the individual emulation steps get carried out:
Before updating machine state, all exceptions unrelated to that state
updating should have got raised (i.e. the only ones possibly resulting
in partly updated state are faulting memory writes [pushes]).

Note that while not immediately needed here, write and distinct read
emulation routines get updated to deal with zero byte accesses too, for
overall consistency.

Reported-by: 刘令 <liuling-it@360.cn>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
master commit: 81d3a0b26c1672c60b2a54dd8780e6f6472d2328
master date: 2016-02-26 12:14:39 +0100

9 years agox86/hvm: print register state upon triple fault
Andrew Cooper [Fri, 4 Mar 2016 12:15:32 +0000 (13:15 +0100)]
x86/hvm: print register state upon triple fault

A sample looks like:

(XEN) d1v0 Triple fault - invoking HVM shutdown action 1
(XEN) *** Dumping Dom1 vcpu#0 state: ***
(XEN) ----[ Xen-4.7-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    2
(XEN) RIP:    0000:[<0000000000100005>]
(XEN) RFLAGS: 0000000000010002   CONTEXT: hvm guest (d1v0)
(XEN) rax: 0000000000000020   rbx: 0000000000000000   rcx: 0000000000000000
(XEN) rdx: 0000000000000000   rsi: 0000000000000000   rdi: 0000000000000000
(XEN) rbp: 0000000000000000   rsp: 0000000000000000   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 0000000000000011   cr4: 0000000000000000
(XEN) cr3: 0000000000000000   cr2: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: 0000

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 1329105943179b91da2466431cc972e223900ced
master date: 2016-02-25 13:02:29 +0100

9 years agox86emul: fix rIP handling
Jan Beulich [Fri, 4 Mar 2016 12:14:39 +0000 (13:14 +0100)]
x86emul: fix rIP handling

Deal with rIP just like with any other register: Truncate to designated
width upon entry, write back the zero-extended 32-bit value when
emulating 32-bit code, and leave the upper 48 bits unchanged for 16-bit
code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0640ffb67fb92e2561c63b9308c27b71281fdd72
master date: 2016-02-18 15:05:34 +0100

9 years agoxen/arm: vgic-v2: Implement correctly ITARGETSR0 - ITARGETSR7 read-only
Julien Grall [Fri, 4 Mar 2016 12:13:22 +0000 (13:13 +0100)]
xen/arm: vgic-v2: Implement correctly ITARGETSR0 - ITARGETSR7 read-only

Each ITARGETSR register are 4-byte wide and the offset is in byte.

The current implementation is computing the end of the range wrongly
resulting to emulate only ITARGETSR{0,1} read-only. The rest will be
treated as read-write.

As 8 registers should be read-only, the end of the range should be
ITARGETSR + (4 * 8) - 1.

For convenience introduce ITARGETSR7 and ITARGETSR8.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
(cherry picked from commit bc50de883847c1ebc7c8b4d73283d9be6c4df38e)

9 years agoxen/arm: vgic-v2: Report the correct GICC size to the guest
Julien Grall [Fri, 4 Mar 2016 12:11:20 +0000 (13:11 +0100)]
xen/arm: vgic-v2: Report the correct GICC size to the guest

The GICv2 DT node is usually used by the guest to know the address/size
of the regions (GICD, GICC...) to map into their virtual memory.

While the GICv2 spec requires the size of the GICC to be 8KB, we
correctly do an 8KB stage-2 mapping but erroneously report 256 in the
device tree (based on GUEST_GICC_SIZE).

I bet we didn't see any issue so far because all the registers except
GICC_DIR lives in the first 256 bytes of the GICC region and all the
guests I have seen so far are driving the GIC with GICC_CTLR.EIOmode =
0.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- fixed some typos in commit message ]

(cherry picked from commit 8ee6d574b7073b5c98fcf94d20a53197609b85e1)

9 years agotools: pygrub: if partition table is empty, try treating as a whole disk
Ian Campbell [Thu, 5 Nov 2015 14:46:12 +0000 (14:46 +0000)]
tools: pygrub: if partition table is empty, try treating as a whole disk

pygrub (in identify_disk_image()) detects a DOS style partition table
via the presence of the 0xaa55 signature at the end of the first
sector of the disk.

However this signature is also present in whole-disk configurations
when there is an MBR on the disk. Many filesystems (e.g. ext[234])
include leading padding in their on disk format specifically to enable
this.

So if we think we have a DOS partition table but do not find any
actual partition table entries we may as well try looking at it as a
whole disk image. Worst case is we probe and find there isn't anything
there.

This was reported by Sjors Gielen in Debian bug #745419. The fix was
inspired by a patch by Adi Kriegisch in
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=745419#27

Tested by genext2fs'ing my /boot into a new raw image (works) and
then:
   dd if=/usr/lib/grub/i386-pc/g2ldr.mbr of=img conv=notrunc bs=512 count=1

to add an MBR (with 0xaa55 signature) to it, which after this patch
also works.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: 745419-forwarded@bugs.debian.org
(cherry picked from commit fb31b1475f1bf179f033b8de3f0e173006fd77e9)
(cherry picked from commit 6c9b1bcce4fcc872edddd44f88390a67d5954069)

9 years agox86: fix unintended fallthrough case from XSA-154
Andrew Cooper [Thu, 18 Feb 2016 14:26:16 +0000 (15:26 +0100)]
x86: fix unintended fallthrough case from XSA-154

... and annotate the other deliberate one: Coverity objects otherwise.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
One of the two instances was actually a bug.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 8dd6d1c099865ee5f5916616a0ca79cd943c46f9
master date: 2016-02-18 15:10:07 +0100

9 years agoxen/arm64: Make sure we get all debug output
Dirk Behme [Thu, 18 Feb 2016 14:25:43 +0000 (15:25 +0100)]
xen/arm64: Make sure we get all debug output

Starting in the wrong ELx mode I get the following debug output:

...
- Current EL 00000004 -
- Xen must be entered in NS EL2 mode -
- Boot failed -

The output of "Please update the bootloader" is missing here, because
string concatenation in gas, unlike in C, keeps the \0 between each
individual string.

Make sure this is output, too. With this, we get

...
- Current EL 00000004 -
- Xen must be entered in NS EL2 mode -
- Please update the bootloader -
- Boot failed -

as intended.

Signed-off-by: Dirk Behme <dirk.behme@de.bosch.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- added same change to arm32 case ]
master commit: c31d34082555566eb27d0d1eb42962f72fa886d3
master date: 2016-02-18 10:13:42 +0000

9 years agohvmloader: fix scratch_alloc to avoid overlaps
Anthony PERARD [Wed, 17 Feb 2016 15:49:49 +0000 (16:49 +0100)]
hvmloader: fix scratch_alloc to avoid overlaps

scratch_alloc() set scratch_start to the last byte of the current
allocation.  The value of scratch_start is then reused as is (if it is
already aligned) in the next allocation.  This result in a potential reuse
of the last byte of the previous allocation.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4ab3ac074cb1f101f42e02103fa263a1f4f422b5
master date: 2016-02-10 14:46:45 +0100

9 years agox86/nHVM: avoid NULL deref during INVLPG intercept handling
Jan Beulich [Wed, 17 Feb 2016 15:49:18 +0000 (16:49 +0100)]
x86/nHVM: avoid NULL deref during INVLPG intercept handling

When intercepting (or emulating) L1 guest INVLPG, the nested P2M
pointer may be (is?) NULL, and hence there's no point in calling
p2m_flush(). In fact doing so would cause a dereference of that NULL
pointer at least in the ASSERT() right at the beginning of the
function.

While so far nothing supports hap_invlpg() being reachable from the
INVLPG intercept paths (only INVLPG insn emulation would lead there),
and hence the code in question (added by dd6de3ab99 ["Implement
Nested-on-Nested"]) appears to be dead, this seems to be the change
which can be agreed on as an immediate fix. Ideally, however, the
problematic code would go away altogether. See thread at
lists.xenproject.org/archives/html/xen-devel/2016-01/msg03762.html.

Reported-by: 刘令 <liuling-it@360.cn>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 86c59615f4e7f38df24182f20d9dbdec3299c514
master date: 2016-02-09 13:22:13 +0100

9 years agocredit: recalculate per-cpupool credits when updating timeslice
Juergen Gross [Wed, 17 Feb 2016 15:48:37 +0000 (16:48 +0100)]
credit: recalculate per-cpupool credits when updating timeslice

When modifying the timeslice of the credit scheduler in a cpupool the
cpupool global credit value (n_cpus * credits_per_tslice) isn't
recalculated. This will lead to wrong scheduling decisions later.

Do the recalculation when updating the timeslice.

Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Alan.Robinson <alan.robinson@ts.fujitsu.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: ffc342fbb060cd753fc3a5f6fb6f550dd29a2637
master date: 2016-02-02 14:03:40 +0100

9 years agocredit: update timeslice under lock
Juergen Gross [Wed, 17 Feb 2016 15:48:16 +0000 (16:48 +0100)]
credit: update timeslice under lock

When updating the timeslice of the credit scheduler protect the
scheduler's private data by it's lock. Today a possible race could
result only in some weird scheduling decisions during one timeslice,
but further adjustments will need the lock anyway.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: f2c96ac4dedf4976e46de34c69c2cd8b289c4ef2
master date: 2016-02-02 14:03:06 +0100

9 years agox86/vmx: don't clobber exception_bitmap when entering/leaving emulated real mode
Andrew Cooper [Wed, 17 Feb 2016 15:47:52 +0000 (16:47 +0100)]
x86/vmx: don't clobber exception_bitmap when entering/leaving emulated real mode

Most updates to the exception bitmaps set or clear an individual bits.

However, entering or exiting emulated real mode unilaterally clobbers it,
leaving the exit code to recalculate what it should have been.  This is error
prone, and indeed currently fails to recalculate the TRAP_no_device intercept
appropriately.

Instead of overwriting exception_bitmap when entering emulated real mode, move
the override into vmx_update_exception_bitmap() and leave exception_bitmap
unmodified.

This means that recalculation is unnecessary, and that the use of
vmx_fpu_leave() and vmx_update_debug_state() while in emulated real mode
doesn't result in TRAP_no_device and TRAP_int3 being un-intercepted.

This is only a functional change on hardware lacking unrestricted guest
support.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 78c93adf0a7f6a7abe249a63e7398ca1221a6d25
master date: 2016-02-02 14:00:52 +0100

9 years agox86/mce: fix misleading indentation in init_nonfatal_mce_checker()
Ian Campbell [Wed, 17 Feb 2016 15:47:21 +0000 (16:47 +0100)]
x86/mce: fix misleading indentation in init_nonfatal_mce_checker()

Debian bug 812166[0] reported this build failure due to
Wmisleading-indentation with gcc-6:

non-fatal.c: In function 'init_nonfatal_mce_checker':
non-fatal.c:103:2: error: statement is indented as if it were guarded by... [-Werror=misleading-indentation]
  switch (c->x86_vendor) {
  ^~~~~~

non-fatal.c:97:5: note: ...this 'if' clause, but it is not
     if ( __get_cpu_var(poll_bankmask) == NULL )
     ^~

I was unable to reproduce (xen builds cleanly for me with "6.0.0 20160117
(experimental) [trunk revision 232481]") but looking at the code the issue
above is clearly real.

Correctly reindent the if statement.

This file uses Linux coding style (infact the use of Xen style for
this line is the root cause of the wanring) so use tabs and while
there remove the whitespace inside the if as Linux does.

[0] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=812166

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2e46e3f2539d026594ec1618e7df2c2bc8785b42
master date: 2016-01-22 16:19:51 +0100

9 years agox86: fix (and simplify) MTRR overlap checking
Jan Beulich [Wed, 17 Feb 2016 15:46:52 +0000 (16:46 +0100)]
x86: fix (and simplify) MTRR overlap checking

Obtaining one individual range per variable range register (via
get_mtrr_range()) was bogus from the beginning, as these registers may
cover multiple disjoint ranges. Do away with that, in favor of simply
comparing masked addresses.

Also, for is_var_mtrr_overlapped()'s result to be correct when called
from mtrr_wrmsr(), generic_set_mtrr() must update saved state first.

As minor cleanup changes, constify is_var_mtrr_overlapped()'s parameter
and make mtrr_wrmsr() static.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3272230848f36eb5bbb660216898a90048a81d9f
master date: 2016-01-21 16:11:04 +0100

9 years agox86/mmuext: tighten TLB flush address checks
Jan Beulich [Wed, 17 Feb 2016 15:46:25 +0000 (16:46 +0100)]
x86/mmuext: tighten TLB flush address checks

Addresses passed by PV guests should be subjected to __addr_ok(),
avoiding undue TLB flushes.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 828e114f7cdd9910483783ab0563b178325e579a
master date: 2016-01-21 16:09:22 +0100

9 years agox86/VMX: sanitize rIP before re-entering guest
Jan Beulich [Wed, 17 Feb 2016 15:43:56 +0000 (16:43 +0100)]
x86/VMX: sanitize rIP before re-entering guest

... to prevent guest user mode arranging for a guest crash (due to
failed VM entry). (On the AMD system I checked, hardware is doing
exactly the canonicalization being added here.)

Note that fixing this in an architecturally correct way would be quite
a bit more involved: Making the x86 instruction emulator check all
branch targets for validity, plus dealing with invalid rIP resulting
from update_guest_eip() or incoming directly during a VM exit. The only
way to get the latter right would be by not having hardware do the
injection.

Note further that there are a two early returns from
vmx_vmexit_handler(): One (through vmx_failed_vmentry()) leads to
domain_crash() anyway, and the other covers real mode only and can
neither occur with a non-canonical rIP nor result in an altered rIP,
so we don't need to force those paths through the checking logic.

This is CVE-2016-2271 / XSA-170.

Reported-by: 刘令 <liuling-it@360.cn>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ffbbfda37782a2408953af1a3e00ada80bb141bc
master date: 2016-02-17 16:18:08 +0100

9 years agox86: enforce consistent cachability of MMIO mappings
Jan Beulich [Wed, 17 Feb 2016 15:43:21 +0000 (16:43 +0100)]
x86: enforce consistent cachability of MMIO mappings

We've been told by Intel that inconsistent cachability between
multiple mappings of the same page can affect system stability only
when the affected page is an MMIO one. Since the stale data issue is
of no relevance to the hypervisor (since all guest memory accesses go
through proper accessors and validation), handling of RAM pages
remains unchanged here. Any MMIO mapped by domains however needs to be
done consistently (all cachable mappings or all uncachable ones), in
order to avoid Machine Check exceptions. Since converting existing
cachable mappings to uncachable (at the time an uncachable mapping
gets established) would in the PV case require tracking all mappings,
allow MMIO to only get mapped uncachable (UC, UC-, or WC).

This also implies that in the PV case we mustn't use the L1 PTE update
fast path when cachability flags get altered.

Since in the HVM case at least for now we want to continue honoring
pinned cachability attributes for pages not mapped by the hypervisor,
special case handling of r/o MMIO pages (forcing UC) gets added there.
Arguably the counterpart change to p2m-pt.c may not be necessary, since
UC- (which already gets enforced there) is probably strict enough.

Note that the shadow code changes include fixing the write protection
of r/o MMIO ranges: shadow_l1e_remove_flags() and its siblings, other
than l1e_remove_flags() and alike, return the new PTE (and hence
ignoring their return values makes them no-ops).

This is CVE-2016-2270 / XSA-154.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c61a6f74f80eb36ed83a82f713db3143159b9009
master date: 2016-02-17 16:16:53 +0100

9 years agodocs: correct descriptions of gnttab_max_{, maptrack}_frames
Ian Campbell [Wed, 20 Jan 2016 13:06:22 +0000 (14:06 +0100)]
docs: correct descriptions of gnttab_max_{, maptrack}_frames

gnttab_max_frames incorrectly referred to numbers of grant tab
operations and gnttab_max_maptrack_frames was confusingly worded.

Add the default for gnttab_max_frames while here (it's currently the
same on all arches since no arch uses the available arch override) and
adjust the default for gnttab_max_maptrack_frames to match the normal
form.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ef17887d848dae0ca46231b47bf30d3c1d4aa87d
master date: 2016-01-19 16:24:44 +0000

9 years agox86/vmx: Fix injection of #DB traps following XSA-156
Andrew Cooper [Wed, 20 Jan 2016 13:05:48 +0000 (14:05 +0100)]
x86/vmx: Fix injection of #DB traps following XSA-156

Most #DB exceptions are traps rather than faults, meaning that the instruction
pointer in the exception frame points after the instruction rather than at it.

However, VMX intercepts all have fault semantics, even when intercepting a
trap.  Re-injecting an intercepted trap as a fault causes an infinite loop in
the guest, by re-executing the same trapping instruction repeatedly.  This
breaks debugging inside the guest.

Introduce a helper which copies VM_EXIT_INTR_INTO to VM_ENTRY_INTR_INFO, and
use it to mirror the intercepted interrupt back to the guest.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 0747bc8b4d85f3fc0ee1e58418418fa0229e8ff8
master date: 2016-01-05 11:28:56 +0000

9 years agox86/VMX: prevent INVVPID failure due to non-canonical guest address
Jan Beulich [Wed, 20 Jan 2016 13:03:02 +0000 (14:03 +0100)]
x86/VMX: prevent INVVPID failure due to non-canonical guest address

While INVLPG (and on SVM INVLPGA) don't fault on non-canonical
addresses, INVVPID fails (in the "individual address" case) when passed
such an address.

Since such intercepted INVLPG are effectively no-ops anyway, don't fix
this in vmx_invlpg_intercept(), but instead have paging_invlpg() never
return true in such a case.

This is CVE-2016-1571 / XSA-168.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: bf05e88ed7342a91cceba050b6c622accb809842
master date: 2016-01-20 13:50:10 +0100

9 years agox86/mm: PV superpage handling lacks sanity checks
Jan Beulich [Wed, 20 Jan 2016 13:02:17 +0000 (14:02 +0100)]
x86/mm: PV superpage handling lacks sanity checks

MMUEXT_{,UN}MARK_SUPER fail to check the input MFN for validity before
dereferencing pointers into the superpage frame table.

Reported-by: Qinghao Tang <luodalongde@gmail.com>
get_superpage() has a similar issue.

This is CVE-2016-1570 / XSA-167.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 47abf29a9255b2e7b94e56d66b455d0a584b68b8
master date: 2016-01-20 13:49:23 +0100

9 years agotools/ocaml/xb: Correct calculations of data/space the ring
Andrew Cooper [Tue, 10 Nov 2015 10:46:44 +0000 (10:46 +0000)]
tools/ocaml/xb: Correct calculations of data/space the ring

ml_interface_{read,write}() would miscalculate the quantity of
data/space in the ring if it crossed the ring boundary, and incorrectly
return a short read/write.

This causes a protocol stall, as either side of the ring ends up waiting
for what they believe to be the other side needing to take the next
action.

Correct the calculations to cope with crossing the ring boundary.

In addition, correct the error detection.  It is a hard error if the
producer index gets more than a ring size ahead of the consumer, or if
the consumer ever overtakes the producer.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Reviewed-by: David Scott <dave@recoil.org>
(cherry picked from commit 8a2c11f876e6cf9c74f2bcaed5a997adc57da888)
(cherry picked from commit 6150df9f3f99ecbcbd9917002186d1d895b5602e)

9 years agooxenstored: Quota.merge: don't assume domain already exists
Jonathan Davies [Wed, 11 Nov 2015 11:21:53 +0000 (11:21 +0000)]
oxenstored: Quota.merge: don't assume domain already exists

In Quota.merge, we merge two quota hashtables, orig_quota and mod_quota, putting
the results into dest_quota. These hashtables map domids to the number of
entries currently owned by that domain.

When mod_quota contains an entry for a domid that was not present in orig_quota
(or dest_quota), the call to get_entry caused Quota.merge to raise a Not_found
exception. This propagates back to the client as an ENOENT error, which is not
an appropriate return value from some operations, such as transaction_end.

This situation can arise when a transaction that introduces a domain (hence
calling Quota.add_entry) needs to be coalesced due to concurrent xenstore
activity.

This patch handles the merge in the case where mod_quota contains an entry not
present in orig_quota (or in dest_quota) by treating that hashtable as having
existing value 0.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 82ff7cbed64e3cc094d6812b3ad672c660649969)
(cherry picked from commit ba391da2e7c68aad131159c90733e0d39b85ed37)

9 years agoConfig.mk: update OVMF changeset
Wei Liu [Wed, 14 Oct 2015 11:41:13 +0000 (12:41 +0100)]
Config.mk: update OVMF changeset

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commits 04c5efb0a141fa53e805e396970419436e74ce67
 and f046e501bbca1c8a46853b2e1f1b587e228c73de)

Apropos of discussion in
 "OVMF related osstest failures on multiple branches"
 http://lists.xenproject.org/archives/html/xen-devel/2016-01/msg00442.html

We believe the older ovmf.git does not work when built with the gcc in
Debian jessie.  We do not know where this bug lies but we are fixing
it by updating ovmf.

We have decided that we are not in a position to review the changes to
OVMF upstream, and ourselves decide what to cherry pick.  Instead we
will update the revision wholesale and use the xen.git stable
branches' push gate.

Conflicts:
Config.mk

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commits 6c3c6ff9ecaa5ee0be8b535d36fdcd12380564a1
 and 1d3cc6e62c4d2fc3dd9251d4921881425c9d27bd)

Conflicts:
Config.mk
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
9 years agoblktap: Fix two 'maybe uninitialized' variables
Dario Faggioli [Fri, 20 Jun 2014 14:09:00 +0000 (16:09 +0200)]
blktap: Fix two 'maybe uninitialized' variables

[ Cross-ported to blktap1 from 345e44a85d71a
  "blktap2: Fix two 'maybe uninitialized' variables" -iwj;
  Remainder of commit message is from blktap2's version. ]

for which gcc 4.9.0 complains about, like this:

block-qcow.c: In function `get_cluster_offset':
block-qcow.c:431:3: error: `tmp_ptr' may be used uninitialized in this function
[-Werror=maybe-uninitialized]
   memcpy(tmp_ptr, l1_ptr, 4096);
   ^
block-qcow.c:606:7: error: `tmp_ptr2' may be used uninitialized in this
function [-Werror=maybe-uninitialized]
   if (write(s->fd, tmp_ptr2, 4096) != 4096) {
       ^
cc1: all warnings being treated as errors
/home/dario/Sources/xen/xen/xen.git/tools/blktap2/drivers/../../../tools/Rules.mk:89:
 recipe for target 'block-qcow.o' failed
make[5]: *** [block-qcow.o] Error 1

The proper behavior is to return upon allocation failure.
About what to return, 0 seems the best option, looking
at both the function and the call sites.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Backport-requested-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
9 years agoQEMU_TAG update
Ian Jackson [Mon, 4 Jan 2016 15:36:12 +0000 (15:36 +0000)]
QEMU_TAG update

9 years agoQEMU_TAG update
Ian Jackson [Fri, 18 Dec 2015 14:58:10 +0000 (14:58 +0000)]
QEMU_TAG update

9 years agox86/HVM: avoid reading ioreq state more than once
Jan Beulich [Thu, 17 Dec 2015 13:29:28 +0000 (14:29 +0100)]
x86/HVM: avoid reading ioreq state more than once

Otherwise, especially when the compiler chooses to translate the
switch() to a jump table, unpredictable behavior (and in the jump table
case arbitrary code execution) can result.

This is XSA-166.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: b452430a4cdfc801fa4bc391aed7522365e1deb6
master date: 2015-12-17 14:22:46 +0100

9 years agox86: don't leak ST(n)/XMMn values to domains first using them
Jan Beulich [Thu, 17 Dec 2015 13:28:58 +0000 (14:28 +0100)]
x86: don't leak ST(n)/XMMn values to domains first using them

FNINIT doesn't alter these registers, and hence using it is
insufficient to initialize a guest's initial state.

This is CVE-2015-8555 / XSA-165.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 81818b3f277544535974204f8d840da86fa8a44f
master date: 2015-12-17 14:22:13 +0100

9 years agox86/time: fix domain type check in tsc_set_info()
Haozhong Zhang [Tue, 15 Dec 2015 14:40:18 +0000 (15:40 +0100)]
x86/time: fix domain type check in tsc_set_info()

Replace is_hvm_domain() in tsc_set_info() by has_hvm_container_domain()
to keep consistent with other domain type checks in tsc_set_info().

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 3c80d6f3c61eb0f8072f70b0a9a8c8c7adf17572
master date: 2015-12-08 09:46:30 +0100

9 years agoVT-d: drop unneeded Ivybridge quirk workaround
Jan Beulich [Tue, 15 Dec 2015 14:39:52 +0000 (15:39 +0100)]
VT-d: drop unneeded Ivybridge quirk workaround

We've been told by Intel that server chipsets don't need the workaround
anymore starting with Ivybridge (Xeon E5/E7 v2); the second half of the
workaround was missing anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: a10307b3912e65bbdd9184ba5fe849d252b75f92
master date: 2015-12-03 15:33:10 +0100

9 years agoevtchn: don't reuse ports that are still "busy"
David Vrabel [Tue, 15 Dec 2015 14:39:20 +0000 (15:39 +0100)]
evtchn: don't reuse ports that are still "busy"

When using the FIFO ABI a guest may close an event channel that is
still LINKED.  If this port is reused, subsequent events may be lost
because they may become pending on the wrong queue.

This could be fixed by requiring guests to only close event channels
that are not linked.  This is difficult since: a) irq cleanup in the
guest may be done in a context that cannot wait for the event to be
unlinked; b) the guest may attempt to rebind a PIRQ whose previous
close is still pending; and c) existing guests already have the
problematic behaviour.

Instead, simply check a port is not "busy" (i.e., it's not linked)
before reusing it.

Guests should still drain any queues for VCPUs that are being
offlined, or the port will become unusable until the VCPU is onlined
and starts processing events again.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 78e24c269b0a4a8b864ece725e6d4209ed95dfa7
master date: 2015-12-02 15:21:46 +0100

9 years agox86/ept: remove unnecessary sync after resolving misconfigured entries
David Vrabel [Tue, 15 Dec 2015 14:38:38 +0000 (15:38 +0100)]
x86/ept: remove unnecessary sync after resolving misconfigured entries

When using EPT, type changes are done with the following steps:

1. Set entry as invalid (misconfigured) by settings a reserved memory
type.

2. Flush all EPT and combined translations (ept_sync_domain()).

3. Fixup misconfigured entries as required (on EPT_MISCONFIG vmexits or
when explicitly setting an entry.

Since resolve_misconfig() only updates entries that were misconfigured,
there is no need to invalidate any translations since the hardware
does not cache misconfigured translations (vol 3, section 28.3.2).

Remove the unnecessary (and very expensive) ept_sync_domain() calls).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: cea357ba4b3335ca5279ee9c00838f85575d5842
master date: 2015-12-02 15:19:53 +0100

9 years agox86/boot: check for not allowed sections before linking
Daniel Kiper [Tue, 15 Dec 2015 14:37:55 +0000 (15:37 +0100)]
x86/boot: check for not allowed sections before linking

Currently check for not allowed sections is performed just after
compilation. However, if compilation succeeds and check fails then
second build will create xen.gz/xen.efi without any visible error.
This happens because %.o: %.c recipe created object file during first
run and make do not execute this recipe during second run. So, look
for not allowed sections before linking. This way check will be
executed every time.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d380b3559734739ae009cd3c0e9aabb5602863e2
master date: 2015-11-25 17:24:36 +0100

9 years agox86/vPMU: document as unsupported
Jan Beulich [Tue, 15 Dec 2015 14:37:24 +0000 (15:37 +0100)]
x86/vPMU: document as unsupported

This is XSA-163.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: c03480cf5c4e96fb4afb2237ad0a3cac7162564a
master date: 2015-11-24 18:32:20 +0100

9 years agosched: fix locking for insert_vcpu() in credit1 and RTDS
Dario Faggioli [Tue, 15 Dec 2015 14:36:26 +0000 (15:36 +0100)]
sched: fix locking for insert_vcpu() in credit1 and RTDS

The insert_vcpu() hook is handled with inconsistent locking.
In fact, schedule_cpu_switch() calls the hook with runqueue
lock held, while sched_move_domain() relies on the hook
implementations to take the lock themselves (and, since that
is not done in Credit1 and RTDS, such operation is not safe
in those cases).

This is fixed as follows:
 - take the lock in the hook implementations, in specific
   schedulers' code;
 - avoid calling insert_vcpu(), for the idle vCPU, in
   schedule_cpu_switch(). In fact, idle vCPUs are set to run
   immediately, and the various schedulers won't insert them
   in their runqueues anyway, even when explicitly asked to.

While there, still in schedule_cpu_switch(), locking with
_irq() is enough (there's no need to do *_irqsave()).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: ae2f41e3d7e7798537b7ea6dbb9a0c6aeb1179e3
master date: 2015-11-24 14:48:34 +0100

9 years agoVMX: fix/adjust trap injection
Jan Beulich [Tue, 15 Dec 2015 14:36:01 +0000 (15:36 +0100)]
VMX: fix/adjust trap injection

In the course of investigating the 4.1.6 backport issue of the XSA-156
patch I realized that #DB injection has always been broken, but with it
now getting always intercepted the problem has got worse: Documentation
clearly states that neither DR7.GD nor DebugCtl.LBR get cleared before
the intercept, so this is something we need to do before reflecting the
intercepted exception.

While adjusting this (and also with 4.1.6's strange use of
X86_EVENTTYPE_SW_EXCEPTION for #DB in mind) I further realized that
the special casing of individual vectors shouldn't be done for
software interrupts (resulting from INT $nn).

And then some code movement: Setting of CR2 for #PF can be done in the
same switch() statement (no need for a separate if()), and reading of
intr_info is better done close the the consumption of the variable
(allowing the compiler to generate better code / use fewer registers
for variables).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 81a28f14009f4d8577a81b28dd06f6828112054b
master date: 2015-11-24 12:30:31 +0100

9 years agox86/HVM: don't inject #DB with error code
Jan Beulich [Tue, 15 Dec 2015 14:35:28 +0000 (15:35 +0100)]
x86/HVM: don't inject #DB with error code

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper@citrix.com>
master commit: 057e0e72d2a5d598087c5f167ec6a13203a3cf65
master date: 2015-11-12 16:59:18 +0100

9 years agox86/vmx: improvements to vmentry failure handling
Andrew Cooper [Tue, 15 Dec 2015 14:34:29 +0000 (15:34 +0100)]
x86/vmx: improvements to vmentry failure handling

Combine the almost identical vm_launch_fail() and vm_resume_fail() into a
single vmx_vmentry_failure().

Re-save all GPRs so that domain_crash() prints the real register values,
rather than the stack frame of the vmx_vmentry_failure() call.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: bbcf0b218f64b1e3e2b66b0fbb623f51d9014e81
master date: 2015-11-03 18:14:02 +0100

9 years agox86/PoD: Make p2m_pod_empty_cache() restartable
Andrew Cooper [Tue, 15 Dec 2015 14:33:17 +0000 (15:33 +0100)]
x86/PoD: Make p2m_pod_empty_cache() restartable

This avoids a long running operation when destroying a domain with a
large PoD cache.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 59a5061723ba47c0028cf48487e5de551c42a378
master date: 2015-11-02 15:33:38 +0100

9 years agomemory: fix XSA-158 fix
Jan Beulich [Wed, 9 Dec 2015 12:55:58 +0000 (13:55 +0100)]
memory: fix XSA-158 fix

For one the uses of domu_max_order and ptdom_max_order were swapped.

And then gcc warns about an unused result of a __must_check function
in the control part of a conditional expression when both other
expressions can be determined by the compiler to produce the same value
(see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68039), which happens
when HAS_PASSTHROUGH is undefined (i.e. for ARM on 4.4 and older).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: ff841cead287d7913901ba5c4e7628a6958b5bea
master date: 2015-12-09 13:53:13 +0100

9 years agoQEMU_TAG update
Ian Jackson [Wed, 9 Dec 2015 11:50:11 +0000 (11:50 +0000)]
QEMU_TAG update

9 years agolibxl: Fix bootloader-related virtual memory leak on pv build failure
Ian Jackson [Wed, 18 Nov 2015 15:34:54 +0000 (15:34 +0000)]
libxl: Fix bootloader-related virtual memory leak on pv build failure

The bootloader may call libxl__file_reference_map(), which mmap's the
pv_kernel and pv_ramdisk into process memory.  This was only unmapped,
however, on the success path of libxl__build_pv().  If there were a
failure anywhere between libxl_bootloader.c:parse_bootloader_result()
and the end of libxl__build_pv(), the calls to
libxl__file_reference_unmap() would be skipped, leaking the mapped
virtual memory.

Ideally this would be fixed by adding the unmap calls to the
destruction path for libxl__domain_build_state.  Unfortunately the
lifetime of the libxl__domain_build_state is opaque, and it doesn't
have a proper destruction path.  But, the only thing in it that isn't
from the gc are these bootloader references, and they are only ever
set for one libxl__domain_build_state, the one which is
libxl__domain_create_state.build_state.

So we can clean up in the exit path from libxl__domain_create_*, which
always comes through domcreate_complete.

Remove the now-redundant unmaps in libxl__build_pv's success path.

This is XSA-160.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Tested-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 59543a7cc218e9d466810409088f3015f259078c)

9 years agomemory: fix XENMEM_exchange error handling
Jan Beulich [Tue, 8 Dec 2015 13:09:20 +0000 (14:09 +0100)]
memory: fix XENMEM_exchange error handling

assign_pages() can fail due to the domain getting killed in parallel,
which should not result in a hypervisor crash.

Reported-by: Julien Grall <julien.grall@citrix.com>
Also delete a redundant put_gfn() - all relevant paths leading to the
"fail" label already do this (and there are also paths where it was
plain wrong). All of the put_gfn()-s got introduced by 51032ca058
("Modify naming of queries into the p2m"), including the otherwise
unneeded initializer for k (with even a kind of misleading comment -
the compiler warning could actually have served as a hint that the use
is wrong).

This is CVE-2015-8339 + CVE-2015-8340 / XSA-159.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: eedecb3cf0b2ce1ffc2eb08f3c73f88d42c382c9
master date: 2015-12-08 14:01:43 +0100

9 years agomemory: split and tighten maximum order permitted in memops
Jan Beulich [Tue, 8 Dec 2015 13:08:41 +0000 (14:08 +0100)]
memory: split and tighten maximum order permitted in memops

Introduce and enforce separate limits for ordinary DomU, DomU with
pass-through device(s), control domain, and hardware domain.

The DomU defaults were determined based on what so far was allowed by
multipage_allocation_permitted().

The x86 hwdom default was chosen based on linux-2.6.18-xen.hg c/s
1102:82782f1361a9 indicating 2Mb is not enough, plus some slack.

The ARM hwdom default was chosen to allow 2Mb (order-9) mappings, plus
a little bit of slack.

This is CVE-2015-8338 / XSA-158.

Reported-by: Julien Grall <julien.grall@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 4a578b316eb98975374d88f28904acf13dbcfac2
master date: 2015-12-08 14:00:33 +0100

9 years agoConfig: Switch to unified qemu trees.
Ian Campbell [Thu, 10 Sep 2015 13:31:34 +0000 (14:31 +0100)]
Config: Switch to unified qemu trees.

Upstream qemu is now in qemu-xen.git and the trad fork is in
qemu-xen-traditional.git.

QEMU_UPSTREAM_REVISION is currently a tag and
QEMU_TRADITIONAL_REVISION is a specific revision, so no changes are
required to those.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Conflicts:
Config.mk
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 78833c04250416f1870c458309d3ac0e5cf915fd)

Conflicts:
Config.mk

9 years agoupdate Xen version to 4.5.3-pre
Jan Beulich [Tue, 10 Nov 2015 11:20:59 +0000 (12:20 +0100)]
update Xen version to 4.5.3-pre

9 years agox86/HVM: always intercept #AC and #DB
Jan Beulich [Tue, 10 Nov 2015 11:18:37 +0000 (12:18 +0100)]
x86/HVM: always intercept #AC and #DB

Both being benign exceptions, and both being possible to get triggered
by exception delivery, this is required to prevent a guest from locking
up a CPU (resulting from no other VM exits occurring once getting into
such a loop).

The specific scenarios:

1) #AC may be raised during exception delivery if the handler is set to
be a ring-3 one by a 32-bit guest, and the stack is misaligned.

This is CVE-2015-5307 / XSA-156.

Reported-by: Benjamin Serebrin <serebrin@google.com>
2) #DB may be raised during exception delivery when a breakpoint got
placed on a data structure involved in delivering the exception. This
can result in an endless loop when a 64-bit guest uses a non-zero IST
for the vector 1 IDT entry, but even without use of IST the time it
takes until a contributory fault would get raised (results depending
on the handler) may be quite long.

This is CVE-2015-8104 / XSA-156.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: bd2239d9fa975a1ee5bcd27c218ae042cd0a57bc
master date: 2015-11-10 12:03:08 +0100

9 years agoupdate Xen version to 4.5.2 RELEASE-4.5.2
Jan Beulich [Tue, 3 Nov 2015 09:11:18 +0000 (10:11 +0100)]
update Xen version to 4.5.2

9 years agolibxl: adjust PoD target by memory fudge, too
Ian Jackson [Wed, 21 Oct 2015 15:18:30 +0000 (16:18 +0100)]
libxl: adjust PoD target by memory fudge, too

PoD guests need to balloon at least as far as required by PoD, or risk
crashing.  Currently they don't necessarily know what the right value
is, because our memory accounting is (at the very least) confusing.

Apply the memory limit fudge factor to the in-hypervisor PoD memory
target, too.  This will increase the size of the guest's PoD cache by
the fudge factor LIBXL_MAXMEM_CONSTANT (currently 1Mby).  This ensures
that even with a slightly-off balloon driver, the guest will be
stable even under memory pressure.

There are two call sites of xc_domain_set_pod_target that need fixing:

The one in libxl_set_memory_target is straightforward.

The one in xc_hvm_build_x86.c:setup_guest is more awkward.  Simply
setting the PoD target differently does not work because the various
amounts of memory during domain construction no longer match up.
Instead, we adjust the guest memory target in xenstore (but only for
PoD guests).

This introduces a 1Mby discrepancy between the balloon target of a PoD
guest at boot, and the target set by an apparently-equivalent `xl
mem-set' (or similar) later.  This approach is low-risk for a security
fix but we need to fix this up properly in xen.git#staging and
probably also in stable trees.

This is XSA-153.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
(cherry picked from commit 56fb5fd62320eb40a7517206f9706aa9188d6f7b)

9 years agox86: rate-limit logging in do_xen{oprof,pmu}_op()
Jan Beulich [Thu, 29 Oct 2015 13:02:38 +0000 (14:02 +0100)]
x86: rate-limit logging in do_xen{oprof,pmu}_op()

Some of the sub-ops are acessible to all guests, and hence should be
rate-limited. In the xenoprof case, just like for XSA-146, include them
only in debug builds. Since the vPMU code is rather new, allow them to
be always present, but downgrade them to (rate limited) guest messages.

This is CVE-2015-7971 / XSA-152.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 95e7415843b94c346e5ba8682665f508f220e04b
master date: 2015-10-29 13:37:19 +0100

9 years agoxenoprof: free domain's vcpu array
Jan Beulich [Thu, 29 Oct 2015 13:02:14 +0000 (14:02 +0100)]
xenoprof: free domain's vcpu array

This was overlooked in fb442e2171 ("x86_64: allow more vCPU-s per
guest").

This is CVE-2015-7969 / XSA-151.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 6e97c4b37386c2d09e09e9b5d5d232e37728b960
master date: 2015-10-29 13:36:52 +0100

9 years agox86/PoD: Eager sweep for zeroed pages
Andrew Cooper [Thu, 29 Oct 2015 13:01:47 +0000 (14:01 +0100)]
x86/PoD: Eager sweep for zeroed pages

Based on the contents of a guests physical address space,
p2m_pod_emergency_sweep() could degrade into a linear memcmp() from 0 to
max_gfn, which runs non-preemptibly.

As p2m_pod_emergency_sweep() runs behind the scenes in a number of contexts,
making it preemptible is not feasible.

Instead, a different approach is taken.  Recently-populated pages are eagerly
checked for reclaimation, which amortises the p2m_pod_emergency_sweep()
operation across each p2m_pod_demand_populate() operation.

Note that in the case that a 2M superpage can't be reclaimed as a superpage,
it is shattered if 4K pages of zeros can be reclaimed.  This is unfortunate
but matches the previous behaviour, and is required to avoid regressions
(domain crash from PoD exhaustion) with VMs configured close to the limit.

This is CVE-2015-7970 / XSA-150.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 101ce53266866144e724ed593173bc4098b300b9
master date: 2015-10-29 13:36:25 +0100

9 years agofree domain's vcpu array
Jan Beulich [Thu, 29 Oct 2015 13:00:50 +0000 (14:00 +0100)]
free domain's vcpu array

This was overlooked in fb442e2171 ("x86_64: allow more vCPU-s per
guest").

This is CVE-2015-7969 / XSA-149.

Reported-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Ian Campbell <ian.campbell@citrix.com>
master commit: d46896ebbb23f3a9fef2eb6066ae614fd1acfd96
master date: 2015-10-29 13:35:40 +0100

9 years agox86: guard against undue super page PTE creation
Jan Beulich [Thu, 29 Oct 2015 12:59:03 +0000 (13:59 +0100)]
x86: guard against undue super page PTE creation

When optional super page support got added (commit bd1cd81d64 "x86: PV
support for hugepages"), two adjustments were missed: mod_l2_entry()
needs to consider the PSE and RW bits when deciding whether to use the
fast path, and the PSE bit must not be removed from L2_DISALLOW_MASK
unconditionally.

This is CVE-2015-7835 / XSA-148.

Reported-by: "栾尚聪(好风)" <shangcong.lsc@alibaba-inc.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: fe360c90ea13f309ef78810f1a2b92f2ae3b30b8
master date: 2015-10-29 13:35:07 +0100

9 years agoarm: handle races between relinquish_memory and free_domheap_pages
Ian Campbell [Thu, 29 Oct 2015 12:58:38 +0000 (13:58 +0100)]
arm: handle races between relinquish_memory and free_domheap_pages

Primarily this means XENMEM_decrease_reservation from a toolstack
domain.

Unlike x86 we have no requirement right now to queue such pages onto
a separate list, if we hit this race then the other code has already
fully accepted responsibility for freeing this page and therefore
there is no more for relinquish_memory to do.

This is CVE-2015-7814 / XSA-147.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1ef01396fdff88b1c3331a09ca5c69619b90f4ea
master date: 2015-10-29 13:34:17 +0100

9 years agoarm: rate-limit logging from unimplemented PHYSDEVOP and HVMOP.
Ian Campbell [Thu, 29 Oct 2015 12:58:15 +0000 (13:58 +0100)]
arm: rate-limit logging from unimplemented PHYSDEVOP and HVMOP.

These are guest accessible and should therefore be rate-limited.
Moreover, include them only in debug builds.

This is CVE-2015-7813 / XSA-146.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1c0e59ff15764e7b0c59282365974f5b8924ce83
master date: 2015-10-29 13:33:38 +0100

9 years agoarm: Support hypercall_create_continuation for multicall
Julien Grall [Thu, 29 Oct 2015 12:57:48 +0000 (13:57 +0100)]
arm: Support hypercall_create_continuation for multicall

Multicall for ARM has been supported since commit f0dbdc6 "xen: arm: fully
implement multicall interface.". Although, if an hypercall in multicall
requires preemption, it will crash the host:

(XEN) Xen BUG at domain.c:347
(XEN) ----[ Xen-4.7-unstable  arm64  debug=y  Tainted:    C ]----
[...]
(XEN) Xen call trace:
(XEN)    [<00000000002420cc>] hypercall_create_continuation+0x64/0x380 (PC)
(XEN)    [<0000000000217274>] do_memory_op+0x1b00/0x2334 (LR)
(XEN)    [<0000000000250d2c>] do_multicall_call+0x114/0x124
(XEN)    [<0000000000217ff0>] do_multicall+0x17c/0x23c
(XEN)    [<000000000024f97c>] do_trap_hypercall+0x90/0x12c
(XEN)    [<0000000000251ca8>] do_trap_hypervisor+0xd2c/0x1ba4
(XEN)    [<00000000002582cc>] guest_sync+0x88/0xb8
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 5:
(XEN) Xen BUG at domain.c:347
(XEN) ****************************************
(XEN)
(XEN) Manual reset required ('noreboot' specified)

Looking to the code, the support of multicall looks valid to me, as we only
need to fill call.args[...]. So drop the BUG();

This is CVE-2015-7812 / XSA-145.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 29bcf64ce8bc0b1b7aacd00c8668f255c4f0686c
master date: 2015-10-29 13:31:10 +0100

9 years agoRevert "libxl: use correct command line for arm guests."
Ian Jackson [Mon, 26 Oct 2015 11:16:06 +0000 (11:16 +0000)]
Revert "libxl: use correct command line for arm guests."

This reverts commit 9befcd335c21818caaf5c5ab764d31a4006d2800.

This commit breaks the build:
 libxl_arm.c: In function 'libxl__arch_domain_init_hw_description':
 libxl_arm.c:591:76: error: 'state' undeclared (first use in this function)
 libxl_arm.c:591:76: note: each undeclared identifier is reported only once fo\
r each function it appears in
 make[3]: *** [libxl_arm.o] Error 1

"state" was introduced in a7511905 "xen: Extend DOMCTL createdomain to
support arch configuration".

On Julien's recommendation: a7511905 ought not to be backported, so
revert this and wait for Ian Campbell to get back.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
CC: Julien Grall <julien.grall@citrix.com>
CC: Ian Campbell <Ian.Campbell@eu.citrix.com>
CC: Jan Beulich <JBeulich@suse.com>
9 years agotools/libxc: arm: Check the index before accessing the bank
Julien Grall [Thu, 17 Sep 2015 17:36:36 +0000 (18:36 +0100)]
tools/libxc: arm: Check the index before accessing the bank

When creating a guest with more than 3GB of memory, the 2 banks will be
used and the loop with overrunning. The code will fail later on because
Xen will deny to populate the region:

domainbuilder: detail: xc_dom_devicetree_mem: called
domainbuilder: detail: xc_dom_mem_init: mem 3096 MB, pages 0xc1800 pages, 4k each
domainbuilder: detail: xc_dom_mem_init: 0xc1800 pages
domainbuilder: detail: xc_dom_boot_mem_init: called
domainbuilder: detail: set_mode: guest xen-3.0-aarch64, address size 64
domainbuilder: detail: xc_dom_malloc            : 14384 kB
domainbuilder: detail: populate_guest_memory: populating RAM @0000000040000000-0000000100000000 (3072MB)
domainbuilder: detail: populate_one_size: populated 0x3/0x3 entries with shift 18
domainbuilder: detail: populate_guest_memory: populating RAM @0000000200000000-0000000201800000 (24MB)
domainbuilder: detail: populate_one_size: populated 0xc/0xc entries with shift 9
domainbuilder: detail: populate_guest_memory: populating RAM @0000007fad41c000-0007fb39dd42c000 (2141954816MB)
domainbuilder: detail: populate_one_size: populated 0x100/0x1e4 entries with shift 0
domainbuilder: detail: populate_guest_memory: Not enough RAM

This is because we are currently accessing the bank before checking the
validity of the index. AFAICT, on  Debian Jessie, the compiler (gcc 4.9.2) is
assuming that it's not necessary to verify the index because it's used
before. This is a valid assumption because the operand of && are
execute from from left to right.

Re-order the checks to verify the validity of the index before accessing
the bank.

The problem has been present since the introduction of the multi-bank
feature in commit 45d9867837f099e9eed4189dac5ed39d1fe2ed49 " tools: arm:
prepare domain builder for multiple banks of guest RAM".

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit ed5c2c05cfa557b2391aef9557864d8d958d8d84)

9 years agolibxl: use correct command line for arm guests.
Ian Campbell [Thu, 6 Aug 2015 10:55:57 +0000 (11:55 +0100)]
libxl: use correct command line for arm guests.

We need to use libxl__domain_build_state.pv_cmdline in order to pickup
the correct args when using pygrub. libxl_domain_build_info.cmdline is
any args statically configured by the user.

This is consistent with the call to xc_domain_allocate, which takes
the cmdline too (in that case for x86/PV usage).

state->pv_cmdline is also set for non-pygrub guests, since
libxl__bootloader_run propagates info->cmdline if no bootloader is
configured.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 201eac83831d94ba2e9a63a7eed4c128633fafb1)

9 years agox86/NUMA: fix SRAT table processor entry parsing and consumption
Jan Beulich [Wed, 21 Oct 2015 15:33:32 +0000 (17:33 +0200)]
x86/NUMA: fix SRAT table processor entry parsing and consumption

- don't overrun apicid_to_node[] (possible in the x2APIC case)
- don't limit number of processor related SRAT entries we can consume
- make acpi_numa_{processor,x2apic}_affinity_init() as similar to one
  another as possible
- print APIC IDs in hex (to ease matching with other log messages), at
  once making legacy and x2APIC ones distinguishable (by width)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 83281fc9b31396e94c0bfb6550b75c165037a0ad
master date: 2015-10-14 12:46:27 +0200

9 years agox86: hide MWAITX from PV domains
Jan Beulich [Wed, 21 Oct 2015 15:31:02 +0000 (17:31 +0200)]
x86: hide MWAITX from PV domains

Since MWAIT is hidden too. (Linux starting with 4.3 is making use of
that feature, and is checking for it without looking at the MWAIT one.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 941cd44324db7eddc46cba4596fa13d505066ccf
master date: 2015-10-13 17:17:52 +0200

9 years agoVT-d: don't suppress invalidation address write when it is zero
Jan Beulich [Wed, 21 Oct 2015 15:30:12 +0000 (17:30 +0200)]
VT-d: don't suppress invalidation address write when it is zero

GFN zero is a valid address, and hence may need invalidation done for
it just like for any other GFN.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: 710942e57fb42ff8f344ca82f6b678f67e38ae63
master date: 2015-10-12 15:58:35 +0200

9 years agodocs: xl.cfg: permissive option is not PV only.
Ian Campbell [Tue, 6 Oct 2015 08:42:35 +0000 (09:42 +0100)]
docs: xl.cfg: permissive option is not PV only.

Since XSA-131 qemu-xen has defaulted to non-permissive mode and the
option was extended to cover that case in 015a373351e5 "tools: libxl:
allow permissive qemu-upstream pci passthrough".

Since I was rewrapping to adjust the text anyway I've split the safety
warning into a separate paragraph to make it more obvious.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Eric <epretorious@yahoo.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 7f25baba1c0942e50757f4ecb233202dbbc057b9)

9 years agotools: libxl: allow permissive qemu-upstream pci passthrough.
Ian Campbell [Mon, 1 Jun 2015 10:32:23 +0000 (11:32 +0100)]
tools: libxl: allow permissive qemu-upstream pci passthrough.

Since XSA-131 qemu-xen now restricts access to PCI cfg by default. In
order to allow local configuration of the existing libxl_device_pci
"permissive" flag needs to be plumbed through via the new QMP property
added by the XSA-131 patches.

Versions of QEMU prior to XSA-131 did not support this permissive
property, so we only pass it if it is true. Older versions only
supported permissive mode.

qemu-xen-traditional already supports the permissive mode setting via
xenstore.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 015a373351e5c3553f848324ac0f07a9d92883fa)

9 years agox86/p2m-pt: tighten conditions of IOMMU mapping updates
Jan Beulich [Thu, 8 Oct 2015 09:38:44 +0000 (11:38 +0200)]
x86/p2m-pt: tighten conditions of IOMMU mapping updates

Whether the MFN changes does not depend on the new entry being valid
(but solely on the old one), and the need to update or TLB-flush also
depends on permission changes.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 660fd65d5578a95ec5eac522128bba23325179eb
master date: 2015-10-02 13:40:36 +0200

9 years agocredit1: fix tickling when it happens from a remote pCPU
Dario Faggioli [Thu, 8 Oct 2015 09:38:05 +0000 (11:38 +0200)]
credit1: fix tickling when it happens from a remote pCPU

especially if that is also from a different cpupool than the
processor of the vCPU that triggered the tickling.

In fact, it is possible that we get as far as calling vcpu_unblock()-->
vcpu_wake()-->csched_vcpu_wake()-->__runq_tickle() for the vCPU 'vc',
but all while running on a pCPU that is different from 'vc->processor'.

For instance, this can happen when an HVM domain runs in a cpupool,
with a different scheduler than the default one, and issues IOREQs
to Dom0, running in Pool-0 with the default scheduler.
In fact, right in this case, the following crash can be observed:

(XEN) ----[ Xen-4.7-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    7
(XEN) RIP:    e008:[<ffff82d0801230de>] __runq_tickle+0x18f/0x430
(XEN) RFLAGS: 0000000000010086   CONTEXT: hypervisor (d1v0)
(XEN) rax: 0000000000000001   rbx: ffff8303184fee00   rcx: 0000000000000000
(XEN) ... ... ...
(XEN) Xen stack trace from rsp=ffff83031fa57a08:
(XEN)    ffff82d0801fe664 ffff82d08033c820 0000000100000002 0000000a00000001
(XEN)    0000000000006831 0000000000000000 0000000000000000 0000000000000000
(XEN) ... ... ...
(XEN) Xen call trace:
(XEN)    [<ffff82d0801230de>] __runq_tickle+0x18f/0x430
(XEN)    [<ffff82d08012348a>] csched_vcpu_wake+0x10b/0x110
(XEN)    [<ffff82d08012b421>] vcpu_wake+0x20a/0x3ce
(XEN)    [<ffff82d08012b91c>] vcpu_unblock+0x4b/0x4e
(XEN)    [<ffff82d080167bd0>] vcpu_kick+0x17/0x61
(XEN)    [<ffff82d080167c46>] vcpu_mark_events_pending+0x2c/0x2f
(XEN)    [<ffff82d08010ac35>] evtchn_fifo_set_pending+0x381/0x3f6
(XEN)    [<ffff82d08010a0f6>] notify_via_xen_event_channel+0xc9/0xd6
(XEN)    [<ffff82d0801c29ed>] hvm_send_ioreq+0x3e9/0x441
(XEN)    [<ffff82d0801bba7d>] hvmemul_do_io+0x23f/0x2d2
(XEN)    [<ffff82d0801bbb43>] hvmemul_do_io_buffer+0x33/0x64
(XEN)    [<ffff82d0801bc92b>] hvmemul_do_pio_buffer+0x35/0x37
(XEN)    [<ffff82d0801cc49f>] handle_pio+0x58/0x14c
(XEN)    [<ffff82d0801eabcb>] vmx_vmexit_handler+0x16b3/0x1bea
(XEN)    [<ffff82d0801efd21>] vmx_asm_vmexit_handler+0x41/0xc0

In this case, pCPU 7 is not in Pool-0, while the (Dom0's) vCPU being
woken is. pCPU's 7 pool has a different scheduler than credit, but it
is, however, right from pCPU 7 that we are waking the Dom0's vCPUs.
Therefore, the current code tries to access csched_balance_mask for
pCPU 7, but that is not defined, and hence the Oops.

(Note that, in case the two pools run the same scheduler we see no
Oops, but things are still conceptually wrong.)

Cure things by making the csched_balance_mask macro accept a
parameter for fetching a specific pCPU's mask (instead than always
using smp_processor_id()).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: ea5637968a09a81a64fa5fd73ce49b4ea9789e12
master date: 2015-09-30 14:44:22 +0200

9 years agox86/p2m-pt: ignore pt-share flag for shadow mode guests
Jan Beulich [Thu, 8 Oct 2015 09:37:19 +0000 (11:37 +0200)]
x86/p2m-pt: ignore pt-share flag for shadow mode guests

There is no page table sharing in shadow mode.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: c0a85795d864dd64c116af661bf676d66ddfd5fc
master date: 2015-09-29 13:56:03 +0200

9 years agox86/p2m-pt: delay freeing of intermediate page tables
Jan Beulich [Thu, 8 Oct 2015 09:36:52 +0000 (11:36 +0200)]
x86/p2m-pt: delay freeing of intermediate page tables

Old intermediate page tables must be freed only after IOMMU side
updates/flushes have got carried out.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 960265fbd878cdc9841473b755e4ccc9eb1942d2
master date: 2015-09-29 13:55:34 +0200

9 years agox86/EPT: tighten conditions of IOMMU mapping updates
Jan Beulich [Thu, 8 Oct 2015 09:35:55 +0000 (11:35 +0200)]
x86/EPT: tighten conditions of IOMMU mapping updates

Permission changes should also result in updates or TLB flushes.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 6c0e4ad60850032c9bbd5d18b8446421c97e08e4
master date: 2015-09-29 10:25:29 +0200

9 years agovt-d: fix IM bit mask and unmask of Fault Event Control Register
Quan Xu [Thu, 8 Oct 2015 09:34:56 +0000 (11:34 +0200)]
vt-d: fix IM bit mask and unmask of Fault Event Control Register

Bit 0:29 in Fault Event Control Register are 'Reserved and Preserved',
software cannot write 0 to it unconditionally. Software must preserve
the value read for writes.

Signed-off-by: Quan Xu <quan.xu@intel.com>
Acked-by: Yang Zhang <yang.z.zhang@intel.com>
vt-d: fix IM bit unmask of Fault Event Control Register in init_vtd_hw()

Bit 0:29 in Fault Event Control Register are 'Reserved and Preserved',
software cannot write 0 to it unconditionally. Software must preserve
the value read for writes.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Quan Xu <quan.xu@intel.com>
master commit: 86f3ff9fc4cc3cb69b96c1de74bcc51f738fe2b9
master date: 2015-09-25 09:08:22 +0200
master commit: 26b300bd727ef00a8f60329212a83c3b027a48f7
master date: 2015-09-25 18:03:04 +0200

9 years agoxen/xsm: Make p->policyvers be a local variable (ver) to shut up GCC 5.1.1 warnings.
Konrad Rzeszutek Wilk [Thu, 8 Oct 2015 09:33:28 +0000 (11:33 +0200)]
xen/xsm: Make p->policyvers be a local variable (ver) to shut up GCC 5.1.1 warnings.

policydb.c: In function ‘user_read’:
policydb.c:1443:26: error: ‘buf[2]’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
         usrdatum->bounds = le32_to_cpu(buf[2]);
                          ^
cc1: all warnings being treated as errors

Which (as Andrew mentioned) is because GCC cannot assume
that 'p->policyvers' has the same value between checks.

We make it local, optimize the name to 'ver' and the warnings go away.
We also update another call site with this modification to
make it more inline with the rest of the functions.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
master commit: 6a2f81459e1455d65a9a6f78dd2a0d0278619680
master date: 2015-09-22 12:09:03 -0400

9 years agoxen/arm: vgic-v2: Map the GIC virtual CPU interface with the correct size
Julien Grall [Fri, 25 Sep 2015 13:10:06 +0000 (14:10 +0100)]
xen/arm: vgic-v2: Map the GIC virtual CPU interface with the correct size

On GICv2, the GIC virtual CPU interface is at minimum 8KB. Due some to
some necessary quirk for GIC using 64KB stride, we are mapping the
region in 2 time.
The first mapping is 4KB and the second one is 8KB, i.e 12KB in total.
Although the minimum supported size (and widely used) is 8KB. This means
that we are mapping 4KB more to any guest using GICv2.

While this looks scary at first glance, the GIC virtual CPU interface is
most frequently at the end the GIC I/O region. So we will most likely
map an an unused I/O region or a mirrored version of GICV for platform
using 64KB stride.

Nonetheless, fix the second mapping to only map 4KB.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(Backported from 493a67ee4a3fd9420e94fa2cf108e2a27961202b)

9 years agoxen/arm: vgic: Correctly emulate write when byte is used
Julien Grall [Tue, 22 Sep 2015 20:18:48 +0000 (21:18 +0100)]
xen/arm: vgic: Correctly emulate write when byte is used

When a guest is writing a byte, the value will be located in bits[7:0]
of the register.

Although the current implementation is expecting the byte at the Nth
byte of the register where N = address & 4;

When the address is not 4-byte aligned, the corresponding byte in the
internal state will always be set to zero rather.

Note that byte access are only used for GICD_IPRIORITYR and
GICD_ITARGETSR. So the worst things that could happen is not setting the
priority correctly and ignore the target vCPU written.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 3f214fea76acc6cbc1101fe1815cee795483a67d)

9 years agoxen: arm: bootfdt: Avoid reading off the front of *_cells array
Ian Campbell [Thu, 16 Jul 2015 08:50:07 +0000 (09:50 +0100)]
xen: arm: bootfdt: Avoid reading off the front of *_cells array

In device_tree_for_each_node the call to the callback was using
{address,size}_cells[depth - 1], which at depth 0 could read off the
front of the array.

We already handled this correctly in the rest of the loop so fixup
this instance as well.

Reported-by: Chris (Christopher) Brand <chris.brand@broadcom.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Chris (Christopher) Brand <chris.brand@broadcom.com>
Reviewed-by: Julien Grall <julien.grall@citrix.com>
(cherry picked from commit 989f3939bd16a0e1669c179b6c5c264812a25fc2)

9 years agoxen: arm: always omit guest user stack in vcpu_show_execution_state
Ian Campbell [Mon, 30 Mar 2015 11:12:35 +0000 (12:12 +0100)]
xen: arm: always omit guest user stack in vcpu_show_execution_state

Using !usr_mode(regs) only catches arm32 usr mode and not arm64 user
mode, switch to psr_mode_is_user instead.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit ceb7d6b66de21810a837389911eb2f40c1ca6222)

9 years agoxen: arm: handle accesses to CNTP_CVAL_EL0
Ian Campbell [Mon, 30 Mar 2015 11:12:24 +0000 (12:12 +0100)]
xen: arm: handle accesses to CNTP_CVAL_EL0

All OSes we have run on top of Xen use CNTP_TVAL_EL0 but for
completeness we really should handle CVAL too.

In vtimer_emulate_cp64 pull the propagation of the 64-bit result into
two 32-bit registers out of the switch to avoid duplicating for every
register. We also need to initialise x now since previously the only
register implemented register was R/O.

While adding HSR_SYSREG_CNTP_CVAL_EL0 also move
HSR_SYSREG_CNTP_CTL_EL0 so it is sorted correctly.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit 9cfa8ebe2e82d38c2e0c32fa23ea920a43e414ca)

9 years agoxen: arm: correctly handle vtimer traps from userspace
Ian Campbell [Mon, 30 Mar 2015 11:12:23 +0000 (12:12 +0100)]
xen: arm: correctly handle vtimer traps from userspace

Previously 32-bit userspace on 32-bit kernel and 64-bit userspace on
64-bit kernel could access these registers irrespective of whether the
kernel had configured them to be allowed to. To fix this:

 - Userspace access to CNTP_CTL_EL0 and CNTP_TVAL_EL0 should be gated
   on CNTKCTL_EL1.EL0PTEN.
 - Userspace access to CNTPCT_EL0 should be gated on
   CNTKCTL_EL1.EL0PCTEN.

When we do not handle an access we now silently inject an undef even
in debug mode since the debugging messages are not helpful (we have
handled the access, by explicitly choosing not to).

The usermode accessibility check is rather repetitive, so a helper
macro is introduced.

Since HSR_EC_CP15_64 cannot be taken from a guest in AArch64 mode
except due to a hardware bug switch the associated check to a BUG_ON
(which will be switched to something more appropriate in a subsequent
patch)

Fix a coding style issue in HSR_CPREG64(CNTPCT) while touching similar
code.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit d306211e2131eb2d160522f21f21fceaa9dd054c)

Conflicts:
xen/arch/arm/traps.c

9 years agox86/sysctl: don't clobber memory if NCAPINTS > ARRAY_SIZE(pi->hw_cap)
Andrew Cooper [Wed, 23 Sep 2015 07:08:49 +0000 (09:08 +0200)]
x86/sysctl: don't clobber memory if NCAPINTS > ARRAY_SIZE(pi->hw_cap)

There is no current problem, as both NCAPINTS and pi->hw_cap are 8 entries,
but the limit should be calculated appropriately so as to avoid hypervisor
stack corruption if the two do get out of sync.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c373b912e74659f0e0898ae93e89513694cfd94e
master date: 2015-09-16 11:22:00 +0200

9 years agox86/MSI: fail if no hardware support
Jan Beulich [Wed, 23 Sep 2015 07:08:22 +0000 (09:08 +0200)]
x86/MSI: fail if no hardware support

This is to guard against buggy callers (luckily Dom0 only) invoking
the respective hypercall for a device not being MSI-capable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c7d5d5d8ea1ecbd6ef8b47dace4dec825f0f6e48
master date: 2015-09-16 11:20:27 +0200

9 years agox86/p2m: fix mismatched unlock
Jan Beulich [Wed, 23 Sep 2015 07:07:52 +0000 (09:07 +0200)]
x86/p2m: fix mismatched unlock

Luckily, due to gfn_unlock() currently mapping to p2m_unlock(), this is
only a cosmetic issue right now.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 1f180822ad3fe83fe293393ec175f14ded98f082
master date: 2015-09-14 13:39:19 +0200

9 years agox86/hvm: fix saved pmtimer and hpet values
Kouya Shimura [Wed, 23 Sep 2015 07:07:18 +0000 (09:07 +0200)]
x86/hvm: fix saved pmtimer and hpet values

The ACPI PM timer is sometimes broken on live migration.
Since vcpu->arch.hvm_vcpu.guest_time is always zero in other than
"delay for missed ticks mode". Even in "delay for missed ticks mode",
vcpu's guest_time field is not valid (i.e. zero) when
the state of vcpu is "blocked". (see pt_save_timer function)

The original author (Tim Deegan) of pmtimer_save() must have intended
that it saves the last scheduled time of the vcpu. Unfortunately it was
already implied this bug. FYI, there is no other timer mode than
"delay for missed ticks mode" then.

For consistency with HPET, pmtimer_save() should refer hvm_get_guest_time()
to update the counter as well as hpet_save() does.

Without this patch, the clock of windows server 2012R2 without HPET
might leap forward several minutes on live migration.

Signed-off-by: Kouya Shimura <kouya@jp.fujitsu.com>
Retain use of ->arch.hvm_vcpu.guest_time when non-zero. Do the inverse
adjustment for vHPET.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Kouya Shimura <kouya@jp.fujitsu.com>
master commit: 244582a01dcb49fa30083725964a066937cc94f2
master date: 2015-09-11 16:24:56 +0200

9 years agoefi: introduce efi_arch_flush_dcache_area
Stefano Stabellini [Wed, 23 Sep 2015 07:06:00 +0000 (09:06 +0200)]
efi: introduce efi_arch_flush_dcache_area

Objects loaded by FileHandle->Read need to be flushed from dcache,
otherwise copy_from_paddr will read stale data when copying the kernel,
causing a failure to boot.

Introduce efi_arch_flush_dcache_area and call it from read_file.

This commit introduces no functional changes on x86.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 0d6a3c755374f04f6dd25373da28291a8f35bede
master date: 2015-09-09 15:29:06 +0200

9 years agolibxl: handle read-only drives with qemu-xen
Stefano Stabellini [Tue, 22 Sep 2015 15:56:18 +0000 (16:56 +0100)]
libxl: handle read-only drives with qemu-xen

The current libxl code doesn't deal with read-only drives at all.

Upstream QEMU and qemu-xen only support read-only cdrom drives: make
sure to specify "readonly=on" for cdrom drives and return error in case
the user requested a non-cdrom read-only drive.

This is XSA-142, discovered by Lin Liu
(https://bugzilla.redhat.com/show_bug.cgi?id=1257893).

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Backport to Xen 4.5 and earlier, apropos of report and review from
Michael Young.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
9 years agolibxl: Increase device model startup timeout to 1min.
Anthony PERARD [Tue, 7 Jul 2015 15:09:13 +0000 (16:09 +0100)]
libxl: Increase device model startup timeout to 1min.

On a busy host, QEMU may take more than 10s to load and start.

This is likely due to a bug in Linux where the I/O subsystem sometime
produce high latency under load and result in QEMU taking a long time to
load every single dynamic libraries.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 9acfbe14d7261b03e3b3f4dc3c850ba2a7093e1f)

Conflicts:
tools/libxl/libxl_internal.h