Tim Deegan [Mon, 30 Sep 2013 12:23:33 +0000 (14:23 +0200)]
x86/mm/shadow: Fix initialization of PV shadow L4 tables.
Shadowed PV L4 tables must have the same Xen mappings as their
unshadowed equivalent. This is done by copying the Xen entries
verbatim from the idle pagetable, and then using guest_l4_slot()
in the SHADOW_FOREACH_L4E() iterator to avoid touching those entries.
adc5afbf1c70ef55c260fb93e4b8ce5ccb918706 (x86: support up to 16Tb)
changed the definition of ROOT_PAGETABLE_XEN_SLOTS to extend right to
the top of the address space, which causes the shadow code to
copy Xen mappings into guest-kernel-address slots too.
In the common case, all those slots are zero in the idle pagetable,
and no harm is done. But if any slot above #271 is non-zero, Xen will
crash when that slot is later cleared (it attempts to drop
shadow-pagetable refcounts on its own L4 pagetables).
Fix by using the new ROOT_PAGETABLE_PV_XEN_SLOTS when appropriate.
Monitor pagetables need the full Xen mappings, so they keep using the
old name (with its new semantics).
This is CVE-2013-4356 / XSA-64.
Signed-off-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f46befdd825c8a459c5eb21adb7d5b0dc6e30ad5
master date: 2013-09-30 14:18:25 +0200
Ignoring them generally implies using uninitialized data and, in all
but two of the cases dealt with here, potentially leaking hypervisor
stack contents to guests.
This is CVE-2013-4355 / XSA-63.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6bb838e7375f5b031e9ac346b353775c90de45dc
master date: 2013-09-30 14:17:46 +0200
Jan Beulich [Fri, 27 Sep 2013 09:59:54 +0000 (11:59 +0200)]
x86/HVM: refuse doing string operations in certain situations
We shouldn't do any acceleration for
- "rep movs" when either side is passed through MMIO or when both sides
are handled by qemu
- "rep ins" and "rep outs" when the memory operand is any kind of MMIO
Jan Beulich [Fri, 27 Sep 2013 09:59:14 +0000 (11:59 +0200)]
x86/HVM: linear address must be canonical for the whole accessed range
... rather than just for the first byte.
While at it, also
- make the real mode case at least dpo a wrap around check
- drop the mis-named "gpf" label (we're not generating faults here)
and use in-place returns instead
sched_credit: filter node-affinity mask against online cpus
in _csched_cpu_pick(), as not doing so may result in the domain's
node-affinity mask (as retrieved by csched_balance_cpumask() )
and online mask (as retrieved by cpupool_scheduler_cpumask() )
having an empty intersection.
Therefore, when attempting a node-affinity load balancing step
and running this:
...
/* Pick an online CPU from the proper affinity mask */
csched_balance_cpumask(vc, balance_step, &cpus);
cpumask_and(&cpus, &cpus, online);
...
we end up with an empty cpumask (in cpus). At this point, in
the following code:
....
/* If present, prefer vc's current processor */
cpu = cpumask_test_cpu(vc->processor, &cpus)
? vc->processor
: cpumask_cycle(vc->processor, &cpus);
....
an ASSERT (from inside cpumask_cycle() ) triggers like this:
It is for example sufficient to have a domain with node-affinity
to NUMA node 1 running, and issueing a `xl cpupool-numa-split'
would make the above happen. That is because, by default, all
the existing domains remain assigned to the first cpupool, and
it now (after the cpupool-numa-split) only includes NUMA node 0.
This change prevents that by generalizing the function used
for figuring out whether a node-affinity load balancing step
is legit or not. This way we can, in _csched_cpu_pick(),
figure out early enough that the mask would end up empty,
skip the step all together and avoid the splat.
Andrew Cooper [Fri, 27 Sep 2013 09:54:42 +0000 (11:54 +0200)]
watchdog/crash: Always disable watchdog in console_force_unlock()
Depending on the state of the conring and serial_tx_buffer,
console_force_unlock() can be a long running operation, usually because of
serial_start_sync()
XenServer testing has found a reliable case where console_force_unlock() on
one PCPU takes long enough for another PCPU to timeout due to the watchdog
(such as waiting for a tlb flush callin).
The watchdog timeout causes the second PCPU to repeat the
console_force_unlock(), at which point the first PCPU typically fails an
assertion in spin_unlock_irqrestore(&port->tx_lock) (because the tx_lock has
been unlocked behind itself).
console_force_unlock() is only on emergency paths, so one way or another the
host is going down. Disable the watchdog before forcing the console lock to
help prevent having pcpus completing with each other to bring the host down.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 7b9fa702ca323164d6b49e8b639a57f880454a8c
master date: 2013-08-13 14:31:01 +0200
Andrew Cooper [Fri, 27 Sep 2013 09:53:26 +0000 (11:53 +0200)]
xen/conring: Write to console ring even if console lock is busted
console_lock_busted gets set when an NMI/MCE/Double Fault handler decides to
bring Xen down in an emergency. conring_puts() cannot block and does
not have problematic interactions with the console_lock.
Therefore, choosing to not put the string into the console ring simply means
that the kexec environment cant find any panic() message caused by an IST
interrupt, which is unhelpful for debugging purposes.
In the case that two pcpus fight with console_force_unlock(), having slightly
garbled strings in the console ring is far more useful than having nothing at
all.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Matt Wilson <msw@amazon.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 66450c1d1ab3c4480bbba949113b95d1ab6a943a
master date: 2013-08-06 17:45:00 +0200
Jan Beulich [Wed, 25 Sep 2013 08:54:30 +0000 (10:54 +0200)]
x86/xsave: initialize extended register state when guests enable it
Till now, when setting previously unset bits in XCR0 we wouldn't touch
the active register state, thus leaving in the newly enabled registers
whatever a prior user of it left there, i.e. potentially leaking
information between guests.
This is CVE-2013-1442 / XSA-62.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 63a75ba0de817d6f384f96d25427a05c313e2179
master date: 2013-09-25 10:41:25 +0200
Olaf Hering [Mon, 23 Sep 2013 14:28:52 +0000 (16:28 +0200)]
unmodified_drivers: enable unplug per default
Since xen-3.3 an official unplug protocol for emulated hardware is
available in the toolstack. The pvops kernel does the unplug per
default, so it is safe to do it also in the drivers for forward ported
xenlinux.
Currently its required to load xen-platform-pci with the module
parameter dev_unplug=all, which is cumbersome.
Also recognize the dev_unplug=never parameter, which provides the
default before this patch.
Jan Beulich [Mon, 23 Sep 2013 14:28:21 +0000 (16:28 +0200)]
x86/HVM: properly handle MMIO reads and writes wider than a machine word
Just like real hardware we ought to split such accesses transparently
to the caller. With little extra effort we can at once even handle page
crossing accesses correctly.
George Dunlap [Mon, 23 Sep 2013 14:27:08 +0000 (16:27 +0200)]
x86/HVM: fix failure path in hvm_vcpu_initialise
It looks like one of the failure cases in hvm_vcpu_initialise jumps to
the wrong label; this could lead to slow leaks if something isn't
cleaned up properly.
I will probably change these labels in a future patch, but I figured
it was better to have this fix separately.
core2_vpmu_dump() was incorrectly setting VPMU_CONTEXT_LOADED when it
was intending to check for it.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
This would have been avoided if the dump function declared all its
pointers "const" - doing this now (also in SVM).
Also fixing some indentation issues at once.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
master commit: 42c5b1214071d363a52c6356dfe2ed820f500849
master date: 2013-09-16 12:22:20 +0200
Jan Beulich [Mon, 23 Sep 2013 14:23:52 +0000 (16:23 +0200)]
x86: machine_restart() must not call acpi_dmar_reinstate() twice
.. as that function is not idempotent (it always alters the table
checksum). The (generally) duplicate call was a result from it being
made before machine_restart() re-invoking itself on the boot CPU.
Considering that no problem arose so far from the table corruption I
doubt that we need to restore the correct table signature on the
reboot path in general. The only case I can see this as potentially
necessary is the tboot one, hence do the call just in that case.
Signed-off-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 803f9a6cdfeda64beee908576de0ad02d6b0c480
master date: 2013-09-12 17:47:08 +0100
Jan Beulich [Mon, 23 Sep 2013 14:22:47 +0000 (16:22 +0200)]
libxc/x86: fix page table creation for huge guests
The switch-over logic from one page directory to the next was wrong;
it needs to be deferred until we actually reach the last page within
a given region, instead of being done when the last entry of a page
directory gets started with.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 06d086832155fc7f5344e9d108b979de34674d11
master date: 2013-09-12 17:41:04 +0200
Jan Beulich [Mon, 23 Sep 2013 14:21:52 +0000 (16:21 +0200)]
x86: fix memory cut-off when using PFN compression
For one setup_max_pdx(), when invoked a second time (after SRAT got
parsed), needs to start from the original max_page value again (using
the already adjusted one from the first invocation would not allow the
cut-off boundary to be moved up).
Second, _if_ we need to cut off some part of memory, we must not allow
this to also propagate into the NUMA accounting. Otherwise
cutoff_node() results in nodes_cover_memory() to find some parts of
memory apparently not having a PXM association, causing all SRAT info
to be ignored.
The only possibly problematic consumer of node_spanned_pages (the
meaning of which gets altered here in that it now also includes memory
Xen can't actively make use of) is XEN_SYSCTL_numainfo: At a first
glance the potentially larger reported memory size shouldn't confuse
tool stacks.
And finally we must not put our boot time modules at addresses which
(at that time) can't be guaranteed to be accessible later. This applies
to both the EFI boot loader and the module relocation code.
Bastian Blank [Sun, 11 Aug 2013 20:10:20 +0000 (22:10 +0200)]
tools: xen-mceinj: Add missing return value checks
The return value of vasprintf must be checked. This check is enforced
with the compiler options used in Debian by request and in Ubuntu by
default.
Check the return value and abort on error.
Signed-off-by: Bastian Blank <waldi@debian.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 1efe90faa31be104a24fe75323429d227eae1d9f)
George Dunlap [Fri, 5 Jul 2013 11:13:54 +0000 (12:13 +0100)]
libxl: Allow network driver domains when run_hotplug_scritps is set
As of commit 05bfd984dfe7014f1f5ea1133608b9bab589c120, hotplug scripts
are not run if backend_domid != LIBXL_TOOSTACK_DOMID; so there is no reason
to restrict this for network driver domains any more.
Yang Zhang [Thu, 12 Sep 2013 09:20:17 +0000 (11:20 +0200)]
Nested VMX: Clear bit 31 of IA32_VMX_BASIC MSR
The bit 31 of revision_id will set to 1 if vmcs shadowing enabled. And
according intel SDM, the bit 31 of IA32_VMX_BASIC MSR is always 0. So we
cannot set low 32 bit of IA32_VMX_BASIC to revision_id directly. Must clear
the bit 31 to 0.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f3a4eb9253826d1e49e682314c8666b28fa0b717
master date: 2013-09-10 16:41:35 +0200
Jan Beulich [Thu, 12 Sep 2013 09:19:28 +0000 (11:19 +0200)]
x86/xsave: fix migration from xsave-capable to xsave-incapable host
With CPUID features suitably masked this is supposed to work, but was
completely broken (i.e. the case wasn't even considered when the
original xsave save/restore code was written).
First of all, xsave_enabled() wrongly returned the value of
cpu_has_xsave, i.e. not even taking into consideration attributes of
the vCPU in question. Instead this function ought to check whether the
guest ever enabled xsave support (by writing a [non-zero] value to
XCR0). As a result of this, a vCPU's xcr0 and xcr0_accum must no longer
be initialized to XSTATE_FP_SSE (since that's a valid value a guest
could write to XCR0), and the xsave/xrstor as well as the context
switch code need to suitably account for this (by always enforcing at
least this part of the state to be saved/loaded).
This involves undoing large parts of c/s 22945:13a7d1f7f62c ("x86: add
strictly sanity check for XSAVE/XRSTOR") - we need to cleanly
distinguish between hardware capabilities and vCPU used features.
Next both HVM and PV save code needed tweaking to not always save the
full state supported by the underlying hardware, but just the parts
that the guest actually used. Similarly the restore code should bail
not just on state being restored that the hardware cannot handle, but
also on inconsistent save state (inconsistent XCR0 settings or size of
saved state not in line with XCR0).
And finally the PV extended context get/set code needs to use slightly
different logic than the HVM one, as here we can't just key off of
xsave_enabled() (i.e. avoid doing anything if a guest doesn't use
xsave) because the tools use this function to determine host
capabilities as well as read/write vCPU state. The set operation in
particular needs to be capable of cleanly dealing with input that
consists of only the xcr0 and xcr0_accum values (if they're both zero
then no further data is required).
While for things to work correctly both sides (saving _and_ restoring
host) need to run with the fixed code, afaict no breakage should occur
if either side isn't up to date (other than the breakage that this
patch attempts to fix).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Yang Zhang <yang.z.zhang@intel.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 4cc1344447a0458df5d222960f2adf1b65084fa8
master date: 2013-09-09 14:36:54 +0200
Jan Beulich [Thu, 12 Sep 2013 09:18:00 +0000 (11:18 +0200)]
x86/xsave: initialization improvements
- properly validate available feature set on APs
- also validate xsaveopt availability on APs
- properly indicate whether the initialization is on the BSP (we
shouldn't be using "cpu == 0" checks for this)
Jan Beulich [Thu, 12 Sep 2013 09:15:24 +0000 (11:15 +0200)]
xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc()
Not doing this was found to cause problems with sequences of allocation
(multi-page), freeing, and then again allocation of the same page upon
boot when interrupts are still disabled (causing the owner field to be
non-zero, thus making the allocator attempt a TLB flush and, in its
processing, triggering an assertion).
Reported-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 0fbf3208d9c1a568aeeb61d9f4fbca03b1cfa1f8
master date: 2013-09-09 14:34:12 +0200
x86: allow guest to set/clear MSI-X mask bit (try 2)
Guest needs the ability to enable and disable MSI-X interrupts
by setting the MSI-X control bit, for a passed-through device.
Guest is allowed to write MSI-X mask bit only if Xen *thinks*
that mask is clear (interrupts enabled). If the mask is set by
Xen (interrupts disabled), writes to mask bit by the guest is
ignored.
Currently, a write to MSI-X mask bit by the guest is silently
ignored.
A likely scenario is where we have a 82599 SR-IOV nic passed
through to a guest. From the guest if you do
ifconfig <ETH_DEV> down
ifconfig <ETH_DEV> up
the interrupts remain masked. On VF reset, the mask bit is set
by the controller. At this point, Xen is not aware that mask is set.
However, interrupts are enabled by VF driver by clearing the mask
bit by writing directly to BAR3 region containing the MSI-X table.
From dom0, we can verify that
interrupts are being masked using 'xl debug-keys M'.
Jan Beulich [Thu, 12 Sep 2013 09:14:01 +0000 (11:14 +0200)]
x86/EFI: properly handle run time memory regions outside the 1:1 map
Namely with PFN compression, MMIO ranges that the firmware may need
runtime access to can live in the holes that gets shrunk/eliminated by
PFN compression, and hence no mappings would result from simply
copying Xen's direct mapping table's L3 page table entries. Build
mappings for this "manually" in the EFI runtime call 1:1 page tables.
Use the opportunity to also properly identify (via a forcibly undefined
manifest constant) all the disabled code regions associated with it not
being acceptable for us to call SetVirtualAddressMap().
Andrew Cooper [Thu, 12 Sep 2013 08:57:06 +0000 (10:57 +0200)]
x86: Special case __HYPERVISOR_iret rather more when writing hypercall pages
In all cases when a hypercall page is written, __HYPERVISOR_iret is first
written as a regular hypercall, then subsequently rewritten in its special
case.
For VMX and SVM, this means that following the ud2a instruction is 3 bytes of
an imm32 parameter. For a ring3 kernel, this means that following the syscall
instruction is the second half of 'pop %r11'.
For a ring1 kernel, the iret case ends up as the same number of bytes as the
rest of the hypercalls, but it is pointless writing it twice, and is changed
for consistency.
Therefore, skip the loop iteration which would write the incorrect
__HYPERVISOR_iret hypercall. This removes junk machine code from the tail and
makes disassemblers rather more happy when looking at the hypercall page.
Also, a miscellaneous whitespace fix in the comment for ring3 kernel.
Jan Beulich [Mon, 9 Sep 2013 09:51:20 +0000 (11:51 +0200)]
hvmloader: fix SeaBIOS interface
The SeaBIOS ROM image may validly exceed 128k in size, it's only our
interface code that so far assumed that it wouldn't. Remove that
restriction by setting the base address depending on image size.
Add a check to HVM loader so that too big images won't result in silent
guest failure anymore.
Uncomment the intended build-time size check for rombios, moving it
into a function so that it would actually compile.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 5f2875739beef3a75c7a7e8579b6cbcb464e61b3
master date: 2013-09-05 11:47:03 +0200
Xi Xiong [Mon, 9 Sep 2013 09:49:46 +0000 (11:49 +0200)]
xend: fix file descriptor leak in pci utilities
A file descriptor leak was detected after creating multiple domUs with
pass-through PCI devices. This patch fixes the issue.
Signed-off-by: Xi Xiong <xixiong@amazon.com> Reviewed-by: Matt Wilson <msw@amazon.com>
[msw: adjusted commit message] Signed-off-by: Matt Wilson <msw@amazon.com>
master commit: 749019afca4fd002d36856bad002cc11f7d0ddda
master date: 2013-09-03 16:36:52 +0100
Steven Noonan [Mon, 9 Sep 2013 09:49:15 +0000 (11:49 +0200)]
xend: handle extended PCI configuration space when saving state
Newer PCI standards (e.g., PCI-X 2.0 and PCIe) introduce extended
configuration space which is larger than 256 bytes. This patch uses
stat() to determine the amount of space used to correctly save all of
the PCI configuration space. Resets handled by the xen-pciback driver
don't have this problem, as that code correctly handles saving
extended configuration space.
Signed-off-by: Steven Noonan <snoonan@amazon.com> Reviewed-by: Matt Wilson <msw@amazon.com>
[msw: adjusted commit message] Signed-off-by: Matt Wilson <msw@amazon.com>
master commit: 1893cf77992cc0ce9d827a8d345437fa2494b540
master date: 2013-09-03 16:36:47 +0100
I realise that this patch causes a change to the public headers. However I
feel it is justified as:
* All toolstacks used to have to embed the magic string (and almost certainly
still do)
* If by some miriacle a new toolstack has started using the new define will
continue to work.
* The only intree consumer of the define is hvmloader itself.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 0f4cb23c3ea5b987c49c9a9368e7a0d505ec064f
master date: 2013-08-30 10:40:48 +0200
Andrew Cooper [Mon, 9 Sep 2013 09:47:44 +0000 (11:47 +0200)]
hvmloader/smbios: Correctly count the number of tables written
Fixes regression indirectly introduced by c/s 4d23036e709627
That changeset added some smbios tables which were option based on the
toolstack providing appropriate xenstore keys. The do_struct() macro would
unconditionally increment nr_structs, even if a table was not actually
written.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 4aa19549e17650b9bfe2b31d7f52a95696d388f0
master date: 2013-08-30 10:40:29 +0200
Jan Beulich [Mon, 9 Sep 2013 09:46:26 +0000 (11:46 +0200)]
x86: AVX instruction emulation fixes
- we used the C4/C5 (first prefix) byte instead of the apparent ModR/M
one as the second prefix byte
- early decoding normalized vex.reg, thus corrupting it for the main
consumer (copy_REX_VEX()), resulting in #UD on the two-operand
instructions we emulate
Also add respective test cases to the testing utility plus
- fix get_fpu() (the fall-through order was inverted)
- add cpu_has_avx2, even if it's currently unused (as in the new test
cases I decided to refrain from using AVX2 instructions in order to
be able to actually run all the tests on the hardware I have)
- slightly tweak cpu_has_avx to more consistently express the outputs
we don't care about (sinking them all into the same variable)
Fix inactive timer list corruption on second S3 resume
init_timer cannot be safely called multiple times on same timer since it does memset(0)
on the structure, erasing the auxiliary member used by linked list code. This breaks
inactive timer list in common/timer.c.
Moved resume_timer initialisation to ns16550_init_postirq, so it's only done once.
Signed-off-by: Ian Campbell <ijc@hellion.org.uk> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 258d27a1d9fb33a490bef1381f52d522225c3dca)
John Liu [Mon, 22 Jul 2013 21:23:10 +0000 (22:23 +0100)]
oxenstored: Protect oxenstored from malicious domains.
add check logic when read from IO ring, and if error happens,
then mark the reading connection as "bad", Unless vm reboot,
oxenstored will not handle message from this connection any more.
xs_ring_stubs.c: add a more strict check on ring reading
connection.ml, domain.ml: add getter and setter for bad flag
process.ml: if exception raised when reading from domain's ring,
mark this domain as "bad"
xenstored.ml: if a domain is marked as "bad", do not handle it.
Signed-off-by: John Liu <john.liuqiming@huawei.com> Acked-by: David Scott <dave.scott@eu.citrix.com>
(cherry picked from commit 704302ce9404c73cfb687d31adcf67094ab5bb53)
Yang Zhang [Tue, 27 Aug 2013 13:30:20 +0000 (15:30 +0200)]
Nested VMX: Update APIC-v(RVI/SVI) when vmexit to L1
If enabling APIC-v, all interrupts to L1 are delivered through APIC-v.
But when L2 is running, external interrupt will casue L1 vmexit with
reason external interrupt. Then L1 will pick up the interrupt through
vmcs12. when L1 ack the interrupt, since the APIC-v is enabled when
L1 is running, so APIC-v hardware still will do vEOI updating. The problem
is that the interrupt is delivered not through APIC-v hardware, this means
SVI/RVI/vPPR are not setting, but hardware required them when doing vEOI
updating. The solution is that, when L1 tried to pick up the interrupt
from vmcs12, then hypervisor will help to update the SVI/RVI/vPPR to make
sure the following vEOI updating and vPPR updating corrently.
Also, since interrupt is delivered through vmcs12, so APIC-v hardware will
not cleare vIRR and hypervisor need to clear it before L1 running.
Yang Zhang [Tue, 27 Aug 2013 13:28:16 +0000 (15:28 +0200)]
Nested VMX: Force check ISR when L2 is running
External interrupt is allowed to notify CPU only when it has higher
priority than current in servicing interrupt. With APIC-v, the priority
comparing is done by hardware and hardware will inject the interrupt to
VCPU when it recognizes an interrupt. Currently, there is no virtual
APIC-v feature available for L1 to use, so when L2 is running, we still need
to compare interrupt priority with ISR in hypervisor instead via hardware.
Jan Beulich [Tue, 27 Aug 2013 13:24:31 +0000 (15:24 +0200)]
ACPI: fix acpi_os_map_memory()
It using map_domain_page() was entirely wrong. Use __acpi_map_table()
instead for the time being, with locking added as the mappings it
produces get replaced with subsequent invocations. Using locking in
this way is acceptable here since the only two runtime callers are
acpi_os_{read,write}_memory(), which don't leave mappings pending upon
returning to their callers.
Also fix __acpi_map_table()'s first parameter's type - while benign for
unstable, backports to pre-4.3 trees will need this.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
ACPI: use ioremap() in acpi_os_map_memory()
This drops the post-boot use of __acpi_map_table() here again (together
with the somewhat awkward locking), in favor of using ioremap().
Juergen Gross [Thu, 22 Aug 2013 09:28:28 +0000 (11:28 +0200)]
Correct X2-APIC HVM emulation
commit 6859874b61d5ddaf5289e72ed2b2157739b72ca5 ("x86/HVM: fix x2APIC
APIC_ID read emulation") introduced an error for the hvm emulation of
x2apic. Any try to write to APIC_ICR MSR will result in a GP fault.
Tim Deegan [Tue, 20 Aug 2013 13:02:57 +0000 (15:02 +0200)]
xen: Add stdbool.h workaround for BSD.
On *BSD, stdbool.h lives in /usr/include, but we don't want to have
that on the search path in case we pick up any headers from the build
host's C libraries.
Copy the equivalent hack already in place for stdarg.h: on all
supported compilers the contents of stdbool.h are trivial, so just
supply the things we need in a xen/stdbool.h header.
Signed-off-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Keir Fraser <keir@xen.org> Tested-by: Patrick Welche <prlw1@cam.ac.uk>
master commit: 7b9685ca4ed2fd723600ce66eb20a6d0c115b6cb
master date: 2013-08-15 22:00:45 +0100
Jan Beulich [Tue, 20 Aug 2013 13:00:13 +0000 (15:00 +0200)]
VT-d: protect against bogus information coming from BIOS
Add checks similar to those done by Linux: The DRHD address must not
be all zeros or all ones (Linux only checks for zero), and capabilities
as well as extended capabilities must not be all ones.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Ben Guthro <benjamin.guthro@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Ben Guthro <benjamin.guthro@citrix.com>
Acked by: Yang Zhang <yang.z.zhang@intel.com> Acked-by: Xiantao Zhang <xiantao.zhang@intel.com>
master commit: e8e8b030ecf916fea19639f0b6a446c1c9dbe174
master date: 2013-08-14 11:18:24 +0200
x86/AMD: Inject #GP instead of #UD when unable to map vmcb
According to AMD Programmer's Manual vol2, vmrun, vmsave and vmload
should inject #GP instead of #UD when unable to access memory
location for vmcb. Also, the code should make sure that L1 guest
EFER.SVME is not zero. Otherwise, #UD should be injected.
x86/AMD: Fix nested svm crash due to assertion in __virt_to_maddr
Fix assertion in __virt_to_maddr when starting nested SVM guest
in debug mode. Investigation has shown that svm_vmsave/svm_vmload
make use of __pa() with invalid address.
Patrick Welche [Tue, 20 Aug 2013 12:43:32 +0000 (14:43 +0200)]
libelf: Fix typo in header guard macro
s/__LIBELF_PRIVATE_H_/__LIBELF_PRIVATE_H__/
Signed-off-by: Patrick Welche <prlw1@cam.ac.uk> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 0aec8823501f8ee058c1ba673d2ac3e0f3f2e8db
master date: 2013-08-08 12:47:38 +0100
Yang Zhang [Wed, 7 Aug 2013 14:55:37 +0000 (16:55 +0200)]
Nested VMX: Flush TLBs and Caches if paging mode changed
According to SDM, if paging mode is changed, then whole TLBs and caches will
be flushed. This is missed in nested handle logic. Also this fixed the issue
that 64 bits windows cannot boot up on top of L1 kvm.
Jan Beulich [Wed, 7 Aug 2013 14:55:05 +0000 (16:55 +0200)]
x86: refine FPU selector handling code for XSAVEOPT
Some extra tweaks are necessary to deal with the situation of XSAVEOPT
not writing the FPU portion of the save image (due to it detecting that
the register state did not get modified since the last XRSTOR).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Ben Guthro <ben.guthro@gmail.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: c58d9f2f4844c2ce8859a8d0f26a54cd058eb51f
master date: 2013-08-05 18:42:37 +0200
Andrew Cooper [Wed, 7 Aug 2013 14:53:32 +0000 (16:53 +0200)]
x86/time: Update wallclock in shared info when altering domain time offset
domain_set_time_offset() udpates d->time_offset_seconds, but does not correct
the wallclock in the shared info, meaning that it is incorrect until the next
XENPF_settime hypercall from dom0 which resynchronises the wallclock for all
domains.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 915a59f25c5eddd86bc2cae6389d0ed2ab87e69e
master date: 2013-07-18 09:16:15 +0200
Jan Beulich [Wed, 7 Aug 2013 14:52:34 +0000 (16:52 +0200)]
x86: don't use destroy_xen_mappings() for vunmap()
Its attempt to tear down intermediate page table levels may race with
map_pages_to_xen() establishing them, and now that
map_domain_page_global() is backed by vmap() this teardown is also
wasteful (as it's very likely to need the same address space populated
again within foreseeable time).
Andrew Cooper [Wed, 7 Aug 2013 14:51:56 +0000 (16:51 +0200)]
x86/cpuidle: Change logging for unknown APIC IDs
Dom0 uses this hypercall to pass ACPI information to Xen. It is not very
uncommon for more cpus to be listed in the ACPI tables than are present on the
system, particularly on systems with a common BIOS for a 2 and 4 socket server
varients.
As Dom0 does not control the number of entries in the ACPI tables, and is
required to pass everything it finds to Xen, change the logging.
There is now an single unconditional warning for the first unknown ID, and
further warnings if "cpuinfo" is requested by the user on the command line.
Jan Beulich [Wed, 7 Aug 2013 14:49:39 +0000 (16:49 +0200)]
adjust x86 EFI build
While the rule to generate .init.o files from .o ones already correctly
included $(extra-y), the setting of the necessary compiler flag didn't
have the same. With some yet to be posted patch this resulted in build
breakage because of the compiler deciding not to inline a few functions
(which then results in .text not being empty as required for these
object files).
Andrew Cooper [Wed, 7 Aug 2013 14:48:56 +0000 (16:48 +0200)]
x86/mm: Ensure useful progress in alloc_l2_table()
While debugging the issue which turned out to be XSA-58, a printk in this loop
showed that it was quite easy to never make useful progress, because of
consistently failing the preemption check.
One single l2 entry is a reasonable amount of work to do, even if an action is
pending, and also assures forwards progress across repeat continuations.
Tweak the continuation criteria to fail on the first iteration of the loop.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: d3a55d7d9bb518efe08143d050deff9f4ee80ec1
master date: 2013-07-04 10:33:18 +0200
The IOMMU interrupt handling in bottom half must clear the PPR log interrupt
and event log interrupt bits to re-enable the interrupt. This is done by
writing 1 to the memory mapped register to clear the bit. Due to hardware bug,
if the driver tries to clear this bit while the IOMMU hardware also setting
this bit, the conflict will result with the bit being set. If the interrupt
handling code does not make sure to clear this bit, subsequent changes in the
event/PPR logs will no longer generating interrupts, and would result if
buffer overflow. After clearing the bits, the driver must read back
the register to verify.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Adjust to apply on top of heavily modified patch 1. Adjust flow to get away
with a single readl() in each instance of the status register checks.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
master commit: 9eabb0735400e2b6059dfa3f0b47a426f61f570a
master date: 2013-07-02 08:50:41 +0200
iommu/amd: Fix logic for clearing the IOMMU interrupt bits
The IOMMU interrupt bits in the IOMMU status registers are
"read-only, and write-1-to-clear (RW1C). Therefore, the existing
logic which reads the register, set the bit, and then writing back
the values could accidentally clear certain bits if it has been set.
The correct logic would just be writing only the value which only
set the interrupt bits, and leave the rest to zeros.
This patch also, clean up #define masks as Jan has suggested.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
With iommu_interrupt_handler() properly having got switched its readl()
from status to control register, the subsequent writel() needed to be
switched too (and the RW1C comment there was bogus).
Some of the cleanup went too far - undone.
Further, with iommu_interrupt_handler() now actually disabling the
interrupt sources, they also need to get re-enabled by the tasklet once
it finished processing the respective log. This also implies re-running
the tasklet so that log entries added between reading the log and re-
enabling the interrupt will get handled in a timely manner.
Finally, guest write emulation to the status register needs to be done
with the RW1C (and RO for all other bits) semantics in mind too.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
master commit: 2823a0c7dfc979db316787e1dd42a8845e5825c0
master date: 2013-07-02 08:49:43 +0200
Jan Beulich [Mon, 15 Jul 2013 11:07:08 +0000 (13:07 +0200)]
x86: don't pass negative time to gtime_to_gtsc() (try 2)
This mostly reverts commit eb60be3d ("x86: don't pass negative time to
gtime_to_gtsc()") and instead corrects __update_vcpu_system_time()'s
handling of this_cpu(cpu_time).stime_local_stamp dating back before the
start of a HVM guest (which would otherwise lead to a negative value
getting passed to gtime_to_gtsc(), causing scale_delta() to produce
meaningless output).
Flushing the value to zero was wrong, and printing a message for
something that can validly happen wasn't very useful either.
Andrew Cooper [Tue, 2 Jul 2013 20:02:33 +0000 (21:02 +0100)]
docs: Pull Xen version from canonical location
rather than hard coding it and being wrong every time we branch for a release.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit f487767ad0e58acb6c1ed3cc56daa0fb71b1f23a)
Ian Jackson [Mon, 1 Jul 2013 14:20:28 +0000 (15:20 +0100)]
libxl: suppress device assignment to HVM guest when there is no IOMMU
This in effect copies similar logic from xend: While there's no way to
check whether a device is assigned to a particular guest,
XEN_DOMCTL_test_assign_device at least allows checking whether an
IOMMU is there and whether a device has been assign to _some_
guest.
For the time being, this should be enough to cover for the missing
error checking/recovery in other parts of libxl's device assignment
paths.
There remains a (functionality-, but not security-related) race in
that the iommu should be set up earlier, but this is too risky a
change for this stage of the 4.3 release.
This is a security issue, XSA-61.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Julien Grall [Thu, 27 Jun 2013 17:13:30 +0000 (18:13 +0100)]
xen/arm: Rework the way to compute dom0 DTB base address
If the DTB is loading right after the kernel, on some setup, Linux will
overwrite the DTB during the decompression step.
To be sure the DTB won't be overwritten by the decompression stage, load
the DTB near the end of the first memory bank and below 4Gib (if memory range is
greater).
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Fri, 28 Jun 2013 11:25:57 +0000 (12:25 +0100)]
xen/arm: gic_shutdown_irq must only disable the right IRQ
When GICD_ICENABLERn is read, all the 1s bit represent enabled IRQs.
Currently gic_shutdown_irq:
- read GICD_ICENABLER
- set the corresping bit to 1
- write back the new value
That means, Xen will disable more IRQs than necessary.
Dongxiao Xu [Thu, 27 Jun 2013 15:01:26 +0000 (17:01 +0200)]
nested vmx: Fix the booting of L2 PAE guest
When doing virtual VM entry and virtual VM exit, we need to
sychronize the PAE PDPTR related VMCS registers. With this fix,
we can boot 32bit PAE L2 guest (Win7 & RHEL6.4) on "Xen on Xen"
environment.
Andrew Cooper [Thu, 27 Jun 2013 12:01:18 +0000 (14:01 +0200)]
AMD/intremap: Prevent use of per-device vector maps until irq logic is fixed
XSA-36 changed the default vector map mode from global to per-device. This is
because a global vector map does not prevent one PCI device from impersonating
another and launching a DoS on the system.
However, the per-device vector map logic is broken for devices with multiple
MSI-X vectors, which can either result in a failed ASSERT() or misprogramming
of a guests interrupt remapping tables. The core problem is not trivial to
fix.
In an effort to get AMD systems back to a non-regressed state, introduce a new
type of vector map called per-device-global. This uses per-device vector maps
in the IOMMU, but uses a single used_vector map for the core IRQ logic.
This patch is intended to be removed as soon as the per-device logic is fixed
correctly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
This grub.cfg from a default fedora 19 Beta install
caused pygrub failures.The previous pygrub commit
fixed taht. So this example file added for reference.
Signed-off-by: Marcel Mol <marcel@mesa.nl> Acked-by: Ian Campbell <ian.campbell@citrix.com>
pygrub/GrubConf: fix boot problem for fedora 19 grub.cfg (2nd attempt)
Booting a fedora 19 domU failed because a it could not properly
parse the grub.cfg file. This was cased by
set default="${next_entry}"
This statement actually is within an 'if' statement, so maybe it would
be better to skip code within if/fi blocks...
But this patch seems to work fine.
Signed-off-by: Marcel Mol <marcel@mesa.nl> Acked-by: Ian Campbell <ian.campbell@citix.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Ian Murray [Sat, 22 Jun 2013 12:38:11 +0000 (13:38 +0100)]
Xendomains was not correctly suspending domains when a STOP was issued.
The regex was not selecting the { when parsing JSON output of xl list -l.
It was also not selecting (domain when parsing xl list -l when SXP selected.
Pefixed { with 4 spaces, and removed an extra ( before domain in the regex
string
Added quotes around the grep strings so the spaces inserted into the string
didn't not break the grepping.
This has now been tested against 4.3RC5
Signed-off-by: Ian Murray <murrayie@yahoo.co.uk> Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Anthony PERARD [Wed, 26 Jun 2013 15:54:31 +0000 (16:54 +0100)]
libxl: Use QMP cpu-add to hotplug CPU with qemu-xen.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Anthony PERARD [Wed, 26 Jun 2013 15:54:30 +0000 (16:54 +0100)]
libxl: Add "cpu-add" QMP command.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
[ ijc -- rename index parameter to avoid Wshadow due to index(3) in strings.h ]
Andrew Cooper [Mon, 24 Jun 2013 15:47:05 +0000 (16:47 +0100)]
tools/libxc: Fix memory leaks in xc_domain_save()
Introduces outbuf_free() to mirror the currently existing outbuf_init().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Wed, 26 Jun 2013 13:23:35 +0000 (14:23 +0100)]
libxc: Fix guest boot on ARM after XSA-55
XSA-55 has exposed errors for guest creation on ARM:
- domain virt_base was not defined;
- xc_dom_alloc_segment allocates pfn from 0 instead of the RAM base address.
Jim Fehlig [Tue, 25 Jun 2013 22:02:15 +0000 (16:02 -0600)]
libxl: Fix assignment of devid value returned from libxl__device_nextid
Commit 5420f265 has some misplaced parenthesis that caused devid
to be assigned 1 or 0 based on checking return value of
libxl__device_nextid < 0, e.g.
devid = libxl__device_nextid(...) < 0
This works when only one instance of a given device type exists, but
subsequent devices of the same type will also have a devid = 1 if
libxl__device_nextid succeeds. Fix by checking the value assigned to
devid, e.g.
(devid = libxl__device_nextid(...)) < 0
Signed-off-by: Jim Fehlig <jfehlig@suse.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>