Ian Jackson [Fri, 14 Jun 2013 15:43:17 +0000 (16:43 +0100)]
libelf: check nul-terminated strings properly
It is not safe to simply take pointers into the ELF and use them as C
pointers. They might not be properly nul-terminated (and the pointers
might be wild).
So we are going to introduce a new function elf_strval for safely
getting strings. This will check that the addresses are in range and
that there is a proper nul-terminated string. Of course it might
discover that there isn't. In that case, it will be made to fail.
This means that elf_note_name might fail, too.
For the benefit of call sites which are just going to pass the value
to a printf-like function, we provide elf_strfmt which returns
"(invalid)" on failure rather than NULL.
In this patch we introduce dummy definitions of these functions. We
introduce calls to elf_strval and elf_strfmt everywhere, and update
all the call sites with appropriate error checking.
There is not yet any semantic change, since before this patch all the
places where we introduce elf_strval dereferenced the value anyway, so
it mustn't have been NULL.
In future patches, when elf_strval is made able return NULL, when it
does so it will mark the elf "broken" so that an appropriate
diagnostic can be printed.
This is part of the fix to a security issue, XSA-55.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Ian Jackson [Fri, 14 Jun 2013 15:43:17 +0000 (16:43 +0100)]
libelf: introduce macros for memory access and pointer handling
We introduce a collection of macros which abstract away all the
pointer arithmetic and dereferences used for accessing the input ELF
and the output area(s). We use the new macros everywhere.
For now, these macros are semantically identical to the code they
replace, so this patch has no functional change.
elf_is_elfbinary is an exception: since it doesn't take an elf*, we
need to handle it differently. In a future patch we will change it to
take, and check, a length parameter. For now we just mark it with a
fixme.
That this patch has no functional change can be verified as follows:
0. Copy the scripts "comparison-generate" and "function-filter"
out of this commit message.
1. Check out the tree before this patch.
2. Run the script ../comparison-generate .... ../before
3. Check out the tree after this patch.
4. Run the script ../comparison-generate .... ../after
5. diff --exclude=\*.[soi] -ruN before/ after/ |less
Expect these differences:
* stubdom/zlib-x86_64/ztest*.s2
The filename of this test file apparently contains the pid.
* xen/common/version.s2
The xen build timestamp appears in two diff hunks.
Verification that this is all that's needed:
In a completely built xen.git,
find * -name .*.d -type f | xargs grep -l libelf\.h
Expect results in:
xen/arch/x86: Checked above.
tools/libxc: Checked above.
tools/xcutils/readnotes: Checked above.
tools/xenstore: Checked above.
xen/common/libelf:
This is the build for the hypervisor; checked in B above.
stubdom:
We have one stubdom which reads ELFs using our libelf,
pvgrub, which is checked above.
I have not done this verification for ARM.
This is part of the fix to a security issue, XSA-55.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-8<- comparison-generate -8<-
#!/bin/bash
# usage:
# cd xen.git
# .../comparison-generate OUR-CONFIG BUILD-RUNE-PREFIX ../before|../after
# eg:
# .../comparison-generate ~/work/.config 'schroot -pc64 --' ../before
set -ex
test $# = 3 || need-exactly-three-arguments
our_config=$1
build_rune_prefix=$2
result_dir=$3
git clean -x -d -f
cp "$our_config" .
cat <<END >>.config
debug_symbols=n
CFLAGS += -save-temps
END
if [ -f ./configure ]; then
$build_rune_prefix ./configure
fi
$build_rune_prefix make -C xen
$build_rune_prefix make -C tools/include
$build_rune_prefix make -C stubdom grub
$build_rune_prefix make -C tools/libxc
$build_rune_prefix make -C tools/xenstore
$build_rune_prefix make -C tools/xcutils
rm -rf "$result_dir"
mkdir "$result_dir"
set +x
for f in `find xen tools stubdom -name \*.[soi]`; do
mkdir -p "$result_dir"/`dirname $f`
cp $f "$result_dir"/${f}
case $f in
*.s)
../function-filter <$f >"$result_dir"/${f}2
;;
esac
done
echo ok.
-8<-
-8<- function-filter -8<-
#!/usr/bin/perl -w
# function-filter
# script for massaging gcc-generated labels to be consistent
use strict;
our @lines;
my $sedderybody = "sub seddery () {\n";
while (<>) {
push @lines, $_;
if (m/^(__FUNCTION__|__func__)\.(\d+)\:/) {
$sedderybody .= " s/\\b$1\\.$2\\b/__XSA55MANGLED__$1.$./g;\n";
}
}
$sedderybody .= "}\n1;\n";
eval $sedderybody or die $@;
foreach (@lines) {
seddery();
print or die $!;
}
-8<-
Ian Jackson [Fri, 14 Jun 2013 15:43:16 +0000 (16:43 +0100)]
libelf/xc_dom_load_elf_symtab: Do not use "syms" uninitialised
xc_dom_load_elf_symtab (with load==0) calls elf_round_up, but it
mistakenly used the uninitialised variable "syms" when calculating
dom->bsd_symtab_start. This should be a reference to "elf".
This change might have the effect of rounding the value differently.
Previously if the uninitialised value (a single byte on the stack) was
ELFCLASS64 (ie, 2), the alignment would be to 8 bytes, otherwise to 4.
However, the value is calculated from dom->kernel_seg.vend so this
could only make a difference if that value wasn't already aligned to 8
bytes.
This is part of the fix to a security issue, XSA-55.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Fri, 14 Jun 2013 15:43:16 +0000 (16:43 +0100)]
libelf: move include of <asm/guest_access.h> to top of file
libelf-loader.c #includes <asm/guest_access.h>, when being compiled
for Xen. Currently it does this in the middle of the file.
Move this #include to the top of the file, before libelf-private.h.
This is necessary because in forthcoming patches we will introduce
private #defines of memcpy etc. which would interfere with definitions
in headers #included from guest_access.h.
No semantic or functional change in this patch.
This is part of the fix to a security issue, XSA-55.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Ian Jackson [Fri, 14 Jun 2013 15:43:16 +0000 (16:43 +0100)]
libelf: add `struct elf_binary*' parameter to elf_load_image
The meat of this function is going to need a copy of the elf pointer,
in forthcoming patches.
No functional change in this patch.
This is part of the fix to a security issue, XSA-55.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Ian Jackson [Fri, 14 Jun 2013 15:43:16 +0000 (16:43 +0100)]
libxc: Fix range checking in xc_dom_pfn_to_ptr etc.
* Ensure that xc_dom_pfn_to_ptr (when called with count==0) does not
return a previously-allocated block which is entirely before the
requested pfn (!)
* Provide a version of xc_dom_pfn_to_ptr, xc_dom_pfn_to_ptr_retcount,
which provides the length of the mapped region via an out parameter.
* Change xc_dom_vaddr_to_ptr to always provide the length of the
mapped region and change the call site in xc_dom_binloader.c to
check it. The call site in xc_dom_load_elf_symtab will be corrected
in a forthcoming patch, and for now ignores the returned length.
This is part of the fix to a security issue, XSA-55.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Fri, 14 Jun 2013 15:43:16 +0000 (16:43 +0100)]
libxc: introduce xc_dom_seg_to_ptr_pages
Provide a version of xc_dom_seg_to_ptr which returns the number of
guest pages it has actually mapped. This is useful for callers who
want to do range checking; we will use this later in this series.
This is part of the fix to a security issue, XSA-55.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Ian Jackson [Fri, 14 Jun 2013 15:43:15 +0000 (16:43 +0100)]
libelf: abolish libelf-relocate.c
This file is not actually used. It's not built in Xen's instance of
libelf; in libxc's it's built but nothing in it is called. Do not
compile it in libxc, and delete it.
This reduces the amount of work we need to do in forthcoming patches
to libelf (particularly since as libelf-relocate.c is not used it is
probably full of bugs).
This is part of the fix to a security issue, XSA-55.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Roger Pau Monné [Thu, 13 Jun 2013 09:26:53 +0000 (11:26 +0200)]
x86/vtsc: update vcpu_time in hvm_set_guest_time
When using a vtsc, hvm_set_guest_time changes hvm_vcpu.stime_offset,
which is used in the vcpu time structure to calculate the
tsc_timestamp, so after updating stime_offset we need to propagate the
change to vcpu_time in order for the guest to get the right time if
using the PV clock.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
master commit: 32c864a35ece2c24a336d183869a546798a4b241
master date: 2013-06-05 10:03:08 +0200
Jan Beulich [Thu, 13 Jun 2013 09:23:56 +0000 (11:23 +0200)]
x86: fix XCR0 handling
- both VMX and SVM ignored the ECX input to XSETBV
- both SVM and VMX used the full 64-bit RAX when calculating the input
mask to XSETBV
- faults on XSETBV did not get recovered from
Also consolidate the handling for PV and HVM into a single function,
and make the per-CPU variable "xcr0" static to xstate.c.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
master commit: 10b2b21a241795394637167bd4b076f2de17741f
master date: 2013-06-04 17:25:41 +0200
Jan Beulich [Thu, 13 Jun 2013 09:20:40 +0000 (11:20 +0200)]
x86: preserve FPU selectors for 32-bit guest code
Doing {,F}X{SAVE,RSTOR} unconditionally with 64-bit operand size leads
to the selector values associated with the last instruction/data
pointers getting lost. This, besides being inconsistent and not
compatible with native hardware behavior especially for 32-bit guests,
leads to bug checks in 32-bit Windows when running with "Driver
verifier" (see e.g. http://support.microsoft.com/kb/244617).
In a first try I made the code figure out the current guest mode, but
that has the disadvantage of only taking care of the issue when the
guest executes in the mode for which the state currently is (i.e.
namely not in the case of a 64-bit guest running a 32-bit application,
but being in kernle [64-bit] mode).
The solution presented here is to conditionally execute an auxiliary
FNSTENV and use the selectors from there.
In either case the determined word size gets stored in the last byte
of the FPU/SSE save area, which is available for software use (and I
verified is being cleared to zero by all versions of Xen, i.e. will not
present a problem when migrating guests from older to newer hosts), and
evaluated for determining the operand size {,F}XRSTOR is to be issued
with.
Note that I did check whether using a second FXSAVE or a partial second
XSAVE would be faster than FNSTENV - neither on my Westmere (FXSAVE)
nor on my Romley (XSAVE) they are.
I was really tempted to use branches into the middle of instructions
(i.e. past the REX64 prefixes) here, as that would have allowed to
collapse the otherwise identical fault recovery blocks. I stayed away
from doing so just because I expect others to dislike such slightly
subtle/tricky code.
Reported-by: Ben Guthro <Benjamin.Guthro@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
master commit: 10f969150025498fe27d985f9021a68f8c7acc31
master date: 2013-06-04 17:23:11 +0200
When booting Xen on VMware ESX 5.1 and Workstation 9, you hit a GPF
during MCE initialization. The culprit is line 631 in
set_poll_bankmask():
bitmap_copy(mb->bank_map, mca_allbanks->bank_map, nr_mce_banks);
What is happening is that in mca_cap_init(), nr_mce_banks is being set
to 0. This causes the allocation of bank_map to be set to
ZERO_BLOCK_PTR which is the return value for zero-size allocation by
xzalloc_array()/_xmalloc(). This results in the bitmap_copy() to fail
disastrously. The following patch fixes this issue.
Signed-off-by: Aravindh Puthiyaparambil <aravindp@cisco.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christoph Egger <chegger@amazon.de>
master commit: 5cffb77c4072fa5b46700a2dbb3e46c5a54eba6d
master date: 2013-06-03 15:42:46 +0200
Jan Beulich [Thu, 13 Jun 2013 09:17:46 +0000 (11:17 +0200)]
x86: fix ordering of operations in destroy_irq()
The fix for XSA-36, switching the default of vector map management to
be per-device, exposed more readily a problem with the cleanup of these
vector maps: dynamic_irq_cleanup() clearing desc->arch.used_vectors
keeps the subsequently invoked clear_irq_vector() from clearing the
bits for both the in-use and a possibly still outstanding old vector.
Fix this by folding dynamic_irq_cleanup() into destroy_irq(), which was
its only caller, deferring the clearing of the vector map pointer until
after clear_irq_vector().
Once at it, also defer resetting of desc->handler until after the loop
around smp_mb() checking for IRQ_INPROGRESS to be clear, fixing a
(mostly theoretical) issue with the intercation with do_IRQ(): If we
don't defer the pointer reset, do_IRQ() could, for non-guest IRQs, call
->ack() and ->end() with different ->handler pointers, potentially
leading to an IRQ remaining un-acked. The issue is mostly theoretical
because non-guest IRQs are subject to destroy_irq() only on (boot time)
error paths.
As to the changed locking: Invoking clear_irq_vector() with desc->lock
held is okay because vector_lock already nests inside desc->lock (proven
by set_desc_affinity(), which takes vector_lock and gets called from
various desc->handler->ack implementations, getting invoked with
desc->lock held).
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
master commit: 43427e65ccfea0c6dc0232f358287e6cc616b507
master date: 2013-05-31 08:35:22 +0200
Jan Beulich [Tue, 4 Jun 2013 07:36:32 +0000 (09:36 +0200)]
x86/xsave: properly check guest input to XSETBV
Other than the HVM emulation path, the PV case so far failed to check
that YMM state requires SSE state to be enabled, allowing for a #GP to
occur upon passing the inputs to XSETBV inside the hypervisor.
Jan Beulich [Tue, 4 Jun 2013 07:35:16 +0000 (09:35 +0200)]
x86/xsave: recover from faults on XRSTOR
Just like FXRSTOR, XRSTOR can raise #GP if bad content is being passed
to it in the memory block (i.e. aspects not under the control of the
hypervisor, other than e.g. proper alignment of the block).
Also correct the comment explaining why FXRSTOR needs exception
recovery code to not wrongly state that this can only be a result of
the control tools passing a bad image.
Jan Beulich [Tue, 4 Jun 2013 07:33:29 +0000 (09:33 +0200)]
x86/xsave: fix information leak on AMD CPUs
Just like for FXSAVE/FXRSTOR, XSAVE/XRSTOR also don't save/restore the
last instruction and operand pointers as well as the last opcode if
there's no pending unmasked exception (see CVE-2006-1056 and commit
9747:4d667a139318).
While the FXSR solution sits in the save path, I prefer to have this in
the restore path because there the handling is simpler (namely in the
context of the pending changes to properly save the selector values for
32-bit guest code).
Also this is using FFREE instead of EMMS, as it doesn't seem unlikely
that in the future we may see CPUs with x87 and SSE/AVX but no MMX
support. The goal here anyway is just to avoid an FPU stack overflow.
I would have preferred to use FFREEP instead of FFREE (freeing two
stack slots at once), but AMD doesn't document that instruction.
Petr Matousek [Fri, 31 May 2013 10:24:18 +0000 (12:24 +0200)]
libxc: limit cpu values when setting vcpu affinity
When support for pinning more than 64 cpus was added, check for cpu
out-of-range values was removed. This can lead to subsequent
out-of-bounds cpumap array accesses in case the cpu number is higher
than the actual count.
Jan Beulich [Fri, 31 May 2013 10:23:30 +0000 (12:23 +0200)]
x86: fix boot time APIC mode detection
current_cpu_data becomes valid only relatively late in the boot
process, so looking there for a particular feature early in the game
would generally give the appearance of the feature being unavailable.
Getting this wrong means that at kexec time the system would get
returned to xAPIC mode, causing disconnect_bsp_APIC() to try to access
the APIC page, which on systems with x2APIC pre-enabled will never get
set up.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 234c4dde2fd4f1182fe1a6bea6bced83fe363007
master date: 2013-05-23 13:08:32 +0200
Andrew Cooper [Thu, 23 May 2013 08:16:28 +0000 (10:16 +0200)]
AMD/iommu: SR56x0 Erratum 64 - Reset all head & tail pointers
Reference at time of patch:
http://support.amd.com/us/ChipsetMotherboard_TechDocs/46303.pdf
Erratum 64 states that the head and tail pointers for the Command buffer and
Event log are only reset on a cold boot, not a warm boot.
While the erratum is limited to systems using SR56xx chipsets (such as Family
10h CPUs), resetting the pointers is a sensible action in all cases, including
the PPR log for consistency.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
master commit: 6d243308e1d75f866679db226159c797d6c83aad
master date: 2013-05-22 15:26:52 +0200
Jan Beulich [Thu, 23 May 2013 08:15:55 +0000 (10:15 +0200)]
tools: fix dependency file generation
There is a small set of places where files in subdirectories get
compiled from the parent directory. Dependency file wise this is no
problem as long as the files use names distinct without regard to the
directories they sit in, and tools/console/ violates this (in having
two main.c files). Hence we need to avoid losing the directory name,
both to ensure the two compiler instances don't simultaneously write
to the same file (happening of which is what triggered me looking
into this) and to guarantee dependencies for all files will be seen
by make on an incremental rebuild.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 4d788e164d6556d931bc3e0a69e36b8cf7280794
master date: 2013-05-21 10:16:30 +0200
Jan Beulich [Thu, 23 May 2013 08:13:18 +0000 (10:13 +0200)]
fix XSA-46 regression with xend/xm
The hypervisor side changes for XSA-46 require the tool stack to now
always map the guest pIRQ before granting access permission to the
underlying host IRQ (GSI). This in particular requires that pciif.py
no longer can skip this step (assuming qemu would do it) for HVM
guests.
This in turn exposes, however, an inconsistency between xend and qemu:
The former wants to always establish 1:1 mappings between pIRQ and host
IRQ (for non-MSI only of course), while the latter always wants to
allocate an arbitrary mapping. Since the whole tool stack obviously
should always agree on the mapping model, make libxc enforce the 1:1
mapping as the more natural one (as well as being the one that allows
for easier debugging, since there no need to find out the extra
mapping). Users of libxc that want to establish a particular (rather
than an allocated) mapping are still free to do so, as well as tool
stacks not based on libxc wanting to implement an allocation based
model (which is why it's not the hypervisor that's being changed to
enforce either model).
Since libxl, like xend, already uses a 1:1 model, it's unaffected by
the libxc change (and it being unaffected by the original hypervisor
side changes is - afaict - simply due to qemu getting spawned at a
later point in time compared to the xend event flow).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andreas Falck <falck.andreas.lists@gmail.com> (on 4.1) Tested-by: Gordan Bobic <gordan@bobich.net> (on 4.2) Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 934a5253d932b6f67fe40fc48975a2b0117e4cce
master date: 2013-05-21 11:32:34 +0200
Tim Deegan [Thu, 23 May 2013 08:11:12 +0000 (10:11 +0200)]
x86/hvm: avoid p2m lookups for vlapic accesses.
The LAPIC base address is a known GFN, so we can skip looking up the
p2m: we know it should be handled as emulated MMIO. That helps
performance in older Windows OSes, which make a _lot_ of TPR accesses.
This will change the behaviour of any OS that maps other
memory/devices at its LAPIC address; the new behaviour (the LAPIC
mapping always wins) is closer to actual hardware behaviour.
Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 5d43891bf4002b754cd90d83e91d9190e8c8b9d0
master date: 2013-05-16 12:05:25 +0100
Jan Beulich [Thu, 23 May 2013 08:09:17 +0000 (10:09 +0200)]
x86/IO-APIC: fix guest RTE write corner cases
This fixes two regressions from c/s 20143:a7de5bd776ca ("x86: Make the
hypercall PHYSDEVOP_alloc_irq_vector hypercall dummy"):
For one, IRQs that had their vector set up by Xen internally without a
handler ever having got set (e.g. via "com<n>=..." without a matching
consumer option like "console=com<n>") would wrongly call
add_pin_to_irq() here, triggering the BUG_ON() in that function.
Second, when assign_irq_vector() fails this addition to irq_2_pin[]
needs to be undone.
In the context of this I'm also surprised that the irq_2_pin[]
manipulations here occur without any lock, i.e. rely on Dom0 to do
some sort of serialization.
Update return codes of wrmsr_hypervisor_regs, update callers to deal
with the new return codes:
0: not handled
1: handled
-EAGAIN: retry
Currently wrmsr_hypervisor_regs will not return the following error, it
will be added in a separate patch:
-EINVAL: error during handling
Also update the gdprintk to handle a page value of NULL to avoid
printing a bogus MFN value. Update also computing of MSR value in
gdprintk, the idx was always zero.
Ben Guthro [Thu, 23 May 2013 08:06:34 +0000 (10:06 +0200)]
ns16550: delay resume until dom0 ACPI has a chance to run
Check for ioport access, before fully resuming operation, to avoid
spinning in __ns16550_poll when reading the LSR register returns 0xFF
on failing ioport access.
On some systems (like Lenovo T410, and some HP machines of similar vintage)
there is a SuperIO card that provides this legacy ioport on the LPC bus.
In this case, we need to wait for dom0's ACPI processing to run the proper
AML to re-initialize the chip, before we can use the card again.
This may cause a small amount of garbage to be written to the serial log
while we wait patiently for that AML to be executed.
This implementation limits the number of retries, to avoid a situation
where we keep trying over and over again, in the case of some other failure
on the ioport.
The current code allows the PVHVM guest to make this hypercall.
But for PVHVM guest it always returns -EINVAL (-22) for Xen 4.2
and above. Xen 4.1 and earlier worked.
The reason is that the check in map_vcpu_info would fail
at:
if ( v->arch.vcpu_info_mfn != INVALID_MFN )
The reason is that the vcpu_info_mfn for PVHVM guests ends up by
defualt with the value of zero (introduced by c/s 23143).
The code in vcpu_initialise which initialized vcpu_info_mfn to a
valid value (INVALID_MFN), would never be called for PVHVM:
Jan Beulich [Fri, 3 May 2013 07:42:15 +0000 (09:42 +0200)]
x86: cleanup after making various page table manipulation operations preemptible
This drops the "preemptible" parameters from various functions where
now they can't (or shouldn't, validated by assertions) be run in non-
preemptible mode anymore, to prove that manipulations of at least L3
and L4 page tables and page table entries are now always preemptible,
i.e. the earlier patches actually fulfill their purpose of fixing the
resulting security issue.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: b965b31a6bce8c37e67e525fae6da0e2f26d6b2e
master date: 2013-05-02 17:04:14 +0200
Jan Beulich [Thu, 2 May 2013 15:19:41 +0000 (17:19 +0200)]
VT-d: don't permit SVT_NO_VERIFY entries for known device types
Only in cases where we don't know what to do we should leave the IRTE
blank (suppressing all validation), but we should always log a warning
in those cases (as being insecure).
Jan Beulich [Thu, 2 May 2013 15:19:01 +0000 (17:19 +0200)]
x86: make page table handling error paths preemptible
... as they may take significant amounts of time.
This requires cloning the tweaked continuation logic from
do_mmuext_op() to do_mmu_update().
Note that in mod_l[34]_entry() a negative "preemptible" value gets
passed to put_page_from_l[34]e() now, telling the callee to store the
respective page in current->arch.old_guest_table (for a hypercall
continuation to pick up), rather than carrying out the put right away.
This is going to be made a little more explicit by a subsequent cleanup
patch.
This is part of CVE-2013-1918 / XSA-45.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: b8efae696c9a2d46e91fa0eda739427efc16c250
master date: 2013-05-02 16:39:37 +0200
Jan Beulich [Thu, 2 May 2013 15:18:28 +0000 (17:18 +0200)]
x86: make page table unpinning preemptible
... as it may take significant amounts of time.
Since we can't re-invoke the operation in a second attempt, the
continuation logic must be slightly tweaked so that we make sure
do_mmuext_op() gets run one more time even when the preempted unpin
operation was the last one in a batch.
This is part of CVE-2013-1918 / XSA-45.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: a3e049f8e86fe18e3b87f18dc0c7be665026fd9f
master date: 2013-05-02 16:39:06 +0200
Jan Beulich [Thu, 2 May 2013 15:14:06 +0000 (17:14 +0200)]
x86: make vcpu_destroy_pagetables() preemptible
... as it may take significant amounts of time.
The function, being moved to mm.c as the better home for it anyway, and
to avoid having to make a new helper function there non-static, is
given a "preemptible" parameter temporarily (until, in a subsequent
patch, its other caller is also being made capable of dealing with
preemption).
This is part of CVE-2013-1918 / XSA-45.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: 6cdc9be2a5f2a87b4504404fbf648d16d9503c19
master date: 2013-05-02 16:34:21 +0200
Ben Guthro [Mon, 29 Apr 2013 14:08:16 +0000 (16:08 +0200)]
x86/S3: Fix cpu pool scheduling after suspend/resume
This review is another S3 scheduler problem with the system_state
variable introduced with the following changeset:
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=269f543ea750ed567d18f2e819e5d5ce58eda5c5
Specifically, the cpu_callback function that takes the CPU down during
suspend, and back up during resume. We were seeing situations where,
after S3, only CPU0 was in cpupool0. Guest performance suffered
greatly, since all vcpus were only on a single pcpu. Guests under high
CPU load showed the problem much more quickly than an idle guest.
Removing this if condition forces the CPUs to go through the expected
online/offline state, and be properly scheduled after S3.
This also includes a necessary partial change proposed earlier by
Tomasz Wroblewski here:
http://lists.xen.org/archives/html/xen-devel/2013-01/msg02206.html
It should also resolve the issues discussed in this thread:
http://lists.xen.org/archives/html/xen-devel/2012-11/msg01801.html
Ian Jackson [Thu, 18 Apr 2013 16:42:04 +0000 (17:42 +0100)]
libxl: Fix SEGV in network-attach
When "device/vif" directory exists but is empty l!=NULL, but nb==0, so
l[nb-1] is invalid. Add missing check.
Signed-off-by: Marek Marczykowski <marmarek@invisiblethingslab.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Jan Beulich [Thu, 18 Apr 2013 14:19:51 +0000 (16:19 +0200)]
x86: fix various issues with handling guest IRQs
- properly revoke IRQ access in map_domain_pirq() error path
- don't permit replacing an in use IRQ
- don't accept inputs in the GSI range for MAP_PIRQ_TYPE_MSI
- track IRQ access permission in host IRQ terms, not guest IRQ ones
(and with that, also disallow Dom0 access to IRQ0)
Jan Beulich [Thu, 18 Apr 2013 14:18:42 +0000 (16:18 +0200)]
x86: clear EFLAGS.NT in SYSENTER entry path
... as it causes problems if we happen to exit back via IRET: In the
course of trying to handle the fault, the hypervisor creates a stack
frame by hand, and uses PUSHFQ to set the respective EFLAGS field, but
expects to be able to IRET through that stack frame to the second
portion of the fixup code (which causes a #GP due to the stored EFLAGS
having NT set).
And even if this worked (e.g if we cleared NT in that path), it would
then (through the fail safe callback) cause a #GP in the guest with the
SYSENTER handler's first instruction as the source, which in turn would
allow guest user mode code to crash the guest kernel.
Inject a #GP on the fake (NULL) address of the SYSENTER instruction
instead, just like in the case where the guest kernel didn't register
a corresponding entry point.
This is CVE-2013-1917 / XSA-44.
Reported-by: Andrew Cooper <andrew.cooper3@citirx.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fdac9515607b757c044e7ef0d61b1453ef999b08
master date: 2013-04-18 16:00:35 +0200
On the crash path in nmi_shootdown_cpus(), we shut down the IOMMU, then
disable the IOAPIC.
On systems which support interrupt remapping, the variable iommu_intremap
remains set, meaning that disable_IO_APIC() issues interrupt remapping
invalidate requests.
IOAPIC interrupt remapping used to be conditional on iommu_enabled, but is now
conditional on iommu_intremap, following the above changeset.
This behaviour can be fixed by also indicating that interrupt remapping is not
enabled after shutting down the IOMMU.
tools/blktap2: Handle read/write interrupts in blktap2 control plane.
The following patch:
tools: Retry blktap2 tapdisk message on interrupt.
Addressed a long standing regression with the blktap2 control
plane. An interruption of the select system call would
prematurely terminate the message sequence needed to properly
shutdown a blktap2 tapdisk instance.
Ian Jackson correctly noted that the read and write systems calls
responsible for receiving and sending the control messages could
also return EINTR resulting in similar effects. While this
regression was not noted in field testing this patch adds support
to re-start the calls to provide a technically complete
implementation of control plane management in the presence of
signals.
Signed-off-by: Dr. Greg Wettstein <xen@wind.enjellic.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit a5c800142cfc82159fcb85b47116cf296caebcc5)
Ian Jackson [Mon, 15 Apr 2013 17:04:35 +0000 (18:04 +0100)]
libxl: don't launch more than one tapdisk process for each disk
When adding a disk don't launch multiple tapdisk instances for the
same disk, if transaction fails in device_disk_add reuse the same
tapdisk for further tries instead of creating a new instance each
time a transaction fails.
Reported-by: Darren Shepherd <darren.s.shepherd@gmail.com> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com> Tested-by: Darren Shepherd <darren.s.shepherd@gmail.com> Backport-requested-by: Pasi Karkkainen <pasik@iki.fi>
(cherry picked from commit ec398660e89ca18bb8d061d5047d682bd383778a)
Conflicts:
tools/libxl/libxl.c Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
tools: Retry blktap2 tapdisk message on interrupt.
Re-start blktap2 IPC select call on interrupt.
We hunted this miserable bug for a long time.
The teardown of a blktap2 tapdisk instance is being carried out
inconsistently up to and including the 4.2.1 release. The
problem appears to be a classic 'Heisenbug' which disappears if a
single function call is added to the tapdisk shutdown path. It
is likely this bug has been in existence for the life of the
blktap2 code.
Control messages to manipulate a tapdisk instance are sent over a
UNIX domain socket. A select call is used on both the read and
write paths to wait on I/O and to set a timeout for the
transmission and reception of the control plane messages.
The existing code fails receipt or transmission of the control message
on any type of error return from the select call. The xl control
process receives an interrupt while waiting in the select call which
in turn causes an error return with SIGINT as the return code.
This prematurely terminates the teardown of the tapdisk instance
leaving it in various states of shutdown. Since multiple messages
are needed to implement a full teardown the tapdisk instance can be
left in various states ranging from fully connected to only the minor
being left allocated.
The fix is straight forward. Check the return code from the
select call and re-try read or write of the control message if
errno is sent to EINTR. The problem manifests itself in the read
path but there appears to be little reason to not add the fix to
the write path as well. Both paths appear to be cut-and-paste
copies of each other.
Signed-off-by: Dr. Greg Wettstein <greg@enjellic.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 6cffb2b469a55032a2900ccb8776c0082f346758)
Ian Jackson [Tue, 9 Apr 2013 14:44:07 +0000 (15:44 +0100)]
libxl: run libxl__arch_domain_create() much earlier.
Among other things, arch_domain_create() sets the shadow(/hap/p2m)
memory allocation, which must happen after vcpus are assigned (or the
shadow op will fail) but before memory is allocated (or we might run
out of p2m memory).
libxl__build_pre(), which already sets similar things like maxmem,
semes like a reasonable spot for it. That needed a bit of plumbing to
get the right datastructure from the caller.
As a side-effect, the return code from libxl__arch_domain_create() is
no longer ignored.
This bug was analysed in:
From: "Jan Beulich" <JBeulich@xxxxxxxx>
"Re: [Xen-devel] [xen-unstable test] 16788: regressions - FAIL"
Date: Mon, 04 Mar 2013 16:34:53 +0000
http://lists.xen.org/archives/html/xen-devel/2013-03/msg00191.html
Reported-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Tim Deegan <tim@xen.org> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com>
(Cherry-picked from 650354dbc2626b643c12873275ca67782f1382c8.)
Conflicts:
tools/libxl/libxl_dom.c Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Tim Deegan [Tue, 9 Apr 2013 14:07:23 +0000 (16:07 +0200)]
x86/mm/shadow: spurious warning when unmapping xenheap pages.
Xenheap pages will always have an extra typecount, taken in
share_xen_page_with_guest(), which doesn't come from a shadow PTE.
Adjust the warning in sh_remove_all_mappings() to account for it.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Tim Deegan <tim@xen.org> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cfc515dabe91e3d6c690c68c6a669d6d77fb7ac4
master date: 2013-04-04 10:14:30 +0100
disabled the SMEP bit if a guest VCPU was using HAP and was not
in paging mode. However I could observe VCPUs getting stuck in
the trampoline after the following patch in the Linux kernel
changed the way CR4 gets set up:
x86, realmode: read cr4 and EFER from kernel for 64-bit trampoline
The change will set CR4 from already set flags which includes the
SMEP bit. On bare metal this does not matter as the CPU is in non-
paging mode at that time. But Xen seems to use the emulated non-
paging mode regardless of HAP (I verified that on the guests I was
seeing the issue, HAP was not used).
Therefor it seems right to unset the SMEP bit for a VCPU that is
not in paging-mode, regardless of its HAP usage.
Ben Guthro [Tue, 9 Apr 2013 14:05:52 +0000 (16:05 +0200)]
x86/S3: Restore broken vcpu affinity on resume
When in SYS_STATE_suspend, and going through the cpu_disable_scheduler
path, save a copy of the current cpu affinity, and mark a flag to
restore it later.
Later, in the resume process, when enabling nonboot cpus restore these
affinities.
Signed-off-by: Ben Guthro <benjamin.guthro@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 41e71c2607e036f1ac00df898b8f4acb2d4df7ee
master date: 2013-04-02 09:52:32 +0200
Jan Beulich [Tue, 9 Apr 2013 14:04:25 +0000 (16:04 +0200)]
x86: irq_move_cleanup_interrupt() must ignore legacy vectors
Since the main loop in the function includes legacy vectors, and since
vector_irq[] gets set up for legacy vectors regardless of whether those
get handled through the IO-APIC, it must not do anything on this vector
range. In fact, we should never get past the move_cleanup_count check
for IRQs not handled through the IO-APIC. Adding a respective assertion
woulkd make those iterations more expensive (due to the lock acquire).
For such an assertion to not have false positives we however ought to
suppress setting up IRQ2 as an 8259A interrupt (which wasn't correct
anyway), which is being done here despite the assertion not actually
getting added.
Furthermore, there's no point iterating over the vectors past
LAST_HIPRIORITY_VECTOR, so terminate the loop accordingly.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: af699220ad6d111ba76fc3040342184e423cc9a1
master date: 2013-04-02 08:30:03 +0200
Jan Beulich [Tue, 2 Apr 2013 09:23:05 +0000 (11:23 +0200)]
x86/EFI: permit setting variable with non-zero attributes
This must have been a copy-and-paste mistake - get_variable uses
op->misc as output only, and wants to make sure it's zero for future
extensibility. For set_variable, this is an input though, and hence
the check is wrong.
x86: reserve pages when SandyBridge integrated graphics
SNB graphics devices have a bug that prevent them from accessing certain
memory ranges, namely anything below 1M and in the pages listed in the
table.
Xen does not initialize below 1MB to heap, i.e. below 1MB pages don't be
allocated, so it's unnecessary to reserve memory below the 1 MB mark
that has not already been reserved.
So reserve those pages listed in the table at xen boot if set detect a
SNB gfx device on the CPU to avoid GPU hangs.
To make erst table size checking code works on all systems, both
testing are treated as PASS.
Same situation applies to einj_tab->header_length, so corresponding
table size checking is changed in similar way too.
Originally-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Huang Ying <ying.huang@intel.com>
- use switch() for better readability
- add comment explaining why a formally invalid size it also being
accepted
- check erst_tab->header.length before even looking at
erst_tab->header_length
- prefer sizeof(*erst_tab) over sizeof(struct acpi_table_erst)
Some actions in APEI ERST and EINJ tables are optional, for example,
ACPI_EINJ_BEGIN_OPERATION action is used to do some preparation for
error injection, and firmware may choose to do nothing here. While
some other actions are mandatory, for example, firmware must provide
ACPI_EINJ_GET_ERROR_TYPE implementation.
Original implementation treats all actions as optional (that is, can
have no instructions), that may cause issue if firmware does not
provide some mandatory actions. To fix this, this patch adds
apei_exec_run_optional, which should be used for optional actions.
The original apei_exec_run should be used for mandatory actions.
Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master changeset: 72af01bf6f7489e54ad59270222a29d3e8c501d1
master date: 2013-03-22 12:46:25 +0100
Andrew Cooper [Tue, 2 Apr 2013 09:00:22 +0000 (11:00 +0200)]
ACPI/APEI: Unlock apei_iomaps_lock on error path
This causes deadlocks during early boot on hardware with broken/buggy
APEI implementations, such as a Dell Poweredge 2950 with the latest
currently available BIOS.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Don't use goto or another special error path, as handling the error
case in normal flow is quite simple.
ACPI/APEI: fix ERST MOVE_DATA instruction implementation
The src_base and dst_base fields in apei_exec_context are physical
address, so they should be ioremaped before being used in ERST
MOVE_DATA instruction.
Reported-by: Javier Martinez Canillas <martinez.javier@gmail.com> Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Huang Ying <ying.huang@intel.com>
Replace use of ioremap() by __acpi_map_table()/set_fixmap(). Fix error
handling.
Jan Beulich [Tue, 2 Apr 2013 08:52:26 +0000 (10:52 +0200)]
AMD IOMMU: allow disabling only interrupt remapping when certain IVRS consistency checks fail
After some more thought on the XSA-36 and specifically the comments we
got regarding disabling the IOMMU in this situation altogether making
things worse instead of better, I came to the conclusion that we can
actually restrict the action in affected cases to just disabling
interrupt remapping. That doesn't make the situation worse than prior
to the XSA-36 fixes (where interrupt remapping didn't really protect
domains from one another), but allows at least DMA isolation to still
be utilized.
To do so, disabling of interrupt remapping must be explicitly requested
on the command line - respective checks will then be skipped.
Stepping B-3 has two errata (#47 and #53) related to Interrupt
remapping, to which the workaround is for the BIOS to completely disable
interrupt remapping. These errata are fixed in stepping C-2.
Unfortunately this chipset stepping is very common and many BIOSes are
not disabling interrupt remapping on this stepping . We can detect this in
Xen and prevent Xen from using the problematic interrupt remapping feature.
The Intel 5500/5520/X58 chipset does not support VT-d
Extended Interrupt Mode(EIM). This means the iommu_supports_eim() check
always fails and so x2apic mode cannot be enabled in Xen before this quirk
disables the interrupt remapping feature.
Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Gate the function call to check the quirk on interrupt remapping being
requested to get enabled, and upon failure disable the IOMMU to be in
line with what the changes for XSA-36 (plus follow-ups) did.
Jan Beulich [Tue, 2 Apr 2013 08:50:52 +0000 (10:50 +0200)]
IOMMU: properly check whether interrupt remapping is enabled
... rather than the IOMMU as a whole.
That in turn required to make sure iommu_intremap gets properly
cleared when the respective initialization fails (or isn't being
done at all).
Along with making sure interrupt remapping doesn't get inconsistently
enabled on some IOMMUs and not on others in the VT-d code, this in turn
allowed quite a bit of cleanup on the VT-d side (removed from the
backport).
Andrew Cooper [Tue, 2 Apr 2013 08:48:36 +0000 (10:48 +0200)]
AMD/IOMMU: Process softirqs while building dom0 iommu mappings
Recent changes which have made their way into xen-4.2 stable have pushed the
runtime of construct_dom0() over 5 seconds, which has caused regressions in
XenServer testing because of our 5 second watchdog.
The root cause is that amd_iommu_dom0_init() does not process softirqs and in
particular the nmi_timer which causes the watchdog to decide that no useful
progress is being made.
This patch adds periodic calls to process_pending_softirqs() at the same
interval as the Intel variant of this function. The server which was failing
with the watchdog test now boots reliably with a timeout of 1 second.
Jan Beulich [Tue, 2 Apr 2013 08:47:26 +0000 (10:47 +0200)]
x86/MCA: suppress bank clearing for certain injected events
As the bits indicating validity of the ADDR and MISC bank MSRs may be
injected in a way that isn't consistent with what the underlying
hardware implements (while the bank must be valid for injection to
work, the auxiliary MSRs may not be implemented - and hence cause #GP
upon access - if the hardware never sets the corresponding valid bits.
Consequently we need to do the clearing writes only if no value was
interposed for the respective MSR (which also makes sense the other way
around: there's no point in clearing a hardware register when all data
read came from software). Of course this all requires the injection
tool to do things in a consistent way (but that had been a requirement
before already).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Ren Yongjie <yongjie.ren@intel.com> Acked-by: Liu Jinsong <jinsong.liu@intel.com>
master changeset: b0583c0e64cc8bb6229c95c3304fdac2051f79b3
master date: 2013-03-12 15:53:30 +0100
Try to fix the the issue that "some AMD systems may round the
frequencies in ACPI tables to 100MHz boundaries. We can obtain the real
frequencies from MSRs, so add a quirk to fix these frequencies up
on AMD systems." (from f594065..)
In discussion (around 9855d8..) "it turned out that indeed real
HW/BIOSes may choose to not set the valid bit and thus mark the
P-state as invalid. So this could be considered a fix for broken
BIOSes." (from 9855d8..)
which is great for Linux. Unfortunatly the Linux kernel, when
it tries to do the RDMSR under Xen it fails to get the right
value (it gets zero) as Xen traps it and returns zero. Hence
when dom0 uploads the P-states they will be unmodified and
we should take care of updating the frequencies with the right
values.
I've tested it under Dell Inc. PowerEdge T105 /0RR825, BIOS 1.3.2
08/20/2008 where this quirk can be observed (x86 == 0x10, model == 2).
Also on other AMD (x86 == 0x12, A8-3850; x86 = 0x14, AMD E-350) to
make sure the quirk is not applied there.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: stefan.bader@canonical.com
Do the MSR access here (and while at it, also the one reading
MSR_PSTATE_CUR_LIMIT) on the target CPU, and bound the loop over
amd_fixup_frequency() by max_hw_pstate (matching the one in
powernow_cpufreq_cpu_init()).
Jan Beulich [Tue, 12 Mar 2013 15:19:30 +0000 (16:19 +0100)]
x86/MSI: add mechanism to fully protect MSI-X table from PV guest accesses
This adds two new physdev operations for Dom0 to invoke when resource
allocation for devices is known to be complete, so that the hypervisor
can arrange for the respective MMIO ranges to be marked read-only
before an eventual guest getting such a device assigned even gets
started, such that it won't be able to set up writable mappings for
these MMIO ranges before Xen has a chance to protect them.
This also addresses another issue with the code being modified here,
in that so far write protection for the address ranges in question got
set up only once during the lifetime of a device (i.e. until either
system shutdown or device hot removal), while teardown happened when
the last interrupt was disposed of by the guest (which at least allowed
the tables to be writable when the device got assigned to a second
guest [instance] after the first terminated).
Matthew Daley [Tue, 12 Mar 2013 15:18:45 +0000 (16:18 +0100)]
fix domain unlocking in some xsm error paths
A couple of xsm error/access-denied code paths in hypercalls neglect to
unlock a previously locked domain. Fix by ensuring the domains are
unlocked correctly.
Signed-off-by: Matthew Daley <mattjd@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
master changeset: 9581c4f9a55372a21e759cd449cb676d0e8feddb
master date: 2013-03-06 17:10:26 +0100
Olaf Hering [Tue, 12 Mar 2013 15:18:02 +0000 (16:18 +0100)]
xentrace: fix off-by-one in calculate_tbuf_size
Commit "xentrace: reduce trace buffer size to something mfn_offset can
reach" contains an off-by-one bug. max_mfn_offset needs to be reduced by
exactly the value of t_info_first_offset.
If the system has two cpus and the number of requested trace pages is
very large, the final number of trace pages + the offset will not fit
into a short. As a result the variable offset in alloc_trace_bufs() will
wrap while allocating buffers for the second cpu. Later
share_xen_page_with_privileged_guests() will be called with a wrong page
and the ASSERT in this function triggers. If the ASSERT is ignored by
running a non-dbg hypervisor the asserts in xentrace itself trigger
because "cons" is not aligned because the very last trace page for the
second cpu is a random mfn.
Thanks to Jan for the quick analysis.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
master changeset: d9fb28ae6d41c8201482948660e52889481830dd
master date: 2013-03-04 13:42:17 +0100
George Dunlap [Tue, 12 Mar 2013 15:17:23 +0000 (16:17 +0100)]
credit1: Use atomic bit operations for the flags structure
The flags structure is not protected by locks (or more precisely,
it is protected using an inconsistent set of locks); we therefore need
to make sure that all accesses are atomic-safe. This is particulary
important in the case of the PARKED flag, which if clobbered while
changing the YIELD bit will leave a vcpu wedged in an offline state.
Using the atomic bitops also requires us to change the size of the "flags"
element.
Spotted-by: Igor Pavlikevich <ipavlikevich@gmail.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
master changeset: be6507509454adf3bb5a50b9406c88504e996d5a
master date: 2013-03-04 13:37:39 +0100
Jan Beulich [Tue, 12 Mar 2013 15:16:41 +0000 (16:16 +0100)]
x86: defer processing events on the NMI exit path
Otherwise, we may end up in the scheduler, keeping NMIs masked for a
possibly unbounded period of time (until whenever the next IRET gets
executed). Enforce timely event processing by sending a self IPI.
Of course it's open for discussion whether to always use the straight
exit path from handle_ist_exception.
Jan Beulich [Tue, 12 Mar 2013 15:15:18 +0000 (16:15 +0100)]
SEDF: avoid gathering vCPU-s on pCPU0
The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was
incompatible with the SEDF scheduler: Any vCPU using
VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux
guests) ends up on pCPU0 after that call. Obviously, running all PV
guests' (and namely Dom0's) vCPU-s on pCPU0 causes problems for those
guests rather sooner than later.
So the main thing that was clearly wrong (and bogus from the beginning)
was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced
by a construct that prefers to put back the vCPU on the pCPU that it
got launched on.
However, there's one more glitch: When reducing the affinity of a vCPU
temporarily, and then widening it again to a set that includes the pCPU
that the vCPU was last running on, the generic scheduler code would not
force a migration of that vCPU, and hence it would forever stay on the
pCPU it last ran on. Since that can again create a load imbalance, the
SEDF scheduler wants a migration to happen regardless of it being
apparently unnecessary.
Of course, an alternative to checking for SEDF explicitly in
vcpu_set_affinity() would be to introduce a flags field in struct
scheduler, and have SEDF set a "always-migrate-on-affinity-change"
flag.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Tue, 12 Mar 2013 15:13:54 +0000 (16:13 +0100)]
x86: make certain memory sub-ops return valid values
When a domain's shared info field "max_pfn" is zero,
domain_get_maximum_gpfn() so far returned ULONG_MAX, which
do_memory_op() in turn converted to -1 (i.e. -EPERM). Make the former
always return a sensible number (i.e. zero if the field was zero) and
have the latter no longer truncate return values.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master changeset: 7ffc9779aa5120c5098d938cb88f69a1dda9a0fe
master date: 2013-03-04 10:16:04 +0100
Juergen Gross [Tue, 12 Mar 2013 15:11:55 +0000 (16:11 +0100)]
Avoid stale pointer when moving domain to another cpupool
When a domain is moved to another cpupool the scheduler private data pointers
in vcpu and domain structures must never point to an already freed memory
area.
While at it, simplify sched_init_vcpu() by using DOM2OP instead VCPU2OP.
Tim Deegan [Tue, 12 Mar 2013 15:10:23 +0000 (16:10 +0100)]
vmx: fix handling of NMI VMEXIT.
Call do_nmi() directly and explicitly re-enable NMIs rather than
raising an NMI through the APIC. Since NMIs are disabled after the
VMEXIT, the raised NMI would be blocked until the next IRET
instruction (i.e. the next real interrupt, or after scheduling a PV
guest) and in the meantime the guest will spin taking NMI VMEXITS.
Also, handle NMIs before re-enabling interrupts, since if we handle an
interrupt (and therefore IRET) before calling do_nmi(), we may end up
running the NMI handler with NMIs enabled.
Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master changeset: 7dd3b06ff031c9a8c727df16c5def2afb382101c
master date: 2013-02-28 14:00:18 +0000
Tim Deegan [Fri, 8 Mar 2013 09:56:24 +0000 (10:56 +0100)]
x86/setup: don't relocate the VGA hole.
Copying the contents of the VGA hole is at best pointless and at worst
dangerous. Booting Xen on Xen, it causes a very long delay as each
byte is referred to qemu.
Since we were already discarding the first 1MB of the relocated area,
just avoid copying it in the first place.
Reported-by: Jon Ludlam <jonathan.ludlam@eu.citrix.com> Signed-off-by: Tim Deegan <tim@xen.org>
master changeset: 0b76ce20de85ad7c23c47ee3275020859b91d46b
master date: 2013-02-14 12:20:58 +0000
Jan Beulich [Thu, 7 Mar 2013 17:01:13 +0000 (18:01 +0100)]
x86: fix CMCI injection
This fixes the wrong use of literal vector 0xF7 with an "int"
instruction (invalidated by 25113:14609be41f36) and the fact that doing
the injection via a software interrupt was never valid anyway (because
cmci_interrupt() acks the LAPIC, which does the wrong thing if the
interrupt didn't get delivered though it).
In order to do latter, the patch introduces send_IPI_self(), at once
removing two opend coded uses of "genapic" in the IRQ handling code.
The IOMMU may stop processing page translations due to a perceived lack
of credits for writing upstream peripheral page service request (PPR)
or event logs. If the L2B miscellaneous clock gating feature is enabled
the IOMMU does not properly register credits after the log request has
completed, leading to a potential system hang.
BIOSes are supposed to disable L2B micellaneous clock gating by setting
L2_L2B_CK_GATE_CONTROL[CKGateL2BMiscDisable](D0F2xF4_x90[2]) = 1b. This
patch corrects that for those which do not enable this workaround.
Jan Beulich [Thu, 7 Mar 2013 16:58:19 +0000 (17:58 +0100)]
honor ACPI v4 FADT flags
- force use of physical APIC mode if indicated so (as we don't support
xAPIC cluster mode, the respective flag is taken to force physical
mode too)
- don't use MSI if indicated so (implies no IOMMU)
Both can be overridden on the command line, for the MSI case this at
once adds a new command line option allowing to turn off PCI MSI (IOMMU
and HPET are unaffected by this).
Jan Beulich [Thu, 7 Mar 2013 16:57:02 +0000 (17:57 +0100)]
x86/nhvm: properly clean up after failure to set up all vCPU-s
Otherwise we may leak memory when setting up nHVM fails half way.
This implies that the individual destroy functions will have to remain
capable (in the VMX case they first need to be made so, following
26486:7648ef657fe7 and 26489:83a3fa9c8434) of being called for a vCPU
that the corresponding init function was never run on.
Once at it, also remove a redundant check from the corresponding
parameter validation code.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> Tested-by: Olaf Hering <olaf@aepfle.de>
master changeset: 17281aea1a9a10f1ee165c6e6a2921a67b7b1df2
master date: 2013-02-22 11:21:38 +0100
Tim Deegan [Thu, 7 Mar 2013 16:55:16 +0000 (17:55 +0100)]
x86/mm: Take the p2m lock even in shadow mode.
The reworking of p2m lookups to use get_gfn()/put_gfn() left the
shadow code not taking the p2m lock, even in cases where the p2m would
be updated (i.e. PoD).
In many cases, shadow code doesn't need the exclusion that
get_gfn()/put_gfn() provides, as it has its own interlocks against p2m
updates, but this is taking things too far, and can lead to crashes in
the PoD code.
Now that most shadow-code p2m lookups are done with explicitly
unlocked accessors, or with the get_page_from_gfn() accessor, which is
often lock-free, we can just turn this locking on.
The remaining locked lookups are in sh_page_fault() (in a path that's
almost always already serializing on the paging lock), and in
emulate_map_dest() (which can probably be updated to use
get_page_from_gfn()). They're not addressed here but may be in a
follow-up patch.
Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
master changeset: a15d87475ed95840dba693ab0a56d0b48a215cbc
master date: 2013-02-21 15:16:20 +0000
Fabio Fantoni [Fri, 1 Mar 2013 13:22:21 +0000 (13:22 +0000)]
tools/libxl: Improve videoram setting
- If videoram setting is less than 8 mb shows error and exit.
- Added videoram setting for qemu upstream with cirrus (added in qemu 1.3).
- Updated xl.cfg man.
- Default and minimal videoram changed to 16 mb if stdvga is set and upstream
qemu is being used. This is required by qemu 1.4 to avoid a xen memory error
(qemu 1.3 doesn't complain about it, probably buggy).
Signed-off-by: Fabio Fantoni <fabio.fantoni@heliman.it> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Cherry picked from xen-unstable 2e814a017155b885e4d4b5a88dc05e7367a9722a,
backport as follows:
Signed-off-by: Fabio Fantoni <fabio.fantoni@heliman.it> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Michael Young [Wed, 13 Feb 2013 17:00:15 +0000 (17:00 +0000)]
tools: Fix memset(&p,0,sizeof(p)) idiom in several places.
gcc 4.8 identifies several places where code of the form memset(x, 0,
sizeof(x)); is used incorrectly, meaning that less memory is set to
zero than required.
Signed-off-by: Michael Young <m.a.young@durham.ac.uk> Committed-by: Keir Fraser <keir@xen.org>
(cherry picked from commit d119301b5816b39b5ba722a2f8b301b37e8e34bd)
Ian Jackson [Wed, 23 Jan 2013 16:53:11 +0000 (16:53 +0000)]
libxl: fix stale timeout event callback race
Because there is not necessarily any lock held at the point the
application (eg, libvirt) calls libxl_osevent_occurred_timeout, in a
multithreaded program those calls may be arbitrarily delayed in
relation to other activities within the program.
Specifically this means when ->timeout_deregister returns, libxl does
not know whether it can safely dispose of the for_libxl value or
whether it needs to retain it in case of an in-progress call to
_occurred_timeout.
The interface could be fixed by requiring the application to make a
new call into libxl to say that the deregistration was complete.
However that new call would have to be threaded through the
application's event loop; this is complicated and some application
authors are likely not to implement it properly. Furthermore the
easiest way to implement this facility in most event loops is to queue
up a time event for "now".
Shortcut all of this by having libxl always call timeout_modify
setting abs={0,0} (ie, ASAP) instead of timeout_deregister. This will
cause the application to call _occurred_timeout. When processing this
calldown we see that we were no longer actually interested and simply
throw it away.
Additionally, there is a race between _occurred_timeout and
->timeout_modify. If libxl ever adjusts the deadline for a timeout
the application may already be in the process of calling _occurred, in
which case the situation with for_app's lifetime becomes very
complicated. Therefore abolish libxl__ev_time_modify_{abs,rel} (which
have no callers) and promise to the application only ever to call
->timeout_modify with abs=={0,0}. The application still needs to cope
with ->timeout_modify racing with its internal function which calls
_occurred_timeout. Document this.
This is a forwards-compatible change for applications using the libxl
API, and will hopefully eliminate these races in callback-supplying
applications (such as libvirt) without the need for corresponding
changes to the application. (It is possible that this might expose
bugs in applications, though, as previously libxl would never call
libxl_osevent_hooks->timeout_modify and now it never calls
->timeout_deregister).
For clarity, fold the body of time_register_finite into its one
remaining call site. This makes the semantics of ev->infinite
slightly clearer.
Cc: Bamvor Jian Zhang <bjzhang@suse.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Tested-by: Jim Fehlig <jfehlig@suse.com> Acked-by: Jim Fehlig <jfehlig@suse.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Wed, 23 Jan 2013 16:53:11 +0000 (16:53 +0000)]
libxl: fix stale fd event callback race
Because there is not necessarily any lock held at the point the
application (eg, libvirt) calls libxl_osevent_occurred_timeout and
..._fd, in a multithreaded program those calls may be arbitrarily
delayed in relation to other activities within the program.
libxl therefore needs to be prepared to receive very old event
callbacks. Arrange for this to be the case for fd callbacks.
This requires a new layer of indirection through a "hook nexus" struct
which can outlive the libxl__ev_foo. Allocation and deallocation of
these nexi is mostly handled in the OSEVENT macros which wrap up
the application's callbacks.
Document the problem and the solution in a comment in libxl_event.c
just before the definition of struct libxl__osevent_hook_nexus.
There is still a race relating to libxl__osevent_occurred_timeout;
this will be addressed in the following patch.
Reported-by: Bamvor Jian Zhang <bjzhang@suse.com> Cc: Bamvor Jian Zhang <bjzhang@suse.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Tested-by: Jim Fehlig <jfehlig@suse.com> Acked-by: Jim Fehlig <jfehlig@suse.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Backport-requested-by: Stefano Stabellini <Stefano.Stabellini@eu.citrix.com> Backported-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Wed, 23 Jan 2013 16:53:10 +0000 (16:53 +0000)]
libxl: rename "abs" variables to "absolute"
No functional change.
The purpose is to make it easier to backport patches from Xen 4.3's
libxl, as Xen 4.3's libxl has had this done:
libxl: Enable -Wshadow.
It was convenient to invent $(CFLAGS_LIBXL) to do this.
Various renamings to avoid shadowing standard functions:
- index(3)
- listen(2)
- link(2)
- abort(3)
- abs(3)
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
In this patch we do not change the others, and we do not enable
-Wshadow. We're just trying to bring 4.2's libxl textually closer to
4.3's.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>