Ian Campbell [Fri, 21 Dec 2012 17:05:38 +0000 (17:05 +0000)]
tools/tests: Restrict some tests to x86 only
MCE injection and x86_emulator are clearly x86 specific.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Wed, 19 Dec 2012 16:04:49 +0000 (16:04 +0000)]
xen: remove nr_irqs_gsi from generic code
The concept is X86 specific.
AFAICT the generic concept here is the number of static physical IRQs
which the current hardware has, so call this nr_static_irqs.
Also using "defined NR_IRQS" as a standin for x86 might have made
sense at one point but its just cleaner to push the necessary
definitions into asm/irq.h.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Wed, 19 Dec 2012 14:33:24 +0000 (14:33 +0000)]
libxl: move definition of libxl_domain_config into the IDL
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Wed, 19 Dec 2012 14:16:30 +0000 (14:16 +0000)]
xen: arm: introduce arm32 as a subarch of arm.
- move 32-bit specific files into subarch specific arm32 subdirectory.
- move gic.h to xen/include/asm-arm (it is needed from both subarch
and generic code).
- make the appropriate build and config file changes to support
XEN_TARGET_ARCH=arm32.
This prepares us for an eventual 64-bit subarch.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Wed, 19 Dec 2012 14:16:29 +0000 (14:16 +0000)]
xen: arm: reorder registers in struct cpu_user_regs.
Primarily this is so that they are ordered in the same way as the
mapping from arm64 x0..x31 registers to the arm32 registers, which is
just less confusing for everyone going forward.
It also makes the implementation of select_user_regs in the next patch
slightly simpler.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Wed, 19 Dec 2012 14:16:23 +0000 (14:16 +0000)]
xen: arm: stub page_is_ram_type.
Callers are VT-d (so x86 specific) and various bits of page offlining
support, which although it looks generic (and is in xen/common) does
things like diving into page_info->count_info which is not generic.
In any case on this is only reachable via XEN_SYSCTL_page_offline_op,
which clearly shouldn't be called on ARM just yet.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Wed, 19 Dec 2012 14:16:22 +0000 (14:16 +0000)]
xen: arm: stub out wallclock time.
We don't currently have much concept of wallclock time on ARM (for
either the hypervisor, dom0 or guests). For now just stub everything
out. Specifically domain_set_time_offset, update_vcpu_system_time and
wallclock_time.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
` Committed-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Wed, 19 Dec 2012 14:16:18 +0000 (14:16 +0000)]
xen: arm: Call init_xen_time earlier
If we panic before calling init_xen_time then the "Rebooting in 5
seconds" delay ends up calling udelay which uses cntfrq before it has
been initialised resulting in a divide by zero.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Andre Przywara [Wed, 19 Dec 2012 10:42:09 +0000 (11:42 +0100)]
x86, amd: Disable way access filter on Piledriver CPUs
The Way Access Filter in recent AMD CPUs may hurt the performance of
some workloads, caused by aliasing issues in the L1 cache.
This patch disables it on the affected CPUs.
The issue is similar to that one of last year:
http://lkml.indiana.edu/hypermail/linux/kernel/1107.3/00041.html
This new patch does not replace the old one, we just need another
quirk for newer CPUs.
The performance penalty without the patch depends on the
circumstances, but is a bit less than the last year's 3%.
The workloads affected would be those that access code from the same
physical page under different virtual addresses, so different
processes using the same libraries with ASLR or multiple instances of
PIE-binaries. The code needs to be accessed simultaneously from both
cores of the same compute unit.
More details can be found here:
http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf
CPUs affected are anything with the core known as Piledriver.
That includes the new parts of the AMD A-Series (aka Trinity) and the
just released new CPUs of the FX-Series (aka Vishera).
The model numbering is a bit odd here: FX CPUs have model 2,
A-Series has model 10h, with possible extensions to 1Fh. Hence the
range of model ids.
Signed-off-by: Andre Przywara <osp@andrep.de>
Add and use MSR_AMD64_IC_CFG. Update the value whenever it is found to
not have all bits set, rather than just when it's zero.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Committed-by: Jan Beulich <jbeulich@suse.com>
Daniel De Graaf [Tue, 18 Dec 2012 18:16:52 +0000 (18:16 +0000)]
xen/arch/*: add struct domain parameter to arch_do_domctl
Since the arch-independent do_domctl function now RCU locks the domain
specified by op->domain, pass the struct domain to the arch-specific
domctl function and remove the duplicate per-subfunction locking.
This also removes two get_domain/put_domain call pairs (in
XEN_DOMCTL_assign_device and XEN_DOMCTL_deassign_device), replacing
them with RCU locking.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com> Committed-by: Keir Fraser <keir@xen.org>
Daniel De Graaf [Tue, 18 Dec 2012 18:16:13 +0000 (18:16 +0000)]
xen: lock target domain in do_domctl common code
Because almost all domctls need to lock the target domain, do this by
default instead of repeating it in each domctl. This is not currently
extended to the arch-specific domctls, but RCU locks are safe to take
recursively so this only causes duplicate but correct locking.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Jan Beulich <jbeulich@suse.com> Committed-by: Keir Fraser <keir@xen.org>
Dongxiao Xu [Tue, 18 Dec 2012 18:14:45 +0000 (18:14 +0000)]
nested vmx: nested TPR shadow/threshold emulation
TPR shadow/threshold feature is important to speedup the boot time
for Windows guest. Besides, it is a must feature for certain VMM.
We map virtual APIC page address and TPR threshold from L1 VMCS,
and synch it into shadow VMCS in virtual vmentry.
If TPR_BELOW_THRESHOLD VM exit is triggered by L2 guest, we
inject it into L1 VMM for handling.
Besides, this commit fixes an issue for apic access page, if L1
VMM didn't enable this feature, we need to fill zero into the
shadow VMCS.
So that it becomes possible to create scheduler specific trace
records, within each scheduler, without worrying about the
overlapping, and also without giving up being able to recognise them
univocally. The latter is deemed as useful, since we can have more
than one scheduler running at the same time, thanks to cpupools.
The event ID is 12 bits, and this change uses the upper 3 of them for
the 'scheduler ID'. This means we're limited to 8 schedulers and to
512 scheduler specific tracing events. Both seem reasonable
limitations as of now.
This also converts the existing credit2 tracing (the only scheduler
generating tracing events up to now) to the new system.
Dario Faggioli [Tue, 18 Dec 2012 18:10:57 +0000 (18:10 +0000)]
xen: sched_credit: improve tickling of idle CPUs
Right now, when a VCPU wakes-up, we check whether it should preempt
what is running on the PCPU, and whether or not the waking VCPU can
be migrated (by tickling some idlers). However, this can result in
suboptimal or even wrong behaviour, as explained here:
This change, instead, when deciding which PCPU(s) to tickle, upon
VCPU wake-up, considers both what it is likely to happen on the PCPU
where the wakeup occurs,and whether or not there are idlers where
the woken-up VCPU can run. In fact, if there are, we can avoid
interrupting the running VCPU. Only in case there aren't any of
these PCPUs, preemption and migration are the way to go.
This has been tested (on top of the previous change) by running
the following benchmarks inside 2, 6 and 10 VMs, concurrently, on
a shared host, each with 2 VCPUs and 960 MB of memory (host had 16
ways and 12 GB RAM).
Numbers show how the change has either no or very limited impact
(specjbb2005 case) or, when it does have some impact, that is a
real improvement in performances (sysbench-memory case).
Dario Faggioli [Tue, 18 Dec 2012 18:10:18 +0000 (18:10 +0000)]
xen: sched_credit: improve picking up the idle CPU for a VCPU
In _csched_cpu_pick() we try to select the best possible CPU for
running a VCPU, considering the characteristics of the underlying
hardware (i.e., how many threads, core, sockets, and how busy they
are). What we want is "the idle execution vehicle with the most
idling neighbours in its grouping".
In order to achieve it, we select a CPU from the VCPU's affinity,
giving preference to its current processor if possible, as the basis
for the comparison with all the other CPUs. Problem is, to discount
the VCPU itself when computing this "idleness" (in an attempt to be
fair wrt its current processor), we arbitrarily and unconditionally
consider that selected CPU as idle, even when it is not the case,
for instance:
1. If the CPU is not the one where the VCPU is running (perhaps due
to the affinity being changed);
2. The CPU is where the VCPU is running, but it has other VCPUs in
its runq, so it won't go idle even if the VCPU in question goes.
22005(...) line (the first line) means _csched_cpu_pick() was called
on VCPU 1 of domain 10, while it is running on CPU 0, and it choose
CPU 8, which is busy ('|'), even if there are plenty of idle
CPUs. That is because, as a consequence of changing the VCPU affinity,
CPU 8 was chosen as the basis for the comparison, and therefore
considered idle (its bit gets unconditionally set in the bitmask
representing the idle CPUs). 28004(...) line means the VCPU is woken
up and queued on CPU 8's runq, where it waits for a context switch or
a migration, in order to be able to execute.
This change fixes things by only considering the "guessed" CPU idle if
the VCPU in question is both running there and is its only runnable
VCPU.
Julien Grall [Mon, 17 Dec 2012 18:04:54 +0000 (18:04 +0000)]
libxenstore: filter watch events in libxenstore when we unwatch
XenStore puts in queued watch events via a thread and notifies the user.
Sometimes xs_unwatch is called before all related message is read. The use
case is non-threaded libevent, we have two event A and B:
- Event A will destroy something and call xs_unwatch;
- Event B is used to notify that a node has changed in XenStore.
As the event is called one by one, event A can be handled before event B.
So on next xs_watch_read the user could retrieve an unwatch token and
a segfault occured if the token store the pointer of the structure
(ie: "backend:0xcafe").
To avoid problem with previous application using libXenStore, this behaviour
will only be enabled if XS_UNWATCH_FILTER is given to xs_open.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Thu, 13 Dec 2012 14:39:31 +0000 (14:39 +0000)]
x86/kexec: Change NMI and MCE handling on kexec path
Experimentally, certain crash kernels will triple fault very early
after starting if started with NMIs disabled. This was discovered
when experimenting with a debug keyhandler which deliberately created
a reentrant NMI, causing stack corruption.
Because of this discovered bug, and that the future changes to the NMI
handling will make the kexec path more fragile, take the time now to
bullet-proof the kexec behaviour to be safer in more circumstances.
This patch adds three new low level routines:
* nmi_crash
This is a special NMI handler for using during a kexec crash.
* enable_nmis
This function enables NMIs by executing an iret-to-self, to
disengage the hardware NMI latch.
* trap_nop
This is a no op handler which irets immediately. It is not
declared
with ENTRY() to avoid the extra alignment overhead.
And adds three new IDT entry helper routines:
* _write_gate_lower
This is a substitute for using cmpxchg16b to update a 128bit
structure at once. It assumes that the top 64 bits are unchanged
(and ASSERT()s the fact) and performs a regular write on the lower
64 bits.
* _set_gate_lower
This is functionally equivalent to the already present
_set_gate(), except it uses _write_gate_lower rather than updating
both 64bit values.
* _update_gate_addr_lower
This is designed to update an IDT entry handler only, without
altering any other settings in the entry. It also uses
_write_gate_lower.
The IDT entry helpers are required because:
* Is it unsafe to attempt a disable/update/re-enable cycle on the
NMI or MCE IDT entries.
* We need to be able to update NMI handlers without changing the IST
entry.
As a result, the new behaviour of the kexec_crash path is:
nmi_shootdown_cpus() will:
* Disable the crashing cpus NMI/MCE interrupt stack tables.
Disabling the stack tables removes race conditions which would
lead
to corrupt exception frames and infinite loops. As this pcpu is
never planning to execute a sysret back to a pv vcpu, the update
is
safe from a security point of view.
* Swap the NMI trap handlers.
The crashing pcpu gets the nop handler, to prevent it getting
stuck in
an NMI context, causing a hang instead of crash. The non-crashing
pcpus all get the nmi_crash handler which is designed never to
return.
do_nmi_crash() will:
* Save the crash notes and shut the pcpu down.
There is now an extra per-cpu variable to prevent us from
executing this multiple times. In the case where we reenter
midway through, attempt the whole operation again in preference to
not completing it in the first place.
* Set up another NMI at the LAPIC.
Even when the LAPIC has been disabled, the ID and command
registers are still usable. As a result, we can deliberately
queue up a new NMI to re-interrupt us later if NMIs get unlatched.
Because of the call to __stop_this_cpu(), we have to hand craft
self_nmi() to be safe from General Protection Faults.
* Fall into infinite loop.
machine_kexec() will:
* Swap the MCE handlers to be a nop.
We cannot prevent MCEs from being delivered when we pass off to
the crash kernel, and the less Xen context is being touched the
better.
* Explicitly enable NMIs.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Minor style changes.
Robert Phillips [Thu, 13 Dec 2012 12:10:14 +0000 (12:10 +0000)]
x86/mm/hap: Adjust vram tracking to play nicely with log-dirty.
The previous code assumed the guest would be in one of three mutually exclusive
modes for bookkeeping dirty pages: (1) shadow, (2) hap utilizing the log dirty
bitmap to support functionality such as live migrate, (3) hap utilizing the
log dirty bitmap to track dirty vram pages.
Races arose when a guest attempted to track dirty vram while performing live
migrate. (The dispatch table managed by paging_log_dirty_init() might change
in the middle of a log dirty or a vram tracking function.)
This change allows hap log dirty and hap vram tracking to be concurrent.
Vram tracking no longer uses the log dirty bitmap. Instead it detects
dirty vram pages by examining their p2m type. The log dirty bitmap is only
used by the log dirty code. Because the two operations use different
mechanisms, they are no longer mutually exclusive.
Signed-Off-By: Robert Phillips <robert.phillips@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Minor whitespace changes to conform with coding style Signed-off-by: Tim Deegan <tim@xen.org> Committed-by: Tim Deegan <tim@xen.org>
Daniel De Graaf [Thu, 13 Dec 2012 11:44:02 +0000 (11:44 +0000)]
libxl: introduce XSM relabel on build
Allow a domain to be built under one security label and run using a
different label. This can be used to prevent the domain builder or
control domain from having the ability to access a guest domain's memory
via map_foreign_range except during the build process where this is
required.
Example domain configuration snippet:
seclabel='customer_1:vm_r:nomigrate_t'
init_seclabel='customer_1:vm_r:nomigrate_t_building'
Note: this does not provide complete protection from a malicious dom0;
mappings created during the build process may persist after the relabel,
and could be used to indirectly access the guest's memory. However, if
dom0 correctly unmaps the domain upon building, a the domU is protected
against dom0 becoming malicious in the future.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
acked-by: Ian Campbell <ian.campbell@citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Thu, 13 Dec 2012 11:44:01 +0000 (11:44 +0000)]
libxl: qemu trad logdirty: Tolerate ENOENT on ret path
It can happen in error conditions that lds->ret_path doesn't exist,
and libxl__xs_read_checked signals this by setting got_ret=NULL. If
this happens, fail without crashing.
Reported-by: Alex Bligh <alex@alex.org.uk>, Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Get the address of the GIC distributor, cpu, virtual and virtual cpu
interfaces registers from device tree.
Note: I couldn't completely get rid of GIC_BASE_ADDRESS, GIC_DR_OFFSET
and friends because we are using them from mode_switch.S, that is
executed before device tree has been parsed. But at least mode_switch.S
is known to contain vexpress specific code anyway.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Jan Beulich [Thu, 13 Dec 2012 10:22:54 +0000 (11:22 +0100)]
vscsiif: allow larger segments-per-request values
At least certain tape devices require fixed size blocks to be operated
upon, i.e. breaking up of I/O requests is not permitted. Consequently
we need an interface extension that (leaving aside implementation
limitations) doesn't impose a limit on the number of segments that can
be associated with an individual request.
This, in turn, excludes the blkif extension FreeBSD folks implemented,
as that still imposes an upper limit (the actual I/O request still
specifies the full number of segments - as an 8-bit quantity -, and
subsequent ring slots get used to carry the excess segment
descriptors).
The alternative therefore is to allow the frontend to pre-set segment
descriptors _before_ actually issuing the I/O request. I/O will then
be done by the backend for the accumulated set of segments.
To properly associate segment preset operations with the main request,
the rqid-s between them should match (originally I had hoped to use
this to avoid producing individual responses for the pre-set
operations, but that turned out to violate the underlying shared ring
implementation).
Negotiation of the maximum number of segments a particular backend
implementation supports happens through a new "segs-per-req" xenstore
node.
Charles Arnold [Tue, 11 Dec 2012 12:49:39 +0000 (13:49 +0100)]
x86/EFI: work around CFLAGS being passed in through environment
Short of a solution to the problem described in
http://lists.xen.org/archives/html/xen-devel/2012-12/msg00648.html,
deal with the bad effect this together with c/s 25751:02b4d5fedb7b has
on the EFI build by filtering out the problematic command line items.
Signed-off-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Committed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 11 Dec 2012 12:47:53 +0000 (13:47 +0100)]
x86: frame table related improvements
- fix super page frame table setup for memory hotplug case (should
create full table, or else the hotplug code would need to do the
necessary table population)
- simplify super page frame table setup (can re-use frame table setup
code)
- slightly streamline frame table setup code
- fix (tighten) a BUG_ON() and an ASSERT() condition
- fix spage <-> pdx conversion macros (they had no users so far, and
hence no-one noticed how broken they were)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Dan Magenheimer [Mon, 10 Dec 2012 11:15:53 +0000 (11:15 +0000)]
xen: centralize accounting for domain tot_pages
Provide and use a common function for all adjustments to a
domain's tot_pages counter in anticipation of future and/or
out-of-tree patches that must adjust related counters
atomically.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Committed-by: Keir Fraser <keir@xen.org>
Jan Beulich [Mon, 10 Dec 2012 10:18:25 +0000 (11:18 +0100)]
streamline guest copy operations
- use the variants not validating the VA range when writing back
structures/fields to the same space that they were previously read
from
- when only a single field of a structure actually changed, copy back
just that field where possible
- consolidate copying back results in a few places
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Mon, 10 Dec 2012 10:16:37 +0000 (11:16 +0100)]
x86/oprofile: adjust CPU specific initialization
Drop support for 32-bit only CPU models as well as those that can be
dealt with by the arch_perfmon bits. Models 14 and 15 remain as
questionable (I'm not 100% positive that these don't support 64-bit
mode).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Mon, 10 Dec 2012 10:14:27 +0000 (11:14 +0100)]
scheduler: fix rate limit range checking
For one, neither of the two checks permitted for the documented value
of zero (disabling the functionality altogether).
Second, the range checking of the command line parameter was done by
the credit scheduler's initialization code, despite it being a generic
scheduler option.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Dongxiao Xu [Thu, 6 Dec 2012 16:59:27 +0000 (16:59 +0000)]
nested vmx: check host ability when intercept MSR read
When guest hypervisor tries to read MSR value, we intercept this
behavior and return certain emulated values. Besides that, we also
need to ensure that those emulated values must compatible with host
ability.
Dongxiao Xu [Thu, 6 Dec 2012 16:58:59 +0000 (16:58 +0000)]
nested vmx: fix interrupt delivery to L2 guest
While delivering interrupt into L2 guest, L0 hypervisor need to check
whether L1 hypervisor wants to own the interrupt, if not, directly
inject the interrupt into L2 guest.
Dongxiao Xu [Thu, 6 Dec 2012 16:57:55 +0000 (16:57 +0000)]
nested vmx: enable "Virtualize APIC accesses" feature for L1 VMM
If the "Virtualize APIC accesses" feature is enabled, we need to sync
the APIC-access address from virtual vvmcs into shadow vmcs when doing
virtual_vmentry.
Dongxiao Xu [Thu, 6 Dec 2012 16:56:49 +0000 (16:56 +0000)]
nested vmx: fix DR access VM exit
For DR register, we use lazy restore mechanism when access
it. Therefore when receiving such VM exit, L0 should be responsible to
switch to the right DR values, then inject to L1 hypervisor.
Dongxiao Xu [Thu, 6 Dec 2012 16:54:26 +0000 (16:54 +0000)]
nested vmx: fix rflags status in virtual vmexit
As stated in SDM, all bits (except for those 1-reserved) in rflags
would be set to 0 in VM exit. Therefore we need to follow this logic
in virtual_vmexit.
Dongxiao Xu [Thu, 6 Dec 2012 16:52:50 +0000 (16:52 +0000)]
nested vmx: emulate MSR bitmaps
In nested vmx virtualization for MSR bitmaps, L0 hypervisor will trap
all the VM exit from L2 guest by disable the MSR_BITMAP feature. When
handling this VM exit, L0 hypervisor judges whether L1 hypervisor uses
MSR_BITMAP feature and the corresponding bit is set to 1. If so, L0
will inject such VM exit into L1 hypervisor; otherwise, L0 will be
responsible for handling this VM exit.
Jan Beulich [Thu, 6 Dec 2012 13:19:15 +0000 (14:19 +0100)]
memop: adjust error checking in populate_physmap()
Checking that multi-page allocations are permitted is unnecessary for
PoD population operations. Instead, the (loop invariant) check added
for addressing XSA-31 can be moved here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Liu Jinsong [Thu, 6 Dec 2012 10:47:22 +0000 (10:47 +0000)]
X86/vMCE: handle broken page with regard to migration
At the sender
xc_domain_save has a key point: 'to query the types of all the pages
with xc_get_pfn_type_batch'
1) if broken page occur before the key point, migration will be fine
since proper pfn_type and pfn number will be transferred to the
target and then take appropriate action;
2) if broken page occur after the key point, whole system will crash
and no need care migration any more;
At the target
Target will populates pages for guest. As for the case of broken page,
we prefer to keep the type of the page for the sake of seamless migration.
Target will set p2m as p2m_ram_broken for broken page. If guest access
the broken page again it will kill itself as expected.
Suggested-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Committed-by: Ian Campbell <ian.campbell@citrix.com>
George Dunlap [Thu, 6 Dec 2012 10:19:08 +0000 (10:19 +0000)]
libxl: Make an internal function explicitly check existence of expected paths
libxl__device_disk_from_xs_be() was failing without error for some
missing xenstore nodes in a backend, while assuming (without checking)
that other nodes were valid, causing a crash when another internal
error wrote these nodes in the wrong place.
Make this function consistent by:
* Checking the existence of all nodes before using
* Choosing a default only when the node is not written in device_disk_add()
* Failing with log msg if any node written by device_disk_add() is not present
* Returning an error on failure
* Disposing of the structure before returning using libxl_device_disk_displose()
Also make the callers of the function pay attention to the error and
behave appropriately. In the case of libxl__append_disk_list_of_type(),
this means only incrementing *ndisks as the disk structures are
successfully initialized.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
xen/arm: disable interrupts on return_to_hypervisor
At the moment it is possible to reach return_to_hypervisor with
interrupts enabled (it happens all the times when we are actually going
back to hypervisor mode, when we don't take the return_to_guest path).
If that happens we risk loosing the content of ELR_hyp: if we receive an
interrupt right after restoring ELR_hyp, once we come back we'll have a
different value in ELR_hyp and the original is lost.
In order to make the return_to_hypervisor path safe, we disable
interrupts before restoring any registers.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Samuel Thibault [Thu, 6 Dec 2012 09:22:31 +0000 (09:22 +0000)]
mini-os: drop shutdown variables when CONFIG_XENBUS=n
Shutdown variables are meaningless when CONFIG_XENBUS=n since no
shutdown event will ever happen. Better make sure that no code tries
to use it and never get the hoped shutdown event.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Committed-by: Keir Fraser <keir@xen.org>
Jan Beulich [Wed, 5 Dec 2012 08:52:14 +0000 (09:52 +0100)]
IOMMU/ATS: fix maximum queue depth calculation
The capabilities register field is a 5-bit value, and the 5 bits all
being zero actually means 32 entries.
Under the assumption that amd_iommu_flush_iotlb() really just tried
to correct for the miscalculation above when adding 32 to the value,
that adjustment is also being removed.
Jan Beulich [Tue, 4 Dec 2012 18:38:26 +0000 (18:38 +0000)]
memop: limit guest specified extent order
Allowing unbounded order values here causes almost unbounded loops
and/or partially incomplete requests, particularly in PoD code.
The added range checks in populate_physmap(), decrease_reservation(),
and the "in" one in memory_exchange() architecturally all could use
PADDR_BITS - PAGE_SHIFT, and are being artificially constrained to
MAX_ORDER.
This is XSA-31 / CVE-2012-5515.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson.citrix.com>
Jan Beulich [Tue, 4 Dec 2012 18:38:20 +0000 (18:38 +0000)]
xen: fix error handling of guest_physmap_mark_populate_on_demand()
The only user of the "out" label bypasses a necessary unlock, thus
enabling the caller to lock up Xen.
Also, the function was never meant to be called by a guest for itself,
so rather than inspecting the code paths in depth for potential other
problems this might cause, and adjusting e.g. the non-guest printk()
in the above error path, just disallow the guest access to it.
Finally, the printk() (considering its potential of spamming the log,
the more that it's not using XENLOG_GUEST), is being converted to
P2M_DEBUG(), as debugging is what it apparently was added for in the
first place.
This is XSA-30 / CVE-2012-5514.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson.citrix.com>
Jan Beulich [Tue, 4 Dec 2012 18:38:14 +0000 (18:38 +0000)]
xen: add missing guest address range checks to XENMEM_exchange handlers
Ever since its existence (3.0.3 iirc) the handler for this has been
using non address range checking guest memory accessors (i.e.
the ones prefixed with two underscores) without first range
checking the accessed space (via guest_handle_okay()), allowing
a guest to access and overwrite hypervisor memory.
This is XSA-29 / CVE-2012-5513.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson.citrix.com>
Tim Deegan [Tue, 4 Dec 2012 18:38:05 +0000 (18:38 +0000)]
hvm: Limit the size of large HVM op batches
Doing large p2m updates for HVMOP_track_dirty_vram without preemption
ties up the physical processor. Integrating preemption into the p2m
updates is hard so simply limit to 1GB which is sufficient for a 15000
* 15000 * 32bpp framebuffer.
For HVMOP_modified_memory and HVMOP_set_mem_type preemptible add the
necessary machinery to handle preemption.
This is CVE-2012-5511 / XSA-27.
Signed-off-by: Tim Deegan <tim@xen.org> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson.citrix.com>
Jan Beulich [Tue, 4 Dec 2012 18:38:00 +0000 (18:38 +0000)]
gnttab: fix releasing of memory upon switches between versions
gnttab_unpopulate_status_frames() incompletely freed the pages
previously used as status frame in that they did not get removed from
the domain's xenpage_list, thus causing subsequent list corruption
when those pages did get allocated again for the same or another purpose.
Similarly, grant_table_create() and gnttab_grow_table() both improperly
clean up in the event of an error - pages already shared with the guest
can't be freed by just passing them to free_xenheap_page(). Fix this by
sharing the pages only after all allocations succeeded.
This is CVE-2012-5510 / XSA-26.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Committed-by: Ian Jackson <ian.jackson.citrix.com>
George Dunlap [Tue, 4 Dec 2012 15:50:20 +0000 (15:50 +0000)]
xl: Check for duplicate vncdisplay options, and return an error
If the user has set a vnc display number both in vnclisten (with
"xxxx:yy"), and with vncdisplay, throw an error.
Update man pages to match.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Tue, 4 Dec 2012 15:50:19 +0000 (15:50 +0000)]
xen: arm: Use $(OBJCOPY) not bare objcopy
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reported-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Committed-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Fri, 30 Nov 2012 12:20:23 +0000 (12:20 +0000)]
arm: const-correctness in virt_to_maddr
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Acked-by: Tim Deegan <tim@xen.org> Committed-by: Ian Campbell <ian.campbell@citrix.com>