Andrew Cooper [Wed, 11 Feb 2015 16:18:27 +0000 (17:18 +0100)]
x86/nmi: fix shootdown of pcpus running in VMX non-root mode
c/s 7dd3b06ff "vmx: fix handling of NMI VMEXIT" fixed one issue but
inadvertently introduced a regression when it came to the NMI shootdown. The
shootdown code worked by patching vector 2 in each IDT, but the introduced
direct call to do_nmi() bypassed this.
Instead of patching each IDT, take a different approach by updating the
existing dispatch table. This allows for the removal of the remote IDT
patching and the removal of the nmi_crash() entry point.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Tue, 10 Feb 2015 12:29:51 +0000 (13:29 +0100)]
x86/hvm: explicitly mark ioreq server pages dirty
...when they are added back into the guest physmap, when an ioreq
server is disabled. If this is not done then the pages are missed
during migration, causing ioreq server creation to fail on the remote end.
This problem only manifests if the ioreq server is non-default because in
the default case the pages are never removed from the guest physmap.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Paul Durrant [Tue, 10 Feb 2015 12:28:40 +0000 (13:28 +0100)]
x86/hvm: wait for at least one ioreq server to be enabled
In the case where a stub domain is providing emulation for an HVM
guest, there is no interlock in the toolstack to make sure that
the stub domain is up and running before the guest is unpaused.
Prior to the introduction of ioreq servers this was not a problem,
since there was only ever one emulator so ioreqs were simply
created anyway and the vcpu remained blocked until the stub domain
started and picked up the ioreq.
Since ioreq servers allow for multiple emulators for a single guest
it's not possible to know a priori which emulator will handle a
particular ioreq, so emulators must attach to a guest before the
guest runs.
This patch works around the lack of interlock in the toolstack for
stub domains by keeping the domain paused until at least one ioreq
server is created and enabled, which in practice means the stub
domain is indeed up and running.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Ian Jackson [Mon, 9 Feb 2015 15:34:40 +0000 (15:34 +0000)]
libxl: More probably detect reentry of destroyed ctx
In libxl_ctx_free:
1. Move the GC_INIT earlier, so that we can:
2. Take the lock around most of the work. This is technically
unnecessary since calling any other libxl entrypoint on a ctx being
passed to libxl_ctx_free risks undefined behaviour. But, taking
the lock allows us to much more usually spot this.
3. Increment osevent_in_hook by 1000. If we are reentered after
destroy, this will trip some of our entrypoints' asserts. It also
means that if we crash for some other reason relating to reentering
a destroyed ctx, the cause will be more obviously visible by
examining ctx->osevent_in_hook (assuming that the memory previously
used for the ctx hasn't been reused and overwritten).
4. Free the lock again. (pthread_mutex_destroy requires that the
mutex be unlocked.)
With this patch, I find that an occasional race previously reported
as:
libvirtd: libxl_internal.h:3265: libxl__ctx_unlock: Assertion `!r' failed.
is now reported as:
libvirtd: libxl_event.c:1236: libxl_osevent_occurred_fd: Assertion `!libxl__gc_owner(gc)->osevent_in_hook' failed.
Examining the call trace with gdb shows this:
(gdb) bt
#0 0xb773f424 in __kernel_vsyscall ()
#1 0xb7022871 in raise () from /lib/i386-linux-gnu/i686/nosegneg/libc.so.6
#2 0xb7025d62 in abort () from /lib/i386-linux-gnu/i686/nosegneg/libc.so.6
#3 0xb701b958 in __assert_fail () from /lib/i386-linux-gnu/i686/nosegneg/libc.so.6
#4 0xb6f00390 in libxl_osevent_occurred_fd (ctx=0xb84813a8, for_libxl=0xb84d6848, fd=31, events_ign=0, revents_ign=1) at libxl_event.c:1236
#5 0xb1b70464 in libxlDomainObjFDEventCallback () from /usr/local/lib/libvirt/connection-driver/libvirt_driver_libxl.so
#6 0xb72163b1 in virEventPollDispatchHandles () from /usr/local/lib/libvirt.so.0
#7 0xb7216be5 in virEventPollRunOnce () from /usr/local/lib/libvirt.so.0
#8 0xb7214a7e in virEventRunDefaultImpl () from /usr/local/lib/libvirt.so.0
#9 0xb77c7b98 in virNetServerRun ()
#10 0xb7771c63 in main ()
(gdb) print ctx->osevent_in_hook
$2 = 1000
(gdb)
which IMO demonstrates that libxl_osevent_occurred_fd is being called
on a destroyed ctx.
This is probably a symptom of the bug in libvirt fixed by these
patches:
https://www.redhat.com/archives/libvir-list/2015-February/msg00024.html
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Ian Campbell <ian.campbell@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Jim Fehlig <jfehlig@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Mon, 9 Feb 2015 15:20:32 +0000 (15:20 +0000)]
libxl: event handling: ao_inprogress does waits while reports outstanding
libxl__ao_inprogress needs to check (like
libxl__ao_complete_check_progress_reports) that there are no
oustanding progress callbacks.
Otherwise it might happen that we would destroy the ao while another
thread has an outstanding callback its egc report queue. The other
thread would then, in its egc_run_callbacks, touch the destroyed ao.
Instead, when this happens in libxl__ao_inprogress, simply run round
the event loop again. The thread which eventually makes the callback
will spot our poller in the ao, and notify the poller, waking us up.
This fixes an assertion failure race seen with libvirt:
libvirtd: libxl_event.c:1792: libxl__ao_complete_check_progress_reports: Assertion `ao->in_initiator' failed.
or (after "Add an assert to egc_run_callbacks")
libvirtd: libxl_event.c:1338: egc_run_callbacks: Assertion `aop->ao->magic == 0xA0FACE00ul' failed.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Ian Campbell <ian.campbell@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Jim Fehlig <jfehlig@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Mon, 9 Feb 2015 15:18:30 +0000 (15:18 +0000)]
libxl: event handling: Break out ao_work_outstanding
Break out the test in libxl__ao_complete_check_progress_reports, into
ao_work_outstanding, which reports false if either (i) the ao is still
ongoing or (ii) there is a progress report (perhaps on a different
thread's callback queue) which has yet to be reported to the
application.
No functional change.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Ian Campbell <ian.campbell@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Jim Fehlig <jfehlig@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Mon, 9 Feb 2015 15:10:11 +0000 (15:10 +0000)]
libxl: event handling: Add an assert to egc_run_callbacks
Check that the ao is still live when we are about to running some of
its callbacks.
This reveals an existing bug in libxl which is exercised by libvirt,
converting
libvirtd: libxl_event.c:1792: libxl__ao_complete_check_progress_reports: Assertion `ao->in_initiator' failed.
into
libvirtd: libxl_event.c:1338: egc_run_callbacks: Assertion `aop->ao->magic == 0xA0FACE00ul' failed.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Ian Campbell <ian.campbell@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Jim Fehlig <jfehlig@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Thu, 5 Feb 2015 16:28:56 +0000 (16:28 +0000)]
tools: work around collision of -O0 and -D_FORTIFY_SOURCE
Some systems have python-config include -D_FORTIFY_SOURCE in the
CFLAGS. But -D_FORTIFY_SOURCE does not (currently) work with -O0, and
-O0 is enabled in debug builds (since 1166ecf781). As a result, on
those systems, debug builds fail.
Work around this problem as follows:
* In configure, detect -D_FORTIFY_SOURCE in $(python-config --cflags)
* If detected, set the new autoconf substitution and make variable
PY_NOOPT_CFLAGS to -O1.
* In tools/Rules.mk, where we add -O0, also add PY_NOOPT_CFLAGS
(which will override the -O0 with -O1 if required).
Overriding the -O0 is better than disabling Fortify because the
latter might have an adverse security impact. A user who wants to
disable optimisation completely even for Python and also disable
Fortify can set the environment variable
EXTRA_CFLAGS_XEN_TOOLS='-U_FORTIFY_SOURCE -O0'
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reported-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> CC: Jan Beulich <JBeulich@suse.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Euan Harris <euan.harris@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Tested-by: Don Slutz <dslutz@verizon.com>
Lasse Collin [Thu, 5 Feb 2015 13:01:09 +0000 (14:01 +0100)]
common/xz: add comments for the intentionally missing break statements
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org>
[Linux commit 84d517f3e56f7d0d305c14a701cee8f7372ebe1e] Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: IAn Campbell <ian.campbell@citrix.com>
Liang Li [Thu, 5 Feb 2015 12:59:48 +0000 (13:59 +0100)]
x86: avoid needless EPT table ajustment and cache flush
When a guest change it's MTRR MSRs, ajusting EPT table and flushing
cache are needed only when guest has IOMMU device, using need_iommu(d)
can minimize the impact to guest with device assigned, since a guest
may be hot plugged with a device thus there may be dirty cache lines
before need_iommu(d) becoming true, force the p2m_memory_type_changed
and flush_all when the first device is assigned to guest to amend this
issue.
Signed-off-by: Liang Li <liang.z.li@intel.com> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Wed, 4 Feb 2015 16:07:48 +0000 (16:07 +0000)]
libvchan: address compiler warnings
Both vchan_wr() and stdout_wr() should be defined with a non-empty
argument list (i.e. void). Additionally both of them as well as usage()
should be static to make clear that no other code is referencing them.
Further, statements should follow declarations.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Mike Latimer [Fri, 30 Jan 2015 21:01:00 +0000 (14:01 -0700)]
libxl: Wait for ballooning if free memory is increasing
During domain startup, all required memory ballooning must complete
within a maximum window of 33 seconds (3 retries, 11 seconds of delay).
If not, domain creation is aborted with a 'failed to free memory' error.
In order to accommodate large domains or slower hardware (which require
substantially longer to balloon memory) the free memory process should
continue retrying if the amount of free memory is increasing on each
iteration of the loop.
Signed-off-by: Mike Latimer <mlatimer@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Wei Liu [Tue, 3 Feb 2015 13:47:08 +0000 (13:47 +0000)]
rump kernels: use new platform macro
Starting from rump kernel changeset 91d5623 ("Renaming platform macros,
app-tools and autoconf target string"), __RUMPUSER_XEN__ and __RUMPAPP__
are deleted. We are supposed to use __RUMPRUN__ instead.
We still keep __RUMPUSER_XEN__ for now in order to make xen-unstable
pass osstest push gate. I will remove __RUMPUSER_XEN__ later.
Related discussion:
http://thread.gmane.org/gmane.comp.rumpkernel.user/739
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Tue, 3 Feb 2015 10:39:17 +0000 (11:39 +0100)]
x86: provide build time option to support up to 123Tb of memory
As this requires growing struct page_info from 32 to 48 bytes as well
as shrinking the always accessible direct mapped memory range from 5Tb
to 3.5Tb, this isn't being introduced as a general or default enabled
feature.
A side effect of the change to x86's mm.h is that asm/mm.h may no
longer be included directly. Hence in the few places where this was done,
xen/mm.h is being substituted (indirectly in the hvm/mtrr.h case).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 3 Feb 2015 10:36:39 +0000 (11:36 +0100)]
x86/mm: allow for building without shadow mode support
Considering the complexity of the code, it seems to be a reasonable
thing to allow people to disable that code entirely even outside the
immediate need for this by the next patch.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tim Deegan [Tue, 3 Feb 2015 10:34:12 +0000 (11:34 +0100)]
x86/shadow: tidy up fragmentary page lists in multi-page shadows
Multi-page shadows are linked together using the 'list' field. When
those shadows are in the pinned list, the list fragments are spliced
into the pinned list; otherwise they have no associated list head.
Rework the code that handles these fragments to use the page_list
interface rather than manipulating the fields directly. This makes
the code cleaner, and allows the 'list' field to be either the
compact pdx form or a normal list_entry.
Signed-off-by: Tim Deegan <tim@xen.org> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Introduce sh_terminate_list() and make it use LIST_POISON*.
Move helper array of shadow_size() into common.c.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
Boris Ostrovsky [Tue, 3 Feb 2015 10:30:40 +0000 (11:30 +0100)]
x86/VPMU: handle APIC_LVTPC accesses
Don't have the hypervisor update APIC_LVTPC when _it_ thinks the vector
should be updated. Instead, handle guest's APIC_LVTPC accesses and write what
the guest explicitly wanted (but only when VPMU is enabled).
This is updated version of commit 8097616fbdda that was reverted by cc3404093c85. Unlike the previous version, we don't update APIC_LVTPC
when VPMU is disabled to avoid interfering with NMI watchdog (which
runs only when VPMU is off).
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Boris Ostrovsky [Tue, 3 Feb 2015 10:30:09 +0000 (11:30 +0100)]
x86/VPMU: disable when NMI watchdog is on
NMI watchdog sets APIC_LVTPC register to generate an NMI when PMU counter
overflow occurs. This may be overwritten by VPMU code later, effectively
turning off the watchdog.
We should disable VPMU when NMI watchdog is running.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 3 Feb 2015 10:25:47 +0000 (11:25 +0100)]
time: widen wallclock seconds to 64 bits
Linux is in the process of converting their seconds representation to
64 bits, so in order to support it consistently we should follow suit
(which at some point in quite a few years we'd have to do anyway). To
represent this in struct shared_info we leverage a 32-bit hole in
x86-64's and arm's variant of the structure; for x86-32 guests the only
(reasonable) choice we have is to put the extension in struct
arch_shared_info.
A note on the conditional suppressing the xen_wc_sec_hi helper macro
definition in the ix86 case for hypervisor and tools: Neither of the
two actually need this, and its presence causes the tools to fail to
build (due to the inclusion of both the x86-64 and x86-32 variants of
the header).
As a secondary change, x86's do_platform_op() gets a pointless
initializer as well as a pointless assignment of that same variable
dropped.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 30 Jan 2015 14:11:14 +0000 (14:11 +0000)]
ocaml/xenctrl: Fix stub_xc_readconsolering()
The Ocaml stub to retrieve the hypervisor console ring had a few problems.
* A single 32k buffer would truncate a large console ring.
* The buffer was static and not under the protection of the Ocaml GC lock so
could be clobbered by concurrent accesses.
* Embedded NUL characters would cause caml_copy_string() (which is strlen()
based) to truncate the buffer.
The function is rewritten from scratch, using the same algorithm as the python
stubs, but uses the protection of the Ocaml GC lock to maintain a static
running total of the ring size, to avoid redundant realloc()ing in future
calls.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Dave Scott <dave.scott@eu.citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> Acked-by: David Scott <dave.scott@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Andrew Cooper [Wed, 28 Jan 2015 17:55:32 +0000 (17:55 +0000)]
ocaml/xenctrl: Make failwith_xc() thread safe
The static error_str[] buffer is not thread-safe, and 1024 bytes is
unreasonably large. Reduce to 256 bytes (which is still much larger than any
current use), and move it to being a stack variable.
Also, propagate the Noreturn attribute from caml_raise_with_string().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Dave Scott <Dave.Scott@eu.citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> Acked-by: David Scott <dave.scott@citrix.com>
Andrew Cooper [Wed, 28 Jan 2015 15:52:35 +0000 (15:52 +0000)]
tools/libxc: Don't leave scratch_pfn uninitialised if the domain has no memory
c/s 5b5c40c0d1 "libxc: introduce a per architecture scratch pfn for temporary
grant mapping" accidentally an issue whereby there were two paths out of
xc_core_arch_get_scratch_gpfn() which returned 0, but only one of which
assigned a value to the gpfn parameter.
xc_domain_maximum_gpfn() can validly return 0, at which point gpfn 1 is a
valid scratch page to use.
In addition, widen rc before adding 1 and possibly overflowing.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Julien Grall <julien.grall@linaro.org> CC: Jan Beulich <JBeulich@suse.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Thu, 29 Jan 2015 14:50:00 +0000 (15:50 +0100)]
random: add missing include xen/cache.h
The commit f6c9698 " x86: allow reading MSR_IA32_TSC with XENPF_resource_op"
introduced a built regression on ARM platform.
random.c:8:28: error: expected \91=\92, \91,\92, \91;\92, \91asm\92 or \91__attribute__\92 before \91boot_random\92
unsigned int __read_mostly boot_random;
^
The define __read_mostly is defined in asm/cache.h which is included by
other headers on x86 but not on ARM. Include xen/cache.h to fix the
build.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Jan Beulich [Thu, 29 Jan 2015 13:42:20 +0000 (13:42 +0000)]
x86/shadow: adjust mask shadow_audit_tables() passes to hash_foreach()
It so far having been ~1 made most of the code preceding the call
pointless, but I assume this wasn't meant to be that way. Also replace
the remaining hard coded ~1 with an expression documenting the
intention a little better.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Adjust again to use SHF_page_type_mask, at Jan's suggestion.
Jan Beulich [Thu, 29 Jan 2015 13:40:40 +0000 (13:40 +0000)]
x86/shadow: convert non-const statics
To make obvious that such statics are safe to use, they should be
const. In some of the cases, they wouldn't even need to be static, but
keep them so upon the maintainer's request.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
Jan Beulich [Thu, 29 Jan 2015 13:24:04 +0000 (14:24 +0100)]
x86: support SMBIOS v3
While presumably of primary use to ARM64 (once the code gets
generalized), we should still support this more modern variant,
allowing for the actual DMI data to reside in memory above 4Gb.
While based on draft version 3.0.0d, it is assumed that the final
version of the specification will not render this implementation
invalid (not the least because Linux 3.19 already makes the same
assumption).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
David Vrabel [Thu, 29 Jan 2015 13:22:22 +0000 (14:22 +0100)]
grant-table: defer releasing pages acquired in a grant copy
Acquiring a page for the source or destination of a grant copy is an
expensive operation. A common use case is for two adjacent grant copy
ops to operate on either the same source or the same destination page.
Instead of always acquiring and releasing destination and source pages
for each operation, release the page once it is no longer valid for
the next op.
If either the source or destination domains changes both pages are
released as it is unlikely that either will still be valid.
XenServer's performance benchmarks show modest improvements in network
receive throughput (netback uses grant copy in the guest Rx path) and
no regressions in disk performance (using tapdisk3 which grant copies
as the backend).
Baseline Deferred Release
Interhost receive to VM 7.2 Gb/s ~9 Gbit/s
Interhost aggregate 24 Gb/s 28 Gb/s
Intrahost single stream 14 Gb/s 14 Gb/s
Intrahost aggregate 34 Gb/s 36 Gb/s
Aggregate disk write 900 MB/s 900 MB/s
Aggregate disk read 890 MB/s 890 MB/s
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Dan Carpenter [Wed, 28 Jan 2015 15:50:08 +0000 (16:50 +0100)]
bunzip2: off by one in get_next_block()
"origPtr" is used as an offset into the bd->dbuf[] array. That array is
allocated in start_bunzip() and has "bd->dbufSize" number of elements so
the test here should be >= instead of >.
Later we check "origPtr" again before using it as an offset so I don't
know if this bug can be triggered in real life.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Trivial adjustments to make the respective Linux commit b5c8afe5be51078a979d86ae5ae78c4ac948063d apply to Xen.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Jan Beulich [Wed, 28 Jan 2015 15:38:20 +0000 (16:38 +0100)]
x86: skip further initialization for idle domains
While in the end not really found necessary, early versions of the
patches to follow pointed out that we needlessly set up paging for idle
domains. Arranging for that to be skipped made me notice that we can at
once skip vMCE setup for them. Leverage to adjustment to further
re-arrange the way FPU setup gets skipped.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Chao Peng [Wed, 28 Jan 2015 15:33:01 +0000 (16:33 +0100)]
x86: allow reading MSR_IA32_TSC with XENPF_resource_op
Memory bandwidth monitoring requires system time information returned
along with the monitoring counter to verify the correctness of the
counter value and to calculate the time elapsed between two samplings.
Add MSR_IA32_TSC to the read path and it returns scaled system time(ns)
instead of raw timestamp to elimanate the needs to convert. The return
time is obfuscated with booting random to eliminate the potential abuse
of it. RESOURCE_ACCESS_MAX_ENTRIES is also increased to 3 so MSR_IA32_TSC
can be used together with an MSR write/read operation pair.
Jan Beulich [Wed, 28 Jan 2015 15:29:46 +0000 (16:29 +0100)]
x86: also use tzcnt instead of bsf in __scanbit()
... when available, i.e. by runtime patching. This saves the
conditional move, having a back-to-back dependency on BSF's (EFLAGS)
result.
The need to include asm/cpufeatures.h from asm/bitops.h requires a
workaround for an otherwise resulting circular header file dependency:
Provide a mode by which the including site of the former header can
request to only get the X86_FEATURE_* defines (and very little more)
from it, allowing it to nevertheless be included in its entirety later
on.
While doing this I also noticed that the function's "max" parameter was
pointlessly "unsigned long" - the function only returning
"unsigned int", this can't be of any use, and hence gets converted at
once, along with the necessary adjustments to CMOVZ's output operands.
Note that while only alternative_io() is needed by this change (and
hence gets pulled over from Linux), for completeness its input-only
counterpart alternative_input() gets added as well.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Wei Liu [Wed, 28 Jan 2015 13:26:21 +0000 (13:26 +0000)]
libxl: correct function name
spaw_ -> spawn_
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Andrew Cooper [Tue, 27 Jan 2015 16:58:06 +0000 (16:58 +0000)]
tools/libxc: Disable CONFIG_MIGRATE in stubdom environments
The legacy save/restore infrastructure requires several function pointers from
the toolstack (libxl or Xend in the past) in order to work, and for HVM guests
also need to be able to play around in dom0's filesystem to move the device
model save record.
Migration v2 changes some of this, but is similarly dependent on
toolstack-provided function pointers.
Someone who wishes to re-architect the interaction of moving parts for running
a domain might be in a position to re-enabled this, but for now, explicitly
fail with ENOSYS (from xc_nomigrate.c) rather than failing with an error about
a missing function pointer (or indeed falling over a NULL pointer on certain
paths).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Andrew Cooper [Tue, 27 Jan 2015 20:38:11 +0000 (20:38 +0000)]
ocaml/xenctrl: Check return values from hypercalls
rather than blindly continuing and possibly using negative values.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Dave Scott <dave.scott@eu.citrix.com> Acked-by: David Scott <dave.scott@citrix.com>
Create preinit_xen_time() and move to it minimum required
subset of operations needed to properly initialized
cpu_khz and boot_count vars. This is allow us to use udelay()
immediately after the call.
Quan Xu [Thu, 15 Jan 2015 09:21:50 +0000 (04:21 -0500)]
vTPM/TPM2: Support TPM 2.0 bind and unbind data
Bind data with TPM2_RSA_Encrypt, which performs RSA encryption using
the indicated padding scheme according to PKCS#1v2.1(PKCS#1). If the
scheme of keyHandle is TPM_ALG_NULL, then the caller may use inScheme
to specify the padding scheme.
Unbind data with TPM2_RSA_Decrypt, which performs RSA decryption using
the indicated padding scheme according to PKCS#1v2.1(PKCS#1).
Signed-off-by: Quan Xu <quan.xu@intel.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Quan Xu [Thu, 15 Jan 2015 09:21:48 +0000 (04:21 -0500)]
vTPM/TPM2: Support 'tpm2' extra command line.
Make vtpm-stubdom domain compatible to launch on TPM 1.x / TPM 2.0.
Add:
..
extra="tpm2=1"
..
to launch vtpm-stubdom domain on TPM 2.0, ignore it on TPM 1.x. for
example,
vtpm-stubdom domain configuration on TPM 2.0:
Quan Xu [Thu, 15 Jan 2015 09:21:47 +0000 (04:21 -0500)]
vTPM/TPM2: Add main entrance vtpmmgr2_init()
Accept commands from the vtpm-stubdom domains via the mini-os TPM
backend driver. The vTPM manager communicates directly with hardware
TPM 2.0 using the mini-os tpm2_tis driver.
Signed-off-by: Quan Xu <quan.xu@intel.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Quan Xu [Thu, 15 Jan 2015 09:21:46 +0000 (04:21 -0500)]
vTPM/TPM2: TPM2.0 TIS initialization and self test.
call the TPM 2.0 various registers that allow communication between
the TPM 2.0 and platform hardware and software. TPM2_SelfTest causes
the TPM 2.0 to perform a test of its capabilities.
Signed-off-by: Quan Xu <quan.xu@intel.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Quan Xu [Thu, 15 Jan 2015 09:21:45 +0000 (04:21 -0500)]
vTPM/TPM2: Create and load SK on TPM 2.0
TPM2_Create is used to create an object that can be loaded into a
TPM using TPM2_Load(). If the command completes successfully, the
TPM will create the new object and return the object’s creation.
data (creationData), its public area (outPublic), and its encrypted
sensitive area (outPrivate). Preservation of the returned data is
the responsibility of the caller. The object will need to be loaded
(TPM2_Load()).
TPM2_Load is used to load objects into the TPM. This command is used
when both a TPM2B_PUBLIC and TPM2B_PRIVATE are to be loaded. If only
a TPM2B_PUBLIC is to be loaded, the TPM2_LoadExternal command is used.
Signed-off-by: Quan Xu <quan.xu@intel.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Quan Xu [Thu, 15 Jan 2015 09:21:44 +0000 (04:21 -0500)]
vTPM/TPM2: TPM 2.0 takes ownership and create SRK
TPM2_CreatePrimary is used to create a Primary Object under one of
the Primary Seeds or a Temporary Object under TPM_RH_NULL. The command
uses a TPM2B_PUBLIC as a template for the object to be created. The
command will create and load a Primary Object. The sensitive area is
not returned. Any type of object and attributes combination that is
allowed by TPM2_Create() may be created by this command. The constraints
on templates and parameters are the same as TPM2_Create() except that a
Primary Storage Key and a Temporary Storage Key are not constrained to
use the algorithms of their parents.
Signed-off-by: Quan Xu <quan.xu@intel.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Andrew Cooper [Tue, 27 Jan 2015 20:34:02 +0000 (20:34 +0000)]
tools/libxl: Correct static pattern rule for pkgconfig files
Attempting to build libxl causes Make to emit the following warnings
andrewcoop@andrewcoop:xen.git$ make -C tools/libxl all
...
Makefile:253: target `xenlight.pc' doesn't match the target pattern
Makefile:253: target `xlutil.pc' doesn't match the target pattern
...
because the static pattern rule is malformed. 'Makefile' as the only
prereq-pattern does not contain a pattern.
The rule ends up working because of the use of $@.in where $< should have been
used, but lacked any dependency between a $FOO.pc and its .in source file.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
libxl: Prevent qemu closing QMP socket on shutdown before libxl is done with it.
At present on shutdown when using pci-passthrough with qemu-xen, qemu
closes the QMP socket before libxl is done with it causing these
errors to be logged by libxl:
Waiting for domain test (domid 1) to die [pid 11568]
Domain 1 has shut down, reason code 0 0x0
Action for shutdown reason code 0 is destroy
Domain 1 needs to be cleaned up: destroying the domain
libxl: error: libxl_qmp.c:443:qmp_next: Socket read error: Connection reset by peer
libxl: error: libxl_qmp.c:701:libxl__qmp_initialize: Failed to connect to QMP
libxl: error: libxl_qmp.c:686:libxl__qmp_initialize: Connection error: Connection refused
libxl: error: libxl_dm.c:1588:kill_device_model: Device Model already exited
Done. Exiting now
Prevent this by using the qemu '-no-shutdown' parameter which is
described as doing:
"Don’t exit QEMU on guest shutdown, but instead only stop the emulation.
This allows for instance switching to monitor to commit changes to the disk image."
So Qemu will stop emulating, but keeps the QMP socket open and waits
for libxl to kill the qemu process when it is done, preventing the
race and resulting in this to be logged by libxl:
Waiting for domain test (domid 1) to die [pid 10859]
Domain 1 has shut down, reason code 0 0x0
Action for shutdown reason code 0 is destroy
Domain 1 needs to be cleaned up: destroying the domain
Done. Exiting now
Wei Liu [Sun, 25 Jan 2015 15:38:59 +0000 (15:38 +0000)]
tools/Makefile: fix qemu-xen-traditional build
In d9740237a ("tools: unhook blktap1 from the build and remove all
references to it"), one spot was left unchanged, which leads to failure
in building qemu-xen-traditional.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Wei Liu [Tue, 20 Jan 2015 11:47:46 +0000 (11:47 +0000)]
tools: generate systemd service files only when systemd is available
Though that's not in any way harmful but it is on the other hand not
very useful.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
[ ijc -- rerun autogen.sh ] Acked-by: Ian Campbell <ian.campbell@citrix.com>
Wei Liu [Tue, 20 Jan 2015 13:31:12 +0000 (13:31 +0000)]
tools: fix "make distclean"
The original rule to target "distclean" in tools/Rules.mk was in effect
"make clean". It should be "make distclean".
However not all Makefiles in subdirectories have distclean target
defined. So this patch also adds a bunch of distclean targets to various
Makefiles. They only depend on clean target and don't have any actions
in most cases.
With the patch applied, following command outputs 0 results:
Ocaml and libfsimage are known to have distclean defined in a dedicated
rules file.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Wei Liu [Tue, 20 Jan 2015 12:22:50 +0000 (12:22 +0000)]
libxl: provide xlutil.pc
Please rerun autogen.sh after applying this patch.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Wei Liu [Tue, 20 Jan 2015 12:22:49 +0000 (12:22 +0000)]
libxl: delete xenlight.pc.in in distclean
That file is generated by configure. Deleting it in "make clean" leads
to rerun configure. Move it under distclean target.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Wed, 21 Jan 2015 13:25:44 +0000 (13:25 +0000)]
libxc: introduce a per architecture scratch pfn for temporary grant mapping
The code to initialize the grant table in libxc uses
xc_domain_maximum_gpfn() + 1 to get a guest pfn for mapping the grant
frame and to initialize it.
This solution has two major issues:
- The check of the return of xc_domain_maximum_gpfn is buggy because
xen_pfn_t is unsigned and in case of an error -ERRNO is returned.
Which is never catch with ( pfn <= 0 ).
- The guest memory layout maybe filled up to the end, i.e
xc_domain_maximum_gpfn() + 1 gives either 0 or an invalid PFN due to
hardware limitation.
Futhermore, on ARM, xc_domain_maximum_gpfn() is not implemented and
return -ENOSYS. This will make libxc to use always the same PFN which
may colapse with an already mapped region (see xen/include/public/arch-arm.h
for the layout).
This patch only address the problem for ARM, the x86 version use the same
behavior (ie xc_domain_maximum_gpfn() + 1), as I'm not familiar with Xen x86.
A new function xc_core_arch_get_scratch_gpfn is introduced to be able to
choose the gpfn per architecture.
For the ARM version, we use the GUEST_GNTTAB_GUEST which is the base of
the region by the guest to map the grant table. At the build time,
nothing is mapped there.
At the same time correctly check the return of xc_domain_maximum_gpfn
for x86.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
x86: vcpu_destroy_pagetables() must not return -EINTR
.. otherwise it has the side effect that: domain_relinquish_resources
will stop and will return to user-space with -EINTR which it is not
equipped to deal with that error code; or vcpu_reset - which will
ignore it and convert the error to -ENOMEM..
The preemption mechanism we have for domain destruction is to return
-EAGAIN (and then user-space calls the hypercall again) and as such we need
to catch the case of:
we need to return -ERESTART otherwise we end up returning -ENOMEM.
There are also other callers of vcpu_destroy_pagetables: arch_vcpu_reset
(vcpu_reset) are:
- hvm_s3_suspend (asserts on any return code),
- vlapic_init_sipi_one (asserts on any return code),
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 26 Jan 2015 11:50:21 +0000 (12:50 +0100)]
x86: use tzcnt instead of bsf
Following a compiler change done in 2012, make use of the fact that for
non-zero input BSF and TZCNT produce the same numeric result (EFLAGS
setting differs), and that CPUs not knowing of TZCNT will treat the
instruction as BSF (i.e. ignore what looks like a REP prefix to them).
The assumption here is that TZCNT would never have worse performance
than BSF.
Also extend the asm() input in find_first_set_bit() to allow memory
operands.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Mon, 26 Jan 2015 11:48:38 +0000 (12:48 +0100)]
x86/HVM: improve EFER validation error messages
The previous error message was very little use in identifying the actual
problem after the fact. Now, hvm_efer_valid() will indicate the issue which
it objects to, which is far more useful for diagnosing issues from logs.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Boris Ostrovsky [Fri, 23 Jan 2015 16:53:49 +0000 (17:53 +0100)]
x86/VPMU: handle APIC_LVTPC accesses
Don't have the hypervisor update APIC_LVTPC when _it_ thinks the vector should
be updated. Instead, handle guest's APIC_LVTPC accesses and write what the guest
explicitly wanted.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Tested-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
Boris Ostrovsky [Fri, 23 Jan 2015 16:50:53 +0000 (17:50 +0100)]
x86/VPMU: manage VPMU_CONTEXT_SAVE flag in vpmu_save_force()
There is a possibility that we set VPMU_CONTEXT_SAVE on VPMU context in
vpmu_load() and never clear it (because vpmu_save_force() will see
VPMU_CONTEXT_LOADED bit clear, which is possible on AMD processors)
The problem is that amd_vpmu_save() assumes that if VPMU_CONTEXT_SAVE is set
then (1) we need to save counters and (2) we don't need to "stop" control
registers since they must have been stopped earlier. The latter may cause all
sorts of problem (like counters still running in a wrong guest and hypervisor
sending to that guest unexpected PMU interrupts).
Since setting this flag is currently always done prior to calling
vpmu_save_force() let's both set and clear it there.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Tested-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
Roger Pau Monné [Fri, 23 Jan 2015 14:16:18 +0000 (15:16 +0100)]
x86: prevent access to HPET from Dom0
Prevent Dom0 from accessing HPET MMIO region by adding the HPET mfn to the
list of forbiden memory regions (if ACPI_HPET_PAGE_PROTECT4 or
ACPI_HPET_PAGE_PROTECT64 flag is set) or to the list of read-only regions.
Also provide an option that prevents adding the HPET to the read-only memory
regions called ro-hpet, in case there are systems that put other stuff in
the HPET page.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Don't loop over iomem_deny_access() for consecutive MFNs.
Put new command line option's doc entry in right spot.