Ian Campbell [Mon, 30 Mar 2015 11:12:32 +0000 (12:12 +0100)]
xen: arm: handle remaining traps from userspace
CP14 dbg and general CP register access are both handled with
unconditional injection of #undef from their respective handlers, so
allow these even from 32-bit userspace on a 64-bit kernel.
SMC32 and HVC32 should only come from a guest in AArch32 mode and
SMC64 and HVC64 should only come from a guest in AArch64 mode. Add
appropriate BUG_ONs to all cases.
After this bad_trap is no longer used.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Mon, 30 Mar 2015 11:12:30 +0000 (12:12 +0100)]
xen: arm: Handle CP14 32-bit register accesses from userspace
Accesses to these from 32-bit userspace would cause a hypervisor
exception (host crash) when running a 64-bit kernel, which is worked
around by the fix to XSA-102. On 32-bit kernels they would be
implemented as RAZ/WI which is incorrect but harmless.
Update as follows:
- DBGDSCRINT should be R/O.
- DBGDSCREXT should be EL1 only.
- DBGOSLAR is WO and EL1 only.
- DBGVCR, DBGB[VC]R*, DBGW[VC]R*, and DBGOSDLR are EL1 only.
DBGDIDR and DBGDSCRINT are accessible from EL0 if DBGDSCRext.UDCCdis.
Since we emulate that as RAZ/WI we allow access.
When we do not allow an access we now silently inject an undef even in
debug mode since the debugging messages are not helpful (we have
handled the access, by explicitly choosing not to).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Mon, 30 Mar 2015 11:12:29 +0000 (12:12 +0100)]
xen: arm: Handle CP15 register traps from userspace
Previously userspace access to PM* would have been incorrectly (but
benignly) implemented as RAZ/WI when running on a 32-bit kernel and
would cause a hypervisor exception (host crash) when running a 64-bit
kernel (this was already solved via the fix to XSA-102).
PMINTENSET, PMINTENCLR are EL1 only, but it is not clear whether
attempts to access from EL0 will trap to EL1 or EL2, be conservative
and handle EL0 access with an undef injection.
ACTLR is EL1 only and the ARM ARM states that HCR_EL2.TACR causes
accesses from EL1 to trap. However remain conservative even here and
handle accesses from EL0 by injecting an undef injection.
PMUSERENR is R/O at EL0 and we implement as RAZ/WI at EL1 as before.
The remaining PM* registers are accessible to EL0 only if
PMUSERENR_EL0.EN is set, since we emulate this as RAZ/WI the bit is
never set so we inject a trap on attempted access. We weren't
previously handling PMCCNTR.
HSR_EC_CP15_32 should never be seen from a 64-bit guest, so BUG_ON if
that occurs.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Mon, 30 Mar 2015 11:12:28 +0000 (12:12 +0100)]
xen: arm: drop cache maintenance by set/way trap handling
We do not set HCR_EL2.TSW so we will never see these.
This is undoubtedly wrong, but for now remove the dead code.
However, retain the HSR_SYSREG_* added by the precursor to this patch,
although they aren't used they are factually accurate and may as well
be kept for future use.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Mon, 30 Mar 2015 11:12:24 +0000 (12:12 +0100)]
xen: arm: handle accesses to CNTP_CVAL_EL0
All OSes we have run on top of Xen use CNTP_TVAL_EL0 but for
completeness we really should handle CVAL too.
In vtimer_emulate_cp64 pull the propagation of the 64-bit result into
two 32-bit registers out of the switch to avoid duplicating for every
register. We also need to initialise x now since previously the only
register implemented register was R/O.
While adding HSR_SYSREG_CNTP_CVAL_EL0 also move
HSR_SYSREG_CNTP_CTL_EL0 so it is sorted correctly.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Mon, 30 Mar 2015 11:12:23 +0000 (12:12 +0100)]
xen: arm: correctly handle vtimer traps from userspace
Previously 32-bit userspace on 32-bit kernel and 64-bit userspace on
64-bit kernel could access these registers irrespective of whether the
kernel had configured them to be allowed to. To fix this:
- Userspace access to CNTP_CTL_EL0 and CNTP_TVAL_EL0 should be gated
on CNTKCTL_EL1.EL0PTEN.
- Userspace access to CNTPCT_EL0 should be gated on
CNTKCTL_EL1.EL0PCTEN.
When we do not handle an access we now silently inject an undef even
in debug mode since the debugging messages are not helpful (we have
handled the access, by explicitly choosing not to).
The usermode accessibility check is rather repetitive, so a helper
macro is introduced.
Since HSR_EC_CP15_64 cannot be taken from a guest in AArch64 mode
except due to a hardware bug switch the associated check to a BUG_ON
(which will be switched to something more appropriate in a subsequent
patch)
Fix a coding style issue in HSR_CPREG64(CNTPCT) while touching similar
code.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Mon, 30 Mar 2015 11:12:22 +0000 (12:12 +0100)]
xen: arm: Factor out psr_mode_is_user
This embodies the logic on arm64 that userspace can be either 32-bit
or 64-bit. It will be used in other places shortly.
Note that the logic differs slightly because the original (in
inject_abt64_exception) knew that the kernel was 64-bit and could
therefore assume that any 32-bit mode was userspace. Instead the
refactored code explicitly checks for usr mode.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Boris Ostrovsky [Mon, 30 Mar 2015 20:17:59 +0000 (16:17 -0400)]
flask: Update XEN_SYSCTL_cputopoinfo name
Commit 2090f14c5cbd ("sysctl: make XEN_SYSCTL_topologyinfo sysctl a
little more efficient") renamed XEN_SYSCTL_topologyinfo to
XEN_SYSCTL_cputopoinfo.
It, however, neglected to update this macro for flask-related files.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reported-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Ian Campbell <ian.campbell@citrix.com>
libxc: Introduce xc_domain_nr_gpfns as a cousin of xc_domain_maximum_gpfn.
The commit a8f8a590e02d2d2b717257c0bd9a8b396103bdf4
"libxc: Check xc_domain_maximum_gpfn for negative return values"
introduced an regression in tools outside libxc (migrate v2)
which wanted the unfiltered GPFN value. Said commit added
a wrapper which added +1 if there were no errors.
To make it work pre-commit a8f8a59 we add an xc_domain_nr_gpfns
which will add +1 if there are no errors (and change all in-tree
callers to use it). The xc_domain_maximum_gpfn will return the
unfiltered GPFN value.
Suggested-by: Ian Campbell <ian.campbell@citrix.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Coverity-IDs: 1291939 (stray semicolon), 1291941 (structually dead code) CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Xen Coverity Team <coverity@xen.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Thu, 26 Mar 2015 10:54:04 +0000 (10:54 +0000)]
xen: arm: correctly handle continuations for 64-bit guests
The 64-bit ABI is different to 32-bit:
- uses x16 as the op register rather than r12.
- arguments in x0..x5 and not r0..r5. Using rN here potentially
truncates.
- return value goes in x0, not r0.
Hypercalls can only be made directly from kernel space, so checking
the domain's size is sufficient.
Spotted due to spurious -EFAULT when destroying a domain, due to the
hypercall's pointer argument being truncated. I'm unclear why I am
only seeing this now.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Jackson [Tue, 10 Feb 2015 20:09:49 +0000 (20:09 +0000)]
libxl: Comment cleanups
* Add two comments in libxl_remus_disk_drbd documenting buggy handling
of the hotplug script exit status.
* Add a section heading for async exec in libxl_aoutils.c
* Mention the right function name (libxl__ev_child_fork, not
libxl__ev_fork) in libxl_internal.h
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Yang Hongyang <yanghy@cn.fujitsu.com> CC: Wen Congyang <wency@cn.fujitsu.com> CC: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Yang Hongyang <yanghy@cn.fujitsu.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Tue, 10 Feb 2015 20:09:48 +0000 (20:09 +0000)]
libxl: Further fix exit paths from libxl_device_events_handler
On the success path, do not call GC_FREE explicitly. Instead, call
AO_INPROGRESS.
GC_FREE will free the gc underlying the long-term ao, which is then
subsequently referenced in backend_watch_callback's call to
libxl__nested_ao_create. It is a miracle that this ever works at all.
Also, add an `if (rc) goto out;' after the xswatch registration.
After this, libxl_device_events_handler has the conventional and
correct ao initiation pattern.
Olaf Hering [Tue, 24 Mar 2015 14:37:42 +0000 (14:37 +0000)]
tools/mkrpm: improve version.release handling
An increasing version and/or release number helps to update existing
packages without --force as in "rpm Uvh --force xen.rpm". Instead its
possible to do "rpm -Fvh *.rpm" to update only already installed
packages.
The usage of --force disables essentials checks such as file conflict
detection. As a result the new xen.rpm may overwrite files owned by
other packages.
With the current way of calculating version-release it is difficult to
get an increasing release number into the spec file. The release is
always zero unless "make make XEN_VENDORVERSION=`date +.%s`" is used,
which has the bad side effect that xen.gz always gets a different
filename every time.
Update mkrpm to recognize PKG_RELEASE=. Its value will be appended to
the Release string. It can be filled with a time stamp, like:
make rpmball PKG_RELEASE="`date +%Y%m%d%H%M%S`"
Signed-off-by: Olaf Hering <olaf@aepfle.de> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> Tested-by: George Dunlap <george.dunlap@eu.citrix.com>
Olaf Hering [Fri, 27 Mar 2015 10:29:24 +0000 (10:29 +0000)]
hotplug/Linux: add missing backslash in dom0_ip
Without it the actual error message is not written to xenstore.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Boris Ostrovsky [Thu, 26 Mar 2015 18:08:44 +0000 (14:08 -0400)]
libxc: Make conversion from page count to bytes 32-bit safe
Commit ba59e2ce935d ("libxc: allocate memory with vNUMA information for
PV guest") creates default vNUMA layout with a single range containing
all memory. The end of the range is calculated by shifting
dom->total_pages by 12 to the left.
On 32-bit dom0 this may result in losing upper bits since total_pages is
a 32-bit type.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Jan Beulich [Fri, 27 Mar 2015 14:23:25 +0000 (15:23 +0100)]
VT-d: improve fault info logging
I got repeatedly annoyed by there not getting anything logged by
default on VT-d faults (and hence having to tell people to add extra
command line options), and hence I think it is time to redo this code:
Log basic fault information at guest-warning level (rate limited by
default), and show the page walk in verbose rather than only in debug
mode. Break up multi-line message so that each gets a proper log level
attached, at once splitting out the common part. Also don't log
"unknown" faults as interrupt-remapping ones.
As a minor cleanup fix the type of the involved "fault_type" variables.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com>
Jan Beulich [Thu, 26 Mar 2015 10:23:33 +0000 (11:23 +0100)]
x86: simplify non-atomic bitops
- being non-atomic, their pointer arguments shouldn't be volatile-
qualified
- their (half fake) memory operands can be a single "+m" instead of
being both an output and an input
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 26 Mar 2015 10:19:57 +0000 (11:19 +0100)]
x86/MSI: fix error handling
__setup_msi_irq() needs to undo what it did before calling
write_msi_msg() in case that returned an error.
map_domain_pirq() needs to get rid of the MSI descriptor it
(implicitly) allocated. The case of a setup_msi_irq() failure on a
non-initial multi-vector-MSI interrupt needs special handling: While
the initial IRQ will get freed by the caller (who also passed it to
us), we need to undo the effect setup_msi_irq() had on it. (As a
benefit from the added call to msi_free_irq() we no longer need to
explicitly call destroy_irq() on the non-initial slots.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
JeHyeon Yeon [Thu, 26 Mar 2015 10:19:10 +0000 (11:19 +0100)]
LZ4 : fix the data abort issue
If the part of the compression data are corrupted, or the compression
data is totally fake, the memory access over the limit is possible.
This is the log from my system usning lz4 decompression.
[6502]data abort, halting
[6503]r0 0x00000000 r1 0x00000000 r2 0xdcea0ffc r3 0xdcea0ffc
[6509]r4 0xb9ab0bfd r5 0xdcea0ffc r6 0xdcea0ff8 r7 0xdce80000
[6515]r8 0x00000000 r9 0x00000000 r10 0x00000000 r11 0xb9a98000
[6522]r12 0xdcea1000 usp 0x00000000 ulr 0x00000000 pc 0x820149bc
[6528]spsr 0x400001f3
and the memory addresses of some variables at the moment are
ref:0xdcea0ffc, op:0xdcea0ffc, oend:0xdcea1000
As you can see, COPYLENGH is 8bytes, so @ref and @op can access the momory
over @oend.
Signed-off-by: JeHyeon Yeon <tom.yeon@windriver.com> Reviewed-by: David Sterba <dsterba@suse.cz>
[Linux commit d5e7cafd69da24e6d6cc988fab6ea313a2577efc] Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Jan Beulich [Thu, 26 Mar 2015 10:18:28 +0000 (11:18 +0100)]
x86: don't change affinity with interrupt unmasked
With ->startup unmasking the IRQ, setting the affinity afterwards
without masking the IRQ again is invalid namely for MSI (address and
data can't be updated atomically and may - at least for MSI-X - be
cached while unmasked).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 26 Mar 2015 10:17:51 +0000 (11:17 +0100)]
hvmloader: don't treat ROM BAR like other BARs
Its low 11 bits have different meaning.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Boris Ostrovsky [Thu, 26 Mar 2015 10:13:01 +0000 (11:13 +0100)]
sysctl: don't overwrite array size variable when it is set on error earlier
When querying CPU topology, if caller-provided array size is smaller than
number of online CPUs then, in addition to returning -ENOBUFS, sysctl is
expected to provide back this number. However, this value, stored in 'i',
is overwritten in the subsequent loop's control statement.
Make sure we don't do this by converting the loop to 'while'.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Ian Campbell [Thu, 26 Mar 2015 10:09:31 +0000 (11:09 +0100)]
arm: use gprintk as appropriate
gdprintk is now only included with debug=y builds. Therefore:
- switch some uses to gprintk
- remove some now redundant #ifndef NDEBUG surrounding existing
gdprintk uses.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Thu, 26 Mar 2015 10:08:28 +0000 (11:08 +0100)]
introduce gprintk()
... and convert several gdprintk()-s to it, as the next patch will make
them no-ops in non-debug builds.
Note that as a non-debug facility this does not print file name and
line number of the origin, to people are expected to use meaningful and
easily distinguishable messages (i.e. just like with plain printk()).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Thu, 26 Mar 2015 10:05:01 +0000 (11:05 +0100)]
add flag to start info regarding virtual mapped p2m list
Xen pv domains are using a domain private p2m list to convert guest pfns
to mfns. This p2m list has to be updated by the Xen tools during domain
restore and migration, as the mfns will most likely change. In order to
locate the p2m list the Xen tools need an interface provided by the
guest. Up to now this interface has been the shared info page where the
guest would store the mfn of the top level page of a 3-level p2m tree.
This p2m tree is fixed in it's layout and due to the limitation of
entries it can hold at each level it is limiting the maximum size of the
p2m list which can be reported to the Xen tools. The maximum memory the
p2m tree can support for 64 bit domains is 512 GB (32 bit domains don't
have a problem, as the p2m tree limit is much higher than the supported
domain size of 64 GB).
In order to be able to support pv domains with more than 512 GB an
additional way to specify the p2m list for the Xen tools has been added:
instead of a tree structure linked via mfns, the virtual address of a
linear p2m list, the cr3 value of the related address space and the size
of the p2m list can be specified by the guest (added by commit 50bd1f0825339dfacde471df7664729216fc46e3).
Guests implementing this new interface need to know, of course, whether
the Xen tools are capable to use the new interface instead of the old
p2m tree interface. Otherwise a guest using only the new interface with
the virtual mapped linear p2m list on a machine with old Xen tools not
supporting this interface could not be restored or migrated.
The added flag in the start info indicates the Xen tool's capability to
use the new interface enabling the guest to omit the p2m tree and thus
to support more than 512 GB of RAM.
Vijaya Kumar K [Tue, 24 Mar 2015 11:44:47 +0000 (17:14 +0530)]
xen: Add populate_pt_range interface to reserve non-leaf level table entries
On x86, for the pages mapped with PAGE_HYPERVISOR attribute
non-leaf page tables are allocated with valid pte entries.
and with MAP_SMALL_PAGES attribute only non-leaf page tables are
allocated with invalid (valid bit set to 0) pte entries.
However on arm this is not the case. On arm for the pages
mapped with PAGE_HYPERVISOR and MAP_SMALL_PAGES both
non-leaf and leaf level page table are allocated with valid bit
in pte entries.
This behaviour in arm makes common vmap code fail to
allocate memory beyond 128MB as described below.
In vm_init, map_pages_to_xen() is called for mapping
vm_bitmap. Initially one page of vm_bitmap is allocated
and mapped using PAGE_HYPERVISOR attribute.
For the rest of vm_bitmap pages, MAP_SMALL_PAGES attribute
is used to map.
In ARM for both PAGE_HYPERVISOR and MAP_SMALL_PAGES, valid bit
is set to 1 in pte entry for these mapping.
In vm_alloc(), map_pages_to_xen() is failing for >128MB because
for this next vm_bitmap page the mapping is already set in vm_init()
with valid bit set in pte entry. So map_pages_to_xen() in
ARM returns error.
With this patch, MAP_SMALL_PAGES is dropped and arch specific
populate_pt_range() api is introduced to populate non-leaf
page table entries for the requested pages. Added RESERVE option
to map non-leaf page table entries.
Signed-off-by: Vijaya Kumar K<Vijaya.Kumar@caviumnetworks.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- rewrote subject line ]
Wei Liu [Fri, 20 Mar 2015 16:19:12 +0000 (16:19 +0000)]
libxl: use new QEMU xenstore protocol
Originally both QEMU traditional and QEMU upstream used hardcoded
/local/domain/0 paths. This patch changes the protocol to use
/local/domain/$dm_domid path.
For QEMU traditional and upstream without stubdom, $dm_domid is 0 so
the path is in fact still /local/domain/0.
For QEMU traditional stubdom, this is incompatible protocol change.
However QEMU traditional is shipped with Xen so we are allowed to do
such change. This change needs to work with corresponding QEMU
traditional changeset.
There is no compatibility issue with QEMU upstream stubdom, because QEMU
upstream stubdom doesn't exist yet.
Watch /local/domain/$dm_domid/device-model/$domid/state, wait until
state turns "running" then unpause guest.
LIBXL_STUBDOM_START_TIMEOUT is the timeout used wait for stubdom to be
ready. My test on a very old machine (Core 2 6400) showed that it might
need more than 20s before the stubdom is ready to serve DomU.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Boris Ostrovsky [Tue, 24 Mar 2015 08:27:00 +0000 (09:27 +0100)]
x86: don't use BAD_APICID for non-APICID fields
BAD_APICID is used by cpuinfo_x86's phys_proc_id, cpu_core_id
and compute_unit_id even though these fields don't hold an APIC ID
itself but rather its derivative.
Provide appropriate macros for each of those three (and make them
unsigned).
This also fixes regression introduced by commit 2090f14c5cbd ("sysctl:
make XEN_SYSCTL_topologyinfo sysctl a little more efficient") which
leaked BAD_APICID into common code, breaking ARM.
Reported-by: Julien Grall <julien.grall@linaro.org> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Ditch INVALID_{CORE,SOCKET}ID in favor of always using
XEN_INVALID_{CORE,SOCKET}_ID.
Boris Ostrovsky [Tue, 24 Mar 2015 08:23:54 +0000 (09:23 +0100)]
pci: include asm/numa.h in pci.h
Commit 4fa6b0bacf9c ("pci: stash device's PXM information in struct
pci_dev") added node field to xen/include/xen/pci.h. Its type,
nodeid_t, is defined in asm/numa.h and we should include this file
explicitly in pci.h
Reported-by: Julien Grall <julien.grall@linaro.org> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Jan Beulich [Tue, 24 Mar 2015 08:23:00 +0000 (09:23 +0100)]
x86: support newer Intel CPU models
This just follows what the January 2015 edition of the SDM documents,
with additional clarification from Intel:
- Broadwell models 0x4f and 0x56 don't cross-reference other tables,
but should be treated like other Boradwell (0x3d),
- Xeon Phi model 0x57 lists LASTBRANCH_TOS but not where the actual
stack is. Being told it's Silvermont based, attach it there.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
acpi_disabled needs to be moved out of .init.data.
Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com>
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Don Slutz [Mon, 23 Mar 2015 15:53:44 +0000 (16:53 +0100)]
x86/hvm: prevent gcc uninitialised var warning
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 reports:
----------------------------------------------------------------------
hvm.c: In function `hvm_create_ioreq_server':
hvm.c:487:18: error: `bufioreq_pfn' may be used uninitialised in this function
[-Werror=uninitialized]
hvm.c:718:30: note: `bufioreq_pfn' was declared here
----------------------------------------------------------------------
My code analysis says that gcc is wrong, but initilize the variable
to prevent the gcc warning.
Reported-by: Ian Murray <murrayie@yahoo.co.uk> Signed-off-by: Don Slutz <dslutz@verizon.com>
Jan Beulich [Mon, 23 Mar 2015 15:51:14 +0000 (16:51 +0100)]
x86: allow 64-bit PV guest kernels to suppress user mode exposure of M2P
Xen L4 entries being uniformly installed into any L4 table and 64-bit
PV kernels running in ring 3 means that user mode was able to see the
read-only M2P presented by Xen to the guests. While apparently not
really representing an exploitable information leak, this still very
certainly was never meant to be that way.
Building on the fact that these guests already have separate kernel and
user mode page tables we can allow guest kernels to tell Xen that they
don't want user mode to see this table. We can't, however, do this by
default: There is no ABI requirement that kernel and user mode page
tables be separate. Therefore introduce a new VM-assist flag allowing
the guest to control respective hypervisor behavior:
- when not set, L4 tables get created with the respective slot blank,
and whenever the L4 table gets used as a kernel one the missing
mapping gets inserted,
- when set, L4 tables get created with the respective slot initialized
as before, and whenever the L4 table gets used as a user one the
mapping gets zapped.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
Jan Beulich [Mon, 23 Mar 2015 15:49:42 +0000 (16:49 +0100)]
vm-assist: prepare for discontiguous used bit numbers
Since the a flag will get assigned a value discontiguous to the
existing ones (in order to preserve the low bits, as only those are
currently accessible to 32-bit guests), this requires a little bit of
rework of the VM assist code in general: An architecture specific
VM_ASSIST_VALID definition gets introduced (with an optional compat
mode counterpart), and compilation of the respective code becomes
conditional upon this being defined (ARM doesn't wire these up and
hence doesn't need that code).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
Boris Ostrovsky [Fri, 20 Mar 2015 16:35:39 +0000 (17:35 +0100)]
sysctl: make XEN_SYSCTL_topologyinfo sysctl a little more efficient
A number of changes to XEN_SYSCTL_topologyinfo interface:
* Instead of copying data for each field in xen_sysctl_topologyinfo separately
put cpu/socket/node into a single structure and do a single copy for each
processor.
* A NULL cputopo handle passed is a request for maximum number of CPUs
(num_cpus). If cputopo is valid and num_cpus is smaller than the number of
CPUs in the system then -ENOBUFS is returned (and correct num_cpus is provided)
* Do not use max_cpu_index, which is almost always used for calculating number
CPUs (thus requiring adding or subtracting one), replace it with num_cpus.
* There is no need to copy whole op in sysctl to user at the end, we only need
num_cpus.
* Rename xen_sysctl_topologyinfo and XEN_SYSCTL_topologyinfo to reflect the fact
that these are used for CPU topology. Subsequent patch will add support for
PCI topology sysctl.
* Replace INVALID_TOPOLOGY_ID with "XEN_"-prefixed macros for each invalid type
(core, socket, node).
Update sysctl version to 0x0000000C
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Boris Ostrovsky [Fri, 20 Mar 2015 16:34:13 +0000 (17:34 +0100)]
pci: stash device's PXM information in struct pci_dev
If ACPI provides PXM data for IO devices then dom0 will pass it to
hypervisor during PHYSDEVOP_pci_device_add call. This information,
however, is currently ignored.
We will store this information (in the form of nodeID) in pci_dev
structure so that we can provide it, for example, to the toolstack
when it adds support (in the following patches) for querying the
hypervisor about device topology
We will also print it when user requests device information dump.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Boris Ostrovsky [Fri, 20 Mar 2015 16:32:25 +0000 (17:32 +0100)]
numa: __node_distance() should return u8
SLIT values are byte-sized and some of them (0-9 and 255) have
special meaning. Adjust __node_distance() to reflect this and
modify scrub_heap_pages() and do_sysctl() to deal with
__node_distance() returning an invalid SLIT entry.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Ross Lagerwall [Fri, 20 Mar 2015 16:04:14 +0000 (16:04 +0000)]
tools/libxl: Restore errnoval behavior for datacopier callback
6d896e1357ff ("tools/libxl: Extend datacopier to support reading into a
buffer") changed the semantics of the errnoval parameter for the datacopier
callback without updating all callers. Restore the original behavior for now.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
libxc: Fix do_memory_op to return negative value on errors
instead of the -Exx values (which should go in errno).
This patch has HUGE implications. There is a lot of APIs
that are using do_memory_op. Fortunately most of them
check for 'if (do_memory_op(..) < 0)' so will function
properly. However there were some which printed the return
value to the user. They have been fixed in:
libxc: Don't assign return value to errno for E820 get/set xc_ calls.
libxc: Check xc_sharing_* for proper return values.
libxc: If xc_domain_add_to_physmap fails, include errno value
libxc: Check xc_maximum_ram_page for negative return values.
libxc: Check xc_domain_maximum_gpfn for negative return values
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
libxc: Check xc_maximum_ram_page for negative return values.
Instead of assuming everything is always OK. As such
we return now the return value (or zero for success).
The max_mfn is now passed in as the parameter.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
libxc: Check xc_domain_maximum_gpfn for negative return values
Instead of assuming everything is always OK. We stash
the gpfns value as an parameter. Since we use it in three
of places we might as well update xc_domain_maximum_gpfn
to do the right thing.
Suggested-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
libxc: Return negative value and propagate errno for xc_offline_page API
Instead of returning -Exx we now return -1 for error.
We could stash the -Exx values in errno values but why - the
underlaying functions we call all stash the proper errno
value. Hence we just propagate it up wherver it is needed.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
libxc: xc_physdev_map return -1 and populate errno.
The users of these (qemu) check for a negative value
so we are safe in regards to that. However they
also use the return value to inform the user of the
error.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
libxc: Fix xc_domain_get_tsc_info to return -1 instead of -Exx.
We don't need to put fill errno because xc_hypercall_buffer_alloc
fills the errno with the appropriate errno values and we just
need to pass them up the stack.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
libxc: Propagate errno from hypercall instead of anything else.
After we have done the hypercall - the errno has the failure
code. However our usage of pthread and munmap can trigger them
to manipulate the errno with their failure values. That would
be bad as what we care about is just the hypercall error value.
Another solution to this would be to save the 'errno' from
pthread/munmap/madvise as an extra parameter to be analyzed
later. However the call-sites above us do not care about it.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Thu, 5 Mar 2015 16:28:04 +0000 (16:28 +0000)]
libxl: Domain destroy: fork
Call xc_domain_destroy in a subprocess. That allows us to do so
asynchronously, rather than blocking the whole process calling libxl.
The changes in detail:
* Provide an libxl__ev_child in libxl__domain_destroy_state, and
initialise it in libxl__domain_destroy. There is no possibility
to `clean up' a libxl__ev_child, but there need to clean it up, as
the control flow ensures that we only continue after the child has
exited.
* Call libxl__ev_child_fork at the right point and put the call to
xc_domain_destroy and associated logging in the child. (The child
opens a new xenctrl handle because we mustn't use the parent's.)
* Consequently, the success return path from domain_destroy_domid_cb
no longer calls dis->callback. Instead it simply returns.
* We plumb the errorno value through the child's exit status, if it
fits. This means we normally do the logging only in the parent.
* Incidentally, we fix the bug that we were treating the return value
from xc_domain_destroy as an errno value when in fact it is a
return value from do_domctl (in this case, 0 or -1 setting errno).
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jim Fehlig <jfehlig@suse.com> Tested-by: Jim Fehlig <jfehlig@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Fri, 13 Mar 2015 12:22:25 +0000 (12:22 +0000)]
xen: common: Use unbounded array for symbols_offset.
Using a singleton array causes gcc5 to report:
symbols.c: In function 'symbols_lookup':
symbols.c:128:359: error: array subscript is above array bounds [-Werror=array-bounds]
symbols.c:136:176: error: array subscript is above array bounds [-Werror=array-bounds]
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Ian Jackson [Tue, 17 Mar 2015 15:30:58 +0000 (09:30 -0600)]
libxl: Domain destroy: unlock userdata earlier
Unlock the userdata before we actually call xc_domain_destroy. This
leaves open the possibility that other libxl callers will see the
half-destroyed domain (with no devices, paused), but this is fine.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jim Fehlig <jfehlig@suse.com> Tested-by: Jim Fehlig <jfehlig@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Tue, 17 Mar 2015 15:30:57 +0000 (09:30 -0600)]
libxl: In domain death search, start search at first domid we want
From: Ian Jackson <Ian.Jackson@eu.citrix.com>
When domain_death_xswatch_callback needed a further call to
xc_domain_getinfolist it would restart it with the last domain it
found rather than the first one it wants.
If it only wants one it will also only ask for one domain. The result
would then be that it gets the previous domain again (ie, the previous
one to the one it wants), which still doesn't reveal the answer to the
question, and it would therefore loop again.
It's completely unclear to me why I thought it was a good idea to
start the xc_domain_getinfolist with the last domain previously found
rather than the first one left un-confirmed. The code has been that
way since it was introduced.
Instead, start each xc_domain_getinfolist at the next domain whose
status we need to check.
We also need to move the test for !evg into the loop, we now need evg
to compute the arguments to getinfolist.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reported-by: Jim Fehlig <jfehlig@suse.com> Reviewed-by: Jim Fehlig <jfehlig@suse.com> Tested-by: Jim Fehlig <jfehlig@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
tools/libxl: close the logfile_w and null file descriptors in libxl__spawn_qdisk_backend() error path
Signed-off-by: Koushik Chakravarty <koushik.chakravarty@citrix.com> CC: Ian Jackson <ian.jackson@eu.citrix.com> CC: Stefano Stabellini <stefano.stabellini@eu.citrix.com> CC: Ian Campbell <ian.campbell@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Pramod Devendra [Thu, 19 Mar 2015 12:55:12 +0000 (12:55 +0000)]
libxl/libxl_qmp.c: fix error handling in qmp_open
1. Make sure sun_path does not overflow
2. Close qmp_fd on error
3. Use goto out for error handling
Signed-off-by: Pramod Devendra <pramod.devendra@citrix.com> CC: Ian Jackson <ian.jackson@eu.citrix.com> CC: Stefano Stabellini <stefano.stabellini@eu.citrix.com> CC: Ian Campbell <ian.campbell@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
George Dunlap [Thu, 19 Mar 2015 17:09:19 +0000 (17:09 +0000)]
docs: Mention a common pitfall in ballooning
Several users have reported that their available free memory in the
guest when they used maxmem >> memory was much smaller than when
maxmem == memory. This is the unavoidable consequence of how
ballooning works, but it's not something users expect.
Warn them of this effect in the place we tell them how to make it
happen, so they aren't surprised.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
dpci: put the dpci back on the list if scheduled from another CPU
There is race when we clear the STATE_SCHED in the softirq
- which allows the 'raise_softirq_for' (on another CPU or
on the one running the softirq) to schedule the dpci.
Specifically this can happen when the other CPU receives
an interrupt, calls 'raise_softirq_for', and puts the dpci
on its per-cpu list (same dpci structure). Note that
this could also happen on the same physical CPU, however
the explanation for simplicity will assume two CPUs actors.
There would be two 'dpci_softirq' running at the same time
(on different CPUs) where on one CPU it would be executing
hvm_dirq_assist (so had cleared STATE_SCHED and set STATE_RUN)
and on the other CPU it is trying to call:
if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
BUG();
Since STATE_RUN is already set it would end badly.
The reason we can get his with this is when an interrupt
affinity is set over multiple CPUs.
Potential solutions:
a) Instead of the BUG() we can put the dpci back on the per-cpu
list to deal with later (when the softirq are activated again).
This putting the 'dpci' back on the per-cpu list is an spin
until the bad condition clears.
b) We could also expand the test-and-set(STATE_SCHED) in raise_softirq_for
to detect for 'STATE_RUN' bit being set and schedule the dpci.
The BUG() check in dpci_softirq would be replace with a spin until
'STATE_RUN' has been cleared. The dpci would still not
be scheduled when STATE_SCHED bit was set.
c) Only schedule the dpci when the state is cleared
(no STATE_SCHED and no STATE_RUN). It would spin if STATE_RUN is set
(as it is in progress and will finish). If the STATE_SCHED is set
(so hasn't run yet) we won't try to spin and just exit.
Down-sides of the solutions:
a). Live-lock of the CPU. We could be finishing an dpci, then adding
the dpci, exiting, and the processing the dpci once more. And so on.
We would eventually stop as the TIMER_SOFTIRQ would be set, which will
cause SCHEDULER_SOFTIRQ to be set as well and we would exit this loop.
Interestingly the old ('tasklet') code used this mechanism.
If the function assigned to the tasklet was running - the softirq
that ran said function (hvm_dirq_assist) would be responsible for
putting the tasklet back on the per-cpu list. This would allow
to have an running tasklet and an 'to-be-scheduled' tasklet
at the same time.
b). is similar to a) - instead of re-entering the dpci_softirq
we are looping in the softirq waiting for the correct condition to
arrive. As it does not allow unwedging ourselves because the other
softirqs are not called - it is less preferable.
c) can cause an dead-lock if the interrupt comes in when we are
processing the dpci in the softirq - iff this happens on the same CPU.
We would be looping in on raise_softirq waiting for STATE_RUN
to be cleared, while the softirq that was to clear it - is preempted
by our interrupt handler.
As such, this patch - which implements a) is the best candidate
for this quagmire.
Wei Liu [Wed, 18 Mar 2015 21:33:58 +0000 (21:33 +0000)]
libxlu: avoid having two definitions of XLU_ConfigList
There is already a typedef in libxlutil.h. Remove the one in
libxlu_internal.h.
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Dario Faggioli [Wed, 18 Mar 2015 13:20:01 +0000 (14:20 +0100)]
sched_rt: avoid ASSERT()ing on runq dump if there are no domains
being serviced by the RTDS scheduler, as that is a
legit situation to be in: think, for instance, of a
newly created RTDS cpupool, with no domains migrated
to it yet.
While there:
- move the spinlock acquisition up, to effectively
protect the domain list and avoid races;
- the mask of online pCPUs was being retrieved
but then not used anywhere in the function: get
rid of that.
Dario Faggioli [Fri, 13 Mar 2015 11:09:41 +0000 (12:09 +0100)]
xl: enable using ranges of pCPUs when creating cpupools
instead of just list of single pCPUs or NUMA node IDs, as
it happens right now.
On the other hand, after this change, strings containing
pCPUs and NUMA node ranges is supported. The syntax is the
same one supported by the "cpus" and "cpus_soft" config
switch, i.e., "4-8" or "node:1,12-18,^14".
This make things more flexible, more consistent, and also
improves error handling, as the pCPU range parsing routine
already present in xl is more reliable than just a call
to atoi().
While there, remove a redundant error check in the legacy syntax
handling (libxl_bitmap_test() already checks the index being
within the size of the bitmap).
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Juergen Gross <JGross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Dario Faggioli [Fri, 13 Mar 2015 11:09:32 +0000 (12:09 +0100)]
xl: enable using ranges of pCPUs when manipulating cpupools
in fact, right now, xl sub-commands 'cpupool-cpu-add' and
'cpupool-cpu-remove' only accept the specification of one
pCPU to be added or removed to/from a cpupool.
With this change, they can deal with ranges, like "4-8",
or "node:1,12-18,^14". The syntax is exactly the same one
that is supported by the 'vcpu-pin' subcommand, and
specifying just one pCPU still works, of course.
This make things more flexible, more consistent, and also
improves error handling, as the pCPU range parsing routine
already present in xl is more reliable than just a call
to atoi().
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Juergen Gross <JGross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
To add (removes) to (from) a cpupool all the pCPUs corresponding
to the bits that are set in the passed bitmap.
This is convenient and useful in order to implement, in xl,
the possibility of specifying ranges of pCPUs to be added
(removed) to (from) a cpupool, in the relevant commands.
While there, convert libxl_cpupool_cpu{add,remove} to use the
appropriate logging macro (LOGE()).
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Juergen Gross <JGross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Dario Faggioli [Fri, 13 Mar 2015 11:09:03 +0000 (12:09 +0100)]
xl: turn some int local variable into bool
more specifically, the ones used as argument presence
flags in `xl list'.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
tools/libxl: avoid comparing an unsigned int to -1
Signed-off-by: Koushik Chakravarty <koushik.chakravarty@citrix.com> CC: Ian Jackson <ian.jackson@eu.citrix.com> CC: Stefano Stabellini <stefano.stabellini@eu.citrix.com> CC: Ian Campbell <ian.campbell@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>