Roger Pau Monne [Mon, 7 Nov 2016 15:32:00 +0000 (16:32 +0100)]
libxc: properly account for the page offset when copying ACPI data
Or else ACPI data is always copied at the start of the page pointed by
guest_addr_out, ignoring the page offset.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-and-Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Mon, 7 Nov 2016 15:29:15 +0000 (08:29 -0700)]
IOMMU: release lock on new exit path
This was overlooked in 7b2842a414 ("IOMMU: replace ASSERT()s checking
for NULL").
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Daniel De Graaf [Fri, 4 Nov 2016 15:35:20 +0000 (11:35 -0400)]
xsm: add missing permissions discovered in testing
Add two missing allow rules:
1. Device model domain construction uses getvcpucontext, discovered by
Andrew Cooper while chasing an unrelated issue.
2. When a domain is destroyed with a device passthrough active, the
calls to remove_{irq,ioport,iomem} can be made by the hypervisor itself
(which results in an XSM check with the source xen_t). It does not make
sense to deny these permissions; no domain should be using xen_t, and
forbidding the hypervisor from performing cleanup is not useful.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Thu, 3 Nov 2016 16:41:57 +0000 (16:41 +0000)]
libxl: disallow enabling PoD and ALTP2M at the same time
That combination would cause Xen to crash.
Note that although this is a security issue, is not XSA-worthy because
ALTP2M is experimental.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Thu, 3 Nov 2016 17:57:57 +0000 (17:57 +0000)]
git: Add metadata to the result of `git archive`
When building Xen from a source tarball, commit information is usually lost,
especially if the tarball was generated from a tag.
Have `git archive` automatically fill in metadata at the point of creating the
archive, which is especially useful when using web snapshot links such as:
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Fri, 28 Oct 2016 15:17:17 +0000 (16:17 +0100)]
flask: build policy in different locations
The flask policy can be build twice -- one for hypervisor and one for
tools.
Before this patch, everything is built inside tools/flask/policy
directory. It is possible to have a race to write to the same output
file when running parallel builds.
Prepend output file names with FLASK_BUILD_DIR. Hypervisor and tools
build will set that variable to different directories, so that we can
be safe from races.
Adjust other bits of the build system as needed.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Luwei Kang [Fri, 4 Nov 2016 08:29:18 +0000 (16:29 +0800)]
tools/libxc: Add xstate cpuid leaf of avx512
Enable get xstate cpuid leaf information regarding avx512 in guest.
Signed-off-by: Luwei Kang <luwei.kang@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monne [Thu, 3 Nov 2016 16:48:56 +0000 (17:48 +0100)]
docs: replace hint with pointer in PVHv2 ACPI documentation
Use pointer instead of hint, since this is the only way to get the address
of the RSDP.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-by: Jan Beulich <jbeulich@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monne [Thu, 3 Nov 2016 12:19:03 +0000 (13:19 +0100)]
docs: document ACPI usage in PVHv2 guests
It is possible for PVHv2 guests to get the hardware description from ACPI
tables, add this to the documentation also.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Dario Faggioli [Wed, 2 Nov 2016 15:05:03 +0000 (16:05 +0100)]
features: declare the Credit2 scheduler as Supported
Credit2 is available in tree as an "Experimental" scheduler since
a few years. Recently, effort started for making it production ready
and, eventually, the new Xen's default scheduler. As a consequence of
that, it has undergone a great deal of development, testing and
benchmarking.
In fact, Credit2's much more modern (wrt Credit1) design and cleaner
code makes it a lot easier to understand what the scheduler is doing,
fix scheduling issues that may come up, and implement new and more
advanced features, in future.
In some more details:
- key features that were missing (pinning and context switching
rate-limiting) have now been implemented, and more (soft affinity,
caps and reservations) are about to come. The gap wrt Credit1 is
therefore closing. In particular, with pinning and rate-limiting
available, the scheduler can be considered usable.
- Credit2 is tested by OSSTest since long time. Furthermore, as a
part of recent efforts, stress tests and benchmarks have been run
and shown no bugs or stability issues.
- A number of different benchmarks have been run, most of them
comparing Credit2 with Credit1. Some of the results were posted on
xen-devel, some others have been illustrated during a talk at 2016
edition of Xen-Project Developer Summit. In general, performance
look promising --if not better than Credit1 already, in some of
the cases.
It therefore appears that we are ready to mark the Credit2 scheduler
as a 'Supported' feature, and ask users to look at it and try it, if
they think it suits their needs.
Of course, declaring something 'Supported' has security implications.
So here it is how the situation looks like from a security standpoint:
1) Is guest->host privilege escalation possible?
The only interfaces exposed to unprivileged guests are the SCHEDOP
hypercalls, and timers. None of those hypercalls contain any pointers,
and they don't look to contain any privilege escalation path. Also,
they're not specific to Credit2, as they're "used" by all schedulers
(ingluding the current default, Credit1), so anything about these
interfaces would be a security concern already.
2) Is guest user->guest kernel escalation possible?
The guest kernel is not really relying on anything from the scheduler
to protect itself or any data in any way.
3) Is there any information leakage?
The only information which the scheduler exposes to unprivileged
guests is the timing information. This may be able to be used for
side-channel attacks to probabilistically infer things about other
vcpus running on the same system; but this has not traditionally
been considered within the security boundary. And, again, this is
possible with all schedulers.
The control domain can issue DOMCTL_SCHEDOP and SYSCTL_SCHEDOP
hypercalls, but the involved data structures are handled in a
way that does not leak information (which would be leaked "only"
to Dom0 anyway).
4) Can a Denial-of-Service be triggered?
This is a risk, with schedulers, and one that's hard to foresee.
For instance, it _did_ happen on Credit1, in the past (a vcpu
could "game the system" by sleeping at particular times to gain
BOOST priority and monopolize 95% of the cpu). In that case, it
was possible because of the probabilistic nature of accounting
in Credit1 (which was then fixed). Well, Credit2:
- already do accurate, rather than probabilistic, accounting;
- does not have any BOOST or, in general, any way for a vcpu to
become 'more important' than the others: they're all subjected
to the same crediting algorithm.
Also note that, the accounting and the crediting algorithm are a lot
simpler than in Credit1, and hence a lot easier to understand, debug
and audit.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Mon, 31 Oct 2016 17:03:04 +0000 (17:03 +0000)]
Config.mk: fix comment for debug option
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Mon, 31 Oct 2016 17:42:25 +0000 (17:42 +0000)]
build: make debug option affect tools only
The debug option in Config.mk affects hypervisor, tools and stubdom by
appending different flags to CFLAGS. Mini-os under extra is not
affected because it already has its own build system when it is
separated from xen.git.
It is undesirable because now hypervisor build is affected by both
Kconfig and debug.
Disentangle the semantics of debug by pushing relevant options to
individual sub-systems.
For hypervisor, the flags previously added by debug option is now
controlled by CONFIG_DEBUG.
For tools, flags are moved from config/*.mk into tools/Rules.mk.
For stubdom, because it unilaterally sets debug=y before including
top-level Config.mk, we only need to move the debug build set of flags
into stubdom Makefile.
Specifically there are some considerations on what flags are picked:
1. we don't need -fno-optimize-sibling-calls anymore because gcc doc
indicates that it is not enabled for -O1, which we already set in the
debug build.
2. for tools we use -O0 -g3 in Rules.mk because they already take
precedence over the flags set in config/*.mk.
3. for hypervisor we don't add -fno-omit-frame-pointer to debug build
because that's controlled by CONFIG_FRAME_POINTER.
This patch doesn't intend to tune those flags, but to provide identical
set of effective flags as before. The debug option in Config.mk will
only affect tools components after this patch is applied.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Mon, 31 Oct 2016 17:01:12 +0000 (17:01 +0000)]
xen: disable debug build
Xen debug build is controlled by Kconfig.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
which in C is 'proposed_id >= INT_MIN', or in other words, tautologically
true. As a result, 32bit builds of oxenstored always try to allocate the
transaction id 1, and fall into an infinite loop of trying the next id if
transaction 1 is already in use.
Restrict the range down to 1 billion, to sit in the positive half of a 31 bit
ocaml integer. The compiled code is now:
which (other than non-optimal code generation because of the unnecessary use
of %ecx), isn't unconditionally true.
In principle, the check could be changed to 'proposed_id == 0x7fffffff' which
would still allow for 2 billion transaction in 32bit builds. However, in
64bit builds, this reintroduces a risk that if proposed_id is initially
greater than 0x7fffffff, it will not be clipped suitably into range.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: David Scott <dave@recoil.org> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monne [Mon, 31 Oct 2016 10:05:20 +0000 (11:05 +0100)]
tools/libacpi: fix sed usage
Current usage of sed in the libacpi Makefile make uses of non-POSIX options,
that are not available on all the OSes supported by the Xen tools.
The '-i' option has slightly different semantics between GNU and BSD sed
implementations, while on the GNU version the suffix is optional, on the BSD
one it is not. Also BSD sed seems to have problems parsing the script
itself, reporting "extra characters at the end of d command".
Fix those issues by using a temporary intermediate file, and replace the
script with a simpler version that achieves the same purpose (removing the
initial license header comment).
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Mon, 31 Oct 2016 09:04:18 +0000 (10:04 +0100)]
stubdom: fix stubdom-vtpm build
stubdom-vtpm needs gmp and expects it under
stubdom/cross-root-x86_64/x86_64-xen-elf/lib while gmp seems to install
it under stubdom/cross-root-x86_64/x86_64-xen-elf/lib64 at least in an
openSUSE environment.
Modify gmp configure parameters to explicitly specify --libdir.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Sat, 29 Oct 2016 17:22:38 +0000 (18:22 +0100)]
stubdom: make GMP aware that it's being cross-compiled
Append --build and --host flags to GMP's configure script so that it
knows it is being cross-compiled.
This should fix the issue that GMP doesn't compile with gcc 6, because
configure script won't try to test the host environment anymore.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Thu, 27 Oct 2016 09:55:52 +0000 (11:55 +0200)]
xenstore: fix add_change_node()
add_change_node() in xenstored is used to add a modified node to the
list of changed nodes of one transaction. It is being called with the
recurse parameter set to true when removing a node in order to get
watches for children of the node fired at transaction end, too.
If, however, the node to be deleted had been modified in the same
transaction the recurse parameter of add_change_node() is lost as
an entry already in the list of the changed nodes won't be entered
again.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Thu, 4 Aug 2016 18:01:15 +0000 (18:01 +0000)]
x86/hvm: Don't truncate the hvm hypercall index before range checking it
c/s 5eeca68f introduced the 64bit ABI for HVM guests, and chose to explicitly
truncate the index, despite the fact that the `mov $imm32, %eax` in the
hypercall page already provides the expected truncation.
The truncation isn't very obvious, and is counterintuitive, seeing as all
other 64bit parameters are passed without truncation. It is also different to
the PV ABI, which is otherwise identical.
As the hypercall page has always been present for HVM guests (and indeed, is
basically mandatory to abstract away vendor differences), it is exceedingly
unlikely that any code exists which enters hvm_do_hypercall() with upper bits
set in %rax.
Therefore, take the opportunity to fix the ABI before it becomes impossible to
fix.
While tweaking this area, fix one piece of trailing whitespace.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Meng Xu [Wed, 26 Oct 2016 19:06:06 +0000 (15:06 -0400)]
xen:rtds: Fix bug in budget accounting
Bug scenario:
repl_timer_handler() may be called before rt_schedule() for a VCPU.
This situation may happen in two scenarios:
(1) The VCPU misses deadline due to the system is oversubscribed. For example,
the sum of VCPUs utilization on a core is larger than one.
(2) The VCPU has budget = period, which causes the timers for
rt_schedule() and repl_timer_handler() are fired at the same time.
When the situation happens, it causes the following incorrect behavior:
repl_timer_handler() will update the VCPU period and deadline.
If the VCPU is still the highest priority one, even with the new deadline,
it will continue to run, but with new period and deadline.
Since the budget enforcement timer for the previous period is still armed,
rt_schedule() will still be called in the new period and enforce the budget
for the previous period.
The current burn_budget() will deduct the time spent in previous period from
the budget in current period, which is incorrect.
Fix:
We keeps last_start always within the current period for a VCPU, so that
we only deduct the time spent in the current period from the VCPU budget.
We always update last_start whenever we update cur_deadline for a VCPU.
Andrew Cooper [Wed, 26 Oct 2016 11:06:44 +0000 (12:06 +0100)]
x86/emul: Move CPUID Faulting fault generation into the emulator
In hindsight, this is a better position for it, as it avoids opencoding
hvmemul_inject_hw_exception() in hvmemul_cpuid(), and reduces the requirements
on other ops->cpuid() hooks wanting to implement cpuid faulting in the future.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Fri, 23 Sep 2016 13:48:27 +0000 (14:48 +0100)]
x86/emul: Correct the decoding of SReg3 operands
REX.R is ignored when considering segment register operands, and needs masking
out first.
While fixing this, reorder the user segments in x86_segment to match SReg3
encoding. This avoids needing a translation table between hardware ordering
and Xen's ordering.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Meng Xu [Sat, 22 Oct 2016 02:12:02 +0000 (22:12 -0400)]
xen: rtds: always clear the flag when replenishing a depleted vcpu
We should clear the __RTDS_depleted bit once a VCPU budget is replenished.
Because repl_timer_handler may be called after rt_schedule
but before rt_context_saved, the VCPU may be not on CPU or on queue
when the VCPU is the middle of context switch
The transaction id of 0 is reserved, meaning "not in a transaction". It is up
to the xenstored server to allocate transaction ids. While oxenstored starts
its ids at 1, but insufficient care is taken with truncation cases.
A 32bit oxenstored has an int with 31 bits of width, meaning that the
transaction id will wrap around to 0 after 2 billion transactions.
A 64bit oxenstored has an int with 63 bits of width, meaning that once 4
billion transactions are used, the allocated id will be truncated when written
into the uin32_t field in the ring. This causes the client to reply with the
truncated id, breaking any further attempt to use any transactions.
Limit all transaction ids to the range between 1 and 0x7ffffffe. This is the
best which can be done without making oxenstored depend on Stdint or Cstruct,
yet still work for 32bit builds.
Also check that the proposed new transaction id isn't currently in use. For
the first 2 billion transactions there is no chance of a collision, and after
that, the chance is at most 20 (the default open transaction quota) in 2
billion.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: David Scott <dave@recoil.org> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monne [Tue, 25 Oct 2016 09:53:28 +0000 (11:53 +0200)]
tools/configure: fix pkg-config install path for FreeBSD
pkg-config from FreeBSD ports doesn't have ${prefix}/share/pkgconfig in the
default search path, fix this by having a PKG_INSTALLDIR variable that can
be changed on a per-OS basis.
It would be best to use PKG_INSTALLDIR as defined by the pkg.m4 macro, but
sadly this also reports a wrong value on FreeBSD (${libdir}/pkgconfig, which
expands to /usr/local/lib/pkgconfig by default, and is also _not_ part of
the default pkg-config search path).
This patch should not change the behavior for Linux installs.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-by: Alexander Nusov <alexander.nusov@nfvexpress.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Fri, 14 Oct 2016 17:02:31 +0000 (18:02 +0100)]
libacpi: require ACPI_BUILD_DIR to be set
It's better to have a explicit error than a build failure returned by
gcc.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Mon, 24 Oct 2016 15:34:17 +0000 (17:34 +0200)]
x86: MISALIGNSSE feature depends on SSE
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Mon, 24 Oct 2016 15:33:30 +0000 (17:33 +0200)]
x86emul: fix XOP decode
Commit f09902c456 ("x86emul: add XOP decoding") ended up overwriting b
prior to the last use of its previously stored value. SLightly defer
fetching the main opcode byte.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Clang complains nr_dom_vcpus may be used uninitialised after 4a6070ea9.
The real issue is vinfo can be NULL and nr_dom_vcpus remains
uninitialised if previous call fails.
Initialise nr_dom_vcpus to 0 at the beginning of the loop to fix the
issue.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Wed, 19 Oct 2016 16:30:36 +0000 (17:30 +0100)]
x86/emul: Correctly annotate all push/pop %sreg instructions
c/s 373923ed9c2 "x86emul: fix pushing of selector registers" redirected
all push %sreg instructions into the general push path. However, this
ends up hitting the assertion at the head of the push path.
Annotate All push and pop %sreg instructions as Mov, indicating that
they do not read the destination operand.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Boris Ostrovsky [Sun, 23 Oct 2016 23:09:19 +0000 (19:09 -0400)]
tools: Handle existing link to acpi directory
The link to acpi include directory is not removed by Makefile's 'clean'
target. This can lead to make failure when making xen/.dir target if
we try to create the link again.
We can prevent this failure by (1) removing acpi link when cleaning up
and (2) adding '-f' option to 'ln' (just like we do for other targets).
We should also add tools/include/acpi link to .gitignore.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Dario Faggioli [Fri, 21 Oct 2016 13:49:30 +0000 (15:49 +0200)]
libxl: avoid considering pCPUs outside of the cpupool during NUMA placement
During NUMA automatic placement, the information
of how many vCPUs can run on what NUMA nodes is used,
in order to spread the load as evenly as possible.
Such information is derived from vCPU hard and soft
affinity, but that is not enough. In fact, affinity
can be set to be a superset of the pCPUs that belongs
to the cpupool in which a domain is but, of course,
the domain will never run on pCPUs outside of its
cpupool.
Take this into account in the placement algorithm.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reported-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
vscsiif.h: replace PAGE_SIZE with VSCSIIF_PAGE_SIZE
Do not reference PAGE_SIZE directly: it could be undefined, or it could
have different values in the frontend or in the backend.
Define VSCSIIF_PAGE_SIZE as 4096, assuming all users of vscsiif.h have
4K page granularity. Replace PAGE_SIZE with VSCSIIF_PAGE_SIZE.
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Do not reference PAGE_SIZE directly: it could be undefined, or it could
have different values in the frontend or in the backend.
Define USBIF_RING_SIZE as 4096, assuming all users of usbif.h have 4K
page granularity. Replace PAGE_SIZE with USBIF_RING_SIZE.
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Tamas K Lengyel [Fri, 14 Oct 2016 00:00:47 +0000 (18:00 -0600)]
altp2m: don't attempt to unshare pages during change_altp2m_gfn op
Attempting to change gfn mappings with altp2m on a memory shared page results
in a lock-order violation (mm locking order violation: 282 > 254), which
crashes the hypervisor. Don't attempt to automatically unshare such pages and
just fall back to failing the op if the page type is not correct.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Kyle Huey [Thu, 20 Oct 2016 13:44:28 +0000 (06:44 -0700)]
x86/Intel: virtualize support for cpuid faulting
On HVM guests, the cpuid triggers a vm exit, so we can check the emulated
faulting state in vmx_do_cpuid and hvmemul_cpuid. A new function,
hvm_check_cpuid_fault will check if cpuid faulting is enabled and the CPL > 0.
When it returns true, the cpuid handling functions will inject a GP(0). Notably
explicit hardware support for faulting on cpuid is not necessary to emulate
support for an HVM guest.
On PV guests, hardware support is required so that userspace cpuid will trap
to Xen. Xen already enables cpuid faulting on supported CPUs for pv guests (that
aren't the control domain, see the comment in intel_ctxt_switch_levelling).
Every PV guest cpuid will trap via a GP(0) to emulate_privileged_op (via
do_general_protection). Once there we simply decline to emulate cpuid if the
CPL > 0 and faulting is enabled, leaving the GP(0) for the guest kernel to
handle.
Signed-off-by: Kyle Huey <khuey@kylehuey.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Kyle Huey [Thu, 20 Oct 2016 13:44:27 +0000 (06:44 -0700)]
x86/Intel: Expose cpuid_faulting_enabled so it can be used elsewhere
While we're here, use bool instead of bool_t.
Signed-off-by: Kyle Huey <khuey@kylehuey.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Thu, 1 Sep 2016 09:38:27 +0000 (10:38 +0100)]
x86/svm: Drop adjustment of X86_FEATURE_APIC
The common hvm_cpuid() code already does this.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
He Chen [Wed, 19 Oct 2016 08:03:24 +0000 (16:03 +0800)]
xen/sm{e, a}p: allow disabling sm{e, a}p for Xen itself
SMEP/SMAP is a security feature to prevent kernel executing/accessing
user address involuntarily, any such behavior will lead to a page fault.
SMEP/SMAP is open (in CR4) for both Xen and HVM guest in earlier code.
SMEP/SMAP bit set in Xen CR4 would enforce security checking for 32-bit
PV guest which will suffer unknown SMEP/SMAP page fault when guest
kernel attempt to access user address although SMEP/SMAP is close for
PV guests.
This patch introduces a new boot option value "hvm" for "sm{e,a}p", it
is going to diable SMEP/SMAP for Xen hypervisor while enable them for
HVM. In this way, 32-bit PV guest will not suffer SMEP/SMAP security
issue. Users can choose whether open SMEP/SMAP for Xen itself,
especially when they are going to run 32-bit PV guests.
Signed-off-by: He Chen <he.chen@linux.intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
[Fixed up command line docs] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Thu, 13 Oct 2016 11:12:20 +0000 (12:12 +0100)]
x86/vmx: Reduce the verbosity of the vmentry failure error reporting
Identify the affected vcpu at the start of the message. While tweaking this
area, add extra newlines between cases.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <JBeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Thu, 13 Oct 2016 10:46:58 +0000 (11:46 +0100)]
x86/vmx: Print the problematic MSR if a vmentry fails
Sample error looks like:
(XEN) Failed vm entry (exit reason 0x80000022) caused by MSR loading (entry 13).
(XEN) msr 0000068a val 1fff800000102af0 (mbz 0)
(XEN) ************* VMCS Area **************
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <JBeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Tue, 18 Oct 2016 12:43:07 +0000 (13:43 +0100)]
libxl: remove explicit rule for libxl_arm_acpi.o
After 9c635883 ("ARM64: fix libxl build, do not include
../../xen/include") there is nothing special needed to build
libxl_arm_acpi.o. Remove the explicit rule, use predefined one.
Build tested on ARM64.
Suggested-by: Steve Capper <steve.capper@linaro.org> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
ARM64: fix libxl build, do not include ../../xen/include
Do not include ../../xen/include/ to build libxl_arm_acpi.c: header
files clashing against default headers under /usr/include are present in
that directory.
Link only $(XEN_ROOT)/xen/include/acpi under tools/include instead.
Build tested on ARM64 and x86_64.
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org> Tested-by: Steve Capper <steve.capper@linaro.org> Acked-by: Wei Liu <wei.liu2@citrix.com>
Ronald Rojas [Mon, 17 Oct 2016 00:16:32 +0000 (20:16 -0400)]
tools/xl: Use %u for uint32_t domids
domid is normally represented by uint32_t, but many format
strings in xl_cmdimpl.c use %d when printing, which is signed.
Use %u instead to print the unsigned integer domid.
Signed-off-by: Ronald Rojas <ronladred@gmail.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Fri, 14 Oct 2016 17:02:30 +0000 (18:02 +0100)]
libacpi: fix arm64 build
The arm64 build for libacpi was broken due to two reasons:
1. ACPI_BUILD_DIR was appended twice to dsdt_anycpu_arm.c.
2. The inclusion of firmware/Rules.mk overrided XEN_TARGET_ARCH, which
made CONFIG_ARM disappear.
Fix those by:
1. Correctly generate full path for dsdt_anaycpu_arm.c.
2. Include tools/Rules.mk instead, because libacpi/Makefile doesn't rely
on settings in firmware/Rules.mk.
While at it, use CONFIG_ARM_64 instead of CONFIG_ARM as it is more
accurate.
Reported-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 14 Oct 2016 12:09:42 +0000 (14:09 +0200)]
x86/Viridian: don't depend on undefined register state
The high halves of all GPRs are undefined in 32-bit and compat modes,
and the dependency is being obfuscated by our structure field names not
matching architectural register names (it was actually while putting
together a patch to correct this when I noticed the issue here).
For consistency also use the architecturally correct names on the
output side.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Jan Beulich [Fri, 14 Oct 2016 12:09:16 +0000 (14:09 +0200)]
x86emul: fix pushing of selector registers
Both explicit PUSH and far CALL currently push unrelated data (the
segment attributes word) in the high half (attributes and limit in the
64-bit case in the high 48 bits) instead of zero. To avoid having to
apply this and further changes in multiple places, also fold the two
(respectively) far call/jmp instances into one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 14 Oct 2016 12:08:29 +0000 (14:08 +0200)]
x86emul: honor MXCSR.MM
Commit 6dc9ac9f52 ("x86emul: check alignment of SSE and AVX memory
operands") didn't consider a specific AMD mode: Mis-alignment #GP
faults can be masked on some of their hardware.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Thu, 13 Oct 2016 10:27:28 +0000 (11:27 +0100)]
x86/hvm: Correct the position of the %cs L/D checks
Contrary to the description in the software manuals, in Long Mode, attempts to
load %cs check that D is not set in combination with L before the present flag
is checked.
This can be observed because the L/D check fails with #GP before the presence
check failes with #NP.
This change partially reverts c/s 78ff18c90 "x86: defer not-present segment
checks", taking it back to how it was in the v1 submission.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Lan Tianyu [Thu, 13 Oct 2016 11:06:28 +0000 (13:06 +0200)]
keyhandler: rework process of nonirq keyhandler
Keyhandler may run for a long time in serial port driver's
timer handler on the large machine with a lot of physical
cpus(e,g dump_timerq()) when serial port driver works in
the poll mode(via the exception mechanism).
If a timer handler runs a long time, it will block nmi_timer_fn()
to feed NMI watchdog and cause Xen hypervisor panic. Inserting
process_pending_softirqs() in timer handler will not help. when timer
interrupt arrives, timer subsystem calls all expired timer handlers
before programming next timer interrupt. There is no timer interrupt
arriving to trigger timer softirq during run a timer handler.
This patch is to fix the issue to make nonirq keyhandler run in
tasklet when receive debug key from serial port.
Signed-off-by: Lan Tianyu <tianyu.lan@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Mon, 10 Oct 2016 12:50:58 +0000 (13:50 +0100)]
ipxe: update to newer commit
The current commit in tree is rather old. It has come to a point that
cherry-picking commits from upstream isn't trivial anymore.
There is long term plan to track ipxe upstream, but for 4.8 release, we
should just update ipxe to a newer commit (they are using rolling
release model now).
Forward-port the one boot prompt patch that is still relevant and retire
the rest which are already in upstream.
Reported-by: Juergen Schinker <ba1020@homie.homelinux.net> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Disable the Cortex-a53-edac. Xen currently does not yet
handle reads/writes to the implementation defined CPUMERRSR
register.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com> Acked-by: Alistair Francis <alistair.francis@xilinx.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Changeset fbf96e6, "xentrace: correct formula to calculate
t_info_pages", broke the trace metadata page count calculation, by
mistaking t_info_first_offset as denominated in bytes, when in fact it
is denominated in words (uint32_t).
Effectively revert that change, and put a comment there to reduce the
chance that someone will make that mistake in the future.
Reviewed-by: Igor Druzhinin <igor.druzhinin@citrix.com> Tested-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
However we still have an issue - The file being installed (xen.efi.map)
does not exist in an ARM64 build (the xen.efi is linked againts xen).
The fix can be done two ways:
a) See if xen.efi.map exists and then copy it
b) Or link xen.efi.map to xen-syms.map (similar to how xen.efi is linked
against xen).
The patch chooses the former.
Reported-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Wei Liu [Mon, 10 Oct 2016 09:40:30 +0000 (10:40 +0100)]
Kconfig: use tab instead of space
Previously in d6be2cfc ("xen: make clear gcov support limitation in
Kconfig") and db6c2264 ("xen: add a gcov Kconfig option"), space was
used to indent Kconfig text. Change that to use tab instead.
No functional change.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 10 Oct 2016 10:16:49 +0000 (12:16 +0200)]
x86: defer not-present segment checks
Following on from commits 5602e74c60 ("x86emul: correct loading of
%ss") and bdb860d01c ("x86/HVM: correct segment register loading during
task switch") the point of the non-.present checks needs to be refined:
#NP (and its #SS companion), other than suggested by the various
instruction pages in Intel's SDM, gets checked for only after all type
and permission checks. The only checks getting done even later are the
long mode specific ones for system descriptors (which we don't support
yet) and 64-bit code segments (i.e. anything touching other than the
attribute byte).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Razvan Cojocaru [Fri, 7 Oct 2016 09:35:58 +0000 (11:35 +0200)]
x86/hvm: remove emulation context setting from hvmemul_cmpxchg()
hvmemul_cmpxchg() sets the read emulation context in p_new instead
of p_old, which is inconsistent (and wrong). Since p_old is
unused in any case and cmpxchg() semantics would be altered even
if it wasn't, remove the emulation context setting code.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Lan Tianyu [Fri, 7 Oct 2016 09:35:26 +0000 (11:35 +0200)]
timer: process softirq during dumping timer info
Dumping timer info may run for a long time on the huge machine with
a lot of physical cpus. To avoid triggering NMI watchdog, add
process_pending_softirqs() in the loop of dumping timer info.
Signed-off-by: Lan Tianyu <tianyu.lan@intel.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Jan Beulich [Wed, 5 Oct 2016 12:19:43 +0000 (14:19 +0200)]
x86emul: deliver correct math exceptions
#MF only applies to x87 instructions. SSE and AVX ones need #XM to be
raised instead, unless CR4.OSXMMEXCPT is clear, in which case #UD needs
to result. (But note that this is only a latent issue - we don't
emulate any instructions so far which could result in #XM.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
casionwoo [Tue, 4 Oct 2016 11:04:08 +0000 (20:04 +0900)]
Fix to be error handled when 10ms delayed for cpu_on
Comment of origin code said "wait max 10 ms until cpu is on"
Origin code expects to print "CPU%d power enable failed", if cpu do not on until 10ms
But actual code do not reach to print even it wait 10 ms (actually it waits 11ms not 10ms)
Because the comparing is like bellow
"if ( timeout-- == 0 )"
So I modified the code to wait 10ms and print the error statement
Let me simulate about origin code and modified code.
Jan Beulich [Tue, 4 Oct 2016 10:26:14 +0000 (04:26 -0600)]
arm: fix build with gcc6
Commit e170622f95 ("xen/arm: p2m: Re-implement p2m_set_mem_access using
p2m_{set,get}_entry") eliminated the only user of level_sizes[],
causing gcc6 to warn about the unused variable (as it's a const one
older gcc versions apparently don't care to emit a warning).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Jan Beulich [Tue, 4 Oct 2016 13:04:46 +0000 (14:04 +0100)]
x86emul: honor guest CR0.TS and CR0.EM
We must not emulate any instructions accessing respective registers
when either of these flags is set in the guest view of the register, or
else we may do so on data not belonging to the guest's current task.
Being architecturally required behavior, the logic gets placed in the
instruction emulator instead of hvmemul_get_fpu(). It should be noted,
though, that hvmemul_get_fpu() being the only current handler for the
get_fpu() callback, we don't have an active problem with CR4: Both
CR4.OSFXSR and CR4.OSXSAVE get handled as necessary by that function.
This is XSA-190.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Ian Jackson [Tue, 4 Oct 2016 09:19:36 +0000 (10:19 +0100)]
libxl: Mark libxl_retrieve_domain_configuration as for external callers only
This function takes the userdata lock. Incautious use inside libxl
can result in nested acquisition of that lock, and deadlock.
There is no good reason to use this function inside libxl, but it is a
superficially attractive option. Make future regressions easier to
spot by marking the function for external use only.
Similar arguments apply for the application-facing userdata accessors,
so do those too.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
A few issues were introduced in 38cd0664 ("libxl/arm: Add the size of
ACPI tables to maxmem"):
1. d_config was not properly initialised and disposed of.
2. using libxl_retrieve_domain_configuration caused thread to
deadlock itself.
Fix those issues by:
1. properly initialise and dispose of d_config.
2. switch to use libxl__get_domain_configuration.
Note that in theory we can refactor libxl_retrieve_domain_configuration
a bit to get a function without locking, but up until the calculation of
extra memory only relies on static configuration, hence we use the
stored configuration only.
Reported-by: Anthony PERARD <anthony.perard@citrix.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
tmem: Batch and squash XEN_SYSCTL_TMEM_OP_SAVE_GET_POOL_[FLAGS,NPAGES,UUID]
in one sub-call: XEN_SYSCTL_TMEM_OP_GET_POOLS.
These operations are used during the save process of migration.
Instead of doing 64 hypercalls lets do just one. We modify
the 'struct xen_tmem_client' structure (used in
XEN_SYSCTL_TMEM_OP_[GET|SET]_CLIENT_INFO) to have an extra field
'nr_pools'. Armed with that the code slurping up pages from the
hypervisor can allocate a big enough structure (struct tmem_pool_info)
to contain all the active pools. And then just iterate over each
one and save it in the stream.
We are also re-using one of the subcommands numbers for this,
as such the XEN_SYSCTL_INTERFACE_VERSION should be incremented
and that was done in the patch titled:
"tmem/libxc: Squash XEN_SYSCTL_TMEM_OP_[SET|SAVE].."
In the xc_tmem_[save|restore] we also added proper memory handling
of the 'buf' and 'pools'. Because of the loops and to make it as
easy as possible to review we add a goto label and for almost
all error conditions jump in it.
The include for inttypes is required for the PRId64 macro to
work (which is needed to compile this code under 32-bit).
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
tmem/xc_tmem_control: Rename 'arg1' to 'len' and 'arg2' to arg.
That is what they are used for. Lets make it more clear.
Of all the various sub-commands, the only one that needed
semantic change is XEN_SYSCTL_TMEM_OP_SAVE_BEGIN. That in the
past used 'arg1', and now we are moving it to use 'arg'.
Since that code is only used during migration which is tied
to the toolstack it is OK to change it.
We should increment the XEN_SYSCTL_INTERFACE_VERSION because
of this, and that was fortunatly done in the patch titled:
"tmem/libxc: Squash XEN_SYSCTL_TMEM_OP_[SET|SAVE].."
While at it, also fix xc_tmem_control_oid to properly handle
the 'buf' and bounce it as appropiate.
Acked-by: Andrew cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>