Jan Beulich [Fri, 12 Aug 2016 14:57:07 +0000 (16:57 +0200)]
x86emul: don't special case fetching unsigned 8-bit immediates
These can be made work using SrcImmByte, making sure the low 8 bits of
src.val get suitably zero extended upon consumption. SHLD and SHRD
require a little more adjustment: Their source operands get changed
away from SrcReg, handling the register access "manually" instead of
the insn byte fetching.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 12 Aug 2016 14:55:48 +0000 (16:55 +0200)]
x86emul: all push flavors are data moves
Make all paths leading to the "push" label have the Mov flag set, and
ASSERT() that to be the case. For the opcode FF group the adjustment is
benign for the paths not leading to "push", as they all set dst.type to
OP_NONE
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 12 Aug 2016 14:54:24 +0000 (16:54 +0200)]
x86emul: don't special case fetching the immediate of PUSH
These immediates follow the standard patterns in all modes, so they're
better fetched by the generic source operand handling code.
To facilitate testing, instead of adding yet another of these pretty
convoluted individual test cases, simply introduce another blowfish run
with -mno-accumulate-outgoing-args (the additional -Dstatic is to
keep the compiler from converting the calling convention to
"regparm(3)", which I did observe it does).
To make this introduction of a new blowfish pass (and potential further
ones later one) have less impact on the readability of the final code,
abstract all such "binary blob" executions via a table to iterate
through.
The resulting native code execution adjustment also uncovered a lack of
clobbers on the asm() in the 64-bit case, which is being fixed at once.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Razvan Cojocaru [Fri, 12 Aug 2016 14:51:36 +0000 (16:51 +0200)]
vm_event: synchronize vCPU state in vm_event_resume()
Vm_event_vcpu_pause() needs to use vcpu_pause_nosync() in order
for the current vCPU to not get stuck. A consequence of this is
that the custom vm_event response handlers will not always see
the real vCPU state in v->arch.user_regs. This patch makes sure
that the state is always synchronized in vm_event_resume, before
any handlers have been called. This problem especially affects
vm_event_set_registers().
Simply checking vm_event_pause_count to make sure the vCPU is
paused suffices since there's only one ring / consumer at a
time, and events are being processed one-by-one, so the
toolstack won't unpause the vCPU behind our backs.
Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Andrew Cooper [Thu, 11 Aug 2016 17:21:14 +0000 (17:21 +0000)]
x86/cpufreq: Avoid using processor_pminfo[cpu] when it is NULL
The undefined behaviour sanitiser shows that it really is NULL via the
pre_initcall path.
(XEN) ================================================================================
(XEN) UBSAN: Undefined behaviour in cpufreq.c:158:66
(XEN) member access within null pointer of type 'struct processor_pminfo'
(XEN) ----[ Xen-4.8-unstable x86_64 debug=y Not tainted ]----
<snip>
(XEN) [<ffff82d0801c4231>] cpufreq_add_cpu+0x161/0xdc0
(XEN) [<ffff82d0801c6610>] cpufreq.c#cpu_callback+0x20/0x30
(XEN) [<ffff82d0804eefad>] cpufreq.c#cpufreq_presmp_init+0x2d/0x50
(XEN) [<ffff82d0804c5942>] do_presmp_initcalls+0x22/0x30
(XEN) [<ffff82d08051852d>] __start_xen+0x378d/0x42f0
(XEN) [<ffff82d080100073>] __high_start+0x53/0x60
Fix two other occurances of the same buggy logic.
The processor_pminfo[] objects are only allocated as a result of
XENPF_set_processor_pminfo hypercalls, which means that this early cpu
callback will always hit the early NULL check, and is therefore pointless.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 11 Aug 2016 11:36:42 +0000 (13:36 +0200)]
x86/NUMA: cleanup
- drop the only left CONFIG_NUMA conditional (this is always true)
- drop struct node_data's node_id field (being always equal to the
node_data[] array index used)
- don't open code node_{start,end}_pfn() nor node_spanned_pages()
except when used as lvalues (those could be converted too, but this
seems a little awkward)
- no longer open code pfn_to_paddr() in an expression being modified
anyway
- make dump less verbose by logging actual vs intended node IDs only
when they don't match
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 11 Aug 2016 11:35:50 +0000 (13:35 +0200)]
page-alloc/x86: don't restrict DMA heap to node 0
When node zero has no memory, the DMA bit width will end up getting set
to 9, which is obviously not helpful to hold back a reasonable amount
of low enough memory for Dom0 to use for DMA purposes. Find the lowest
node with memory below 4Gb instead.
Introduce arch_get_dma_bitsize() to keep this arch-specific logic out
of common code.
Also adjust the original calculation: I think the subtraction of 1
should have been part of the flsl() argument rather than getting
applied to its result. And while previously the division by 4 was valid
to be done on the flsl() result, this now also needs to be converted,
as is should only be applied to the spanned pages value.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Trammell Hudson [Thu, 11 Aug 2016 11:34:59 +0000 (13:34 +0200)]
allow reproducible builds of xen.gz
The mkelf32 executable was using an uninitialized stack buffer for
padding after the ehdr and phdr are written to the xen file, which
leads to non-deterministic bytes in the binary and prevented Xen
hypervisors from being reproducibly built.
Additionally, the file was then compressed with gzip -9 without the
-n | --no-name flag, which lead to the xen.gz file having
non-deterministric bytes (the timestamp) in the compressed file.
Signed-off-by: Trammell Hudson <trammell.hudson@twosigma.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 5 Aug 2016 13:26:21 +0000 (14:26 +0100)]
x86/microcode: Avoid undefined behaviour from signed integer overflow
The checksums should be calculated using unsigned 32bit integers, as they are
intended to overflow and end at 0. Replace some other signed integers with
unsigned ones, to avoid mixed-sign comparisons.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Fri, 5 Aug 2016 13:22:48 +0000 (14:22 +0100)]
xen/common: Avoid undefined behaviour by shifting into a sign bit
For d->shutdown_code, change the field to being unsigned and using an unsigned
sentinel. The sentinal needs to be distinguishable from any value
representable in a u8.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 10 Aug 2016 09:41:28 +0000 (10:41 +0100)]
x86/traps: Fix failed ASSERT() in do_guest_trap()
c/s 2e426d6 "x86/traps: Drop use_error_code parameter from do_{,guest_}trap()"
introduced an assertion which covered the correctness of shifting 1u by an
input parameter.
While all other inputs provide a constants vector, the `int $N` handling path
from do_general_protection() passes any vector.
This path is triggered by XTF, which uses `int 0x20` to facilitate returning
to kernel mode after running specific tests in user mode.
No vectors above 32 have an error code, so adjust the logic to cope.
Reported-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Boris Ostrovsky [Wed, 10 Aug 2016 09:58:34 +0000 (11:58 +0200)]
hvmloader: acpi_build_tables() can't take acpi_config as const
We'd need to update other routines' definitions. However, acpi_config
is not really a const since new_vm_gid() wants to update
acpi_config.vm_gid_addr.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
By doing this we can move hvmloader-private interfaces (such as
uart_exists(), lpt_exists() etc.) out of the ACPI builder. This will
help us with allowing to call the builder from places other than
hvmloader.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Mon, 8 Aug 2016 09:42:50 +0000 (10:42 +0100)]
tools/xenalyze: Allow automatic resizing of sample buffers
Rather than have large fixed-size buffers, start with smaller buffers
and allow them to grow as needed (doubling each time), with a fairly
large maximum. Allow this maximum to be set by a command-line
parameter.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
George Dunlap [Mon, 8 Aug 2016 09:42:49 +0000 (10:42 +0100)]
tools/xenalyze: Get rid of extraneous data structure
The only difference between event_cycle_summary and cycle_summary was
that the former has a separate counter for "events" which had
zero-cycle events. But a lot of the code dealing with them had to be
duplicated with slightly different fields.
Remove event_cycle_summary, add an "event_count" field to
cycle_symmary, and use cycle_summary for everything.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
George Dunlap [Mon, 8 Aug 2016 09:42:48 +0000 (10:42 +0100)]
tools/xenalyze: Remove weighted cpi summaries
At the moment these structures are not used, and half of the code for
collecting it is commented out. To be used they require further
support for collecting hardware instruction counter data inside of
Xen.
Remove the code entirely; when they're wanted again they will be here
in the git log.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Wed, 3 Aug 2016 16:56:56 +0000 (16:56 +0000)]
x86/traps: Drop use_error_code parameter from do_{,guest_}trap()
Whether or not an error code is needed can be determinted entirely from the
trapnr paramter, as error codes are architecturally specified.
Introduce TRAP_HAVE_EC as a bitmap of reserved vectors which have error codes,
and drop the use_error_code from all callsites.
As a result, the DO_ERROR{,_NOCODE}() macros become entirely superflouous and
can be dropped. Update the exception_table to point straight at do_trap().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Mon, 1 Aug 2016 09:55:59 +0000 (10:55 +0100)]
xl: use xenconsole startup protocol
If user asks xl to automatically connect to console when creating a
guest, use the new startup protocol before trying to unpause domain so
that we don't lose any console output.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Mon, 8 Aug 2016 10:07:46 +0000 (11:07 +0100)]
CODING_STYLE: Allow single-sentence comments without full stops
One of the common ways in which contributors trip up over the
CODING_STYLE guides is by not putting a full stop at the end of a
comment when there is only a single sentence. Calling these out is a
waste of everybody's time: The full stop at the end of a comment with
a single sentence (or a single phrase) adds absolutely nothing to the
legibility of the code.
Modify CODING_STYLE to allow comments with a single sentence or
sentence fragment to either have a full stop or not, while making it
clear that comments with multiple sentences must have a full stop at
the end of each sentence.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@citrix.com>
Dario Faggioli [Thu, 4 Aug 2016 08:59:03 +0000 (10:59 +0200)]
tools: xenalyze: kill spurious sched_switch output in non dump mode.
In fact, 52cf096df7 ("xenalyze: handle scheduling event"),
when dealing with TRC_SCHED_SWITCH, forgot to check whether
we actually are in dump mode, causing the printf() in
dump_sched_switch() to always produce its output, which
is not what we want.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Bob Liu [Thu, 4 Aug 2016 01:07:56 +0000 (09:07 +0800)]
libxl: return any serial tty path in libxl_console_get_tty
When specifying a serial list in domain config, users of
libxl_console_get_tty cannot get the tty path of a second specified pty serial,
since right now it always returns the tty path of serial 0.
Signed-off-by: Bob Liu <bob.liu@oracle.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Tue, 2 Aug 2016 16:10:47 +0000 (18:10 +0200)]
tools: make xenstore domain easy configurable
Add configuration entries to sysconfig.xencommons for selection of the
xenstore type (domain or daemon) and start the selected xenstore
service via a script called from sysvinit or systemd.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Tue, 2 Aug 2016 16:10:46 +0000 (18:10 +0200)]
tools: use pidfile for test if xenstored is running
Instead of trying to read xenstore via xenstore-read use the pidfile
of xenstored for the test whether xenstored is running. This prepares
support of xenstore domain, as trying to read xenstore will block
for ever in case xenstore domain is started after trying to read.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Tue, 2 Aug 2016 16:10:45 +0000 (18:10 +0200)]
tools: split out xenstored starting form xencommons
In order to prepare starting a xenstore domain split out the starting
of the xenstore daemon from the xencommons script into a dedicated
launch-xenstore script.
A rerun of autogen.sh is required.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Tue, 2 Aug 2016 16:10:44 +0000 (18:10 +0200)]
tools: remove systemd xenstore socket definitions
On a system with systemd the xenstore sockets are created via systemd.
Remove the related configuration files in order to be able to decide
at runtime whether the sockets should be created or not. This will
enable Xen to start xenstore either via a daemon or via a stub domain.
As the xenstore domain start program will exit after it has done its
job prepare the same behaviour to be tolerated by systemd for the
xenstore daemon by specifying the appropriate flags in the service
file.
A rerun of autogen.sh is required.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: David Scott <dave@recoil.org>
This is because a data abort from a guest was received due to a
permission fault but memaccess thought there are no permission fault.
On ARM, memaccess permissions are stored in a radix tree because there
are not enough available bits in the p2m entry to store the access
restriction. When memaccess is restricting the access (i.e any other
access than p2m_access_rwx), the access will be added in the radix tree
using the GFN as a key. This will be done for all 4KB pages.
This means that memaccess has to shatter all the superpages in a given
region to set the permission on a 4KB granularity. Currently, when a
superpage is shattered, the new entries are using the value
p2m->default_access which will restrict permission (because memaccess
has been enabled). However the radix tree does not yet contain
an entry for this GFN.
If a guest VCPU is running at the same time and trying to access the
modified region, it will result to a stage-2 permission fault. As
the radix tree does not yet contain an entry for the GFN, memaccess will
deduce that the fault was not valid and a data abort will be injecting
to the guest (and crash it).
Furthermore, the permission may be restricted outside of the requested
region if it is only a subset of a 1GB/2MB superpage.
The two issues can be fixed by re-using the permission of the superpage
entry and override the necessary fields. This is not a problem because
memaccess cannot work on superpage.
Lastly, document the code which call mfn_to_p2m_entry when creating a
the p2m entry for a table to explain that create the p2m entry to page table
to explain that permission are ignored by the hardware (See D4.3.1 in ARM DDI
0487A.j). so the value of the parameter 'access' of mfn_to_p2m_entry does
not matter.
The ARM erratum applies to certain revisions of Cortex-A57. The
processor may report a Stage 2 translation fault as the result of
Stage 1 fault for load crossing a page boundary when there is a
permission fault or device memory fault at stage 1 and a translation
fault at Stage 2.
So Xen needs to check that Stage 1 translation does not generate a fault
before handling the Stage 2 fault. If it is a Stage 1 translation fault,
return to the guest to let the processor injecting the correct fault.
Julien Grall [Thu, 4 Aug 2016 17:50:06 +0000 (18:50 +0100)]
xen/arm: traps: Avoid unnecessary VA -> IPA translation in abort handlers
Translating a VA to a IPA is expensive. Currently, Xen is assuming that
HPFAR_EL2 is only valid when the stage-2 data/instruction abort happened
during a translation table walk of a first stage translation (i.e S1PTW
is set).
However, based on the ARM ARM (D7.2.34 in DDI 0487A.j), the register is
also valid when the data/instruction abort occured for a translation
fault.
With this change, the VA -> IPA translation will only happen for
permission faults that are not related to a translation table of a
first stage translation.
Julien Grall [Thu, 4 Aug 2016 17:50:05 +0000 (18:50 +0100)]
xen/arm: traps: MMIO should only be emulated for fault translation
The function do_trap_data_abort_guest assumes that a stage-2 data abort
can only be taken for a translation fault or permission fault today.
Whilst this is true today, it might not be in the future. Rather than
emulating the MMIO for any fault other than the permission one, print
a warning message when the fault is not handled by Xen.
Julien Grall [Thu, 4 Aug 2016 17:50:04 +0000 (18:50 +0100)]
xen/arm: Use check_workaround to handle the erratum 766422
Currently, Xen is accessing the stored MIDR everytime it has to check
whether the processor is affected by the erratum 766422.
This could take advantage of the new capability bitfields to detect
whether the processor is affected at boot time.
With this patch, the number of instructions to check the erratum is
going down from ~13 (including 2 loads and a co-processor access) to
~6 instructions (include 1 load).
Julien Grall [Thu, 4 Aug 2016 17:50:03 +0000 (18:50 +0100)]
xen/arm: Provide macros to help creating workaround helpers
Workarounds may require to execute a different path when the platform
is affected by the associated erratum. Furthermore, this may need to
be called in the common code.
To avoid too much intrusion/overhead, the workaround helpers need to
be a nop on architecture which will never have the workaround and have
to be quick to check whether the platform requires it.
The alternative framework is used to transform the check in a single
instruction. When the framework is not available, the helper will have
~6 instructions including 1 instruction load.
The macro will create a handler called check_workaround_xxxxx with
xxxx the erratum number.
For instance, the line bellow will create a workaround helper for
erratum #424242 which is enabled when the capability
ARM64_WORKAROUND_424242 is set and only available for ARM64:
Julien Grall [Thu, 4 Aug 2016 17:50:02 +0000 (18:50 +0100)]
xen/arm: traps: Simplify the switch in do_trap_*_abort_guest
The fault status we care are in the form BBBBxx where xx is the lookup
level that gave the fault. We can simplify the code by masking the 2 least
significant bits.
Andrew Cooper [Thu, 4 Aug 2016 11:38:05 +0000 (12:38 +0100)]
x86/debug: Make debugger_trap_entry() safe during early boot
debugger_trap_entry() is reachable during early boot where its unconditional
use of current is unsafe. Add a warning to the function to this effect.
Perform the vector check first, as this allows the compiler to elide the other
content from most of its callsites. Check guest_mode(regs) before using
current, which makes the path safe on early boot.
While editing this area, drop DEBUGGER_trap_{entry,fatal}, as hiding a return
statement in a function-like macro is very antisocial programming; show the
real control flow at each of the callsites. Finally, switch
debugger_trap_{entry,fatal} to having boolean return types, to match their
semantics.
No behavioural change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 4 Aug 2016 08:08:48 +0000 (10:08 +0200)]
hvmloader: don't hard-code IO-APIC parameters
The IO-APIC address has variable bits determined by the PCI-to-ISA
bridge (albeit for now we refrain from actually evaluating them, as
there's still implicit rather than explicit agreement on the IO-APIC
base address between qemu and the hypervisor), and the IO-APIC version
should be read from the IO-APIC.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Aug 2016 08:08:00 +0000 (10:08 +0200)]
x86/time: relax barriers
On x86 there's no need for full barriers in loops waiting for some
memory location to change. Nor do we need full barriers between two
reads and two writes - SMP ones fully suffice.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Aug 2016 08:07:02 +0000 (10:07 +0200)]
x86/time: group time stamps into a structure
If that had been done from the beginning, mistakes like the one
corrected in commit b64438c7c1 ("x86/time: use correct (local) time
stamp in constant-TSC calibration fast path") would likely never have
happened.
Also add a few "const" to make more obvious when things aren't expected
to change.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Dario Faggioli <dario.faggioli@citrix.com> Tested-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Aug 2016 08:04:29 +0000 (10:04 +0200)]
x86/time: fold recurring code
Common code between time_calibration_{std,tsc}_rendezvous() can better
live in a single place, eliminating the risk of adjusting one without
the other.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Dario Faggioli <dario.faggioli@citrix.com> Tested-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Aug 2016 08:02:52 +0000 (10:02 +0200)]
x86/time: calibrate TSC against platform timer
... instead of unconditionally against the PIT. This allows for local
and master system times to remain in better sync (which matters even
when, on any modern system, the master time is really used only during
secondary CPU bringup, as the error between the two is in fact
noticable in cross-CPU NOW() invocation monotonicity).
This involves moving the init_platform_timer() invocation into
early_time_init(), splitting out the few things which really need to be
done in init_xen_time(). That in turn allows dropping the open coded
PIT initialization from init_IRQ() (it was needed for APIC clock
calibration, which runs between early_time_init() and init_xen_time()).
In the course of this re-ordering also set the timer channel 2 gate low
after having finished calibration. This should be benign to overall
system operation, but appears to be the more clean state.
Also do away with open coded 8254 register manipulation from 8259 code.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Thu, 4 Aug 2016 08:01:57 +0000 (10:01 +0200)]
x86/HVM: add new functions to get/set memory types
For clarity this patch breaks the code to set/get memory types out
of do_hvm_op() into dedicated functions: hvmop_set/get_mem_type().
Also, for clarity, checks for whether a memory type change is allowed
are broken out into a separate function called by hvmop_set_mem_type().
There is no intentional functional change in this patch.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Thu, 4 Aug 2016 08:01:17 +0000 (10:01 +0200)]
x86: rename p2m_mmio_write_dm to p2m_ioreq_server
Previously p2m type p2m_mmio_write_dm was introduced for write-
protected memory pages whose write operations are supposed to be
forwarded to and emulated by an ioreq server. Yet limitations of
rangeset restrict the number of guest pages to be write-protected.
This patch replaces the p2m type p2m_mmio_write_dm with a new name:
p2m_ioreq_server, which means this p2m type can be claimed by one
ioreq server, instead of being tracked inside the rangeset of ioreq
server. And a new memory type, HVMMEM_ioreq_server, is now used in
the HVMOP_set/get_mem_type interface to set/get this p2m type.
Patches following up will add the related HVMOP handling code which
map/unmap type p2m_ioreq_server to/from an ioreq server. Without
following patches, memory type changes to HVMMEM_ioreq_server can
still be allowed, and in such cases, p2m_ioreq_server pages will be
treated the same as ones with previous type p2m_mmio_write_dm, and
are tracked inside the ioreq server's rangeset.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
xen/arm: traps: Don't inject a fault if the translation VA -> IPA fails
Based on ARM ARM (D4.5.3 in ARM DDI 0486A and B3.12.7 in ARM DDI 0406C.c),
a Stage 1 translation error has priority over a Stage 2 translation error.
Therefore gva_to_ipa can only fail if another vCPU is playing with the
page table.
Rather than injecting a custom fault, replay the instruction and let the
processor injecting the correct fault.
This is fine as Xen is handling all the pending softirqs
(see leave_hypervisor_tail) before returning to the guest. One of them
is the scheduler which could rescheduled the vCPU.
The ARM erratum 832075 applies to certain revisions of Cortex-A57, one
of the workarounds is to change device loads into using load-acquire
semantics.
Use the alternative framework to enable the workaround only on affected
cores.
Whilst a guest could trigger the deadlock, it can be broken when the
processor is receiving an interrupt. As the Xen scheduler will always setup
a timer (firing to every 1ms to 300ms depending on the running time
slice) on each processor, the deadlock would last only few milliseconds
and only affects the guest time slice.
Therefore a malicious guest could only hurt itself. Note that all the
guests should implement/enable the workaround for the affected cores.
The ARM errata 819472, 827319 and 824069 define the same workaround for
these hardware issues in certain Cortex-A53 parts.
The cache instructions "dc cvac" and "dc cvau" need to be upgraded to
"dc civac".
Use the alternative framework to replace those instructions only on
affected cores.
Whilst the errata affect cache instructions issued at any exception
level, it is not necessary to trap EL1/EL0 data cache instructions
access in order to upgrade them. Indeed the data cache corruption would
always be at the address used by the data cache instructions. Note that
this address could point to a shared memory between guests and the
hypervisors, however all the information present in it are be validated
before any use.
Therefore a malicious guest could only hurt itself. Note that all the
guests should implement/enable the workaround for the affected cores.
xen/arm: Detect silicon revision and set cap bits accordingly
After each CPU has been started, we iterate through a list of CPU
errata to detect CPUs which need from hypervisor code patches.
For each bug there is a function which checks if that a particular CPU is
affected. This needs to be done on every CPU to cover heterogenous
systems properly.
If a certain erratum has been detected, the capability bit will be set.
In the case the erratum requires code patching, this will be triggered
by the call to apply_alternatives.
The code is based on the file arch/arm64/kernel/cpu_errata.c in Linux
v4.6-rc3.
xen/arm: cpufeature: Provide an helper to check if a capability is supported
The CPU capabilities will be set depending on the value found in the CPU
registers. This patch provides a generic to go through a set of capabilities
and find which one should be enabled.
The parameter "info" is used to display the kind of capability updated (e.g
workaround, feature...).
Some of the processor erratum will require to modify code sequence.
As those modifications may impact the performance, they should only
be enabled on affected cores. Furthermore, Xen may also want to take
advantage of new hardware features coming up with v8.1 and v8.2.
This patch adds an infrastructure to patch Xen during boot time
depending on the "features" available on the platform.
This code is based on the file arch/arm64/kernel/alternative.c in
Linux 4.6-rc3. Any references to arm64 have been dropped to make the
code as generic as possible.
When Xen is creating the page tables, all the executable sections
(.text and .init.text) will be marked read-only and then enforced by
setting SCTLR.WNX.
Whilst it might be possible to mark those entries read-only after
Xen has been patched, we would need extra care to avoid possible
TLBs conflicts (see D4-1732 in ARM DDI 0487A.i) as all
physical CPUs will be running.
All the physical CPUs have to be brought up before patching Xen because
each cores may have different errata/features which require code
patching. The only way to find them is to probe system registers on
each CPU.
To avoid extra complexity, it is possible to create a temporary
writeable mapping with vmap. This mapping will be used to write the
new instructions.
Lastly, runtime patching is currently not necessary for ARM32. So the
code is only enabled for ARM64.
Note that the header asm-arm/alternative.h is a verbatim copy for the
Linux one (arch/arm64/include/asm/alternative.h). It may contain
innacurate comments, but I did not touch them for now.
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Julien Grall <julien.grall@arm.com>
Anshul Makkar [Wed, 3 Aug 2016 12:35:22 +0000 (13:35 +0100)]
ratelimit: Implement rate limit for credit2 scheduler
Rate limit assures that a vcpu will execute for a minimum amount of
time before being put at the back of a queue or being preempted by
higher priority thread.
It introduces context-switch rate-limiting. The patch enables the VM
to batch its work and prevents the system from spending most of its
time in context switches because of a VM that is waking/sleeping at
high rate.
Dario Faggioli [Wed, 3 Aug 2016 12:31:49 +0000 (13:31 +0100)]
xen: fix a (latent) cpupool-related race during domain destroy
So, during domain destruction, we do:
cpupool_rm_domain() [ in domain_destroy() ]
sched_destroy_domain() [ in complete_domain_destroy() ]
Therefore, there's a window during which, from the
scheduler's point of view, a domain stilsts outside
of any cpupool.
In fact, cpupool_rm_domain() does d->cpupool=NULL,
and we don't allow that to hold true, for anything
but the idle domain (and there are, in fact, ASSERT()s
and BUG_ON()s to that effect).
Currently, we never really check d->cpupool during the
window, but that does not mean the race is not there.
For instance, Credit2 at some point (during load balancing)
iterates on the list of domains, and if we add logic that
needs checking d->cpupool, and any one of them had
cpupool_rm_domain() called on itself already... Boom!
(In fact, calling __vcpu_has_soft_affinity() from inside
balance_load() makes `xl shutdown <domid>' reliably
crash, and this is how I discovered this.)
On the other hand, cpupool_rm_domain() "only" does
cpupool related bookkeeping, and there's no harm
postponing it a little bit.
Also, considering that, during domain initialization,
we do:
cpupool_add_domain()
sched_init_domain()
It makes sense for the destruction path to look like
the opposite of it, i.e.:
sched_destroy_domain()
cpupool_rm_domain()
And hence that's what this patch does.
Actually, for better robustness, what we really do is
moving both cpupool_add_domain() and cpupool_rm_domain()
inside sched_init_domain() and sched_destroy_domain(),
respectively (and also add a couple of ASSERT()-s).
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Juergen Gross <jgross@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
xen: credit2: issues in csched2_cpu_pick(), when tracing is enabled.
In fact, when not finding a suitable runqueue where to
place a vCPU, and hence using a fallback, we either:
- don't issue any trace record (while we should, at
least, output the chosen pcpu),
- risk underruning when accessing the runqueues
array, while preparing the trace record.
Fix both issues and, while there, also a couple of style
problems found nearby.
Spotted by Coverity.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jacob Pan [Wed, 3 Aug 2016 12:41:13 +0000 (14:41 +0200)]
mwait-idle: add Denverton
Denverton is an Intel Atom based micro server which shares the same
Goldmont architecture as Broxton. The available C-states on
Denverton is a subset of Broxton with only C1, C1e, and C6.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[Linux commit: 0080d65b7719fc58e60b5595fc61acded330004f] Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 3 Aug 2016 12:40:44 +0000 (14:40 +0200)]
x86/time: introduce and use rdtsc_ordered()
Matching Linux commit 03b9730b76 ("x86/asm/tsc: Add rdtsc_ordered() and
use it in trivial call sites") and earlier ones it builds upon, let's
make sure timing loops don't have their rdtsc()-s re-ordered, as that
would harm precision of the result (values were observed to be several
hundred clocks off without this adjustment).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Joao Martins <joao.m.martins@oracle.com>
Jan Beulich [Wed, 3 Aug 2016 12:39:31 +0000 (14:39 +0200)]
x86/time: adjust local system time initialization
Using the bare return value from read_platform_stime() is not suitable
when local_time_calibration() is going to use its fast path: Divergence
of several dozen microseconds between NOW() return values on different
CPUs results when platform and local time don't stay in close sync.
Latch local and platform time on the CPU initiating AP bringup, such
that the AP can use these values to seed its stime_local_stamp with as
little of an error as possible. The boot CPU, otoh, can simply
calculate the correct initial value (other CPUs could do so too with
even greater accuracy than the approach being introduced, but that can
work only if all CPUs' TSCs start ticking at the same time, which
generally can't be assumed to be the case on multi-socket systems).
This slightly defers init_percpu_time() (moved ahead by commit dd2658f966 ["x86/time: initialise time earlier during
start_secondary()"]) in order to reduce as much as possible the gap
between populating the stamps and consuming them.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Dario Faggioli <dario.faggioli@citrix.com> Tested-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Tue, 2 Aug 2016 17:25:42 +0000 (19:25 +0200)]
libxl: use llabs() instead abs() for int64_t argument
Commit 57f8b13c724023c78fa15a80452d1de3e51a1418 ("libxl: memory size
in kb requires 64 bit variable") introduced a bug: abs() shouldn't
be called with an int64_t argument. llabs() is to be used here.
Caught by clang build with error message:
libxl.c:4198:33: error: absolute value function 'abs' given an argument
of type
'int64_t' (aka 'long') but has parameter of type 'int' which may cause
truncation of value [-Werror,-Wabsolute-value]
if (target_memkb < 0 && abs(target_memkb) > current_target_memkb)
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Wed, 27 Jul 2016 17:34:39 +0000 (18:34 +0100)]
x86/mm: Annotate gfn_get_* helpers as requiring non-NULL parameters
Introduce and use the nonnull attribute to help the compiler catch NULL
parameters being passed to function which require their parameters not to be
NULL. Experimentally, GCC 4.9 on Debian Jessie only warns of non-NULL-ness
from immediate callers, so propagate the attributes out to all helpers.
A sample error looks like:
mem_sharing.c: In function ‘mem_sharing_nominate_page’:
mem_sharing.c:884:13: error: null argument where non-null required (argument 3) [-Werror=nonnull]
amfn = get_gfn_type_access(ap2m, gfn, NULL, &ap2ma, 0, NULL);
^
As part of this, replace the get_gfn_type_access() macro with an equivalent
static inline function for extra type safety, and the ability to be annotated.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
libxl_set_memory_target() and several other interface functions of
libxl use a 32 bit sized parameter for a memory size value in kBytes.
This limits the maximum size to be passed in such a parameter
depending on signedness of the parameter to 2TB or 4TB.
Tamas K Lengyel [Mon, 1 Aug 2016 17:14:27 +0000 (11:14 -0600)]
x86/mem-sharing: mem-sharing a range of memory
Currently mem-sharing can be performed on a page-by-page basis from the control
domain. However, this process is quite wasteful when a range of pages have to
be deduplicated.
This patch introduces a new mem_sharing memop for range sharing where
the user doesn't have to separately nominate each page in both the source and
destination domain, and the looping over all pages happen in the hypervisor.
This significantly reduces the overhead of sharing a range of memory.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Mon, 1 Aug 2016 08:57:10 +0000 (09:57 +0100)]
libxl: create xenstore nodes for control/feature-XXX flags
The xenstore-paths documentation specifies various control/feature-XXX
flags to allow a guest to tell a toolstack about its abilities to
respond to values written to control/shutdown. However, because the
parent control xenstore key is created read-only to the guest, unless
empty nodes for the feature flags are also created reat/write by the
toolstack, the guest will not be able to set any flags.
This patch adds code to create all specified feature flag nodes at
domain creation time.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monne [Tue, 2 Aug 2016 10:49:51 +0000 (12:49 +0200)]
libxl: fix printing hotplug arguments/environment
An OS could decide to not pass any environment variables to hotplug scripts,
and this will trigger a bug in device_hotplug logic, since it expects the
environment array to exist. Allow env to be NULL.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jim Fehlig [Fri, 29 Jul 2016 22:56:22 +0000 (16:56 -0600)]
docs: define semantics of vncpasswd in xl.cfg
A recent discussion around LSN-2016-0001 [1] included defining
the sematics of an empty string for a VNC password. It was stated
that "libxl interprets an empty password in the caller's
configuration to mean that passwordless access should be permitted".
The same applies for vncpasswd setting in xl.cfg. This patch
extends to xl.cfg documentation to define the semantics of setting
vncpasswd to an empty string.
Signed-off-by: Jim Fehlig <jfehlig@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Mon, 1 Aug 2016 12:36:44 +0000 (13:36 +0100)]
xen/types: Correct the definition of uintptr_t
uintptr_t is specified as unsigned int in 32bit, not unsigned long. This is
why, when copying inttypes.h from GCC, the use of PRIxPTR and similar is
broken for 32bit builds.
Use __attribute__((__mode__(__pointer__))) to get the compilers default
pointer type, which matches the pre-existing inttypes.h
Fix the identified breakage with ELF_PRPTRVAL
Compile tested on all architectures, with a manual printk() to trigger any
potential -Wformat issues.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Chao Gao [Mon, 1 Aug 2016 16:22:54 +0000 (18:22 +0200)]
x86/vMSI-x: check whether msixtbl_list in msixtbl_pt_register()
MSI-x tables' initializtion had been deferred in the commit 74c6dc2d0ac4dcab0c6243cdf6ed550c1532b798. If an assigned device does not support
MSI-x, the msixtbl_list won't be initialized. However, the following paths
XEN_DOMCTL_bind_pt_irq
pt_irq_create_bind
msixtbl_pt_register
do not check this case. Some errors(malwares, etc.) may lead to calling
XEN_DOMCTL_bind_pt_irq without a clear gtable and will cause Xen panic.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 1 Aug 2016 16:21:37 +0000 (18:21 +0200)]
mwait-idle: correct/improve BXT support
Linux commit 5dcef69486 ("intel_idle: add BXT support") added an
8-element lookup array with just a 2-bit value used for lookups. As per
the SDM that bit field is really 3 bits wide. Since the top two array
entries are zero, deal with the resulting invalid (zero) values by
moving the zero-MSR-value check into irtl_2_usec() and having that
function's caller check its result instead.
Chris Patterson [Wed, 27 Jul 2016 20:01:26 +0000 (16:01 -0400)]
libxl: compilation warning fix for arm & aarch64
GCC 6 will warn on unused static const variables in c modules:
https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00847.html
When compiling with LIBXL_HAVE_NO_SUSPEND_RESUME set (arm & aarch64),
the compiler emits the following errors:
xl_cmdimpl.c:101:19: error: 'migrate_report'
defined but not used [-Werror=unused-const-variable=]
xl_cmdimpl.c:99:19: error: 'migrate_permission_to_go'
defined but not used [-Werror=unused-const-variable=]
xl_cmdimpl.c:97:19: error: 'migrate_receiver_ready'
defined but not used [-Werror=unused-const-variable=]
xl_cmdimpl.c:95:19: error: 'migrate_receiver_banner'
defined but not used [-Werror=unused-const-variable=]
These unused const variables are only used in functions which exist between
the ifndef block:
#ifndef LIBXL_HAVE_NO_SUSPEND_RESUME
...
#endif
Wrap the same ifndef around these variables.
Signed-off-by: Chris Patterson <pattersonc@ainfosec.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Mon, 25 Jul 2016 15:13:13 +0000 (16:13 +0100)]
xsm: don't require configuring tools to build xen xsm blob
Starting from 08cffe66 ("xsm: add a default policy to .init.data") we
can attach a xsm policy blob to hypervisor. To build that policy blob
now hypervisor build system needs to enter tools directory.
The expectation for hypervisor and tools build systems is different. We
don't want xen build system to depend on configure but we want tools
build system to. That commit broke this expectation because it required
users to run configure before building hypervisor. This broke ARM build
because ARM developers normally build hypervisor and tools separately
(and possibly on different platforms). It can also break x86 if
developers don't run configure before building hypervisor with XSM on.
To fix it, move major part of tools/flask/policy/Makefile into
Makefile.common and create tools only Makefile to include that common
Makefile. Hypervisor Makefile will use Makefile.common to build xsm
policy.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Tested-by: Julien Grall <julien.grall@arm.com>