Andrew Cooper [Wed, 3 Aug 2016 16:56:56 +0000 (16:56 +0000)]
x86/traps: Drop use_error_code parameter from do_{,guest_}trap()
Whether or not an error code is needed can be determinted entirely from the
trapnr paramter, as error codes are architecturally specified.
Introduce TRAP_HAVE_EC as a bitmap of reserved vectors which have error codes,
and drop the use_error_code from all callsites.
As a result, the DO_ERROR{,_NOCODE}() macros become entirely superflouous and
can be dropped. Update the exception_table to point straight at do_trap().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Mon, 1 Aug 2016 09:55:59 +0000 (10:55 +0100)]
xl: use xenconsole startup protocol
If user asks xl to automatically connect to console when creating a
guest, use the new startup protocol before trying to unpause domain so
that we don't lose any console output.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Mon, 8 Aug 2016 10:07:46 +0000 (11:07 +0100)]
CODING_STYLE: Allow single-sentence comments without full stops
One of the common ways in which contributors trip up over the
CODING_STYLE guides is by not putting a full stop at the end of a
comment when there is only a single sentence. Calling these out is a
waste of everybody's time: The full stop at the end of a comment with
a single sentence (or a single phrase) adds absolutely nothing to the
legibility of the code.
Modify CODING_STYLE to allow comments with a single sentence or
sentence fragment to either have a full stop or not, while making it
clear that comments with multiple sentences must have a full stop at
the end of each sentence.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@citrix.com>
Dario Faggioli [Thu, 4 Aug 2016 08:59:03 +0000 (10:59 +0200)]
tools: xenalyze: kill spurious sched_switch output in non dump mode.
In fact, 52cf096df7 ("xenalyze: handle scheduling event"),
when dealing with TRC_SCHED_SWITCH, forgot to check whether
we actually are in dump mode, causing the printf() in
dump_sched_switch() to always produce its output, which
is not what we want.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Bob Liu [Thu, 4 Aug 2016 01:07:56 +0000 (09:07 +0800)]
libxl: return any serial tty path in libxl_console_get_tty
When specifying a serial list in domain config, users of
libxl_console_get_tty cannot get the tty path of a second specified pty serial,
since right now it always returns the tty path of serial 0.
Signed-off-by: Bob Liu <bob.liu@oracle.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Tue, 2 Aug 2016 16:10:47 +0000 (18:10 +0200)]
tools: make xenstore domain easy configurable
Add configuration entries to sysconfig.xencommons for selection of the
xenstore type (domain or daemon) and start the selected xenstore
service via a script called from sysvinit or systemd.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Tue, 2 Aug 2016 16:10:46 +0000 (18:10 +0200)]
tools: use pidfile for test if xenstored is running
Instead of trying to read xenstore via xenstore-read use the pidfile
of xenstored for the test whether xenstored is running. This prepares
support of xenstore domain, as trying to read xenstore will block
for ever in case xenstore domain is started after trying to read.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Tue, 2 Aug 2016 16:10:45 +0000 (18:10 +0200)]
tools: split out xenstored starting form xencommons
In order to prepare starting a xenstore domain split out the starting
of the xenstore daemon from the xencommons script into a dedicated
launch-xenstore script.
A rerun of autogen.sh is required.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Tue, 2 Aug 2016 16:10:44 +0000 (18:10 +0200)]
tools: remove systemd xenstore socket definitions
On a system with systemd the xenstore sockets are created via systemd.
Remove the related configuration files in order to be able to decide
at runtime whether the sockets should be created or not. This will
enable Xen to start xenstore either via a daemon or via a stub domain.
As the xenstore domain start program will exit after it has done its
job prepare the same behaviour to be tolerated by systemd for the
xenstore daemon by specifying the appropriate flags in the service
file.
A rerun of autogen.sh is required.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: David Scott <dave@recoil.org>
This is because a data abort from a guest was received due to a
permission fault but memaccess thought there are no permission fault.
On ARM, memaccess permissions are stored in a radix tree because there
are not enough available bits in the p2m entry to store the access
restriction. When memaccess is restricting the access (i.e any other
access than p2m_access_rwx), the access will be added in the radix tree
using the GFN as a key. This will be done for all 4KB pages.
This means that memaccess has to shatter all the superpages in a given
region to set the permission on a 4KB granularity. Currently, when a
superpage is shattered, the new entries are using the value
p2m->default_access which will restrict permission (because memaccess
has been enabled). However the radix tree does not yet contain
an entry for this GFN.
If a guest VCPU is running at the same time and trying to access the
modified region, it will result to a stage-2 permission fault. As
the radix tree does not yet contain an entry for the GFN, memaccess will
deduce that the fault was not valid and a data abort will be injecting
to the guest (and crash it).
Furthermore, the permission may be restricted outside of the requested
region if it is only a subset of a 1GB/2MB superpage.
The two issues can be fixed by re-using the permission of the superpage
entry and override the necessary fields. This is not a problem because
memaccess cannot work on superpage.
Lastly, document the code which call mfn_to_p2m_entry when creating a
the p2m entry for a table to explain that create the p2m entry to page table
to explain that permission are ignored by the hardware (See D4.3.1 in ARM DDI
0487A.j). so the value of the parameter 'access' of mfn_to_p2m_entry does
not matter.
The ARM erratum applies to certain revisions of Cortex-A57. The
processor may report a Stage 2 translation fault as the result of
Stage 1 fault for load crossing a page boundary when there is a
permission fault or device memory fault at stage 1 and a translation
fault at Stage 2.
So Xen needs to check that Stage 1 translation does not generate a fault
before handling the Stage 2 fault. If it is a Stage 1 translation fault,
return to the guest to let the processor injecting the correct fault.
Julien Grall [Thu, 4 Aug 2016 17:50:06 +0000 (18:50 +0100)]
xen/arm: traps: Avoid unnecessary VA -> IPA translation in abort handlers
Translating a VA to a IPA is expensive. Currently, Xen is assuming that
HPFAR_EL2 is only valid when the stage-2 data/instruction abort happened
during a translation table walk of a first stage translation (i.e S1PTW
is set).
However, based on the ARM ARM (D7.2.34 in DDI 0487A.j), the register is
also valid when the data/instruction abort occured for a translation
fault.
With this change, the VA -> IPA translation will only happen for
permission faults that are not related to a translation table of a
first stage translation.
Julien Grall [Thu, 4 Aug 2016 17:50:05 +0000 (18:50 +0100)]
xen/arm: traps: MMIO should only be emulated for fault translation
The function do_trap_data_abort_guest assumes that a stage-2 data abort
can only be taken for a translation fault or permission fault today.
Whilst this is true today, it might not be in the future. Rather than
emulating the MMIO for any fault other than the permission one, print
a warning message when the fault is not handled by Xen.
Julien Grall [Thu, 4 Aug 2016 17:50:04 +0000 (18:50 +0100)]
xen/arm: Use check_workaround to handle the erratum 766422
Currently, Xen is accessing the stored MIDR everytime it has to check
whether the processor is affected by the erratum 766422.
This could take advantage of the new capability bitfields to detect
whether the processor is affected at boot time.
With this patch, the number of instructions to check the erratum is
going down from ~13 (including 2 loads and a co-processor access) to
~6 instructions (include 1 load).
Julien Grall [Thu, 4 Aug 2016 17:50:03 +0000 (18:50 +0100)]
xen/arm: Provide macros to help creating workaround helpers
Workarounds may require to execute a different path when the platform
is affected by the associated erratum. Furthermore, this may need to
be called in the common code.
To avoid too much intrusion/overhead, the workaround helpers need to
be a nop on architecture which will never have the workaround and have
to be quick to check whether the platform requires it.
The alternative framework is used to transform the check in a single
instruction. When the framework is not available, the helper will have
~6 instructions including 1 instruction load.
The macro will create a handler called check_workaround_xxxxx with
xxxx the erratum number.
For instance, the line bellow will create a workaround helper for
erratum #424242 which is enabled when the capability
ARM64_WORKAROUND_424242 is set and only available for ARM64:
Julien Grall [Thu, 4 Aug 2016 17:50:02 +0000 (18:50 +0100)]
xen/arm: traps: Simplify the switch in do_trap_*_abort_guest
The fault status we care are in the form BBBBxx where xx is the lookup
level that gave the fault. We can simplify the code by masking the 2 least
significant bits.
Andrew Cooper [Thu, 4 Aug 2016 11:38:05 +0000 (12:38 +0100)]
x86/debug: Make debugger_trap_entry() safe during early boot
debugger_trap_entry() is reachable during early boot where its unconditional
use of current is unsafe. Add a warning to the function to this effect.
Perform the vector check first, as this allows the compiler to elide the other
content from most of its callsites. Check guest_mode(regs) before using
current, which makes the path safe on early boot.
While editing this area, drop DEBUGGER_trap_{entry,fatal}, as hiding a return
statement in a function-like macro is very antisocial programming; show the
real control flow at each of the callsites. Finally, switch
debugger_trap_{entry,fatal} to having boolean return types, to match their
semantics.
No behavioural change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 4 Aug 2016 08:08:48 +0000 (10:08 +0200)]
hvmloader: don't hard-code IO-APIC parameters
The IO-APIC address has variable bits determined by the PCI-to-ISA
bridge (albeit for now we refrain from actually evaluating them, as
there's still implicit rather than explicit agreement on the IO-APIC
base address between qemu and the hypervisor), and the IO-APIC version
should be read from the IO-APIC.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Aug 2016 08:08:00 +0000 (10:08 +0200)]
x86/time: relax barriers
On x86 there's no need for full barriers in loops waiting for some
memory location to change. Nor do we need full barriers between two
reads and two writes - SMP ones fully suffice.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Aug 2016 08:07:02 +0000 (10:07 +0200)]
x86/time: group time stamps into a structure
If that had been done from the beginning, mistakes like the one
corrected in commit b64438c7c1 ("x86/time: use correct (local) time
stamp in constant-TSC calibration fast path") would likely never have
happened.
Also add a few "const" to make more obvious when things aren't expected
to change.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Dario Faggioli <dario.faggioli@citrix.com> Tested-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Aug 2016 08:04:29 +0000 (10:04 +0200)]
x86/time: fold recurring code
Common code between time_calibration_{std,tsc}_rendezvous() can better
live in a single place, eliminating the risk of adjusting one without
the other.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Dario Faggioli <dario.faggioli@citrix.com> Tested-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Aug 2016 08:02:52 +0000 (10:02 +0200)]
x86/time: calibrate TSC against platform timer
... instead of unconditionally against the PIT. This allows for local
and master system times to remain in better sync (which matters even
when, on any modern system, the master time is really used only during
secondary CPU bringup, as the error between the two is in fact
noticable in cross-CPU NOW() invocation monotonicity).
This involves moving the init_platform_timer() invocation into
early_time_init(), splitting out the few things which really need to be
done in init_xen_time(). That in turn allows dropping the open coded
PIT initialization from init_IRQ() (it was needed for APIC clock
calibration, which runs between early_time_init() and init_xen_time()).
In the course of this re-ordering also set the timer channel 2 gate low
after having finished calibration. This should be benign to overall
system operation, but appears to be the more clean state.
Also do away with open coded 8254 register manipulation from 8259 code.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Thu, 4 Aug 2016 08:01:57 +0000 (10:01 +0200)]
x86/HVM: add new functions to get/set memory types
For clarity this patch breaks the code to set/get memory types out
of do_hvm_op() into dedicated functions: hvmop_set/get_mem_type().
Also, for clarity, checks for whether a memory type change is allowed
are broken out into a separate function called by hvmop_set_mem_type().
There is no intentional functional change in this patch.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Thu, 4 Aug 2016 08:01:17 +0000 (10:01 +0200)]
x86: rename p2m_mmio_write_dm to p2m_ioreq_server
Previously p2m type p2m_mmio_write_dm was introduced for write-
protected memory pages whose write operations are supposed to be
forwarded to and emulated by an ioreq server. Yet limitations of
rangeset restrict the number of guest pages to be write-protected.
This patch replaces the p2m type p2m_mmio_write_dm with a new name:
p2m_ioreq_server, which means this p2m type can be claimed by one
ioreq server, instead of being tracked inside the rangeset of ioreq
server. And a new memory type, HVMMEM_ioreq_server, is now used in
the HVMOP_set/get_mem_type interface to set/get this p2m type.
Patches following up will add the related HVMOP handling code which
map/unmap type p2m_ioreq_server to/from an ioreq server. Without
following patches, memory type changes to HVMMEM_ioreq_server can
still be allowed, and in such cases, p2m_ioreq_server pages will be
treated the same as ones with previous type p2m_mmio_write_dm, and
are tracked inside the ioreq server's rangeset.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
xen/arm: traps: Don't inject a fault if the translation VA -> IPA fails
Based on ARM ARM (D4.5.3 in ARM DDI 0486A and B3.12.7 in ARM DDI 0406C.c),
a Stage 1 translation error has priority over a Stage 2 translation error.
Therefore gva_to_ipa can only fail if another vCPU is playing with the
page table.
Rather than injecting a custom fault, replay the instruction and let the
processor injecting the correct fault.
This is fine as Xen is handling all the pending softirqs
(see leave_hypervisor_tail) before returning to the guest. One of them
is the scheduler which could rescheduled the vCPU.
The ARM erratum 832075 applies to certain revisions of Cortex-A57, one
of the workarounds is to change device loads into using load-acquire
semantics.
Use the alternative framework to enable the workaround only on affected
cores.
Whilst a guest could trigger the deadlock, it can be broken when the
processor is receiving an interrupt. As the Xen scheduler will always setup
a timer (firing to every 1ms to 300ms depending on the running time
slice) on each processor, the deadlock would last only few milliseconds
and only affects the guest time slice.
Therefore a malicious guest could only hurt itself. Note that all the
guests should implement/enable the workaround for the affected cores.
The ARM errata 819472, 827319 and 824069 define the same workaround for
these hardware issues in certain Cortex-A53 parts.
The cache instructions "dc cvac" and "dc cvau" need to be upgraded to
"dc civac".
Use the alternative framework to replace those instructions only on
affected cores.
Whilst the errata affect cache instructions issued at any exception
level, it is not necessary to trap EL1/EL0 data cache instructions
access in order to upgrade them. Indeed the data cache corruption would
always be at the address used by the data cache instructions. Note that
this address could point to a shared memory between guests and the
hypervisors, however all the information present in it are be validated
before any use.
Therefore a malicious guest could only hurt itself. Note that all the
guests should implement/enable the workaround for the affected cores.
xen/arm: Detect silicon revision and set cap bits accordingly
After each CPU has been started, we iterate through a list of CPU
errata to detect CPUs which need from hypervisor code patches.
For each bug there is a function which checks if that a particular CPU is
affected. This needs to be done on every CPU to cover heterogenous
systems properly.
If a certain erratum has been detected, the capability bit will be set.
In the case the erratum requires code patching, this will be triggered
by the call to apply_alternatives.
The code is based on the file arch/arm64/kernel/cpu_errata.c in Linux
v4.6-rc3.
xen/arm: cpufeature: Provide an helper to check if a capability is supported
The CPU capabilities will be set depending on the value found in the CPU
registers. This patch provides a generic to go through a set of capabilities
and find which one should be enabled.
The parameter "info" is used to display the kind of capability updated (e.g
workaround, feature...).
Some of the processor erratum will require to modify code sequence.
As those modifications may impact the performance, they should only
be enabled on affected cores. Furthermore, Xen may also want to take
advantage of new hardware features coming up with v8.1 and v8.2.
This patch adds an infrastructure to patch Xen during boot time
depending on the "features" available on the platform.
This code is based on the file arch/arm64/kernel/alternative.c in
Linux 4.6-rc3. Any references to arm64 have been dropped to make the
code as generic as possible.
When Xen is creating the page tables, all the executable sections
(.text and .init.text) will be marked read-only and then enforced by
setting SCTLR.WNX.
Whilst it might be possible to mark those entries read-only after
Xen has been patched, we would need extra care to avoid possible
TLBs conflicts (see D4-1732 in ARM DDI 0487A.i) as all
physical CPUs will be running.
All the physical CPUs have to be brought up before patching Xen because
each cores may have different errata/features which require code
patching. The only way to find them is to probe system registers on
each CPU.
To avoid extra complexity, it is possible to create a temporary
writeable mapping with vmap. This mapping will be used to write the
new instructions.
Lastly, runtime patching is currently not necessary for ARM32. So the
code is only enabled for ARM64.
Note that the header asm-arm/alternative.h is a verbatim copy for the
Linux one (arch/arm64/include/asm/alternative.h). It may contain
innacurate comments, but I did not touch them for now.
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Julien Grall <julien.grall@arm.com>
Anshul Makkar [Wed, 3 Aug 2016 12:35:22 +0000 (13:35 +0100)]
ratelimit: Implement rate limit for credit2 scheduler
Rate limit assures that a vcpu will execute for a minimum amount of
time before being put at the back of a queue or being preempted by
higher priority thread.
It introduces context-switch rate-limiting. The patch enables the VM
to batch its work and prevents the system from spending most of its
time in context switches because of a VM that is waking/sleeping at
high rate.
Dario Faggioli [Wed, 3 Aug 2016 12:31:49 +0000 (13:31 +0100)]
xen: fix a (latent) cpupool-related race during domain destroy
So, during domain destruction, we do:
cpupool_rm_domain() [ in domain_destroy() ]
sched_destroy_domain() [ in complete_domain_destroy() ]
Therefore, there's a window during which, from the
scheduler's point of view, a domain stilsts outside
of any cpupool.
In fact, cpupool_rm_domain() does d->cpupool=NULL,
and we don't allow that to hold true, for anything
but the idle domain (and there are, in fact, ASSERT()s
and BUG_ON()s to that effect).
Currently, we never really check d->cpupool during the
window, but that does not mean the race is not there.
For instance, Credit2 at some point (during load balancing)
iterates on the list of domains, and if we add logic that
needs checking d->cpupool, and any one of them had
cpupool_rm_domain() called on itself already... Boom!
(In fact, calling __vcpu_has_soft_affinity() from inside
balance_load() makes `xl shutdown <domid>' reliably
crash, and this is how I discovered this.)
On the other hand, cpupool_rm_domain() "only" does
cpupool related bookkeeping, and there's no harm
postponing it a little bit.
Also, considering that, during domain initialization,
we do:
cpupool_add_domain()
sched_init_domain()
It makes sense for the destruction path to look like
the opposite of it, i.e.:
sched_destroy_domain()
cpupool_rm_domain()
And hence that's what this patch does.
Actually, for better robustness, what we really do is
moving both cpupool_add_domain() and cpupool_rm_domain()
inside sched_init_domain() and sched_destroy_domain(),
respectively (and also add a couple of ASSERT()-s).
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Juergen Gross <jgross@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
xen: credit2: issues in csched2_cpu_pick(), when tracing is enabled.
In fact, when not finding a suitable runqueue where to
place a vCPU, and hence using a fallback, we either:
- don't issue any trace record (while we should, at
least, output the chosen pcpu),
- risk underruning when accessing the runqueues
array, while preparing the trace record.
Fix both issues and, while there, also a couple of style
problems found nearby.
Spotted by Coverity.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jacob Pan [Wed, 3 Aug 2016 12:41:13 +0000 (14:41 +0200)]
mwait-idle: add Denverton
Denverton is an Intel Atom based micro server which shares the same
Goldmont architecture as Broxton. The available C-states on
Denverton is a subset of Broxton with only C1, C1e, and C6.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[Linux commit: 0080d65b7719fc58e60b5595fc61acded330004f] Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 3 Aug 2016 12:40:44 +0000 (14:40 +0200)]
x86/time: introduce and use rdtsc_ordered()
Matching Linux commit 03b9730b76 ("x86/asm/tsc: Add rdtsc_ordered() and
use it in trivial call sites") and earlier ones it builds upon, let's
make sure timing loops don't have their rdtsc()-s re-ordered, as that
would harm precision of the result (values were observed to be several
hundred clocks off without this adjustment).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Joao Martins <joao.m.martins@oracle.com>
Jan Beulich [Wed, 3 Aug 2016 12:39:31 +0000 (14:39 +0200)]
x86/time: adjust local system time initialization
Using the bare return value from read_platform_stime() is not suitable
when local_time_calibration() is going to use its fast path: Divergence
of several dozen microseconds between NOW() return values on different
CPUs results when platform and local time don't stay in close sync.
Latch local and platform time on the CPU initiating AP bringup, such
that the AP can use these values to seed its stime_local_stamp with as
little of an error as possible. The boot CPU, otoh, can simply
calculate the correct initial value (other CPUs could do so too with
even greater accuracy than the approach being introduced, but that can
work only if all CPUs' TSCs start ticking at the same time, which
generally can't be assumed to be the case on multi-socket systems).
This slightly defers init_percpu_time() (moved ahead by commit dd2658f966 ["x86/time: initialise time earlier during
start_secondary()"]) in order to reduce as much as possible the gap
between populating the stamps and consuming them.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Dario Faggioli <dario.faggioli@citrix.com> Tested-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Tue, 2 Aug 2016 17:25:42 +0000 (19:25 +0200)]
libxl: use llabs() instead abs() for int64_t argument
Commit 57f8b13c724023c78fa15a80452d1de3e51a1418 ("libxl: memory size
in kb requires 64 bit variable") introduced a bug: abs() shouldn't
be called with an int64_t argument. llabs() is to be used here.
Caught by clang build with error message:
libxl.c:4198:33: error: absolute value function 'abs' given an argument
of type
'int64_t' (aka 'long') but has parameter of type 'int' which may cause
truncation of value [-Werror,-Wabsolute-value]
if (target_memkb < 0 && abs(target_memkb) > current_target_memkb)
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Wed, 27 Jul 2016 17:34:39 +0000 (18:34 +0100)]
x86/mm: Annotate gfn_get_* helpers as requiring non-NULL parameters
Introduce and use the nonnull attribute to help the compiler catch NULL
parameters being passed to function which require their parameters not to be
NULL. Experimentally, GCC 4.9 on Debian Jessie only warns of non-NULL-ness
from immediate callers, so propagate the attributes out to all helpers.
A sample error looks like:
mem_sharing.c: In function ‘mem_sharing_nominate_page’:
mem_sharing.c:884:13: error: null argument where non-null required (argument 3) [-Werror=nonnull]
amfn = get_gfn_type_access(ap2m, gfn, NULL, &ap2ma, 0, NULL);
^
As part of this, replace the get_gfn_type_access() macro with an equivalent
static inline function for extra type safety, and the ability to be annotated.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
libxl_set_memory_target() and several other interface functions of
libxl use a 32 bit sized parameter for a memory size value in kBytes.
This limits the maximum size to be passed in such a parameter
depending on signedness of the parameter to 2TB or 4TB.
Tamas K Lengyel [Mon, 1 Aug 2016 17:14:27 +0000 (11:14 -0600)]
x86/mem-sharing: mem-sharing a range of memory
Currently mem-sharing can be performed on a page-by-page basis from the control
domain. However, this process is quite wasteful when a range of pages have to
be deduplicated.
This patch introduces a new mem_sharing memop for range sharing where
the user doesn't have to separately nominate each page in both the source and
destination domain, and the looping over all pages happen in the hypervisor.
This significantly reduces the overhead of sharing a range of memory.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Mon, 1 Aug 2016 08:57:10 +0000 (09:57 +0100)]
libxl: create xenstore nodes for control/feature-XXX flags
The xenstore-paths documentation specifies various control/feature-XXX
flags to allow a guest to tell a toolstack about its abilities to
respond to values written to control/shutdown. However, because the
parent control xenstore key is created read-only to the guest, unless
empty nodes for the feature flags are also created reat/write by the
toolstack, the guest will not be able to set any flags.
This patch adds code to create all specified feature flag nodes at
domain creation time.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monne [Tue, 2 Aug 2016 10:49:51 +0000 (12:49 +0200)]
libxl: fix printing hotplug arguments/environment
An OS could decide to not pass any environment variables to hotplug scripts,
and this will trigger a bug in device_hotplug logic, since it expects the
environment array to exist. Allow env to be NULL.
Jim Fehlig [Fri, 29 Jul 2016 22:56:22 +0000 (16:56 -0600)]
docs: define semantics of vncpasswd in xl.cfg
A recent discussion around LSN-2016-0001 [1] included defining
the sematics of an empty string for a VNC password. It was stated
that "libxl interprets an empty password in the caller's
configuration to mean that passwordless access should be permitted".
The same applies for vncpasswd setting in xl.cfg. This patch
extends to xl.cfg documentation to define the semantics of setting
vncpasswd to an empty string.
Signed-off-by: Jim Fehlig <jfehlig@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Mon, 1 Aug 2016 12:36:44 +0000 (13:36 +0100)]
xen/types: Correct the definition of uintptr_t
uintptr_t is specified as unsigned int in 32bit, not unsigned long. This is
why, when copying inttypes.h from GCC, the use of PRIxPTR and similar is
broken for 32bit builds.
Use __attribute__((__mode__(__pointer__))) to get the compilers default
pointer type, which matches the pre-existing inttypes.h
Fix the identified breakage with ELF_PRPTRVAL
Compile tested on all architectures, with a manual printk() to trigger any
potential -Wformat issues.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Chao Gao [Mon, 1 Aug 2016 16:22:54 +0000 (18:22 +0200)]
x86/vMSI-x: check whether msixtbl_list in msixtbl_pt_register()
MSI-x tables' initializtion had been deferred in the commit 74c6dc2d0ac4dcab0c6243cdf6ed550c1532b798. If an assigned device does not support
MSI-x, the msixtbl_list won't be initialized. However, the following paths
XEN_DOMCTL_bind_pt_irq
pt_irq_create_bind
msixtbl_pt_register
do not check this case. Some errors(malwares, etc.) may lead to calling
XEN_DOMCTL_bind_pt_irq without a clear gtable and will cause Xen panic.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 1 Aug 2016 16:21:37 +0000 (18:21 +0200)]
mwait-idle: correct/improve BXT support
Linux commit 5dcef69486 ("intel_idle: add BXT support") added an
8-element lookup array with just a 2-bit value used for lookups. As per
the SDM that bit field is really 3 bits wide. Since the top two array
entries are zero, deal with the resulting invalid (zero) values by
moving the zero-MSR-value check into irtl_2_usec() and having that
function's caller check its result instead.
Chris Patterson [Wed, 27 Jul 2016 20:01:26 +0000 (16:01 -0400)]
libxl: compilation warning fix for arm & aarch64
GCC 6 will warn on unused static const variables in c modules:
https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00847.html
When compiling with LIBXL_HAVE_NO_SUSPEND_RESUME set (arm & aarch64),
the compiler emits the following errors:
xl_cmdimpl.c:101:19: error: 'migrate_report'
defined but not used [-Werror=unused-const-variable=]
xl_cmdimpl.c:99:19: error: 'migrate_permission_to_go'
defined but not used [-Werror=unused-const-variable=]
xl_cmdimpl.c:97:19: error: 'migrate_receiver_ready'
defined but not used [-Werror=unused-const-variable=]
xl_cmdimpl.c:95:19: error: 'migrate_receiver_banner'
defined but not used [-Werror=unused-const-variable=]
These unused const variables are only used in functions which exist between
the ifndef block:
#ifndef LIBXL_HAVE_NO_SUSPEND_RESUME
...
#endif
Wrap the same ifndef around these variables.
Signed-off-by: Chris Patterson <pattersonc@ainfosec.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Mon, 25 Jul 2016 15:13:13 +0000 (16:13 +0100)]
xsm: don't require configuring tools to build xen xsm blob
Starting from 08cffe66 ("xsm: add a default policy to .init.data") we
can attach a xsm policy blob to hypervisor. To build that policy blob
now hypervisor build system needs to enter tools directory.
The expectation for hypervisor and tools build systems is different. We
don't want xen build system to depend on configure but we want tools
build system to. That commit broke this expectation because it required
users to run configure before building hypervisor. This broke ARM build
because ARM developers normally build hypervisor and tools separately
(and possibly on different platforms). It can also break x86 if
developers don't run configure before building hypervisor with XSM on.
To fix it, move major part of tools/flask/policy/Makefile into
Makefile.common and create tools only Makefile to include that common
Makefile. Hypervisor Makefile will use Makefile.common to build xsm
policy.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Tested-by: Julien Grall <julien.grall@arm.com>
xen/arm: p2m: Inline p2m_load_VTTBR into p2m_restore_state
p2m_restore_state is the last caller of p2m_load_VTTBR and already check
if the vCPU does not belong to the idle domain.
Note that it is likely possible to remove some isb in the function
p2m_restore_state, however this is not the purpose of this patch. So the
numerous isb have been left.
xen/arm: p2m: Rework the context switch to another VTTBR in flush_tlb_domain
The current implementation of flush_tlb_domain is relying on the domain
to have a single p2m. With the upcoming feature altp2m, a single domain
may have different p2m. So we would need to switch to the correct p2m in
order to flush the TLBs.
Rather than checking whether the domain is not the current domain, check
whether the VTTBR is different. The resulting assembly code is much
smaller: from 38 instructions (+ 2 functions call) to 22 instructions.
xen/arm: p2m: Don't need to restore the state for an idle vCPU.
The function p2m_restore_state could be called with an idle vCPU in
arguments (when called by construct_dom0). However, we will never return
to EL0/EL1 in this case, so it is not necessary to restore the p2m
registers.
xen/arm: p2m: Move the vttbr field from arch_domain to p2m_domain
The field vttbr holds the base address of the translation table for
guest. Its value will depends on how the p2m has been initialized and
will only be used by the P2M code.
So move the field from arch_domain to p2m_domain. This will also ease
the implementation of altp2m.
xen/arm: p2m: Switch the p2m lock from spinlock to rwlock
P2M reads do not require to be serialized. This will add contention
when PV drivers are using multi-queue because parallel grant
map/unmaps/copies will happen on DomU's p2m.
The p2m is not yet in use when p2m_init and p2m_allocate_table are
called. Furthermore the p2m is not used anymore when p2m_teardown is
called. So taking the p2m lock is not necessary.
xen/arm: p2m: Find the memory attributes based on the p2m type
Currently, mfn_to_p2m_entry is relying on the caller to provide the
correct memory attribute and will deduce the sharability based on it.
Some of the callers, such as p2m_create_table, are using same memory
attribute regardless the underlying p2m type. For instance, this will
lead to use change the memory attribute from MATTR_DEV to MATTR_MEM when
a MMIO superpage is shattered.
Furthermore, it makes more difficult to support different shareability
with the same memory attribute.
All the memory attributes could be deduced via the p2m type. This will
simplify the code by dropping one parameter.
xen/arm: p2m: Differentiate cacheable vs non-cacheable MMIO
Currently, the p2m type p2m_mmio_direct is used to map in stage-2
cacheable MMIO (via map_regions_rw_cache) and non-cacheable one (via
map_mmio_regions). The p2m code is relying on the caller to give the
correct memory attribute.
In a follow-up patch, the p2m code will rely on the p2m type to find the
correct memory attribute. In preparation of this, introduce
p2m_mmio_direct_nc and p2m_mimo_direct_c to differentiate the
cacheability of the MMIO.
xen/arm: p2m: Use a whitelist rather than blacklist in get_page_from_gfn
Currently, the check in get_page_from_gfn is using a blacklist. This is
very fragile because we may forgot to update the check when a new p2m
type is added.
To avoid any possible issue, use a whitelist. All type backed by a RAM
page can could potential be valid. The check is borrowed from x86.
Note with this change, it is not possible anymore to retrieve a page when
the p2m type is p2m_iommu_map_*. This is fine because they are special
mappings for direct mapping workaround and the associated GFN should be
used at all by callers of get_page_from_gfn.
Commit d2412fd63b14c6c21d0a3d4367afa448425dfb8a ("libxl: move common
nic stuff into one source") introduced a double free error in libxl
which occurred during "xl save".
Correct this error.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
It is not possible to know which IRQs will be used by DOM0 when ACPI is
inuse. The approach implemented by this patch, will route all unused
IRQs to DOM0 before it has booted.
The number of IRQs routed is based on the maximum SPIs supported by the
hardware (up to ~1000). However, some of them might not be wired. So we
would allocate resource for nothing.
For each IRQ routed, Xen is allocating memory for irqaction (40 bytes)
and irq_guest (16 bytes). So in the worst case scenario ~54KB of memory
will be allocated. Given that ACPI will mostly be used by server, I
think it is a small drawback.
map_irq_to_domain is slightly reworked to remove the dependency on
device-tree. So the function can be also be used for ACPI and will
avoid code duplication.
The function route_irq_to_guest mandates the IRQ type, stored in
desc->arch.type, to be valid. However, in case of ACPI, these
information is not part of the static tables. Therefore Xen needs to
rely on DOM0 to provide a valid type based on the firmware tables.
A new helper, irq_type_set_by_domain is provided to check whether a
domain is allowed to set the IRQ type. For now, only DOM0 is allowed to
configure.
When the helper returns 1, the routing function will not check whether
the IRQ type is correctly set and configure the GIC. Instead, this will
be done when the domain will enable the interrupt.
Note that irq_set_spi_type is not called because it validates the type
and does not allow it the domain to change the type after the first
write. It means that desc->arch.type may never be set, which is fine
because the field is only used to configure the type during the routing.
Based on 4.3.13 in ARM IHI 0048B.b, changing the value of Int_config is
UNPREDICTABLE when the corresponding interrupt is not disabled.
Therefore, setting the IRQ type when the guest is writing into ICFGR
would require more work to make sure the IRQ has been disabled before
writing into the host ICFGR. As the behavior is UNPREDICTABLE, the type
will be set before enabling the physical IRQ associated to the virtual IRQ.
The callback set_irq_properties will configure the GIC for a specific
IRQ with the type and the priority.
In a follow-up patch, Xen will configure the type and the priority at
different stage of the routing. So split it in 2 separate callbacks.
At the same time, move the ASSERT to check the validity of the type and
if the desc->lock is locked in the common code (gic.c). This is because
the constraint are the same between GICv2 and GICv3, however the driver
of the latter did not contain any sanity check.
xen/arm: gic: Do not configure affinity during routing
The affinity of a guest IRQ is set every time the guest enable it (see
vgic_enable_irqs).
It is not necessary to set the affinity when the IRQ is routed to the
guest because Xen will never receive the IRQ until it hass been enabled
by the guest.
To keep gic_route_irq_to_{xen,guest} behaving the same way (i.e just
setting up the routing), the affinity of IRQ routed to Xen is moved into
__setup_irq.
Andrew Cooper [Fri, 15 Jul 2016 15:43:48 +0000 (16:43 +0100)]
xen/domctl: Add DOMINFO_hap to xen_domctl_getdomaininfo
This allows a toolstack to identify whether a running domain is using hardware
assisted paging or not.
The appropriate tests differ by architecture, so introduce
arch_get_domain_info(). ARM unconditionally sets the new flag, while x86
checks with the paging subsystem first.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Julien Grall <julien.grall@arm.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>