Andre Przywara [Wed, 7 Sep 2016 00:49:37 +0000 (01:49 +0100)]
ARM: vITS: handle MAPTI/MAPI command
The MAPTI commands associates a DeviceID/EventID pair with a LPI/CPU
pair and actually instantiates LPI interrupts. MAPI is just a variant
of this comment, where the LPI ID is the same as the event ID.
We connect the already allocated host LPI to this virtual LPI, so that
any triggering LPI on the host can be quickly forwarded to a guest.
Beside entering the domain and the virtual LPI number in the respective
host LPI entry, we also initialize and add the already allocated
struct pending_irq to our radix tree, so that we can now easily find it
by its virtual LPI number.
We also read the property table to update the enabled bit and the
priority for our new LPI, as we might have missed this during an earlier
INVALL call (which only checks mapped LPIs). But we make sure that the
property table is actually valid, as all redistributors might still
be disabled at this point.
Since write_itte() now sees its first usage, we change the declaration
to static.
Andre Przywara [Wed, 12 Apr 2017 00:14:42 +0000 (01:14 +0100)]
ARM: GICv3: handle unmapped LPIs
When LPIs get unmapped by a guest, they might still be in some LR of
some VCPU. Nevertheless we remove the corresponding pending_irq
(possibly freeing it), and detect this case (irq_to_pending() returns
NULL) when the LR gets cleaned up later.
However a *new* LPI may get mapped with the same number while the old
LPI is *still* in some LR. To avoid getting the wrong state, we mark
every newly mapped LPI as PRISTINE, which means: has never been in an
LR before. If we detect the LPI in an LR anyway, it must have been an
older one, which we can simply retire.
Before inserting such a PRISTINE LPI into an LR, we must make sure that
it's not already in another LR, as the architecture forbids two
interrupts with the same virtual IRQ number on one CPU.
Andre Przywara [Wed, 7 Sep 2016 00:48:40 +0000 (01:48 +0100)]
ARM: vITS: handle MAPD command
The MAPD command maps a device by associating a memory region for
storing ITEs with a certain device ID. Since it features a valid bit,
MAPD also covers the "unmap" functionality, which we also cover here.
We store the given guest physical address in the device table, and, if
this command comes from Dom0, tell the host ITS driver about this new
mapping, so it can issue the corresponding host MAPD command and create
the required tables. We take care of rolling back actions should one
step fail.
Upon unmapping a device we make sure we clean up all associated
resources and release the memory again.
We use our existing guest memory access function to find the right ITT
entry and store the mapping there (in guest memory).
Andre Przywara [Wed, 7 Sep 2016 00:45:48 +0000 (01:45 +0100)]
ARM: vITS: handle CLEAR command
This introduces the ITS command handler for the CLEAR command, which
clears the pending state of an LPI.
This removes a not-yet injected, but already queued IRQ from a VCPU.
Andre Przywara [Wed, 7 Sep 2016 00:47:49 +0000 (01:47 +0100)]
ARM: vITS: handle MAPC command
The MAPC command associates a given collection ID with a given
redistributor, thus mapping collections to VCPUs.
We just store the vcpu_id in the collection table for that.
Andre Przywara [Wed, 7 Sep 2016 00:47:06 +0000 (01:47 +0100)]
ARM: vITS: handle INT command
The INT command sets a given LPI identified by a DeviceID/EventID pair
as pending and thus triggers it to be injected.
As read_itte() is now eventually used, we add the static keyword.
Andre Przywara [Wed, 10 May 2017 15:36:01 +0000 (16:36 +0100)]
ARM: vITS: provide access to struct pending_irq
For each device we allocate one struct pending_irq for each virtual
event (MSI).
Provide a helper function which returns the pointer to the appropriate
struct, to be able to find the right struct when given a virtual
deviceID/eventID pair.
Andre Przywara [Thu, 26 Jan 2017 15:34:19 +0000 (15:34 +0000)]
ARM: vITS: introduce translation table walks
The ITS stores the target (v)CPU and the (virtual) LPI number in tables.
Introduce functions to walk those tables and translate an device ID -
event ID pair into a pair of virtual LPI and vCPU.
We map those tables on demand - which is cheap on arm64 - and copy the
respective entries before using them, to avoid the guest tampering with
them meanwhile.
To allow compiling without warnings, we declare two functions as
non-static for the moment, which two later patches will fix.
Andre Przywara [Wed, 5 Apr 2017 08:23:32 +0000 (09:23 +0100)]
ARM: vITS: add command handling stub and MMIO emulation
Emulate the memory mapped ITS registers and provide a stub to introduce
the ITS command handling framework (but without actually emulating any
commands at this time).
This fixes a misnomer in our virtual ITS structure, where the spec is
confusingly using ID_bits in GITS_TYPER to denote the number of event IDs
(in contrast to GICD_TYPER, where it means number of LPIs).
Andre Przywara [Mon, 5 Sep 2016 12:57:20 +0000 (13:57 +0100)]
ARM: vGIC: advertise LPI support
To let a guest know about the availability of virtual LPIs, set the
respective bits in the virtual GIC registers and let a guest control
the LPI enable bit.
Only report the LPI capability if there is at least one ITS emulated
for that guest (which depends on the host having an ITS at the moment).
For Dom0 we report the same number of interrupts identifiers as the
host, whereas DomUs get a number fixed at 10 bits for the moments, which
covers all SPIs. Also we fix a slight inaccuracy here, since the
number of interrupt identifier specified in GICD_TYPER depends on the
stream interface and is independent from the number of actually wired
SPIs.
This also removes a "TBD" comment, as we now populate the processor
number in the GICR_TYPER register, which will be used by the ITS
emulation later on.
Andre Przywara [Thu, 6 Apr 2017 19:56:31 +0000 (20:56 +0100)]
ARM: vGICv3: re-use vgic_reg64_check_access
vgic_reg64_check_access() checks for a valid access width of a 64-bit
MMIO register, which is useful beyond the current GICv3 emulation only.
Move this function to the vgic-emul.h to be easily reusable.
This function allows to copy a chunk of data from and to guest physical
memory. It looks up the associated page from the guest's p2m tree
and maps this page temporarily for the time of the access.
This function was originally written by Vijaya as part of an earlier series:
https://patchwork.kernel.org/patch/8177251
Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Julien Grall <julien.grall@arm.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Andre Przywara [Mon, 22 Aug 2016 16:35:44 +0000 (17:35 +0100)]
ARM: vGICv3: handle virtual LPI pending and property tables
Allow a guest to provide the address and size for the memory regions
it has reserved for the GICv3 pending and property tables.
We sanitise the various fields of the respective redistributor
registers.
The MMIO read and write accesses are protected by locks, to avoid any
changing of the property or pending table address while a redistributor
is live and also to protect the non-atomic vgic_reg64_extract() function
on the MMIO read side.
Andre Przywara [Mon, 5 Sep 2016 12:46:37 +0000 (13:46 +0100)]
ARM: GICv3: forward pending LPIs to guests
Upon receiving an LPI on the host, we need to find the right VCPU and
virtual IRQ number to get this IRQ injected.
Iterate our two-level LPI table to find the domain ID and the virtual
LPI number quickly when the host takes an LPI. We then look up the
right VCPU in the struct pending_irq.
We use the existing injection function to let the GIC emulation deal
with this interrupt.
This introduces a do_LPI() as a hardware gic_ops.
Andre Przywara [Wed, 7 Jun 2017 15:28:37 +0000 (16:28 +0100)]
ARM: GIC: ITS: remove no longer needed VCPU ID in host LPI entry
To get easy access to the VCPU a forwarded LPI interrupt should be
injected to, so far we stored the VCPU ID in the host LPI entry.
However this creates a redundancy, since we keep the target VCPU in
the struct pending_irq already, which we can easily look up given the
domain and the virtual LPI number.
Apart from removing the redundancy this avoids having to update this
information later and keeping it in sync in a race-free fashion.
Since this information has not been used that, this patch actually does
not change anything, it just removes the declaration and initialization.
Andre Przywara [Fri, 7 Apr 2017 10:23:53 +0000 (11:23 +0100)]
ARM: vGIC: add LPI VCPU ID to struct pending_irq
The target CPU for an LPI is encoded in the interrupt translation table
entry, so can't be easily derived from just an LPI number (short of
walking *all* tables and find the matching LPI).
To avoid this in case we need to know the VCPU (for the INVALL command,
for instance), put the VCPU ID in the struct pending_irq, so that it is
easily accessible.
We use the remaining 8 bits of padding space for that to avoid enlarging
the size of struct pending_irq. The number of VCPUs is limited to 127
at the moment anyway, which we also confirm with a BUILD_BUG_ON.
Andre Przywara [Mon, 5 Sep 2016 12:46:37 +0000 (13:46 +0100)]
ARM: vGIC: cache virtual LPI priority in struct pending_irq
We enhance struct pending_irq to cache the priority information
for LPIs. Reading the information from there is faster than accessing
the property table from guest memory. Also it use some padding area in
the struct, so does not require more memory.
This introduces the function to retrieve the LPI priority as a vgic_ops.
Also this moves the vgic_get_virq_priority() call in
vgic_vcpu_inject_irq() to happen after the NULL check of the pending_irq
pointer, so we can rely on the pointer in the new function.
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Julien Grall <julien.grall@arm.com>
Andre Przywara [Fri, 7 Apr 2017 10:23:53 +0000 (11:23 +0100)]
ARM: GIC: export and extend vgic_init_pending_irq()
For LPIs we later want to dynamically allocate struct pending_irqs.
So beside needing to initialize the struct from there we also need
to clean it up and re-initialize it later on.
Export vgic_init_pending_irq() and extend it to be reusable.
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Julien Grall <julien.grall@arm.com>
Andre Przywara [Mon, 5 Sep 2016 13:13:22 +0000 (14:13 +0100)]
ARM: GICv3: introduce separate pending_irq structs for LPIs
For the same reason that allocating a struct irq_desc for each
possible LPI is not an option, having a struct pending_irq for each LPI
is also not feasible. We only care about mapped LPIs, so we can get away
with having struct pending_irq's only for them.
Maintain a radix tree per domain where we drop the pointer to the
respective pending_irq. The index used is the virtual LPI number.
The memory for the actual structures has been allocated already per
device at device mapping time.
Teach the existing VGIC functions to find the right pointer when being
given a virtual LPI number.
Andre Przywara [Wed, 5 Apr 2017 18:36:52 +0000 (19:36 +0100)]
ARM: GIC: Add checks for NULL pointer pending_irq's
For LPIs the struct pending_irq's are dynamically allocated and the
pointers will be stored in a radix tree. Since an LPI can be "unmapped"
at any time, teach the VGIC how to deal with irq_to_pending() returning
a NULL pointer.
We just do nothing in this case or clean up the LR if the virtual LPI
number was still in an LR.
Those are all call sites for irq_to_pending(), as per:
"git grep irq_to_pending", and their evaluations:
(PROTECTED means: added NULL check and bailing out)
xen/arch/arm/gic.c:
gic_route_irq_to_guest(): only called for SPIs, added ASSERT()
gic_remove_irq_from_guest(): only called for SPIs, added ASSERT()
gic_remove_from_lr_pending(): PROTECTED, called within VCPU VGIC lock
gic_raise_inflight_irq(): PROTECTED, called under VCPU VGIC lock
gic_raise_guest_irq(): PROTECTED, called under VCPU VGIC lock
gic_update_one_lr(): PROTECTED, called under VCPU VGIC lock
xen/arch/arm/vgic.c:
vgic_migrate_irq(): not called for LPIs (virtual IRQs), added ASSERT()
arch_move_irqs(): not iterating over LPIs, LPI ASSERT already in place
vgic_disable_irqs(): not called for LPIs, added ASSERT()
vgic_enable_irqs(): not called for LPIs, added ASSERT()
vgic_vcpu_inject_irq(): PROTECTED, moved under VCPU VGIC lock
xen/include/asm-arm/event.h:
local_events_need_delivery_nomask(): only called for a PPI, added ASSERT()
Andre Przywara [Thu, 25 May 2017 18:07:29 +0000 (19:07 +0100)]
ARM: vGIC: introduce gic_remove_irq_from_queues()
To avoid code duplication in a later patch, introduce a generic function
to remove a virtual IRQ from the VGIC.
Call that function instead of the open-coded version in vgic_migrate_irq().
Andre Przywara [Mon, 10 Apr 2017 18:05:16 +0000 (19:05 +0100)]
ARM: vGIC: move irq_to_pending() calls under the VGIC VCPU lock
So far irq_to_pending() is just a convenience function to lookup
statically allocated arrays. This will change with LPIs, which are
more dynamic, so the memory for their struct pending_irq might go away.
The proper answer to the issue of preventing stale pointers is
ref-counting, which requires more rework and will be introduced with
a later rework.
For now move the irq_to_pending() calls that are used with LPIs under the
VGIC VCPU lock, and only use the returned pointer while holding the lock.
This prevents the memory from being freed while we use it.
For the sake of completeness we take care about all irq_to_pending()
users, even those which later will never deal with LPIs.
Document the limits of vgic_num_irqs().
Andre Przywara [Thu, 25 May 2017 18:06:41 +0000 (19:06 +0100)]
ARM: vGIC: rework gic_remove_from_queues()
The function name gic_remove_from_queues() was a bit of a misnomer,
since it just removes an IRQ from the pending queue, not both queues.
Rename the function to make this more clear, also give it a pointer to
a struct pending_irq directly and rely on the VGIC VCPU lock to be
already taken, so this can be used in more places. This results in the
lock to be taken in the caller instead now.
Replace the list removal in gic_clear_pending_irqs() with a call to
this function.
Andre Przywara [Mon, 10 Apr 2017 16:05:39 +0000 (17:05 +0100)]
ARM: GICv3: setup number of LPI bits for a GICv3 guest
The host supports a certain number of LPI identifiers, as stored in
the GICD_TYPER register.
Store this number from the hardware register in vgic_v3_hw to allow
injecting the very same number into a guest (Dom0).
DomUs get the legacy number of 10 bits here, since for now it only sees
SPIs, so it does not need more. This should be revisited once we get
proper DomU ITS support.
Andre Przywara [Wed, 14 Sep 2016 13:47:19 +0000 (14:47 +0100)]
ARM: GICv3: enable LPIs on the host
Now that the host part of the ITS code is in place, we can enable the
LPIs on each redistributor to get the show rolling.
At this point there would be no LPIs mapped, as guests don't know about
the ITS yet.
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Andre Przywara [Wed, 19 Apr 2017 16:30:49 +0000 (17:30 +0100)]
ARM: GICv3: enable ITS on the host
Even though the ITS emulation is not yet in place, the host ITS already
gets initialized and Xen tries to map the host collections.
However for commands to be processed we need to *enable* the ITS, which
will be done in a later patch not yet merged.
So those MAPC commands are not processed and run into a timeout, leading
to a panic on machines which advertise an ITS in their DT.
This patch just enables the ITS (but not the LPIs on each redistributor),
to get those MAPC commands executed.
This fixes booting Xen on ARM64 machines with an ITS and the
(EXPERT) ITS Kconfig option enabled.
Andre Przywara [Wed, 24 May 2017 00:07:00 +0000 (01:07 +0100)]
ARM: vGIC: avoid rank lock when reading priority
When reading the priority value of a virtual interrupt, we were taking
the respective rank lock so far.
However for forwarded interrupts (Dom0 only so far) this may lead to a
deadlock with the following call chain:
- MMIO access to change the IRQ affinity, calling the ITARGETSR handler
- this handler takes the appropriate rank lock and calls vgic_store_itargetsr()
- vgic_store_itargetsr() will eventually call vgic_migrate_irq()
- if this IRQ is already in-flight, it will remove it from the old
VCPU and inject it into the new one, by calling vgic_vcpu_inject_irq()
- vgic_vcpu_inject_irq will call vgic_get_virq_priority()
- vgic_get_virq_priority() tries to take the rank lock - again!
It seems like this code path has never been exercised before.
Fix this by avoiding taking the lock in vgic_get_virq_priority() (like we
do in vgic_get_target_vcpu()).
Actually we are just reading one byte, and priority changes while
interrupts are handled are a benign race that can happen on real hardware
too. So it is safe to just prevent the compiler from reading from the
struct more than once.
Ian Jackson [Wed, 7 Jun 2017 14:05:44 +0000 (15:05 +0100)]
Makefile: Provide way to ship livepatch test files
In the toplevel Makefile, provide build-tests and install-tests
targets which descend into xen/test. (dist-tests is provided
automatically by the pattern rule, as is the convention here.)
We have to set BASEDIR ourselves, and use these curious runes, because
the convention in Makefiles under xen/ is to "make -f Rules.mk" with
BASEDIR set and to expect Rules.mk to reinvoke the per-directory
Makefile. (This is really very strange.) Normally this invocation
pattern is organised by the machinery in xen/Makefile (which sets
BASEDIR) and Rules.mk, but we need to invoke it from outside that
context.
In theory it would be nice to have a pattern rule %-tests. But this
is not the style in the rest of the toplevel Makefile; and doing that
might interfere with the dist-% pattern rule.
None of this is invoked by default. If install-tests or dist-tests is
requested, the livepatches (the only current output from xen/tests)
are shipped in DESTDIR/usr/lib/debug/xen-livepatch/.
This allows CI systems such as osstest which are trying to consume
this to arrange for the files to be built, and output, without them
having to have special knowledge of the details of Xen's build system.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Ian Jackson [Wed, 7 Jun 2017 14:09:57 +0000 (15:09 +0100)]
xen/test/livepatch: Add xen_nop.livepatch to .gitignore
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Ian Jackson [Wed, 7 Jun 2017 13:44:51 +0000 (14:44 +0100)]
xen/test/livepatch: Regularise Makefiles
In xen/test/livepatch/Makefile:
Provide a `build' target, as most of the
subdir-invoking Makefiles elsewhere expect.
In xen/test/Makefile:
Replace the two open-coded targets with a generalised pattern rule
which descends into each of SUBDIRS. This allows `install' to work
too (it is already supported by xen/test/livepatch/Makefile).
Provide an explicit default target of `tests', and an `all' target
(which is conventional).
Suppress entry into the xen/test/livepatch subdir when we are
building for i386, since the 32-bit hypervisor is not supported any
more and we can't build livepatches for it either.
After this, the xen/test subdirectory is somewhere were make can be
invoked in the way which is conventional for xen.git/xen/ subdirs.
None of this is yet invoked from the top-level Makefile.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Ian Jackson [Wed, 7 Jun 2017 14:00:17 +0000 (15:00 +0100)]
xen/test/livepatch/Makefile: Install in DESTDIR/usr/lib/debug/xen-livepatch
Dumping these patch files in /usr/lib/debug/xen-*.livepatch is a bit
ugly.
Also, refactor the Makefile to have a LIVEPATCHES variable, to reduce
repetition.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Have the caller indicate its buffer size, provide a means to query the
needed size, don't ignore the upper halves of type code and instance,
and don't copy partial data.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 13 Jun 2017 10:37:39 +0000 (11:37 +0100)]
MAINTAINERS: Move rombios and vgabios under x86 maintainership
alongside hvmloader.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Andrew Cooper [Fri, 2 Jun 2017 10:22:17 +0000 (11:22 +0100)]
x86/boot: Fix the boot time relocation calculations
c/s b28044226e1 "x86: make Xen early boot code relocatable" introduces
mov $sym_offs(__image_base__),%esi
to the legacy boot path. However, this is by definition 0, which means the
boot code only functions correctly when Xen is loaded at its preferred
physical address (2M at the time of writing).
Xen does cope if loaded at an alternative physical address, if the
MULTIBOOT2_TAG_TYPE_LOAD_BASE_ADDR tag is filled in properly. While recent
versions of Grub do fill this in appropriately, tboot does not. (In fact,
tboot loads Xen at the preferred address, but claims a load address of 8M.)
Both Multiboot 1 and 2 specify the execution environment as being flat. As a
result, Xen needs no help calculating the proper load address.
However, Multiboot specifies %esp as undefined. Experimentally, using the
entry %esp is fine, but this is certainly no guarantee. Use a temporary stack
in the first page of RAM, which is one of the safest areas to clobber.
Calculate the load address from %eip alone, and ignore
MULTIBOOT2_TAG_TYPE_LOAD_BASE_ADDR entirely. This fixes legacy boot under
various versions of tboot.
Finally, set up the stack as soon as possible, which means the BIOS path has a
usable stack for the entirety of its duration. Use the full available stack
size, rather than limiting to an arbitrary 1k. One side effect is that the
MB2/EFI path continues to use the EFI stack until the trampoline is entered.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Armando Vega [Thu, 8 Jun 2017 18:39:14 +0000 (20:39 +0200)]
xl.cfg man page cleanup and fixes
- fixed some minor numbering and syntax issues in the CPU allocation
examples for the 'cpus' option
- semantic fixes to make explanations more clear throughout
- fixed all the typo's I could see
- general styling and makeup fixes to make everything look more consistent
Signed-off-by: Armando Vega <armando@greenhost.nl> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Tue, 13 Jun 2017 08:41:10 +0000 (10:41 +0200)]
x86/boot: re-arrange how/when we do disk I/O
We place the trampoline no lower than at 256k, so we have ample space
to read the MBRs of BIOS disks into an aligned buffer right below the
trampoline (not doing so has been found to be a problem on a buggy BIOS
coming with a Skull Canyon NUC). To facilitate that move MBR reading
past EDD info retrieval.
Also add a wrap check to the EDD info retrieval loop, to match that in
the MBR reading one.
Reported-by: Paul Durrant <Paul.Durrant@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Paul Durrant <Paul.Durrant@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 13 Jun 2017 08:39:52 +0000 (10:39 +0200)]
domctl: improve device assignment structure layout and use
Avoid needless gaps. Make flags field mandatory for all three
operations (and rename it to fit the intended future purpose of
possibly holding more than just one flag).
Also correct a typo in a related domctl.h comment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Tue, 13 Jun 2017 08:38:51 +0000 (10:38 +0200)]
x86: limit page type width
There's no reason to burn 4 bits on page type when we only have 7 types
(plus "none") at present. This requires changing one use of
PGT_shared_page, which so far assumed that the type is both a power of
2 and the only type with the high bit set.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 13 Jun 2017 08:38:02 +0000 (10:38 +0200)]
x86/HAP: avoid using bogus/misleading locking
hap_teardown() unconditionally releases the paging lock and is always
being called without the lock held: Lock acquire should then be
unconditional too.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
livepatch: Wrong usage of spinlock on debug console.
If we have a large amount of livepatches and want to print them
on the console using 'xl debug-keys x' we eventually hit
the preemption check:
if ( i && !(i % 64) )
{
spin_unlock(&payload_lock);
process_pending_softirqs();
if ( spin_trylock(&payload_lock) )
return
<facepalm> The effect is that we have just effectively
taken the lock and returned without unlocking!
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-and-tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> CC: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Jan Beulich [Mon, 12 Jun 2017 07:32:14 +0000 (09:32 +0200)]
SVM: clean up svm_vmcb_isvalid()
- correct CR3, CR4, and EFER checks
- delete bogus nested paging check
- add vcpu parameter (to include in log messages) and constify vmcb one
- use bool/true/false
- use accessors (and local variables to improve code readability)
- adjust formatting
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Jan Beulich [Mon, 12 Jun 2017 07:30:53 +0000 (09:30 +0200)]
x86/mm: drop further relics of translated PV domains
For PV domains paging_mode_{refcounts,translate}() are always false as
of commits 4045953527 ("x86/paging: Enforce PG_external == PG_translate
== PG_refcounts") and 92942fd3d4 ("x86/mm: drop
guest_{map,get_eff}_l1e() hooks").
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 12 Jun 2017 07:29:45 +0000 (09:29 +0200)]
x86: get_page_from_gfn() should not return misleading type
It is not impossible that the page owner is dom_io. While no current
caller cares about this case, let's nevertheless return an appropriate
type even in that case.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 9 Jun 2017 12:13:24 +0000 (14:13 +0200)]
SVM: use VMCB accessors
This is particularly relevant for the SET form, to ensure proper clean
bits tracking (albeit in the case here it's benign as CPL and other
segment register attributes share a clean bit).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Wei Liu [Fri, 7 Apr 2017 14:55:43 +0000 (15:55 +0100)]
x86/domain: factor out pv_domain_initialise
Lump everything PV related in arch_domain_create into
pv_domain_initialise.
Though domcr_flags and config are not necessary, the new function is
given those to match hvm counterpart.
Since it cleans up after itself there is no need to clean up in
arch_domain_create in case of failure. Remove the initialiser of rc in
arch_domain_create.
No functional change.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Wei Liu [Fri, 7 Apr 2017 14:49:42 +0000 (15:49 +0100)]
x86/domain: factor out pv_domain_destroy
Now this function also frees the perdomain mapping. It is safe to do so
because destroy_perdomain_mapping is idempotent.
Move free_perdomain_mappings after pv_domain_destroy. It is safe to do
so because both destroy_perdomain_mapping and free_perdomain_mappings
are idempotent.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 1 Jun 2017 14:27:38 +0000 (15:27 +0100)]
x86/apic: Drop workarounds for Pentium/82489DX erratum
CONFIG_X86_GOOD_APIC is unconditionally selected for 64bit builds. Drop the
related infrastructure including apic_{read,write}_around(), the former of
which had no effect, and the latter which was an alias of apic_write().
No functional change, as confirmed by diffing the before/after disassembly.
(Three __LINE__ numbers are different, but they are `mov $imm, %reg` as part
of a dprintk() call.)
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <JBeulich@suse.com>
Ross Lagerwall [Tue, 30 May 2017 14:05:04 +0000 (15:05 +0100)]
x86/vmx: Fix vmentry failure because of invalid LER on Broadwell
Occasionally, on certain Broadwell CPUs MSR_IA32_LASTINTTOIP has been
observed to have the top three bits corrupted as though the MSR is using
the LBR_FORMAT_EIP_FLAGS_TSX format. This is incorrect and causes a
vmentry failure -- the MSR should contain an offset into the current
code segment. This is assumed to be erratum BDF14. Workaround the issue
by sign-extending into bits 48:63 for MSR_IA32_LASTINT{FROM,TO}IP.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Punit Agrawal [Wed, 7 Jun 2017 10:34:20 +0000 (12:34 +0200)]
x86: ensure invalidate_icache() definition is visible only when !__ASSEMBLY__
Commit edff605421 introduces an empty invalidate_icache() function in
page.h for x86 but mistakenly places it outside the !__ASSEMBLY__
block. This causes build failure on x86.
Address this by moving the function definition to within the existing
!__ASSEMBLY__ block.
Fixes: edff605421 ("Avoid excess icache flushes in populate_physmap() before domain has been created") Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Punit Agrawal [Fri, 26 May 2017 11:14:07 +0000 (12:14 +0100)]
Avoid excess icache flushes in populate_physmap() before domain has been created
populate_physmap() calls alloc_heap_pages() per requested
extent. alloc_heap_pages() invalidates the entire icache per
extent. During domain creation, the icache invalidations can be deffered
until all the extents have been allocated as there is no risk of
executing stale instructions from the icache.
Introduce a new flag "MEMF_no_icache_flush" to be used to prevent
alloc_heap_pages() from performing icache maintenance operations. Use
the flag in populate_physmap() before the domain has been unpaused and
perform required icache maintenance function at the end of the
allocation.
One concern is the lack of synchronisation around testing for
"creation_finished". But it seems, in practice the window where it is
out of sync should be small enough to not matter.
Punit Agrawal [Fri, 26 May 2017 11:14:06 +0000 (12:14 +0100)]
arm: p2m: Prevent redundant icache flushes
When toolstack requests flushing the caches, flush_page_to_ram() is
called for each page of the requested domain. This needs to unnecessary
icache invalidation operations.
Let's take the responsibility of performing icache operations and use
the recently introduced flag to prevent redundant icache operations by
flush_page_to_ram().
Punit Agrawal [Fri, 26 May 2017 11:14:05 +0000 (12:14 +0100)]
Allow control of icache invalidations when calling flush_page_to_ram()
flush_page_to_ram() unconditionally drops the icache. In certain
situations this leads to execessive icache flushes when
flush_page_to_ram() ends up being repeatedly called in a loop.
Introduce a parameter to allow callers of flush_page_to_ram() to take
responsibility of synchronising the icache. This is in preparations for
adding logic to make the callers perform the necessary icache
maintenance operations.
George Dunlap [Mon, 5 Jun 2017 10:02:30 +0000 (11:02 +0100)]
vif-common.sh: Have iptables wait for the xtables lock
iptables has a system-wide lock on the xtables. Strangely though, in
the case of two concurrent invocations, the default is for the
instance not grabbing the lock to exit out rather than waiting for it.
This means that when starting a large number of guests in parallel,
many will fail out with messages like this:
2017-05-10 11:45:40 UTC libxl: error: libxl_exec.c:118: libxl_report_child_exitstatus: /etc/xen/scripts/vif-bridge remove [18767] exited with error status 4
2017-05-10 11:50:52 UTC libxl: error: libxl_exec.c:118: libxl_report_child_exitstatus: /etc/xen/scripts/vif-bridge offline [1554] exited with error status 4
In order to instruct iptables to wait for the lock, you have to
specify '-w'. Unfortunately, not all versions of iptables have the
'-w' option, so on first invocation check to see if it accepts the -w
command.
Reported-by: Antony Saba <awsaba@gmail.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Tue, 6 Jun 2017 12:37:12 +0000 (14:37 +0200)]
x86/HAP: don't open code clear_domain_page()
Also drop a stray initializer.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Tue, 6 Jun 2017 12:36:41 +0000 (14:36 +0200)]
x86/HVM: correct notion of new CPL in task switch emulation
Commit aac1df3d03 ("x86/HVM: introduce hvm_get_cpl() and respective
hook") went too far in one aspect: When emulating a task switch we
really shouldn't be looking at what hvm_get_cpl() returns, as we're
switching all segment registers.
The issue manifests as a vmentry failure for 32bit VMs which use task
gates to service interrupts/exceptions, in situations where delivering
the event interrupts user code, and a privilege increase is required.
However, instead of reverting the relevant parts of that commit, have
the caller tell the segment loading function what the new CPL is. This
at once fixes ES being loaded before CS so far having had its checks
done against the old CPL.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Chao Gao [Tue, 6 Jun 2017 12:34:30 +0000 (14:34 +0200)]
x86/vlapic: fix two flaws in emulating MSR_IA32_APICBASE
According to SDM Chapter ADVANCED PROGRAMMABLE INTERRUPT CONTROLLER (APIC)
-> Extended XAPIC (x2APIC) -> x2APIC State Transitions, The existing code to
handle guest's writing MSR_IA32_APICBASE has two flaws:
1. Transition from x2APIC Mode to Disabled Mode is allowed but wrongly
disabled currently. Fix it by removing the related check.
2. Transition from x2APIC Mode to xAPIC Mode is illegal but wrongly allowed
currently. Considering changing ENABLE bit of the MSR has been handled,
it can be fixed by only allowing transition from xAPIC Mode to x2APIC Mode
(the other two transitions: from x2APIC mode to xAPIC Mode, from disabled mode
to invalid state (EN=0, EXTD=1) are disabled).
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 6 Jun 2017 12:32:54 +0000 (14:32 +0200)]
x86/NPT: deal with fallout from 2Mb/1Gb unmapping change
Commit efa9596e9d ("x86/mm: fix incorrect unmapping of 2MB and 1GB
pages") left the NPT code untouched, as there is no explicit alignment
check matching the one in EPT code. However, the now more widespread
storing of INVALID_MFN into PTEs requires adjustments:
- calculations when shattering large pages may spill into the p2m type
field (converting p2m_populate_on_demand to p2m_grant_map_rw) - use
OR instead of PLUS,
- the use of plain l{2,3}e_from_pfn() in p2m_pt_set_entry() results in
all upper (flag) bits being clobbered - introduce and use
p2m_l{2,3}e_from_pfn(), paralleling the existing L1 variant.
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Tue, 23 May 2017 16:32:30 +0000 (16:32 +0000)]
x86/pagewalk: Fix pagewalk's handling of instruction fetches
Despite the claim in the comment (which was based partly on the code already
being like that, and mistaken reasoning because of Xen leaking NX into guest
context), reality differs.
Use of the SMAP feature without NX, or in a 2-level guest, demonstrate an
observable difference between reads and instruction fetches, despite
PFEC_insn_fetch not being reported in the #PF error code. This demonstrates
that instruction fetches are distinguished from data reads even without
PFEC_insn_fetch being reported.
Alter the pagewalk logic to keep the pagewalk insn_fetch input intact, but
only conditionally report insn_fetch in the error code. This logic is more
in line with the Intel SDM text:
* I/D flag (bit 4).
This flag is 1 if (1) the access causing the page-fault exception was an
instruction fetch; and (2) either (a) CR4.SMEP = 1; or (b) both (i) CR4.PAE
= 1 (either PAE paging or 4-level paging is in use); and (ii) IA32_EFER.NXE
= 1. Otherwise, the flag is 0. This flag describes the access causing the
page-fault exception, not the access rights specified by paging.
and the AMD SDM text:
* I/D - Bit 4. If this bit is set to 1, it indicates that the access that
caused the page fault was an instruction fetch. Otherwise, this bit is
cleared to 0. This bit is only defined if no-execute feature is enabled
(EFER.NXE=1 && CR4.PAE=1).
Curiously, the AMD manual doesn't mention SMEP despite some Fam16h processors
and all Fam17h processors supporting it. Experimentally, it behaves as
described by Intel.
In addition, add some extra clarification and sanity checking around the use
of NX for the access checks, where it might be reserved.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
When determining Access Rights, Protection Keys only take effect when CR4.PKE
it set, and 4-level paging is active. All other circumstances (notibly, 32bit
PAE paging) skip the Protection Key control mechanism.
Therefore, we do not need to clear CR4.PKE behind the back of a guest which is
not using paging, as such a guest is necesserily running with EFER.LMA
disabled.
The {RD,WR}PKRU instructions are specified as being legal for use in any
operating mode, but only if CR4.PKE is set. By clearing CR4.PKE behind the
back of an unpaged guest, these instructions yield #UD despite the guest
correctly seeing PKE set if it reads CR4, and OSPKE being visible in CPUID.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Huaitong Han <huaitong.han@intel.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Gregory Herrero [Thu, 1 Jun 2017 08:53:04 +0000 (10:53 +0200)]
stop_machine: fill fn_result only in case of error
When stop_machine_run() is called with NR_CPUS as last argument,
fn_result member must be filled only if an error happens since it is
shared across all cpus.
Assume CPU1 detects an error and set fn_result to -1, then CPU2 doesn't
detect an error and set fn_result to 0. The error detected by CPU1 will
be ignored.
Note that in case multiple failures occur on different CPUs, only the
last error will be reported.
Signed-off-by: Gregory Herrero <gregory.herrero@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Thu, 1 Jun 2017 08:50:25 +0000 (10:50 +0200)]
x86: partially undo "fix build with gcc 7"
While f32400e90c ("x86: fix build with gcc 7")'s change to
compat_array_access_ok() is necessary, I had blindly and needlessly
also added it to array_access_ok(). There's no conditional expression
involved there, so undo it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 1 Jun 2017 08:49:53 +0000 (10:49 +0200)]
smp: assert that all affected CPUs are online in on_selected_cpus()
Suggested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 19 May 2017 10:01:42 +0000 (11:01 +0100)]
xen/x86: Drop sync_core()
As identified in Linux c/s c198b121b1a1d "x86/asm: Rewrite sync_core() to use
IRET-to-self", sync_core() is only appropriate for two very specific usecases.
Xen doesn't have need of either of these usecases, so drop sync_core() to
avoid any misuse.
In the unlikely event that we do gain a legitimate use for sync_core(), it
should be reintroduced as a mov to %cr2 rather than cpuid, which has a lower
overhead.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Borislav Petkov [Sat, 3 Dec 2016 15:02:58 +0000 (16:02 +0100)]
xen/x86/alternatives: Do not use sync_core() to serialize I$
We use sync_core() in the alternatives code to stop speculative
execution of prefetched instructions because we are potentially changing
them and don't want to execute stale bytes.
What it does on most machines is call CPUID which is a serializing
instruction. And that's expensive.
However, the instruction cache is serialized when we're on the local CPU
and are changing the data through the same virtual address. So then, we
don't need the serializing CPUID but a simple control flow change. Last
being accomplished with a CALL/RET which the noinline causes.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 12 May 2017 14:07:16 +0000 (15:07 +0100)]
x86/string: Clean up x86/string.h
* None of the GCC docs mention memmove() in its list of builtins even today,
but 4.1 does have the builtin, meaning that all currently supported
compilers have it.
* Consistently use Xen style, matching the common code, and introduce symbol
definitions for function pointer use.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <JBeulich@suse.com>
Andrew Cooper [Tue, 2 Aug 2016 19:55:12 +0000 (19:55 +0000)]
xen/string: Use compiler __builtin_*() where possible
The use of -fno-builtin inhibits these automatic transformations. This causes
constructs such as strlen("literal") to be evaluated at compile time, and
certain simple operations to be replaced with repeated string operations.
To avoid the macro altering the function names, use the method recommended by
the C specification by enclosing the function name in brackets to avoid the
macro being expanded. This means that optimisation opportunities continue to
work in the rest of the translation unit.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 12 May 2017 16:15:36 +0000 (17:15 +0100)]
xen/string: Clean up {xen,arm}/string.h
* Drop __kernel_size_t entirely. It isn't a useful distinction, especially
as it means the the prototypes don't appear to match their common
definitions.
* Introduce __HAVE_ARCH_* guards for strpbrk(), strsep() and strspn(), which
match their implementation in common/string.c
* Apply consistent Xen style throughout.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com>