]> xenbits.xensource.com Git - people/royger/xen.git/log
people/royger/xen.git
5 years agox86/tlb: use Xen L0 assisted TLB flush when available l0_flush_v2.2 gitlab/l0_flush_v2.2
Roger Pau Monne [Thu, 19 Dec 2019 13:16:16 +0000 (14:16 +0100)]
x86/tlb: use Xen L0 assisted TLB flush when available

Use Xen's L0 HVMOP_flush_tlbs hypercall in order to perform flushes.
This greatly increases the performance of TLB flushes when running
with a high amount of vCPUs as a Xen guest, and is specially important
when running in shim mode.

The following figures are from a PV guest running `make -j32 xen` in
shim mode with 32 vCPUs and HAP.

Using x2APIC and ALLBUT shorthand:
real 4m35.973s
user 4m35.110s
sys 36m24.117s

Using L0 assisted flush:
real    1m2.596s
user    4m34.818s
sys     5m16.374s

The implementation adds a new hook to hypervisor_ops so other
enlightenments can also implement such assisted flush just by filling
the hook. Note that the Xen implementation completely ignores the
dirty CPU mask and the linear address passed in, and always performs a
global TLB flush on all vCPUs.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Add a L0 assisted hook to hypervisor ops.

5 years agox86/tlb: allow disabling the TLB clock
Roger Pau Monne [Mon, 27 Jan 2020 09:41:24 +0000 (10:41 +0100)]
x86/tlb: allow disabling the TLB clock

The TLB clock is helpful when running Xen on bare metal because when
doing a TLB flush each CPU is IPI'ed and can keep a timestamp of the
last flush.

This is not the case however when Xen is running virtualized, and the
underlying hypervisor provides mechanism to assist in performing TLB
flushes: Xen itself for example offers a HVMOP_flush_tlbs hypercall in
order to perform a TLB flush without having to IPI each CPU. When
using such mechanisms it's no longer possible to keep a timestamp of
the flushes on each CPU, as they are performed by the underlying
hypervisor.

Offer a boolean in order to signal Xen that the timestamped TLB
shouldn't be used. This avoids keeping the timestamps of the flushes,
and also forces NEED_FLUSH to always return true.

No functional change intended, as this change doesn't introduce any
user that disables the timestamped TLB.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
5 years agox86/tlb: introduce a flush guests TLB flag
Roger Pau Monne [Mon, 27 Jan 2020 10:23:08 +0000 (11:23 +0100)]
x86/tlb: introduce a flush guests TLB flag

Introduce a specific flag to request a guest TLB flush, which is an
ASID/VPID tickle that forces a linear TLB flush for all HVM guests.

This was previously unconditionally done in each pre_flush call, but
that's not required: HVM guests not using shadow don't require linear
TLB flushes as Xen doesn't modify the guest page tables in that case
(ie: when using HAP).

Modify all shadow code TLB flushes to also flush the guest TLB, in
order to keep the previous behavior. I haven't looked at each specific
shadow code TLB flush in order to figure out whether it actually
require a guest TLB flush or not, so there might be room for
improvement in that regard.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
5 years agox86/hap: improve hypervisor assisted guest TLB flush
Roger Pau Monne [Wed, 22 Jan 2020 10:54:58 +0000 (10:54 +0000)]
x86/hap: improve hypervisor assisted guest TLB flush

The current implementation of the hypervisor assisted flush for HAP is
extremely inefficient.

First of all there's no need to call paging_update_cr3, as the only
relevant part of that function when doing a flush is the ASID vCPU
flush, so just call that function directly.

Since hvm_asid_flush_vcpu is protected against concurrent callers by a
spinlock there's no need anymore to pause the affected vCPUs.

Finally the global TLB flush performed by flush_tlb_mask is also not
necessary, since we only want to flush the guest TLB state it's enough
to trigger a vmexit on the pCPUs currently holding any vCPU state, as
such vmexit will already perform an ASID/VPID update, and thus clear
the guest TLB.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
5 years agox86/paging: add TLB flush hooks
Roger Pau Monne [Tue, 14 Jan 2020 09:38:44 +0000 (10:38 +0100)]
x86/paging: add TLB flush hooks

Add shadow and hap implementation specific helpers to perform guest
TLB flushes. Note that the code for both is exactly the same at the
moment, and is copied from hvm_flush_vcpu_tlb. This will be changed by
further patches that will add implementation specific optimizations to
them.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
5 years agox86/hvm: allow ASID flush when v != current
Roger Pau Monne [Tue, 21 Jan 2020 17:23:46 +0000 (17:23 +0000)]
x86/hvm: allow ASID flush when v != current

Current implementation of hvm_asid_flush_vcpu is not safe to use
unless the target vCPU is either paused or the currently running one,
as it modifies the generation without any locking.

Fix this by using atomic operations when accessing the generation
field, both in hvm_asid_flush_vcpu_asid and other ASID functions. This
allows to safely flush the current ASID generation. Note that for the
flush to take effect if the vCPU is currently running a vmexit is
required.

Note the same could be achieved by introducing an extra field to
hvm_vcpu_asid that signals hvm_asid_handle_vmenter the need to call
hvm_asid_flush_vcpu on the given vCPU before vmentry, this however
seems unnecessary as hvm_asid_flush_vcpu itself only sets two vCPU
fields to 0, so there's no need to delay this to the vmentry ASID
helper.

This is not a bugfix as no callers that would violate the assumptions
listed in the first paragraph have been found, but a preparatory
change in order to allow remote flushing of HVM vCPUs.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
5 years agox86/tlb: fix NEED_FLUSH return type
Roger Pau Monne [Fri, 24 Jan 2020 15:22:04 +0000 (16:22 +0100)]
x86/tlb: fix NEED_FLUSH return type

The returned type wants to be bool instead of int.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
5 years agox86/apic: simplify disconnect_bsp_APIC setup of LVT{0/1}
Roger Pau Monne [Thu, 23 Jan 2020 17:37:47 +0000 (18:37 +0100)]
x86/apic: simplify disconnect_bsp_APIC setup of LVT{0/1}

There's no need to read the current values of LVT{0/1} for the
purposes of the function, which seem to be to save the currently
selected vector: in the destination modes used (ExtINT and NMI) the
vector field is ignored and hence can be set to 0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
5 years agox86/apic: fix disabling LVT0 in disconnect_bsp_APIC
Roger Pau Monne [Fri, 17 Jan 2020 14:52:32 +0000 (15:52 +0100)]
x86/apic: fix disabling LVT0 in disconnect_bsp_APIC

The Intel SDM states:

"When an illegal vector value (0 to 15) is written to a LVT entry and
the delivery mode is Fixed (bits 8-11 equal 0), the APIC may signal an
illegal vector error, without regard to whether the mask bit is set or
whether an interrupt is actually seen on the input."

And that's exactly what's currently done in disconnect_bsp_APIC when
virt_wire_setup is true and LVT LINT0 is being masked. By writing only
APIC_LVT_MASKED Xen is actually setting the vector to 0 and the
delivery mode to Fixed (0), and hence it triggers an APIC error even
when the LVT entry is masked.

This would usually manifest when Xen is being shut down, as that's
where disconnect_bsp_APIC is called:

(XEN) APIC error on CPU0: 40(00)

Fix this by calling clear_local_APIC prior to setting the LVT LINT
registers which already clear LVT LINT0, and hence the troublesome
write can be avoided as the register is already cleared.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Use clear_local_APIC in order to clear LINT0.

5 years agox86/hvmloader: round up memory BAR size to 4K
Roger Pau Monne [Tue, 14 Jan 2020 18:06:26 +0000 (19:06 +0100)]
x86/hvmloader: round up memory BAR size to 4K

When placing memory BARs with sizes smaller than 4K multiple memory
BARs can end up mapped to the same guest physical address, and thus
won't work correctly.

Round up all memory BAR sizes to be at least 4K, so that they are
naturally aligned to a page size and thus don't end up sharing a page.
Also add a couple of asserts to the current code to make sure the MMIO
hole is properly sized and aligned.

Note that the guest can still move the BARs around and create this
collisions, and that BARs not filling up a physical page might leak
access to other MMIO regions placed in the same host physical page.

This is however no worse than what's currently done, and hence should
be considered an improvement over the current state.

Reported-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jason Andryuk <jandryuk@gmail.com>
---
Changes since v1:
 - Do the round up when sizing the BARs, so that the MMIO hole is
   correctly sized.
 - Add some asserts that the hole is properly sized and size-aligned.
 - Dropped Jason Tested-by since the code has changed.
---
Jason, can you give this a spin? Thanks.

5 years agonvmx: implement support for MSR bitmaps
Roger Pau Monne [Tue, 7 Jan 2020 11:32:39 +0000 (12:32 +0100)]
nvmx: implement support for MSR bitmaps

Current implementation of nested VMX has a half baked handling of MSR
bitmaps for the L1 VMM: it maps the L1 VMM provided MSR bitmap, but
doesn't actually load it into the nested vmcs, and thus the nested
guest vmcs ends up using the same MSR bitmap as the L1 VMM.

This is wrong as there's no assurance that the set of features enabled
for the L1 vmcs are the same that L1 itself is going to use in the
nested vmcs, and thus can lead to misconfigurations.

For example L1 vmcs can use x2APIC virtualization and virtual
interrupt delivery, and thus some x2APIC MSRs won't be trapped so that
they can be handled directly by the hardware using virtualization
extensions. On the other hand, the nested vmcs created by L1 VMM might
not use any of such features, so using a MSR bitmap that doesn't trap
accesses to the x2APIC MSRs will be leaking them to the underlying
hardware.

Fix this by crafting a merged MSR bitmap between the one used by L1
and the nested guest, and make sure a nested vmcs MSR bitmap always
traps accesses to the x2APIC MSR range, since hardware assisted x2APIC
virtualization or virtual interrupt delivery are never available to
L1 VMM.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
This seems better than what's done currently, but TBH there's a lot of
work to be done in nvmx in order to make it functional and secure that
I'm not sure whether building on top of the current implementation is
something sane to do, or it would be better to start from scratch and
re-implement nvmx to just support the minimum required set of VTx
features in a sane and safe way.

5 years agoRevert "tools/libxc: disable x2APIC when using nested virtualization"
Roger Pau Monne [Wed, 8 Jan 2020 10:34:38 +0000 (11:34 +0100)]
Revert "tools/libxc: disable x2APIC when using nested virtualization"

This reverts commit 7b3c5b70a32303b46d0d051e695f18d72cce5ed0 and
re-enables the usage of x2APIC with nested virtualization.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
5 years agox86/vvmx: fix VM_EXIT_ACK_INTR_ON_EXIT handling
Roger Pau Monne [Thu, 23 Jan 2020 12:33:49 +0000 (13:33 +0100)]
x86/vvmx: fix VM_EXIT_ACK_INTR_ON_EXIT handling

When VM_EXIT_ACK_INTR_ON_EXIT is set in the vmexit control vmcs
register the bit 31 of VM_EXIT_INTR_INFO must be 0, in order to denote
that the field doesn't contain any interrupt information. This is not
currently acknowledged as the field always get filled with valid
interrupt information, regardless of whether VM_EXIT_ACK_INTR_ON_EXIT
is set.

Fix this and only fill VM_EXIT_INTR_INFO when VM_EXIT_ACK_INTR_ON_EXIT
is not set. Note that this requires one minor change in
nvmx_update_apicv in order to obtain the interrupt information from
the internal state rather than the nested vmcs register.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - New in this version.

5 years agonvmx: fix handling of interrupts
Roger Pau Monne [Wed, 8 Jan 2020 10:12:17 +0000 (11:12 +0100)]
nvmx: fix handling of interrupts

When doing a virtual vmexit (ie: a vmexit handled by the L1 VMM)
interrupts shouldn't be injected using the virtual interrupt delivery
mechanism unless the Ack on exit vmexit control bit isn't set in the
nested vmcs.

Gate the call to nvmx_update_apicv helper on whether the nested vmcs
has the Ack on exit bit set in the vmexit control field.

Note that this fixes the usage of x2APIC by the L1 VMM, at least when
the L1 VMM is Xen.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Call nvmx_update_apicv if the "Ack on exit" vmexit control bit
   isn't set.

5 years ago(no commit message)
Roger Pau Monne [Thu, 31 Oct 2019 09:38:27 +0000 (10:38 +0100)]

5 years agolibxl: fix stubdomain creation after aacc143006429de
Roger Pau Monne [Tue, 21 Jan 2020 10:14:09 +0000 (10:14 +0000)]
libxl: fix stubdomain creation after aacc143006429de

aacc143006429de broke stubdomain creation by passing the guest
domain_create_state to libxl__domain_build in libxl__spawn_stub_dm,
when it should instead be crafting a new domain_create_state for the
studdomain.

Fixes: aacc143006429de ('tools/libxl: Plumb domain_create_state down into libxl__build_pre()')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
5 years agox86/mm: Make use of the default access param from xc_altp2m_create_view
Alexandru Stefan ISAILA [Fri, 17 Jan 2020 13:31:33 +0000 (13:31 +0000)]
x86/mm: Make use of the default access param from xc_altp2m_create_view

At this moment the default_access param from xc_altp2m_create_view is
not used.

This patch assigns default_access to p2m->default_access at the time of
initializing a new altp2m view.

Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
5 years agox86/mm: Pull vendor-independent altp2m code out of p2m-ept.c and into p2m.c
Alexandru Stefan ISAILA [Fri, 17 Jan 2020 13:31:31 +0000 (13:31 +0000)]
x86/mm: Pull vendor-independent altp2m code out of p2m-ept.c and into p2m.c

No functional changes.

Requested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
5 years agox86/altp2m: Add hypercall to set a range of sve bits
Alexandru Stefan ISAILA [Fri, 17 Jan 2020 13:31:30 +0000 (13:31 +0000)]
x86/altp2m: Add hypercall to set a range of sve bits

By default the sve bits are not set.
This patch adds a new hypercall, xc_altp2m_set_supress_ve_multi(),
to set a range of sve bits.
The core function, p2m_set_suppress_ve_multi(), does not break in case
of a error and it is doing a best effort for setting the bits in the
given range. A check for continuation is made in order to have
preemption on large ranges.
The gfn of the first error is stored in
xen_hvm_altp2m_suppress_ve_multi.first_error_gfn and the error code is
stored in xen_hvm_altp2m_suppress_ve_multi.first_error.
If no error occurred the values will be 0.

Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
5 years agox86/mm: Add array_index_nospec to guest provided index values
Alexandru Stefan ISAILA [Fri, 17 Jan 2020 13:31:26 +0000 (13:31 +0000)]
x86/mm: Add array_index_nospec to guest provided index values

This patch aims to sanitize indexes, potentially guest provided
values, for altp2m_eptp[] and altp2m_p2m[] arrays.

Requested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
5 years agox86/boot: Drop sym_fs()
Andrew Cooper [Thu, 9 Jan 2020 14:06:38 +0000 (14:06 +0000)]
x86/boot: Drop sym_fs()

All remaining users of sym_fs() can trivially be switched to using sym_esi()
instead.  This is shorter to encode and faster to execute.

This removes the final uses of %fs during boot, which allows us to drop
BOOT_FS from the trampoline GDT, which drops an 16M arbitrary limit on Xen's
compiled size.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/boot: Simplify pagetable manipulation loops
Andrew Cooper [Fri, 10 Jan 2020 01:04:28 +0000 (01:04 +0000)]
x86/boot: Simplify pagetable manipulation loops

For __page_tables_{start,end} and L3 bootmap initialisation, the logic is
unnecesserily complicated owing to its attempt to use the LOOP instruction,
which results in an off-by-8 memory address owing to LOOP's termination
condition.

Rewrite both loops for improved clarity and speed.

Misc notes:
 * TEST $IMM, MEM can't macrofuse.  The loop has 0x1200 iterations, so pull
   the $_PAGE_PRESENT constant out into a spare register to turn the TEST into
   its %REG, MEM form, which can macrofuse.
 * Avoid the use of %fs-relative references.  %esi-relative is the more common
   form in the code, and doesn't suffer an address generation overhead.
 * Avoid LOOP.  CMP/JB isn't microcoded and faster to execute in all cases.
 * For a 4 interation trivial loop, even compilers unroll these.  The
   generated code size is a fraction larger, but this is init and the asm is
   far easier to follow.
 * Reposition the l2=>l1 bootmap construction so the asm reads in pagetable
   level order.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/boot: Drop explicit %fs uses
Andrew Cooper [Thu, 9 Jan 2020 14:06:08 +0000 (14:06 +0000)]
x86/boot: Drop explicit %fs uses

The trampoline relocation code uses %fs for accessing Xen, and this comes with
an arbitrary 16M limitation.  We could adjust the limit, but the boot code is
a confusing mix of %ds/%esi-based and %fs-based accesses, and the use of %fs
is longer to encode, and incurs an address generation overhead.

Rewrite the logic to use %ds, for better consistency with the surrounding
code, and a marginal performance improvement.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/boot: Size the boot/directmap mappings dynamically
Andrew Cooper [Fri, 10 Jan 2020 14:05:29 +0000 (14:05 +0000)]
x86/boot: Size the boot/directmap mappings dynamically

... rather than presuming that 16M will do.  On the EFI side, use
l2e_add_flags() to reduce the code-generation overhead of using
l2e_from_paddr() twice.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/boot: Create the l2_xenmap[] mappings dynamically
Andrew Cooper [Fri, 10 Jan 2020 16:35:14 +0000 (16:35 +0000)]
x86/boot: Create the l2_xenmap[] mappings dynamically

The build-time construction of l2_xenmap[] imposes an arbitrary limit of 16M
total, which is a limit looking to be lifted.

Adjust both the BIOS and EFI paths to fill it in dynamically, based on the
final linked size of Xen.  l2_xenmap[] stays between __page_tables_{start,end}
(rather than move into .bss.page_aligned) as it is expected to gain a
different pagetable reference shortly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agoxen/sched: add const qualifier where appropriate
Juergen Gross [Fri, 8 Nov 2019 16:15:35 +0000 (17:15 +0100)]
xen/sched: add const qualifier where appropriate

Make use of the const qualifier more often in scheduling code.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: Meng Xu <mengxu@cis.upenn.edu>
5 years agoxen/sched: eliminate sched_tick_suspend() and sched_tick_resume()
Juergen Gross [Fri, 8 Nov 2019 15:33:32 +0000 (16:33 +0100)]
xen/sched: eliminate sched_tick_suspend() and sched_tick_resume()

sched_tick_suspend() and sched_tick_resume() only call rcu related
functions, so eliminate them and do the rcu_idle_timer*() calling in
rcu_idle_[enter|exit]().

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: Julien Grall <julien@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agoxen/sched: switch scheduling to bool where appropriate
Juergen Gross [Fri, 8 Nov 2019 11:50:58 +0000 (12:50 +0100)]
xen/sched: switch scheduling to bool where appropriate

Scheduling code has several places using int or bool_t instead of bool.
Switch those.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
5 years agoxen/sched: replace null scheduler percpu-variable with pdata hook
Juergen Gross [Fri, 8 Nov 2019 11:16:10 +0000 (12:16 +0100)]
xen/sched: replace null scheduler percpu-variable with pdata hook

Instead of having an own percpu-variable for private data per cpu the
generic scheduler interface for that purpose should be used.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
5 years agoxen/sched: use scratch cpumask instead of allocating it on the stack
Juergen Gross [Fri, 8 Nov 2019 08:15:04 +0000 (09:15 +0100)]
xen/sched: use scratch cpumask instead of allocating it on the stack

In rt scheduler there are three instances of cpumasks allocated on the
stack. Replace them by using cpumask_scratch.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
5 years agoxen/sched: remove special cases for free cpus in schedulers
Juergen Gross [Fri, 8 Nov 2019 07:02:53 +0000 (08:02 +0100)]
xen/sched: remove special cases for free cpus in schedulers

With the idle scheduler now taking care of all cpus not in any cpupool
the special cases in the other schedulers for no cpupool associated
can be removed.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
5 years agoxen/sched: cleanup sched.h
Juergen Gross [Fri, 8 Nov 2019 09:56:42 +0000 (10:56 +0100)]
xen/sched: cleanup sched.h

There are some items in include/xen/sched.h which can be moved to
private.h as they are scheduler private.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
5 years agoxen/sched: make sched-if.h really scheduler private
Juergen Gross [Thu, 7 Nov 2019 14:34:37 +0000 (15:34 +0100)]
xen/sched: make sched-if.h really scheduler private

include/xen/sched-if.h should be private to scheduler code, so move it
to common/sched/private.h and move the remaining use cases to
cpupool.c and core.c.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
5 years agoxen/sched: move schedulers and cpupool coding to dedicated directory
Juergen Gross [Wed, 22 Jan 2020 14:06:43 +0000 (15:06 +0100)]
xen/sched: move schedulers and cpupool coding to dedicated directory

Move sched*c and cpupool.c to a new directory common/sched.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
5 years agoVT-d: don't pass bridge devices to domain_context_mapping_one()
Jan Beulich [Wed, 22 Jan 2020 15:39:58 +0000 (16:39 +0100)]
VT-d: don't pass bridge devices to domain_context_mapping_one()

When passed a non-NULL pdev, the function does an owner check when it
finds an already existing context mapping. Bridges, however, don't get
passed through to guests, and hence their owner is always going to be
Dom0, leading to the assigment of all but one of the function of multi-
function PCI devices behind bridges to fail.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
5 years agox86/smp: use APIC ALLBUT destination shorthand when possible
Roger Pau Monné [Wed, 22 Jan 2020 15:38:39 +0000 (16:38 +0100)]
x86/smp: use APIC ALLBUT destination shorthand when possible

If the IPI destination mask matches the mask of online CPUs use the
APIC ALLBUT destination shorthand in order to send an IPI to all CPUs
on the system except the current one. This can only be safely used
when no CPU hotplug or unplug operations are taking place, no
offline CPUs or those have been onlined and parked, all CPUs in the
system have been accounted for (ie: the number of CPUs doesn't exceed
NR_CPUS and APIC IDs are below MAX_APICS) and there's no possibility
of CPU hotplug (ie: no disabled CPUs have been reported by the
firmware tables).

This is specially beneficial when using the PV shim, since using the
shorthand avoids performing an APIC register write (or multiple ones
if using xAPIC mode) for each destination when doing a global TLB
flush.

The lock time of flush_lock on a 32 vCPU guest using the shim in
x2APIC mode without the shorthand is:

Global lock flush_lock: addr=ffff82d0804b21c0, lockval=f602f602, not locked
  lock:228455938(79406065573135), block:205908580(556416605761539)

Average lock time: 347577ns

While the same guest using the shorthand:

Global lock flush_lock: addr=ffff82d0804b41c0, lockval=d9c4d9bc, cpu=12
  lock:1890775(416719148054), block:1663958(2500161282949)

Average lock time: 220395ns

Approximately a 1/3 improvement in the lock time.

Note that this requires locking the CPU maps (get_cpu_maps) which uses
a trylock. This is currently safe as all users of cpu_add_remove_lock
do a trylock, but will need reevaluating if non-trylock users appear.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agoxen/arm: gic: Remove pointless assertion against enum gic_sgi
Julien Grall [Sat, 18 Jan 2020 15:39:24 +0000 (15:39 +0000)]
xen/arm: gic: Remove pointless assertion against enum gic_sgi

The Arm Compiler will complain that the assertions ASSERT(sgi < 16) are
always true. This is because sgi is an item of the enum gic_sgi and
should always contain less than 16 SGIs.

Rather than using ASSERTs, introduce a new item in the enum that could
be checked against a build time.

Take the opportunity to remove the specific assigned values for each
item. This is fine because enum always starts at zero and values will be
assigned by increment of one. None of our code also rely on hardcoded
value.

[stefano: grammar fixes in commit message]

Signed-off-by: Julien Grall <julien@xen.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
CC: Andrii Anisov <andrii_anisov@epam.com>
5 years agoRevert "xen/arm32: setup: Give a xenheap page to the boot allocator"
Julien Grall [Thu, 16 Jan 2020 21:51:36 +0000 (21:51 +0000)]
Revert "xen/arm32: setup: Give a xenheap page to the boot allocator"

Since commit c61c1b4943 "xen/page_alloc: statically allocate
bootmem_region_list", the boot allocator does not use the first page of
the first region passed for its own purpose.

This reverts commit ae84f55353475f569daddb9a81ac0a6bc7772c90.

Signed-off-by: Julien Grall <julien@xen.org>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
5 years agogolang/xenlight: Don't leak memory on context open failure
George Dunlap [Fri, 17 Jan 2020 14:01:05 +0000 (14:01 +0000)]
golang/xenlight: Don't leak memory on context open failure

If libxl_ctx_alloc() returns an error, we need to destroy the logger
that we made.

Restructure the Close() method such that it checks for each resource
to be freed and then frees it.  This allows Close() to be come
idempotent, as well as to be a useful clean-up to a partially-created
context.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Nick Rosbrook <rosbrookn@ainfosec.com>
5 years agogolang/xenlight: Errors are negative
George Dunlap [Thu, 26 Dec 2019 17:18:14 +0000 (17:18 +0000)]
golang/xenlight: Errors are negative

Commit 871e51d2d4 changed the sign on the xenlight error types (making
the values negative, same as the C-generated constants), but failed to
flip the sign in the Error() string function.  The result is that
ErrorNonspecific.String() prints "libxl error: 1" rather than the
human-readable error message.

Get rid of the whole issue by making libxlErrors a map, and mapping
actual error values to string, falling back to printing the actual
value of the Error type if it's not present.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Nick Rosbrook <rosbrookn@ainfosec.com>
5 years agogo/xenlight: More informative error messages
George Dunlap [Thu, 26 Dec 2019 14:45:08 +0000 (14:45 +0000)]
go/xenlight: More informative error messages

If an error is encountered deep in a complicated data structure, it's
often difficult to tell where the error actually is.  Make the error
message from the generated toC() and fromC() structures more
informative by tagging which field being converted encountered the
error.  This will have the effect of giving a "stack trace" of the
failure inside a nested data structure.

NB that my version of python insists on reordering a couple of switch
statements for some reason; In other patches I've reverted those
changes, but in this case it's more difficult because they interact
with actual code changes.  I'll leave this here for now, as we're
going to remove helpers.gen.go from being tracked by git at some point
in the near future anyway.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Nick Rosbrook <rosbrookn@ainfosec.com>
5 years agogo/xenlight: Fix CpuidPoliclyList conversion
George Dunlap [Thu, 26 Dec 2019 17:43:17 +0000 (17:43 +0000)]
go/xenlight: Fix CpuidPoliclyList conversion

Empty Go strings should be converted to `nil` libxl_cpuid_policy_list;
otherwise libxl_cpuid_parse_config gets confused.

Also, libxl_cpuid_policy_list returns a weird error, not a "normal"
libxl error; if it returns one of these non-standard errors, convert
it to ErrorInval.

Finally, make the fromC() method take a pointer, and set the value of
CpuidPolicyList such that it will generate a valid CpuidPolicyList in
response.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Nick Rosbrook <rosbrookn@ainfosec.com>
5 years agogolang/xenlight: Do proper nil / NULL conversions for builtin Bitmap type
George Dunlap [Thu, 26 Dec 2019 17:40:33 +0000 (17:40 +0000)]
golang/xenlight: Do proper nil / NULL conversions for builtin Bitmap type

Similar to the autogenerated types, but for `builtin` Bitmap type.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Nick Rosbrook <rosbrookn@ainfosec.com>
5 years agoIntroduce CHANGELOG.md
Paul Durrant [Mon, 13 Jan 2020 15:32:17 +0000 (15:32 +0000)]
Introduce CHANGELOG.md

As agreed during the 2020-01 community call [1] this patch introduces a
changelog, based on the principles explained at keepachangelog.com [2].
A new MAINTAINERS entry is also added, with myself as (currently sole)
maintainer.

[1] See C.2 at https://cryptpad.fr/pad/#/2/pad/edit/ERZtMYD5j6k0sv-NG6Htl-AJ/
[2] https://keepachangelog.com/en/1.0.0/

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Lars Kurth <lars.kurth@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
5 years agolinkfarm: Exclude .*.tmp
Anthony PERARD [Wed, 15 Jan 2020 16:44:54 +0000 (16:44 +0000)]
linkfarm: Exclude .*.tmp

Exclude intermidiate files .*.tmp from the linkfarm, those are
generated by %.o:%.c rules in xen/Rules.mk when
CONFIG_ENFORCE_UNIQUE_SYMBOLS=y.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
5 years agolibxl: event: Document lifetime API for libxl_childproc_setmode
Ian Jackson [Fri, 17 Jan 2020 18:12:07 +0000 (18:12 +0000)]
libxl: event: Document lifetime API for libxl_childproc_setmode

There is already an identical comment for
libxl_osevent_register_hooks.

libxl_childproc_setmode's hooks parameter has the same property and
this should be documented.

Reported-by; George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wl@xen.org>
5 years agox86/smp: move and clean APIC helpers
Roger Pau Monné [Mon, 20 Jan 2020 11:48:05 +0000 (12:48 +0100)]
x86/smp: move and clean APIC helpers

Move __prepare_ICR{2}, apic_wait_icr_idle and
__default_send_IPI_shortcut to the top of the file, since they will be
used by send_IPI_mask in future changes.

While there, take the opportunity to remove the leading underscores,
drop the inline attribute, drop the default prefix from the shorthand
helper, change the return type of the prepare helpers to unsigned and
do some minor style cleanups.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agoVT-d: dma_pte_clear_one() can't fail anymore
Jan Beulich [Mon, 20 Jan 2020 11:47:31 +0000 (12:47 +0100)]
VT-d: dma_pte_clear_one() can't fail anymore

Hence it's pointless for it to return an error indicator, and it's even
less useful for it to be __must_check. This is a result of commit
e8afe1124cc1 ("iommu: elide flushing for higher order map/unmap
operations") moving the TLB flushing out of the function.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
5 years agoVT-d: adjust log messages in domain_context_mapping_one()
Jan Beulich [Mon, 20 Jan 2020 11:46:13 +0000 (12:46 +0100)]
VT-d: adjust log messages in domain_context_mapping_one()

Add missing newlines, use %pd, and drop exclamation marks.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
5 years agoxen/char: scif-uart: Remove useless ASSERT condition
Artem Mygaiev [Wed, 9 Oct 2019 14:20:16 +0000 (17:20 +0300)]
xen/char: scif-uart: Remove useless ASSERT condition

cnt is unsigned, so always >=0

Coverity-ID: 1381848
Signed-off-by: Artem Mygaiev <artem_mygaiev@epam.com>
[julien: Update commit title]
Acked-by: Julien Grall <julien@xen.org>
5 years agobuild: fix dependency file generation with ENFORCE_UNIQUE_SYMBOLS=y
Jan Beulich [Fri, 17 Jan 2020 16:38:19 +0000 (17:38 +0100)]
build: fix dependency file generation with ENFORCE_UNIQUE_SYMBOLS=y

The recorded file, unless overridden by -MQ (or -MT) is that specified
by -o, which doesn't produce correct dependencies and hence will cause
failure to re-build when included files change.

Fixes: 81ecb38b83b0 ("build: provide option to disambiguate symbol names")
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/shadow: use single (atomic) MOV for emulated writes
Jason Andryuk [Fri, 17 Jan 2020 15:19:16 +0000 (16:19 +0100)]
x86/shadow: use single (atomic) MOV for emulated writes

This is the corresponding change to the shadow code as made by
bf08a8a08a2e "x86/HVM: use single (atomic) MOV for aligned emulated
writes" to the non-shadow HVM code.

The bf08a8a08a2e commit message:
Using memcpy() may result in multiple individual byte accesses
(depending how memcpy() is implemented and how the resulting insns,
e.g. REP MOVSB, get carried out in hardware), which isn't what we
want/need for carrying out guest insns as correctly as possible. Fall
back to memcpy() only for accesses not 2, 4, or 8 bytes in size.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Tim Deegan <tim@xen.org>
5 years agox86/sm{e, a}p: do not enable SMEP/SMAP in PV shim by default on AMD
Igor Druzhinin [Fri, 17 Jan 2020 15:18:20 +0000 (16:18 +0100)]
x86/sm{e, a}p: do not enable SMEP/SMAP in PV shim by default on AMD

Due to AMD and Hygon being unable to selectively trap CR4 bit modifications
running 32-bit PV guest inside PV shim comes with significant performance
hit. Moreover, for SMEP in particular every time CR4.SMEP changes on context
switch to/from 32-bit PV guest, it gets trapped by L0 Xen which then
tries to perform global TLB invalidation for PV shim domain. This usually
results in eventual hang of a PV shim with at least several vCPUs.

Since the overall security risk is generally lower for shim Xen as it being
there more of a defense-in-depth mechanism, choose to disable SMEP/SMAP in
it by default on AMD and Hygon unless a user chose otherwise.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: adjust EFI-related build message
Jan Beulich [Fri, 17 Jan 2020 15:17:23 +0000 (16:17 +0100)]
x86: adjust EFI-related build message

As of commit 93249f7fc17c ("x86/efi: split compiler vs linker support"),
EFI support in xen.gz may be available even if no xen.efi gets
generated. Distinguish the cases when emitting the message.

Also drop the pointlessly (afaict) left use of $(filter ...) (needed
only when used in $(if ...)), from the ifeq() introduced by 7059afb202ff
("x86/Makefile: remove $(guard) use from $(TARGET).efi target").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86: refine link time stub area related assertion
Jan Beulich [Fri, 17 Jan 2020 15:15:28 +0000 (16:15 +0100)]
x86: refine link time stub area related assertion

While it has been me to introduce this, the use of | there has become
(and perhaps was from the very beginning) misleading. Rather than
avoiding the right side of it when linking the xen.efi intermediate file
at a different base address, make the expression cope with that case,
thus verifying placement on every step.

Furthermore the original check was too strict: We don't use one page per
CPU, so account for this as well. This involves moving the
STUBS_PER_PAGE definition and making DIV_ROUND_UP() accessible from
assembly (and hence the linker script); move a few other potentially
generally useful definitions along with it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/time: update TSC stamp on restore from deep C-state
Igor Druzhinin [Fri, 17 Jan 2020 15:11:20 +0000 (16:11 +0100)]
x86/time: update TSC stamp on restore from deep C-state

If ITSC is not available on CPU (e.g if running nested as PV shim)
then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
all AMD and some old Intel processors. In which case TSC would need to
be restored on CPU from platform time by Xen upon exiting C-states.

As platform time might be behind the last TSC stamp recorded for the
current CPU, invariant of TSC stamp being always behind local TSC counter
is violated. This has an effect of get_s_time() going negative resulting
in eventual system hang or crash.

Fix this issue by updating local TSC stamp along with TSC counter write.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agoget-maintainer.pl: Dont fall over when L: contains a display name
Lars Kurth [Fri, 17 Jan 2020 15:10:57 +0000 (16:10 +0100)]
get-maintainer.pl: Dont fall over when L: contains a display name

Prior to this change e-mail addresses of the form "display name
<email>" would result into empty output. Also see
https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00753.html

Signed-off-by: Lars Kurth <lars.kurth@citrix.com>
Reviewed-by: Julien Grall <julien@xen.org>
5 years agox86/page: Remove bifurcated PAGE_HYPERVISOR constant
Andrew Cooper [Mon, 13 Jan 2020 12:42:09 +0000 (12:42 +0000)]
x86/page: Remove bifurcated PAGE_HYPERVISOR constant

Despite being vaguely aware, the difference between PAGE_HYPERVISOR in ASM and
C code has nevertheless caused several bugs I should have known better about,
and contributed to review confusion.

There are exactly 4 uses of these constants in asm code (and one is shortly
going to disappear).

Instead of creating the constants which behave differently between ASM and C
code, expose all the constants and use non-ambiguous non-NX ones in ASM.
Adjust the hiding to just _PAGE_NX, which contains a C ternary expression.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agotools/libxc: Construct 32bit PV guests with L3 A/D bits set
Andrew Cooper [Tue, 14 Jan 2020 12:17:45 +0000 (12:17 +0000)]
tools/libxc: Construct 32bit PV guests with L3 A/D bits set

With the 32 PAE build of Xen gone, 32bit PV guests' top level pagetables no
longer behave exactly like PAE in hardware.

They should have A/D bits set, for the same performance reasons as apply to
other levels.  This brings the domain builder in line with how Xen constructs
a 32bit dom0.

As a purely code improvement, make use of range notation to initialise
identical values in adjacent array elements.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wl@xen.org>
5 years agotools/libxl: Plumb domain_create_state down into libxl__build_pre()
Andrew Cooper [Thu, 2 Jan 2020 21:37:36 +0000 (21:37 +0000)]
tools/libxl: Plumb domain_create_state down into libxl__build_pre()

To fix CPUID handling, libxl__build_pre() is going to have to distinguish
between a brand new VM vs one which is being migrated-in/resumed.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
5 years agogolang/xenlight: implement array Go to C marshaling
Nick Rosbrook [Sat, 4 Jan 2020 21:00:53 +0000 (16:00 -0500)]
golang/xenlight: implement array Go to C marshaling

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: implement keyed union Go to C marshaling
Nick Rosbrook [Sat, 4 Jan 2020 21:00:52 +0000 (16:00 -0500)]
golang/xenlight: implement keyed union Go to C marshaling

Since the C union cannot be directly populated, populate the fields of the
corresponding C struct defined in the cgo preamble, and then copy that
struct as bytes into the byte slice that Go uses as the union.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: begin Go to C type marshaling
Nick Rosbrook [Sat, 4 Jan 2020 21:00:51 +0000 (16:00 -0500)]
golang/xenlight: begin Go to C type marshaling

Implement conversions for basic types such as strings and integer
types in toC functions.

Modify function signatures of toC implementations for builtin
types to be consistent with the signature of the generated toC
functions.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agox86/hvm: always expose x2APIC feature in max HVM cpuid policy
Roger Pau Monne [Tue, 24 Dec 2019 10:18:10 +0000 (11:18 +0100)]
x86/hvm: always expose x2APIC feature in max HVM cpuid policy

On hardware without x2APIC support Xen emulated local APIC will
provide such mode, and hence the feature should be set in the maximum
HVM cpuid policy.

Not exposing it in the maximum policy results in HVM domains not
getting such feature exposed unless it's also supported by the
underlying hardware.

This was regressed by c/s 3e0c8272f20 which caused x2APIC not to be enabled
unilaterally for HVM guests.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agolibxc/migration: Adjust layout of struct xc_sr_context
Andrew Cooper [Thu, 19 Dec 2019 21:19:35 +0000 (21:19 +0000)]
libxc/migration: Adjust layout of struct xc_sr_context

We are shortly going to want to introduce some common x86 fields, so having
x86_pv and x86_hvm as the top level objects is a problem.  Insert a
surrounding struct x86 and drop the x86 prefix from the pv/hvm objects.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
5 years agotools/migration: Formatting and style cleanup
Andrew Cooper [Thu, 5 Dec 2019 15:57:13 +0000 (15:57 +0000)]
tools/migration: Formatting and style cleanup

The code has devating from the prevailing style in many ways.  Adjust spacing,
indentation, position of operators, layout of multiline comments, removal of
superfluous comments, constness, trailing commas, and use of unqualified
'unsigned'.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
5 years agoxen/vcpu: Improve sanity checks in vcpu_create()
Andrew Cooper [Wed, 15 Jan 2020 18:44:18 +0000 (18:44 +0000)]
xen/vcpu: Improve sanity checks in vcpu_create()

The BUG_ON() is confusing to follow.  The (!is_idle_domain(d) || vcpu_id) part
is a vestigial remnant of architectures poisioning idle_vcpu[0] with non-NULL
pointers.

Now that idle_vcpu[0] is NULL on all architectures, and d->max_vcpus specified
before vcpu_create() is called, we can properly range check the requested
vcpu_id.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien@xen.org>
5 years agoARM/boot: Don't poison 'current' during early boot
Andrew Cooper [Wed, 15 Jan 2020 18:43:58 +0000 (18:43 +0000)]
ARM/boot: Don't poison 'current' during early boot

This logic was inherited from x86 (which was updated several times since).
Unlike x86 (at the time) however, while NULL isn't mapped in ARM, 0xfffff000
is, making this actively dangerous.

Drop the logic entirely, and leave 'current' as NULL during early boot.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <julien@xen.org>
5 years agoxen/arm: during efi boot, improve the check for usable memory
Stefano Stabellini [Tue, 14 Jan 2020 23:31:55 +0000 (15:31 -0800)]
xen/arm: during efi boot, improve the check for usable memory

When booting via EFI, the EFI memory map has information about memory
regions and their type. Improve the check for the type and attribute of
each memory region to figure out whether it is usable memory or not.
This patch brings the check on par with Linux v5.5-rc6 and makes more
memory reusable as normal memory by Xen (except that Linux also reuses
EFI_PERSISTENT_MEMORY, which we do not).

Specifically, this patch also reuses memory marked as
EfiLoaderCode/Data, and it uses both Attribute and Type for the check
(Attribute needs to be EFI_MEMORY_WB).

Reported-by: Roman Shaposhnik <roman@zededa.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Acked-by: Julien Grall <julien@xen.org>
5 years agoremove unmodified_drivers directory
Juergen Gross [Tue, 14 Jan 2020 12:34:45 +0000 (13:34 +0100)]
remove unmodified_drivers directory

Having Linux kernel drivers for 2.6 based kernels in the Xen tree is
not really needed any longer. So remove them from the tree.

In case anyone wants to look at them they are still available in
older branches.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agoRemove undocumented and unmaintained tools/memshr library
Tamas K Lengyel [Fri, 10 Jan 2020 02:30:52 +0000 (19:30 -0700)]
Remove undocumented and unmaintained tools/memshr library

The library has been largely untouched for over a decade at this point, it is
undocumented and it's unclear what it was originally used for. Remove it from
tree, if anyone needs it in the future it can be carved out from git history.

Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
Acked-by: Wei Liu <wl@xen.org>
5 years agoMAINTAINERS: adjust path of actually maintained memshr code in tools
Tamas K Lengyel [Fri, 10 Jan 2020 02:30:51 +0000 (19:30 -0700)]
MAINTAINERS: adjust path of actually maintained memshr code in tools

tools/tests/mem-sharing is also maintained under the tools folder.

Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
Acked-by: Wei Liu <wl@xen.org>
5 years agox86/boot: Rename l?_identmap to l?_directmap
Andrew Cooper [Fri, 10 Jan 2020 16:06:08 +0000 (16:06 +0000)]
x86/boot: Rename l?_identmap to l?_directmap

Since c/s faa85d4fb3 "x86/boot: Don't map 0 during boot", l1_identmap no
longer has an alias mapped at 0, meaning that none of the l?_identmap[]
pagetables are actually an identity map.

Rename them to l?_directmap, which avoids any kind of implication that they
might be mapped at 0.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agolibxc/migration: Rationalise the 'checkpointed' field to 'stream_type'
Andrew Cooper [Fri, 20 Dec 2019 16:34:16 +0000 (16:34 +0000)]
libxc/migration: Rationalise the 'checkpointed' field to 'stream_type'

Originally, 'checkpointed' was a boolean signalling the difference between a
plain and a Remus stream.  COLO was added later, but several bits of code
retained boolean-style logic.  While correct, it is confusing to follow.

Additionally, XC_MIG_STREAM_NONE means "no checkpoints" but reads as "no
stream".

Consolidate all the logic on the term 'stream_type', and rename STREAM_NONE
to STREAM_PLAIN.  Re-position the stream_type variable so it isn't
duplicated in both the save and restore unions.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
5 years agolibxc/restore: Introduce functionality to simplify blob handling
Andrew Cooper [Wed, 18 Dec 2019 19:01:57 +0000 (19:01 +0000)]
libxc/restore: Introduce functionality to simplify blob handling

During migration, we buffer several blobs of data which ultimately need
handing back to Xen at an appropriate time.

Currently, this is all handled in an ad-hoc manner, but more blobs are soon
going to be added.  Introduce xc_sr_blob to encapsulate a ptr/size pair, and
update_blob() to handle the memory management aspects.

Switch the HVM_CONTEXT and the four PV_VCPU_* blobs over to this new
infrastructure.  No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
5 years agoArm: fix build after 892b9dcebdb7
Jan Beulich [Tue, 14 Jan 2020 15:06:27 +0000 (16:06 +0100)]
Arm: fix build after 892b9dcebdb7

"IRQ: u16 is too narrow for an event channel number" introduced a use of
evetchn_port_t, but its typedef apparently surfaces indirectly here only
on x86.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agoxen/arm: Place a speculation barrier sequence following an eret instruction
Julien Grall [Thu, 19 Dec 2019 08:12:21 +0000 (08:12 +0000)]
xen/arm: Place a speculation barrier sequence following an eret instruction

Some CPUs can speculate past an ERET instruction and potentially perform
speculative accesses to memory before processing the exception return.
Since the register state is often controlled by lower privilege level
at the point of an ERET, this could potentially be used as part of a
side-channel attack.

Newer CPUs may implement a new SB barrier instruction which acts
as an architected speculation barrier. For current CPUs, the sequence
DSB; ISB is known to prevent speculation.

The latter sequence is heavier than SB but it would never be executed
(this is speculation after all!).

Introduce a new macro 'sb' that could be used when a speculation barrier
is required. For now it is using dsb; isb but this could easily be
updated to cater SB in the future.

This is XSA-312.

Signed-off-by: Julien Grall <julien@xen.org>
5 years agodocs/misc: livepatch: Escape backslash
Julien Grall [Mon, 13 Jan 2020 22:05:31 +0000 (22:05 +0000)]
docs/misc: livepatch: Escape backslash

pandoc is currently failing to generate the pdf with the following
error:
! Undefined control sequence.
l.1048   metadata string format is: key=value\0

In this case, we want to print \0 so we need to backslash-escape the
first character.

Interestingly pandoc will not complain when creating html and will just
ignore \0 completely.

Fixes: 5083e0ff93 ("livepatch: Add metadata runtime retrieval mechanism")
Signed-off-by: Julien Grall <julien@xen.org>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
5 years agodocs/misc: pvcalls: Verbatim block should be indented with 4 spaces
Julien Grall [Sat, 11 Jan 2020 00:03:44 +0000 (00:03 +0000)]
docs/misc: pvcalls: Verbatim block should be indented with 4 spaces

At the moment, the diagram is only indented with 2 spaces. So pandoc
will try to badly interpret it and not display it correctly.

Fix it by indenting all the block by 4 spaces (i.e an extra 2 spaces).

Fixes: d661611d08 ("docs/markdown: Switch to using pandoc, and fix underscore escaping")
Signed-off-by: Julien Grall <julien@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
5 years agotools/Rules.mk: fix distclean
Paul Durrant [Thu, 9 Jan 2020 11:15:05 +0000 (11:15 +0000)]
tools/Rules.mk: fix distclean

Running 'make distclean' under tools will currently result in:

tools/Rules.mk:245: *** You have to run ./configure before building or installing the tools.  Stop.

This patch adds 'distclean', 'subdir-distclean%' and 'subdir-clean%' to
no-configure-targets, which allows 'make distclean' to run to completion.

Fixes: 00691c6c (tools: Allow to make *-dir-force-update without ./configure)
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wl@xen.org>
5 years agoIRQ: u16 is too narrow for an event channel number
Jan Beulich [Tue, 14 Jan 2020 11:03:47 +0000 (12:03 +0100)]
IRQ: u16 is too narrow for an event channel number

FIFO event channels allow ports up to 2^17, so we need to use a wider
field in struct pirq. Move "masked" such that it may share the 8-byte
slot with struct arch_pirq on 64-bit arches, rather than leaving a
7-byte hole in all cases.

Take the opportunity and also add a comment regarding "arch" placement
within the structure.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agoCODING_STYLE: Document how to handle unexpected conditions
George Dunlap [Mon, 9 Dec 2019 11:12:07 +0000 (11:12 +0000)]
CODING_STYLE: Document how to handle unexpected conditions

It's not always clear what the best way is to handle unexpected
conditions: whether with ASSERT(), domain_crash(), BUG_ON(), or some
other method.  All methods have a risk of introducing security
vulnerabilities and unnecessary instabilities to production systems.

Provide guidelines for different options and when to use them.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien@xen.org>
5 years agoxen/sched: rt: Fix typo in a comment
Julien Grall [Fri, 10 Jan 2020 11:28:07 +0000 (11:28 +0000)]
xen/sched: rt: Fix typo in a comment

Signed-off-by: Julien Grall <julien@xen.org>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
5 years agoMAINTAINERS: Update my email address
Paul Durrant [Fri, 10 Jan 2020 08:54:37 +0000 (08:54 +0000)]
MAINTAINERS: Update my email address

It is now more covenient for me to use my Amazon address.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
5 years agox86/boot: Drop INVALID_VCPU
Andrew Cooper [Sat, 28 Dec 2019 15:01:00 +0000 (15:01 +0000)]
x86/boot: Drop INVALID_VCPU

Now that NULL will fault at boot, there is no need for a special constant to
signify "current not set up yet".

Since c/s fae249d23413 "x86/boot: Rationalise stack handling during early
boot", the BSP cpu_info block is now consistently zero, so drop the adjacent
re-zeroing.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/boot: Don't map 0 during boot
Andrew Cooper [Mon, 6 Jan 2020 13:37:41 +0000 (13:37 +0000)]
x86/boot: Don't map 0 during boot

In particular, it causes accidental NULL pointer dereferences to go unnoticed.

The majority of the early operation takes place either in Real mode, or
Protected Unpaged mode.  The only bit which requires pagetable mappings is the
trampoline transition into Long mode and jump to the higher mappings, so there
is no need for the whole bottom 2M to be mapped.

Introduce a new l1_bootmap in .init.data, and use it instead of l1_identmap.
The EFI boot path doesn't pass through the trampoline, so doesn't need any
adjustment.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/boot: Clean up l?_bootmap[] construction
Andrew Cooper [Mon, 6 Jan 2020 13:37:54 +0000 (13:37 +0000)]
x86/boot: Clean up l?_bootmap[] construction

The need for Xen to be identity mapped into the bootmap is not obvious, and
differs between the MB and EFI boot paths.

The EFI side is further complicated by an attempt to cope with with l2_bootmap
only being 4k long.  This is undocumented, confusing, only works if Xen is the
single object wanting mapping.

The pageables are common to both the MB and EFI builds, so simplify the EFI
bootmap construction code to make exactly one identity-map of Xen, which now
makes the two paths consistent.  Comment both pieces of logic, explaining what
the mappings are needed for.

Finally, leave a linker assert covering the fact that plenty of code blindly
assumes that Xen is less that 16M.  This wants fixing in due course.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/boot: Remove the preconstructed low 16M superpage mappings
Andrew Cooper [Sat, 28 Dec 2019 14:20:59 +0000 (14:20 +0000)]
x86/boot: Remove the preconstructed low 16M superpage mappings

These are left over from c/s b2804422 "x86: make Xen early boot code
relocatable", which made it possible for Xen not to be in the bottom 16M.

Nothing using the mappings any more.  Build them in the directmap when walking
the E820 table along with everything else.

Furthermore, it is undefined to have superpages and MTRRs disagree on
cacheability boundaries, and nothing actually checks.  While we don't fix this
explicitly, we do at least honour the E820 now if it says there are boundaries
in this range.

As a consequence, there are now no _PAGE_PRESENT entries between
__page_tables_{start,end} which need to skip relocation.  This simplifies the
MB1/2 entry path logic to remove the l2_identmap[] special case.

The low 2M (using 4k pages) is retained for now.  Amongst other things, it
matters for console logging while the legacy VGA hole is in use.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/boot: Rationalise stack handling during early boot
Andrew Cooper [Wed, 8 Jan 2020 13:36:42 +0000 (13:36 +0000)]
x86/boot: Rationalise stack handling during early boot

The top (numerically higher addresses) of cpu0_stack[] contains the BSP's
cpu_info block.  Logic in Xen expects this to be initialised to 0, but this
area of stack is also used during early boot.

Update the head.S code to avoid using the cpu_info block.  Additionally,
update the stack_start variable to match, which avoids __high_start() and
efi_arch_post_exit_boot() needing to make the adjustment manually.

Finally, leave a big warning by the BIOS BSS initialisation, because it is by
no means obvious that the stack doesn't survive the REP STOS.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/MCE: correct struct mcinfo_extended for compat guests
Jan Beulich [Thu, 9 Jan 2020 10:09:02 +0000 (11:09 +0100)]
x86/MCE: correct struct mcinfo_extended for compat guests

The use of any kind of pointers in the public interface is wrong,
including dimensioning arrays based on the size of pointers. The least
bad option of addressing the issue looks to be to pin down the number
that the (64-bit) hypervisor has used anyway (even when passing
information to compat but privileged guests). There aren't actual
instantiations of the structure apart from ones allocated dynamically
out of struct mc_info's mi_data[], which is entirely controlled by the
hypervisor.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/MCE: avoid leaking stack data
Jan Beulich [Thu, 9 Jan 2020 10:08:29 +0000 (11:08 +0100)]
x86/MCE: avoid leaking stack data

While HYPERVISOR_mca is a privileged operation, we still shouldn't leak
stack contents (the tail of every array entry's mc_msrvalues[] of
XEN_MC_physcpuinfo output). Simply use a zeroing allocation here.

Take the occasion and also restrict the involved local variable's scope.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86: clear per cpu stub page information in cpu_smpboot_free()
Juergen Gross [Thu, 9 Jan 2020 10:07:38 +0000 (11:07 +0100)]
x86: clear per cpu stub page information in cpu_smpboot_free()

cpu_smpboot_free() removes the stubs for the cpu going offline, but it
isn't clearing the related percpu variables. This will result in
crashes when a stub page is released due to all related cpus gone
offline and one of those cpus going online later.

Fix that by clearing stubs.addr and stubs.mfn in order to allocate a
new stub page when needed, irrespective of whether the CPU gets parked
or removed.

Fixes: 2e6c8f182c9c50 ("x86: distinguish CPU offlining from CPU removal")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Tao Xu <tao3.xu@intel.com>
5 years agox86/boot: Simplify BSS zeroing
Andrew Cooper [Wed, 8 Jan 2020 13:11:13 +0000 (13:11 +0000)]
x86/boot: Simplify BSS zeroing

There is no need to load a non-flat %es to zero the BSS.  Use sym_esi()
instead, which is easier to follow, faster (avoids two segment loads) and
doesn't require use of the stack.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/boot: Map the trampoline as read-only
Andrew Cooper [Mon, 6 Jan 2020 13:36:30 +0000 (13:36 +0000)]
x86/boot: Map the trampoline as read-only

c/s ec92fcd1d08, which caused the trampoline GDT Access bits to be set,
removed the final writes which occurred between enabling paging and switching
to the high mappings.  There don't plausibly need to be any memory writes in
few instructions is takes to perform this transition.

As a consequence, we can remove the RWX mapping of the trampoline.  It is RX
via its identity mapping below 1M, and RW via the directmap.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/boot: Check for E820_RAM earlier when searching the E820
Andrew Cooper [Sat, 28 Dec 2019 14:41:11 +0000 (14:41 +0000)]
x86/boot: Check for E820_RAM earlier when searching the E820

There is no point performing the masking calculations if we are going to
throw the result away.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agoMAINTAINERS: fix malformed entry
Juergen Gross [Wed, 8 Jan 2020 16:57:16 +0000 (17:57 +0100)]
MAINTAINERS: fix malformed entry

MAINTAINERS entries tagged with "L:" should have a pure mail address
as the second word. Fix a malformed entry. Otherwise add_maintainers.pl
will produce an empty "Cc:" line.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agoxen/spinlock: disable spinlock debugging in console_force_unlock()
Juergen Gross [Wed, 8 Jan 2020 10:43:24 +0000 (11:43 +0100)]
xen/spinlock: disable spinlock debugging in console_force_unlock()

console_force_unlock() might result in subsequent ASSERT() triggering
when CONFIG_DEBUG_LOCKS was active. Avoid that by calling
spin_debug_disable() in console_force_unlock() and make the spinlock
debug assertions trigger only if spin_debug was active.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/boot: boot_vid_mode doesn't need to be global
Andrew Cooper [Tue, 7 Jan 2020 12:12:51 +0000 (12:12 +0000)]
x86/boot: boot_vid_mode doesn't need to be global

AFAICT, it has never had an external user since its introduction.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/mem_sharing: Fix RANDCONFIG build
Andrew Cooper [Tue, 7 Jan 2020 13:41:40 +0000 (13:41 +0000)]
x86/mem_sharing: Fix RANDCONFIG build

Travis reports: https://travis-ci.org/andyhhp/xen/jobs/633751811

  mem_sharing.c:361:13: error: 'rmap_has_entries' defined but not used [-Werror=unused-function]
   static bool rmap_has_entries(const struct page_info *page)
               ^
  cc1: all warnings being treated as errors

This happens in a release build (disables MEM_SHARING_AUDIT) when
CONFIG_MEM_SHARING is enabled.

Expand both trivial helpers into their single callsite.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tamas K Lengyel <tamas@tklengyel.com>

5 years agotools: Allow to make *-dir-force-update without ./configure
Anthony PERARD [Thu, 19 Dec 2019 14:42:16 +0000 (14:42 +0000)]
tools: Allow to make *-dir-force-update without ./configure

This also allows to run `make src-tarball` without first having to run
`./configure`.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Wei Liu <wl@xen.org>