]> xenbits.xensource.com Git - xen.git/log
xen.git
8 years agox86emul: correct stub invocation constraints
Jan Beulich [Wed, 26 Apr 2017 07:49:24 +0000 (09:49 +0200)]
x86emul: correct stub invocation constraints

Stub invocations need to have the space the stub occupies as an input,
to prevent the compiler from re-ordering (or omitting) writes to it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/32on64: properly honor add-to-physmap-batch's size
Jan Beulich [Wed, 26 Apr 2017 07:48:45 +0000 (09:48 +0200)]
x86/32on64: properly honor add-to-physmap-batch's size

Commit 407a3c00ff ("compat/memory: fix build with old gcc") "fixed" a
build issue by switching to the use of uninitialized data. Due to
- the bounding of the uninitialized data item
- the accessed area being outside of Xen space
- arguments being properly verified by the native hypercall function
this is not a security issue.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agodocs: fix iommu_inclusive_mapping documentation
Roger Pau Monné [Wed, 26 Apr 2017 07:48:07 +0000 (09:48 +0200)]
docs: fix iommu_inclusive_mapping documentation

iommu_inclusive_mapping is enabled by default (and has been for a long time).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Also "correct" -> "incorrect" in the text.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
8 years agodmop: add xendevicemodel_modified_memory_bulk()
Jennifer Herbert [Wed, 26 Apr 2017 07:47:30 +0000 (09:47 +0200)]
dmop: add xendevicemodel_modified_memory_bulk()

This new lib devicemodel call allows multiple extents of pages to be
marked as modified in a single call.  This is something needed for a
usecase I'm working on.

The xen side of the modified_memory call has been modified to accept
an array of extents.  The devicemodel library either provides an array
of length 1, to support the original library function, or a second
function allows an array to be provided.

Signed-off-by: Jennifer Herbert <Jennifer.Herbert@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agohvm/dmop: implement COPY_{TO,FROM}_GUEST_BUF_OFFSET() helpers
Andrew Cooper [Wed, 26 Apr 2017 07:46:57 +0000 (09:46 +0200)]
hvm/dmop: implement COPY_{TO,FROM}_GUEST_BUF_OFFSET() helpers

copy_{to,from}_guest_buf() are now implemented using an offset of 0.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jennifer Herbert <Jennifer.Herbert@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
This does only extend to the functionality here, specifically not to
the use of all-upper-case names for the macros:

Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agohvm/dmop: implement COPY_{TO,FROM}_GUEST_BUF() in terms of raw accessors
Andrew Cooper [Wed, 26 Apr 2017 07:40:22 +0000 (09:40 +0200)]
hvm/dmop: implement COPY_{TO,FROM}_GUEST_BUF() in terms of raw accessors

This also allows the usual cases to be simplified, by omitting an unnecessary
buf parameters, and because the macros can appropriately size the object.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jennifer Herbert <Jennifer.Herbert@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
This does only extend to the functionality here, specifically not to
the use of all-upper-case names for the macros:

Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agohvm/dmop: make copy_buf_{from, to}_guest for a buffer not big enough an error
Jennifer Herbert [Wed, 26 Apr 2017 07:40:00 +0000 (09:40 +0200)]
hvm/dmop: make copy_buf_{from, to}_guest for a buffer not big enough an error

This makes copying to or from a buf that isn't big enough an error.
If the buffer isnt big enough, trying to carry on regardless
can only cause trouble later on.

Signed-off-by: Jennifer Herbert <Jennifer.Herbert@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agohvm/dmop: box dmop_args rather than passing multiple parameters around
Jennifer Herbert [Wed, 26 Apr 2017 07:39:14 +0000 (09:39 +0200)]
hvm/dmop: box dmop_args rather than passing multiple parameters around

No functional change.

Signed-off-by: Jennifer Herbert <Jennifer.Herbert@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years ago8250-uart: Fix typo in the header
Julien Grall [Tue, 25 Apr 2017 12:36:51 +0000 (13:36 +0100)]
8250-uart: Fix typo in the header

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agox86/mm: Add missing newline to a printk() in get_page_from_l1e()
Andrew Cooper [Mon, 24 Apr 2017 17:07:20 +0000 (18:07 +0100)]
x86/mm: Add missing newline to a printk() in get_page_from_l1e()

This avoids the log message being followed by

  <G><1>mm.c:5374:d0v0 could not get_page_from_l1e()

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoxen.h: fix comment for vcpu_guest_context
Wei Liu [Mon, 10 Apr 2017 10:28:01 +0000 (11:28 +0100)]
xen.h: fix comment for vcpu_guest_context

Use the correct vcpuop name and delete one trailing white space.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
8 years agoxen/arm: Don't unflatten DT when booting with ACPI
Punit Agrawal [Fri, 21 Apr 2017 13:12:54 +0000 (14:12 +0100)]
xen/arm: Don't unflatten DT when booting with ACPI

Unflattening the device tree when booting with "acpi=force" leads to the
following stack trace on AMD Seattle platform -

(XEN) Xen call trace:
(XEN)    [<0000000000204bfc>] dt_irq_translate+0x48/0x58 (PC)
(XEN)    [<0000000000204f5c>] dt_device_get_irq+0x34/0x38 (LR)
(XEN)    [<0000000000251a08>] platform_get_irq+0x14/0x44
(XEN)    [<00000000002952bc>] smmu.c#arm_smmu_dt_init+0x190/0x100c
(XEN)    [<0000000000299310>] device_init+0xa8/0xdc
(XEN)    [<00000000002950f0>] iommu_hardware_setup+0x34/0x68
(XEN)    [<0000000000294ef0>] iommu_setup+0x48/0x1c8
(XEN)    [<000000000029cecc>] start_xen+0xb94/0xd34
(XEN)    [<00000083fbba91dc>] 00000083fbba91dc

The problem arises due to the unflattened device tree being
unconditionally used in iommu_hardware_setup().

Let's re-arrange the code without changing boot order to unflatten the
device tree only when acpi is disabled.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agox86/hvm: Corrections and improvements to unhandled vmexit logging
Andrew Cooper [Wed, 19 Apr 2017 15:56:32 +0000 (16:56 +0100)]
x86/hvm: Corrections and improvements to unhandled vmexit logging

 * Use gprintk rather than gdprintk.  These logging messages shouldn't
   disappear in release builds, as they usually happen immediately before a
   domain crash.  Raise them from WARNING to ERR.
 * Format the vmexit reason in the same base as is used in the vendor
   manuals (decimal for Intel, hex for AMD), and consistently use 0x for hex
   numbers.
 * Consistently use "Unexpected vmexit" terminology.

In particular, this corrects the information printed for nested VT-x, and
actually prints information for nested SVM.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agocorrect rcu_unlock_domain()
Jan Beulich [Fri, 21 Apr 2017 10:10:51 +0000 (12:10 +0200)]
correct rcu_unlock_domain()

Match rcu_lock_domain(), and remove the slightly misleading comment:
This isn't just the companion to rcu_lock_domain_by_id() (and that
latter function indeed also keeps the domain locked, not the domain
list).

No functional change, as rcu_read_{,un}lock() ignore their arguments
anyway.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/vlapic: Don't reset APIC ID when handling INIT signal
Chao Gao [Fri, 21 Apr 2017 10:09:48 +0000 (12:09 +0200)]
x86/vlapic: Don't reset APIC ID when handling INIT signal

According to SDM "ADVANCED PROGRAMMABLE INTERRUPT CONTROLLER (APIC) ->
"EXTENDED XAPIC (X2APIC)" -> "x2APIC State Transitions", the APIC mode
and APIC ID are preserved when handling INIT signal and a reset places
APIC to xAPIC mode and APIC base address to 0xFEE00000h (this part
is in "Local APIC" -> "Local APIC Status and Location"). So there are
two problems in current code:
1. Using reset logic (aka vlapic_reset) to handle INIT signal.
2. Forgetting resetting APIC mode and base address in vlapic_reset()

This patch introduces a new function vlapic_do_init() and replaces the
wrongly used vlapic_reset(). Also reset APIC mode and APIC base address
in vlapic_reset().

Note that: LDR is read only in x2APIC mode. Resetting it to zero in x2APIC
mode is unreasonable. This patch also doesn't reset LDR when handling INIT
signal in x2APIC mode.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoxen/arm: Properly map the FDT in the boot page table
Julien Grall [Thu, 20 Apr 2017 15:12:28 +0000 (16:12 +0100)]
xen/arm: Properly map the FDT in the boot page table

Currently, Xen is assuming the FDT will always fit in a 2MB section.
Recently, I noticed an early crash on Xen when using GRUB with the
following call trace:

(XEN) Hypervisor Trap. HSR=0x96000006 EC=0x25 IL=1 Syndrome=0x6
(XEN) CPU0: Unexpected Trap: Hypervisor
(XEN) ----[ Xen-4.9-unstable  arm64  debug=y   Not tainted ]----
(XEN) CPU:    0
(XEN) PC:     0000000000264140 strlen+0x10/0x84
(XEN) LR:     00000000002401c0
(XEN) SP:     00000000002cfc20
(XEN) CPSR:   400003c9 MODE:64-bit EL2h (Hypervisor, handler)
(XEN)      X0: 0000000000801230  X1: 0000000000801230  X2: 0000000000005230
(XEN)      X3: 0000000000000030  X4: 0000000000000030  X5: 0000000000000038
(XEN)      X6: 0000000000000034  X7: 0000000000000000  X8: 7f7f7f7f7f7f7f7f
(XEN)      X9: 64622c6479687222 X10: 7f7f7f7f7f7f7f7f X11: 0101010101010101
(XEN)     X12: 0000000000000030 X13: ffffff00ff000000 X14: 0800000003000000
(XEN)     X15: ffffffffffffffff X16: 00000000fefff610 X17: 00000000000000f0
(XEN)     X18: 0000000000000004 X19: 0000000000000008 X20: 00000000007fc040
(XEN)     X21: 00000000007fc000 X22: 000000000000000e X23: 0000000000000000
(XEN)     X24: 00000000002a9f58 X25: 0000000000801230 X26: 00000000002a9f68
(XEN)     X27: 00000000002a9f58 X28: 0000000000298910  FP: 00000000002cfc20
(XEN)
(XEN)   VTCR_EL2: 80010c40
(XEN)  VTTBR_EL2: 0000082800203000
(XEN)
(XEN)  SCTLR_EL2: 30c5183d
(XEN)    HCR_EL2: 000000000038663f
(XEN)  TTBR0_EL2: 00000000f4912000
(XEN)
(XEN)    ESR_EL2: 96000006
(XEN)  HPFAR_EL2: 00000000e8071000
(XEN)    FAR_EL2: 0000000000801230
(XEN)
(XEN) Xen stack trace from sp=00000000002cfc20:
(XEN)    00000000002cfc70 0000000000240254 00000000002a9f58 00000000007fc000
(XEN)    0000000000000000 0000000000000000 0000000000000000 00000000007fc03c
(XEN)    00000000002cfd78 0000000000000000 00000000002cfca0 00000000002986fc
(XEN)    0000000000000000 00000000007fc000 0000000000000000 0000000000000000
(XEN)    00000000002cfcc0 0000000000298f1c 0000000000000000 00000000007fc000
(XEN)    00000000002cfdc0 000000000029904c 00000000f47fc000 00000000f4604000
(XEN)    00000000f47fc000 00000000007fc000 0000000000400000 0000000000000100
(XEN)    00000000f4604000 0000000000000001 0000000000000001 8000000000000002
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 00000000002cfdc0 0000000000299038
(XEN)    00000000f47fc000 00000000f4604000 00000000f47fc000 0000000000000000
(XEN)    00000000002cfe20 000000000029c420 00000000002d8000 00000000f4604000
(XEN)    00000000f47fc000 0000000000000000 0000000000400000 0000000000000100
(XEN)    00000000f4604000 0000000000000001 00000000f47fc000 000000000029c404
(XEN)    00000000fefff510 0000000000200624 00000000f4804000 00000000f4604000
(XEN)    00000000f47fc000 0000000000000000 0000000000400000 0000000000000100
(XEN)    0000000000000001 0000000000000001 0000000000000001 8000000000000002
(XEN)    00000000f47fc000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<0000000000264140>] strlen+0x10/0x84 (PC)
(XEN)    [<00000000002401c0>] fdt_get_property_namelen+0x9c/0xf0 (LR)
(XEN)    [<0000000000240254>] fdt_get_property+0x40/0x50
(XEN)    [<00000000002986fc>] bootfdt.c#device_tree_get_u32+0x18/0x5c
(XEN)    [<0000000000298f1c>] device_tree_for_each_node+0x84/0x144
(XEN)    [<000000000029904c>] boot_fdt_info+0x70/0x23c
(XEN)    [<000000000029c420>] start_xen+0x9c/0xd30
(XEN)    [<0000000000200624>] arm64/head.o#paging+0x84/0xbc
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) CPU0: Unexpected Trap: Hypervisor
(XEN)
(XEN) ****************************************

Indeed, the booting documentation for AArch32 and AArch64 only requires
the FDT to be placed on a 8-byte boundary. This means the Device-Tree can
cross a 2MB boundary.

Given that Xen limits the size of the FDT to 2MB, it will always fit in
a 4MB slot. So extend the fixmap slot for FDT from 2MB to 4MB.

The second 2MB superpage will only be mapped if the FDT is cross the 2MB
boundary.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: Check if the FDT passed by the bootloader is valid
Julien Grall [Thu, 20 Apr 2017 15:12:27 +0000 (16:12 +0100)]
xen/arm: Check if the FDT passed by the bootloader is valid

There is currently no sanity check on the FDT passed by the bootloader.
Whilst they are stricly not necessary, it will avoid us to spend hours
to try to find out why it does not work.

>From the booting documentation for AArch32 [1] and AArch64 [2] must :
    - be placed on 8-byte boundary
    - not exceed 2MB (only on AArch64)

Even if AArch32 does not seem to limit the size, Xen is not currently
able to support more the 2MB FDT. It is better to crash rather with a nice
error message than claiming we are supporting any size of FDT.

The checks are mostly borrowed from the Linux code (see fixmap_remap_fdt
in arch/arm64/mm/mmu.c).

[1] Section 2 in linux/Documentation/arm64/booting.txt
[2] Section 4b in linux/Documentation/arm/Booting

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: Move the code to map FDT in the boot tables from assembly to C
Julien Grall [Thu, 20 Apr 2017 15:12:26 +0000 (16:12 +0100)]
xen/arm: Move the code to map FDT in the boot tables from assembly to C

The FDT will not be accessed before start_xen (begining of C code) is
called and it will be easier to maintain as the code could be common
between AArch32 and AArch64.

A new function early_fdt_map is introduced to map the FDT in the boot
page table.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: mm: Move create_mappings function earlier in the file
Julien Grall [Thu, 20 Apr 2017 15:12:25 +0000 (16:12 +0100)]
xen/arm: mm: Move create_mappings function earlier in the file

This function will be called by other function later one. This will
avoid forward declaration and keep the new function close to sibling
ones.

This was moved just after *_fixmap helpers as they are page table
handling functions too.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: Add BOOT_FDT_VIRT_END and BOOT_FDT_SLOT_SIZE
Julien Grall [Thu, 20 Apr 2017 15:12:24 +0000 (16:12 +0100)]
xen/arm: Add BOOT_FDT_VIRT_END and BOOT_FDT_SLOT_SIZE

The 2 new defines will help to avoid hardcoding the size and the end of
the slot in the code.

The newlines are added for clarity.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/kexec: remove spinlock now that all KEXEC hypercall ops are protected at the...
Eric DeVolder [Wed, 19 Apr 2017 21:01:49 +0000 (16:01 -0500)]
xen/kexec: remove spinlock now that all KEXEC hypercall ops are protected at the top-level

The spinlock in kexec_swap_images() was removed as
this function is only reachable on the kexec hypercall, which is
now protected at the top-level in do_kexec_op_internal(),
thus the local spinlock is no longer necessary.

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoxen/kexec: use hypercall_create_continuation to protect KEXEC ops
Eric DeVolder [Wed, 19 Apr 2017 21:01:48 +0000 (16:01 -0500)]
xen/kexec: use hypercall_create_continuation to protect KEXEC ops

When we concurrently try to unload and load crash
images we eventually get:

 Xen call trace:
    [<ffff82d08018b04f>] machine_kexec_add_page+0x3a0/0x3fa
    [<ffff82d08018b184>] machine_kexec_load+0xdb/0x107
    [<ffff82d080116e8d>] kexec.c#kexec_load_slot+0x11/0x42
    [<ffff82d08011724f>] kexec.c#kexec_load+0x119/0x150
    [<ffff82d080117c1e>] kexec.c#do_kexec_op_internal+0xab/0xcf
    [<ffff82d080117c60>] do_kexec_op+0xe/0x1e
    [<ffff82d08025c620>] pv_hypercall+0x20a/0x44a
    [<ffff82d080260116>] cpufreq.c#test_all_events+0/0x30

 Pagetable walk from ffff820040088320:
  L4[0x104] = 00000002979d1063 ffffffffffffffff
  L3[0x001] = 00000002979d0063 ffffffffffffffff
  L2[0x000] = 00000002979c7063 ffffffffffffffff
  L1[0x088] = 80037a91ede97063 ffffffffffffffff

The interesting thing is that the page bits (063) look legit.

The operation on which we blow up is us trying to write
in the L1 and finding that the L2 entry points to some
bizzare MFN. It stinks of a race, and it looks like
the issue is due to no concurrency locks when dealing
with the crash kernel space.

Specifically we concurrently call kimage_alloc_crash_control_page
which iterates over the kexec_crash_area.start -> kexec_crash_area.size
and once found:

  if ( page )
  {
      image->next_crash_page = hole_end;
      clear_domain_page(_mfn(page_to_mfn(page)));
  }

clears. Since the parameters of what MFN to use are provided
by the callers (and the area to search is bounded) the the 'page'
is probably the same. So #1 we concurrently clear the
'control_code_page'.

The next step is us passing this 'control_code_page' to
machine_kexec_add_page. This function requires the MFNs:
page_to_maddr(image->control_code_page).

And this would always return the same virtual address, as
the MFN of the control_code_page is inside of the
kexec_crash_area.start -> kexec_crash_area.size area.

Then machine_kexec_add_page updates the L1 .. which can be done
concurrently and on subsequent calls we mangle it up.

This is all a theory at this time, but testing reveals
that adding the hypercall_create_continuation() at the
kexec hypercall fixes the crash.

NOTE: This patch follows 5c5216 (kexec: clear kexec_image slot
when unloading kexec image) to prevent crashes during
simultaneous load/unloads.

NOTE: Consideration was given to using the existing flag
KEXEC_FLAG_IN_PROGRESS to denote a kexec hypercall in
progress. This, however, overloads the original intent of
the flag which is to denote that we are about-to/have made
the jump to the crash path. The overloading would lead to
failures in existing checks on this flag as the flag would
always be set at the top level in do_kexec_op_internal().
For this reason, the new flag KEXEC_FLAG_HC_IN_PROGRESS
was introduced.

While at it, fixed the #define mismatched spacing

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/microcode: Use the return value from early_microcode_update_cpu
Ross Lagerwall [Thu, 20 Apr 2017 13:18:00 +0000 (14:18 +0100)]
x86/microcode: Use the return value from early_microcode_update_cpu

Use the return value from early_microcode_update_cpu rather than
ignoring it.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-Acked-by: Julien Grall <julien.grall@arm.com>
8 years agohotplug/FreeBSD: configure xenstored
Wei Liu [Tue, 18 Apr 2017 14:48:59 +0000 (15:48 +0100)]
hotplug/FreeBSD: configure xenstored

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: provide options to define xenstored devices
Wei Liu [Tue, 18 Apr 2017 14:42:43 +0000 (15:42 +0100)]
oxenstored: provide options to define xenstored devices

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agopaths.m4: provide XENSTORED_{KVA,PORT}
Wei Liu [Tue, 18 Apr 2017 14:20:03 +0000 (15:20 +0100)]
paths.m4: provide XENSTORED_{KVA,PORT}

The default values are Linux device names. No users yet.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86: Move microcode loading earlier 4.9.0-rc2
Ross Lagerwall [Tue, 18 Apr 2017 15:47:24 +0000 (16:47 +0100)]
x86: Move microcode loading earlier

Move microcode loading earlier for the boot CPU and secondary CPUs so
that it takes place before identify_cpu() is called for each CPU.
Without this, the detected features may be wrong if the new microcode
loading adjusts the feature bits. That could mean that some fixes (e.g.
d6e9f8d4f35d ("x86/vmx: fix vmentry failure with TSX bits in LBR"))
don't work as expected.

Previously during boot, the microcode loader was invoked for each
secondary CPU started and then again for each CPU as part of an
initcall. Simplify the code so that it is invoked exactly once for each
CPU during boot.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86emul: force CLZERO feature flag in test harness
Jan Beulich [Wed, 19 Apr 2017 11:30:27 +0000 (13:30 +0200)]
x86emul: force CLZERO feature flag in test harness

Commit b988e88cc0 ("x86/emul: Add feature check for clzero") added a
feature check to the emulator, which breaks the harness without this
flag being forced to true.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/vioapic: allow holes in the GSI range for PVH Dom0
Roger Pau Monné [Wed, 19 Apr 2017 11:29:51 +0000 (13:29 +0200)]
x86/vioapic: allow holes in the GSI range for PVH Dom0

The current vIO APIC for PVH Dom0 doesn't allow non-contiguous GSIs, which
means that all GSIs must belong to an IO APIC. This doesn't match reality,
where there are systems with non-contiguous GSIs.

In order to fix this add a base_gsi field to each hvm_vioapic struct, in order
to store the base GSI for each emulated IO APIC. For PVH Dom0 those values are
populated based on the hardware ones.

Reported-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/HVM: don't uniformly report "MMIO" for various forms of failed emulation
Jan Beulich [Wed, 19 Apr 2017 11:29:14 +0000 (13:29 +0200)]
x86/HVM: don't uniformly report "MMIO" for various forms of failed emulation

This helps distinguishing the call paths leading there.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoVMX: don't blindly enable descriptor table exiting control
Jan Beulich [Wed, 19 Apr 2017 11:26:55 +0000 (13:26 +0200)]
VMX: don't blindly enable descriptor table exiting control

This is an optional feature and hence we should check for it before
use.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/HVM: restrict emulation in hvm_descriptor_access_intercept()
Jan Beulich [Wed, 19 Apr 2017 11:26:18 +0000 (13:26 +0200)]
x86/HVM: restrict emulation in hvm_descriptor_access_intercept()

While I did review d0a699a389 ("x86/monitor: add support for descriptor
access events") it didn't really occur to me that someone could be this
blunt and add unguarded emulation again just a few weeks after we
guarded all special purpose emulator invocations. Fix this.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86emul: always fill x86_insn_modrm()'s outputs
Jan Beulich [Wed, 19 Apr 2017 11:25:44 +0000 (13:25 +0200)]
x86emul: always fill x86_insn_modrm()'s outputs

The function is rather unlikely to be called for insns which don't have
ModRM bytes, and hence addressing Coverity's recurring complaint of
callers potentially consuming uninitialized data when they know that
certain opcodes have ModRM bytes can be suppressed this way without
unduly adding overhead to fast paths.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86emul: add "unblock NMI" retire flag
Jan Beulich [Wed, 19 Apr 2017 11:24:18 +0000 (13:24 +0200)]
x86emul: add "unblock NMI" retire flag

No matter that we emulate IRET for (guest) real mode only right now, we
should get its effect on (virtual) NMI delivery right. Note that we can
simply check the other retire flags also in the !OKAY case, as the
insn emulator now guarantees them to only be set on OKAY.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agodocs: fix configuration syntax in xl.cfg manpage
Wei Liu [Tue, 11 Apr 2017 11:03:00 +0000 (12:03 +0100)]
docs: fix configuration syntax in xl.cfg manpage

No quote is required when a string is provided as part of a spec string.

Reported-by: Doug Freed <dwfreed@mtu.edu>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
8 years agotools: Use POSIX signal.h instead of sys/signal.h
Alistair Francis [Mon, 17 Apr 2017 21:33:11 +0000 (14:33 -0700)]
tools: Use POSIX signal.h instead of sys/signal.h

The POSIX spec specifies to use:
    #include <signal.h>
instead of:
    #include <sys/signal.h>
as seen here:
   http://pubs.opengroup.org/onlinepubs/009695399/functions/signal.html

This removes the warning:
    #warning redirecting incorrect #include <sys/signal.h> to <signal.h>
when building with the musl C-library.

Signed-off-by: Alistair Francis <alistair.francis@xilinx.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agotools: Use POSIX poll.h instead of sys/poll.h
Alistair Francis [Mon, 17 Apr 2017 21:33:10 +0000 (14:33 -0700)]
tools: Use POSIX poll.h instead of sys/poll.h

The POSIX spec specifies to use:
    #include <poll.h>
instead of:
    #include <sys/poll.h>
as seen here:
    http://pubs.opengroup.org/onlinepubs/009695399/functions/poll.html

This removes the warning:
    #warning redirecting incorrect #include <sys/poll.h> to <poll.h>
when building with the musl C-library.

Signed-off-by: Alistair Francis <alistair.francis@xilinx.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/emul: Drop more redundant ctxt.event_pending checks
Andrew Cooper [Mon, 10 Apr 2017 12:11:06 +0000 (13:11 +0100)]
x86/emul: Drop more redundant ctxt.event_pending checks

Since c/s 92cf67888a, x86_emulate_wrapper() asserts stricter behaviour about
the relationship between X86EMUL_EXCEPTION and ctxt.event_pending.

These removals should have been included in the aforementioned changeset, and
were only omitted due an oversight.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Tim Deegan <tim@xen.org>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/vIO-APIC: fix uninitialized variable warning
Jan Beulich [Thu, 13 Apr 2017 15:35:02 +0000 (17:35 +0200)]
x86/vIO-APIC: fix uninitialized variable warning

In a release build modern gcc validly complains about "pin" possibly
being uninitialized in vioapic_irq_positive_edge().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoVT-d: correct a comment and remove an useless if() statement
Chao Gao [Thu, 13 Apr 2017 15:34:29 +0000 (17:34 +0200)]
VT-d: correct a comment and remove an useless if() statement

Fix two flaws in the patch (93358e8e VT-d: introduce update_irte to update
irte safely):
1. Expand a comment in update_irte() to make it clear that VT-d hardware
doesn't update IRTE and software can't update IRTE behind us since we hold
iremap_lock.
2. remove an useless if() statement

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoclang: disable the gcc-compat warnings for read_atomic
Roger Pau Monné [Thu, 13 Apr 2017 15:33:21 +0000 (17:33 +0200)]
clang: disable the gcc-compat warnings for read_atomic

clang gcc-compat warnings can wrongly fire when certain constructions are used,
at least the following flow:

switch ( ... )
{
case ...:
    while ( ({ int x; switch ( foo ) { case 1: x = 1; break; } x }) )
    {
        ...

Will cause clang to emit the following warning "'break' is bound to loop, GCC
binds it to switch", which is a false positive, and both gcc and clang bind
the break to the inner switch. In order to workaround this issue, disable the
gcc-compat checks for the usage of the read_atomic macro.

This has been reported upstream as http://bugs.llvm.org/show_bug.cgi?id=32595.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agotools:misc:xenpm: set max freq to all cpu with default cpuid
Luwei Kang [Thu, 13 Apr 2017 10:44:28 +0000 (18:44 +0800)]
tools:misc:xenpm: set max freq to all cpu with default cpuid

User can set max freq to specific cpu by
"xenpm set-scaling-maxfreq [cpuid] <HZ>"
or set max freq to all cpu with default cpuid by
"xenpm set-scaling-maxfreq <HZ>".

Set max freq with default cpuid will cause
segmentation fault after commit id d4906b5d05.
This patch will fix this issue and add ability
to set max freq with default cpuid.

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Compile-tested-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoxen: credit: change an ASSERT on nr_runnable so that it makes sense.
Dario Faggioli [Thu, 13 Apr 2017 07:49:54 +0000 (09:49 +0200)]
xen: credit: change an ASSERT on nr_runnable so that it makes sense.

Since the counter is unsigned, it's pointless/bogous to check
for if to be above zero.

Check that it is at least one before it's decremented, instead.

Spotted by Coverity.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoxen: credit2: cleanup patch for type betterness
Praveen Kumar [Tue, 11 Apr 2017 18:08:42 +0000 (23:38 +0530)]
xen: credit2: cleanup patch for type betterness

The patch actually doesn't impact the functionality as such. This only replaces
bool_t with bool in credit2.

Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agomisc/release-checklist.txt: Try to avoid wrong-tag mistakes
Ian Jackson [Wed, 12 Apr 2017 15:36:30 +0000 (16:36 +0100)]
misc/release-checklist.txt: Try to avoid wrong-tag mistakes

Add some better checking and make the runes a bit more robust.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agomisc/release-checklist.txt: Preemptive updates
Ian Jackson [Wed, 12 Apr 2017 15:31:40 +0000 (16:31 +0100)]
misc/release-checklist.txt: Preemptive updates

These are things I noticed should be fixed, while trying to make
4.9.0-rc1.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agoConfig.mk: Update for 4.9.0-rc1.2 4.9.0-rc1.2
Ian Jackson [Wed, 12 Apr 2017 15:18:57 +0000 (16:18 +0100)]
Config.mk: Update for 4.9.0-rc1.2

Contrary to what I wrote in d0db50ced1f7 "Config.mk: Update for
4.9.0-rc1.1", the build failure with 4.9.0-rc1 was not due to a wrong
qemu tag, but a wrong mini-os tag.  So burn 4.9.0-rc1.1 too :-(.  (We
can rewind the qemu-trad tag to 4.9.0-rc1; the -rc1 and -rc1.1 tags
are identical.)

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agoConfig.mk: Update for 4.9.0-rc1.1 4.9.0-rc1.1
Ian Jackson [Wed, 12 Apr 2017 15:03:35 +0000 (16:03 +0100)]
Config.mk: Update for 4.9.0-rc1.1

In qemu-trad, I made xen-4.9.0-rc1 refer erroneously to the 4.8
branch.  That doesn't build.  So we are burning the version number
4.9.0-rc1 in xen.git and qemu-trad.  (The other trees can remain as
they are.)

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agoConfig.mk, etc.: Prepare 4.9.0-rc1 4.9.0-rc1
Ian Jackson [Wed, 12 Apr 2017 14:45:39 +0000 (15:45 +0100)]
Config.mk, etc.: Prepare 4.9.0-rc1

* Update Config.mk REVISION values to refer to relevant tags.
* Update README version number
* Update xen/Makefile version number

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agox86/atomic: fix cmpxchg16b inline assembly to work with clang
Roger Pau Monné [Mon, 10 Apr 2017 15:32:01 +0000 (17:32 +0200)]
x86/atomic: fix cmpxchg16b inline assembly to work with clang

clang doesn't understand the "=A" register constrain when used with 64bits
assembly and spits out an internal error:

fatal error: error in backend: Cannot select: 0x7f9fb89c9390: i64 = build_pair 0x7f9fb89c92b0,
      0x7f9fb89c9320
  0x7f9fb89c92b0: i32,ch,glue = CopyFromReg 0x7f9fb89c9240, Register:i32 %EAX, 0x7f9fb89c9240:1
    0x7f9fb89c8c20: i32 = Register %EAX
    0x7f9fb89c9240: ch,glue = inlineasm 0x7f9fb89c90f0,
TargetExternalSymbol:i64'lock; cmpxchg16b $1', MDNode:ch<0x7f9fb8476c38>,
TargetConstant:i64<25>, TargetConstant:i32<18>, Register:i32 %EAX, Register:i32
%EDX, TargetConstant:i32<196622>, 0x7f9fb89c87c0, TargetConstant:i32<9>,
Register:i64 %RCX, TargetConstant:i32<9>, Register:i64 %RBX,
TargetConstant:i32<9>, Register:i64 %RDX, TargetConstant:i32<9>, Register:i64
%RAX, TargetConstant:i32<196622>, 0x7f9fb89c87c0, TargetConstant:i32<12>,
Register:i32 %EFLAGS, 0x7f9fb89c90f0:1
      0x7f9fb89c8a60: i64 = TargetExternalSymbol'lock; cmpxchg16b $1'
      0x7f9fb89c8b40: i64 = TargetConstant<25>
      0x7f9fb89c8bb0: i32 = TargetConstant<18>
      0x7f9fb89c8c20: i32 = Register %EAX
      0x7f9fb89c8c90: i32 = Register %EDX
      0x7f9fb89c8d00: i32 = TargetConstant<196622>
      0x7f9fb89c87c0: i64,ch = load<LD8[%4]> 0x7f9fb9053da0, FrameIndex:i64<1>, undef:i64
        0x7f9fb9053a90: i64 = FrameIndex<1>
        0x7f9fb9053e80: i64 = undef
      0x7f9fb89c8e50: i32 = TargetConstant<9>
      0x7f9fb89c8d70: i64 = Register %RCX
      0x7f9fb89c8e50: i32 = TargetConstant<9>
      0x7f9fb89c8ec0: i64 = Register %RBX
      0x7f9fb89c8e50: i32 = TargetConstant<9>
      0x7f9fb89c8fa0: i64 = Register %RDX
      0x7f9fb89c8e50: i32 = TargetConstant<9>
      0x7f9fb89c9080: i64 = Register %RAX
[...]

Fix this by specifying "rdx:rax" manually using the "d" and "a" constraints.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agoxsm: fix clang 3.5 build after c47d1d
Roger Pau Monné [Mon, 10 Apr 2017 15:31:42 +0000 (17:31 +0200)]
xsm: fix clang 3.5 build after c47d1d

The changes introduced on c47d1d broke the clang build due to undefined
references to __xsm_action_mismatch_detected, because clang hasn't optimized
the code properly. The following patch allows the clang build to work again,
while keeping the same functionality.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
8 years agooxenstored: make --restart option best-effort
Jonathan Davies [Fri, 7 Apr 2017 13:27:22 +0000 (14:27 +0100)]
oxenstored: make --restart option best-effort

Only attempt to restore from saved state if it exists.

Without this, oxenstored immediately exits with an exception if the
--restart option is provided but the state file is not present.

(The time-of-check to time-of-use race isn't a concern as oxenstored is
the only thing that should write the state file.)

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: improve event-channel binding logging
Jonathan Davies [Fri, 7 Apr 2017 13:27:21 +0000 (14:27 +0100)]
oxenstored: improve event-channel binding logging

It's useful to see a bit more detail when an inter-domain event-channel
is bound, especially over an oxenstored restart.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: save remote evtchn port, not local port
Jonathan Davies [Fri, 7 Apr 2017 13:27:20 +0000 (14:27 +0100)]
oxenstored: save remote evtchn port, not local port

Previously, Domain.dump output the number of the local port
corresponding to each domain's event-channel. However, when oxenstored
exits, it closes /dev/xen/evtchn which causes the kernel to close the
local port (evtchn_release), so this port is no longer useful.

Instead, store the remote port. This can be used to reconnect the
event-channel by binding the original remote port to a fresh local port.

Indeed, the logic for parsing the stored state already expects a remote
port as it passes the parsed port number to Domain.make (via
Domains.create), which takes a remote port.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: avoid leading slash in paths in saved store state
Jonathan Davies [Fri, 7 Apr 2017 13:27:19 +0000 (14:27 +0100)]
oxenstored: avoid leading slash in paths in saved store state

Internally, paths are represented as lists of strings, where
  * path "/" is represented by []
  * path "/local/domain/0" is represented by ["local"; "domain"; "0"]
(see comment for Store.Path.t).

However, the traversal function generated paths like
    [""; "local"; "domain"; "0"]
because the name of the root node is "". Change it to generate paths
correctly.

Furthermore, the function passed to Store.dump_fct would render the node
"foo" under the path [] as "//foo". Change this to return "/foo".

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: initialise logging earlier
Jonathan Davies [Fri, 7 Apr 2017 13:27:18 +0000 (14:27 +0100)]
oxenstored: initialise logging earlier

Otherwise we miss out on messages from things that try to log earlier in
the start-up procedure.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoRevert "setup vwfi correctly on cpu0"
Stefano Stabellini [Fri, 7 Apr 2017 22:38:58 +0000 (15:38 -0700)]
Revert "setup vwfi correctly on cpu0"

This reverts commit b32d442abd92cdd4d8f2a2e7794cfee9dba7fe22. There is
no need for this patch after "xen/arm: Set and restore HCR_EL2 register
for each vCPU separately".

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3 ITS: introduce device mapping
Andre Przywara [Fri, 7 Apr 2017 22:08:01 +0000 (23:08 +0100)]
ARM: GICv3 ITS: introduce device mapping

The ITS uses device IDs to map LPIs to a device. Dom0 will later use
those IDs, which we directly pass on to the host.
For this we have to map each device that Dom0 may request to a host
ITS device with the same identifier.
Allocate the respective memory and enter each device into an rbtree to
later be able to iterate over it or to easily teardown guests.
Because device IDs are per ITS, we need to identify a virtual ITS. We
use the doorbell address for that purpose, as it is a nice architectural
MSI property and spares us handling with opaque pointer or break
the VGIC abstraction.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: vGICv3: introduce ITS emulation stub
Andre Przywara [Fri, 7 Apr 2017 22:08:00 +0000 (23:08 +0100)]
ARM: vGICv3: introduce ITS emulation stub

Create a new file to hold the emulation code for the ITS widget.
This just holds the data structure and a init and free function for now.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoARM: GICv3 ITS: introduce host LPI array
Andre Przywara [Fri, 7 Apr 2017 22:07:59 +0000 (23:07 +0100)]
ARM: GICv3 ITS: introduce host LPI array

The number of LPIs on a host can be potentially huge (millions),
although in practise will be mostly reasonable. So prematurely allocating
an array of struct irq_desc's for each LPI is not an option.
However Xen itself does not care about LPIs, as every LPI will be injected
into a guest (Dom0 for now).
Create a dense data structure (8 Bytes) for each LPI which holds just
enough information to determine the virtual IRQ number and the VCPU into
which the LPI needs to be injected.
Also to not artificially limit the number of LPIs, we create a 2-level
table for holding those structures.
This patch introduces functions to initialize these tables and to
create, lookup and destroy entries for a given LPI.
By using the naturally atomic access guarantee the native uint64_t data
type gives us, we allocate and access LPI information in a way that does
not require a lock.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3 ITS: introduce ITS command handling
Andre Przywara [Fri, 7 Apr 2017 22:07:58 +0000 (23:07 +0100)]
ARM: GICv3 ITS: introduce ITS command handling

To be able to easily send commands to the ITS, create the respective
wrapper functions, which take care of the ring buffer.
The first two commands we implement provide methods to map a collection
to a redistributor (aka host core) and to flush the command queue (SYNC).
Start using these commands for mapping one collection to each host CPU.
As an ITS might choose between *two* ways of addressing a redistributor,
we store both the MMIO base address as well as the processor number in
a per-CPU variable to give each ITS what it wants.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoARM: GICv3 ITS: map ITS command buffer
Andre Przywara [Fri, 7 Apr 2017 22:07:57 +0000 (23:07 +0100)]
ARM: GICv3 ITS: map ITS command buffer

Instead of directly manipulating the tables in memory, an ITS driver
sends commands via a ring buffer in normal system memory to the ITS h/w
to create or alter the LPI mappings.
Allocate memory for that buffer and tell the ITS about it to be able
to send ITS commands.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3 ITS: allocate device and collection table
Andre Przywara [Fri, 7 Apr 2017 22:07:56 +0000 (23:07 +0100)]
ARM: GICv3 ITS: allocate device and collection table

Each ITS maps a pair of a DeviceID (for instance derived from a PCI
b/d/f triplet) and an EventID (the MSI payload or interrupt ID) to a
pair of LPI number and collection ID, which points to the target CPU.
This mapping is stored in the device and collection tables, which software
has to provide for the ITS to use.
Allocate the required memory and hand it over to the ITS.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3: allocate LPI pending and property table
Andre Przywara [Fri, 7 Apr 2017 22:07:55 +0000 (23:07 +0100)]
ARM: GICv3: allocate LPI pending and property table

The ARM GICv3 provides a new kind of interrupt called LPIs.
The pending bits and the configuration data (priority, enable bits) for
those LPIs are stored in tables in normal memory, which software has to
provide to the hardware.
Allocate the required memory, initialize it and hand it over to each
redistributor. The maximum number of LPIs to be used can be adjusted with
the command line option "max_lpi_bits", which defaults to 20 bits,
covering about one million LPIs.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3 ITS: initialize host ITS
Andre Przywara [Fri, 7 Apr 2017 22:07:54 +0000 (23:07 +0100)]
ARM: GICv3 ITS: initialize host ITS

Map the registers frame for each host ITS and populate the host ITS
structure with some parameters describing the size of certain properties
like the number of bits for device IDs.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoARM: GICv3 ITS: parse and store ITS subnodes from hardware DT
Andre Przywara [Fri, 7 Apr 2017 22:07:53 +0000 (23:07 +0100)]
ARM: GICv3 ITS: parse and store ITS subnodes from hardware DT

Parse the GIC subnodes in the device tree to find every ITS MSI controller
the hardware offers. Store that information in a list to both propagate
all of them later to Dom0, but also to be able to iterate over all ITSes.
This introduces an ITS Kconfig option (as an EXPERT option), use
XEN_CONFIG_EXPERT=y on the make command line to see and use the option.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoxen: credit1: treat pCPUs more evenly during balancing.
Dario Faggioli [Fri, 7 Apr 2017 16:57:14 +0000 (18:57 +0200)]
xen: credit1: treat pCPUs more evenly during balancing.

Right now, we use cpumask_first() for going through
the bus pCPUs in csched_load_balance(). This means
not all pCPUs have equal chances of seeing their
pending work stolen. It also means there is more
runqueue lock pressure on lower ID pCPUs.

To avoid all this, let's record and remember, for
each NUMA node, from what pCPU we have stolen for
last, and start from that the following time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit1: increase efficiency and scalability of load balancing.
Dario Faggioli [Fri, 7 Apr 2017 16:57:07 +0000 (18:57 +0200)]
xen: credit1: increase efficiency and scalability of load balancing.

During load balancing, we check the non idle pCPUs to
see if they have runnable but not running vCPUs that
can be stolen by and set to run on currently idle pCPUs.

If a pCPU has only one running (or runnable) vCPU,
though, we don't want to steal it from there, and
it's therefore pointless bothering with it
(especially considering that bothering means trying
to take its runqueue lock!).

On large systems, when load is only slightly higher
than the number of pCPUs (i.e., there are just a few
more active vCPUs than the number of the pCPUs), this
may mean that:
 - we go through all the pCPUs,
 - for each one, we (try to) take its runqueue locks,
 - we figure out there's actually nothing to be stolen!

To mitigate this, we introduce a counter for the number
of runnable vCPUs on each pCPU. In fact, unless there
re least 2 runnable vCPUs --typically, one running,
and the others in the runqueue-- it does not make sense
to try stealing anything.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit2: avoid cpumask_any() in pick_cpu().
Dario Faggioli [Fri, 7 Apr 2017 16:57:00 +0000 (18:57 +0200)]
xen: credit2: avoid cpumask_any() in pick_cpu().

cpumask_any() is costly (because of the randomization).
And since it does not really matter which exact CPU is
selected within a runqueue, as that will be overridden
shortly after, in runq_tickle(), spending too much time
and achieving true randomization is pretty pointless.

As the picked CPU, however, would be used as an hint,
within runq_tickle(), don't give up on it entirely,
and let's make sure we don't always return the same
CPU, or favour lower or higher ID CPUs.

To achieve that, let's record and remember, for each
runqueue, what CPU we picked for last, and start from
that the following time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen/tools: tracing: add record for credit1 runqueue stealing.
Dario Faggioli [Fri, 7 Apr 2017 16:56:52 +0000 (18:56 +0200)]
xen/tools: tracing: add record for credit1 runqueue stealing.

Including whether we actually tried stealing a vCPU from
a given pCPU, or we skipped that one, because of lock
contention.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit: consider tickled pCPUs as busy.
Dario Faggioli [Fri, 7 Apr 2017 16:56:45 +0000 (18:56 +0200)]
xen: credit: consider tickled pCPUs as busy.

Currently, it can happen that __runq_tickle(),
running on pCPU 2 because vCPU x woke up, decides
to tickle pCPU 3, because it's idle. Just after
that, but before pCPU 3 manages to schedule and
pick up x, either __runq_tickel() or
__csched_cpu_pick(), running on pCPU 6, sees that
idle pCPUs are 0, 1 and also 3, and for whatever
reason it also chooses 3 for waking up (or
migrating) vCPU y.

When pCPU 3 goes through the scheduler, it will
pick up, say, vCPU x, and y will sit in its
runqueue, even if there are idle pCPUs.

Alleviate this by marking a pCPU to be idle
right away when tickling it (like, e.g., it happens
in Credit2).

Note that this does not eliminate the race. That
is not possible without introducing proper locking
for the cpumasks the scheduler uses. It significantly
reduces the window during which it can happen, though.

Introduce proper locking for the cpumask can, in
theory, be done, and may be investigated in future.
It is a significant amount of work to do it properly
(e.g., avoiding deadlock), and it is likely to adversely
affect scalability, and so it may be a path it is just
not worth following.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit: (micro) optimize csched_runq_steal().
Dario Faggioli [Fri, 7 Apr 2017 16:56:38 +0000 (18:56 +0200)]
xen: credit: (micro) optimize csched_runq_steal().

Checking whether or not a vCPU can be 'stolen'
from a peer pCPU's runqueue is relatively cheap.

Therefore, let's do that  as early as possible,
avoiding potentially useless complex checks, and
cpumask manipulations.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit1: simplify csched_runq_steal() a little bit.
Dario Faggioli [Fri, 7 Apr 2017 16:56:31 +0000 (18:56 +0200)]
xen: credit1: simplify csched_runq_steal() a little bit.

Since we're holding the lock on the pCPU from which we
are trying to steal, it can't have disappeared, so we
can drop the check for that (and convert it in an
ASSERT()).

And since we try to steal only from busy pCPUs, it's
unlikely for such pCPU to be idle, so we can:
 - tell the compiler this is actually unlikely,
 - bail early if the pCPU, unfortunately, turns out
   to really be idle.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
8 years agox86/svm: Correct event injection check in svm_vmcb_restore()
Andrew Cooper [Fri, 7 Apr 2017 15:38:53 +0000 (16:38 +0100)]
x86/svm: Correct event injection check in svm_vmcb_restore()

SVM's maximum valid event type is 4.  This appears to be a straigth copy and
paste error in c/s e94e3f210a62, as VT-x's maximum is 6.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
8 years agox86/svm: Fix indentation in svm_vmcb_restore()
Andrew Cooper [Fri, 7 Apr 2017 15:38:12 +0000 (16:38 +0100)]
x86/svm: Fix indentation in svm_vmcb_restore()

Inroduced by c/s b706e1c6af274, spotted by Coverity.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agox86/emul: Poision the stubs with debug traps
Andrew Cooper [Wed, 8 Mar 2017 15:38:55 +0000 (15:38 +0000)]
x86/emul: Poision the stubs with debug traps

...rather than leaving fragments of old instructions in place.  This reduces
the chances of something going further-wrong (as the debug trap will be caught
and terminate the guest) in a cascade-failure where we end up executing the
instruction fragments.

Before:
    (XEN) d2v0 exception 6 (ec=0000) in emulation stub (line 6239)
    (XEN) d2v0 stub: c4 e1 44 77 c3 80 d0 82 ff ff ff d1 90 ec 90

After:
    (XEN) d3v0 exception 6 (ec=0000) in emulation stub (line 6239)
    (XEN) d3v0 stub: c4 e1 44 77 c3 cc cc cc cc cc cc cc cc cc cc

To make this work, the int3 handler needs to be extended to attempt recovery
rather than simply returning back to Xen context.  While altering do_int3(),
leave an obvious sign if an embedded breakpoint has been hit and not dealt
with by debugging facilities.

    (XEN) Hit embedded breakpoint at ffff82d0803d01f6 [extable.c#stub_selftest+0xda/0xee]

Extend the selftests to include int3, and add an extra printk indicating the
start of the recovery selftests, to avoid leaving otherwise-spurious faults
visible in the log.

    (XEN) build-id: 55d7e6f420b4f0ce277f776be620f43d7cb8646c
    (XEN) Running stub recovery selftests...
    (XEN) traps.c:3466: GPF (0000): ffff82d0bffff041 [ffff82d0bffff041] -> ffff82d08035937a
    (XEN) traps.c:813: Trap 12: ffff82d0bffff040 [ffff82d0bffff040] -> ffff82d08035937a
    (XEN) traps.c:1215: Trap 3: ffff82d0bffff041 [ffff82d0bffff041] -> ffff82d08035937a
    (XEN) ACPI sleep modes: S3

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/ioreq server: synchronously reset outstanding p2m_ioreq_server entries when an...
Yu Zhang [Fri, 7 Apr 2017 15:40:04 +0000 (17:40 +0200)]
x86/ioreq server: synchronously reset outstanding p2m_ioreq_server entries when an ioreq server unmaps

After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
synchronously by iterating the p2m table.

The synchronous resetting is necessary because we need to guarantee
the p2m table is clean before another ioreq server is mapped. And
since the sweeping of p2m table could be time consuming, it is done
with hypercall continuation.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agox86/ioreq server: asynchronously reset outstanding p2m_ioreq_server entries
Yu Zhang [Fri, 7 Apr 2017 15:39:16 +0000 (17:39 +0200)]
x86/ioreq server: asynchronously reset outstanding p2m_ioreq_server entries

After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
asynchronously with the current p2m_change_entry_type_global()
interface.

New field entry_count is introduced in struct p2m_domain, to record
the number of p2m_ioreq_server p2m page table entries. One nature of
these entries is that they only point to 4K sized page frames, because
all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
p2m_change_type_one(). We do not need to worry about the counting for
2M/1G sized pages.

This patch disallows mapping of an ioreq server, when there's still
p2m_ioreq_server entry left, in case another mapping occurs right after
the current one being unmapped, releases its lock, with p2m table not
synced yet.

This patch also disallows live migration, when there's remaining
p2m_ioreq_server entry in p2m table. The core reason is our current
implementation of p2m_change_entry_type_global() lacks information
to resync p2m_ioreq_server entries correctly if global_logdirty is
on.

We still need to handle other recalculations, however; which means
that when doing a recalculation, if the current type is
p2m_ioreq_server, we check to see if p2m->ioreq.server is valid or
not.  If it is, we leave it as type p2m_ioreq_server; if not, we reset
it to p2m_ram as appropriate.

To avoid code duplication, lift recalc_type() out of p2m-pt.c and use
it for all type recalculations (both in p2m-pt.c and p2m-ept.c).

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/ioreq server: handle read-modify-write cases for p2m_ioreq_server pages
Paul Durrant [Fri, 7 Apr 2017 15:38:48 +0000 (17:38 +0200)]
x86/ioreq server: handle read-modify-write cases for p2m_ioreq_server pages

In ept_handle_violation(), write violations are also treated as
read violations. And when a VM is accessing a write-protected
address with read-modify-write instructions, the read emulation
process is triggered first.

For p2m_ioreq_server pages, current ioreq server only forwards
the write operations to the device model. Therefore when such page
is being accessed by a read-modify-write instruction, the read
operations should be emulated here in hypervisor. This patch provides
such a handler to copy the data to the buffer.

Note: MMIOs with p2m_mmio_dm type do not need such special treatment
because both reads and writes will go to the device mode.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/ioreq server: add device model wrappers for new DMOP
Yu Zhang [Fri, 7 Apr 2017 15:38:40 +0000 (17:38 +0200)]
x86/ioreq server: add device model wrappers for new DMOP

A new device model wrapper is added for the newly introduced
DMOP - XEN_DMOP_map_mem_type_to_ioreq_server.

Since currently this DMOP only supports the emulation of write
operations, attempts to trigger the DMOP with values other than
XEN_DMOP_IOREQ_MEM_ACCESS_WRITE or 0(to unmap the ioreq server)
shall fail. The wrapper shall be updated once read operations
are also to be emulated in the future.

Also note currently this DMOP only supports one memory type,
and can be extended in the future to map multiple memory types
to multiple ioreq servers, e.g. mapping HVMMEM_ioreq_serverX to
ioreq server X, This wrapper shall be updated when such change
is made.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
8 years agox86/ioreq server: add DMOP to map guest ram with p2m_ioreq_server to an ioreq server
Paul Durrant [Fri, 7 Apr 2017 15:38:11 +0000 (17:38 +0200)]
x86/ioreq server: add DMOP to map guest ram with p2m_ioreq_server to an ioreq server

Previously, p2m_ioreq_server is used to write-protect guest ram
pages, which are tracked with ioreq server's rangeset. However,
number of ram pages to be tracked may exceed the upper limit of
rangeset.

Now, a new DMOP - XEN_DMOP_map_mem_type_to_ioreq_server, is added
to let one ioreq server claim/disclaim its responsibility for the
handling of guest pages with p2m type p2m_ioreq_server. Users of
this DMOP can specify which kind of operation is supposed to be
emulated in a parameter named flags. Currently, this DMOP only
support the emulation of write operations. And it can be further
extended to support the emulation of read ones if an ioreq server
has such requirement in the future.

For now, we only support one ioreq server for this p2m type, so
once an ioreq server has claimed its ownership, subsequent calls
of the XEN_DMOP_map_mem_type_to_ioreq_server will fail. Users can
also disclaim the ownership of guest ram pages with p2m_ioreq_server,
by triggering this new DMOP, with ioreq server id set to the current
owner's and flags parameter set to 0.

Note:
a> both XEN_DMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
are only supported for HVMs with HAP enabled.

b> only after one ioreq server claims its ownership of p2m_ioreq_server,
will the p2m type change to p2m_ioreq_server be allowed.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agox86/ioreq server: release the p2m lock after mmio is handled
Yu Zhang [Fri, 7 Apr 2017 15:35:44 +0000 (17:35 +0200)]
x86/ioreq server: release the p2m lock after mmio is handled

Routine hvmemul_do_io() may need to peek the p2m type of a gfn to
select the ioreq server. For example, operations on gfns with
p2m_ioreq_server type will be delivered to a corresponding ioreq
server, and this requires that the p2m type not be switched back
to p2m_ram_rw during the emulation process. To avoid this race
condition, we delay the release of p2m lock in hvm_hap_nested_page_fault()
until mmio is handled.

Note: previously in hvm_hap_nested_page_fault(), put_gfn() was moved
before the handling of mmio, due to a deadlock risk between the p2m
lock and the event lock(in commit 77b8dfe). Later, a per-event channel
lock was introduced in commit de6acb7, to send events. So we do not
need to worry about the deadlock issue.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agotools: sched: add support for 'null' scheduler
Dario Faggioli [Fri, 7 Apr 2017 12:28:31 +0000 (14:28 +0200)]
tools: sched: add support for 'null' scheduler

It being very very basic, also means this scheduler does
not need much support at the tools level (for now).

Basically, just the definition of the symbol of the
scheduler itself and a couple of stubs.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: sched_null: support for hard affinity
Dario Faggioli [Fri, 7 Apr 2017 12:28:23 +0000 (14:28 +0200)]
xen: sched_null: support for hard affinity

As a (rudimental) way of directing and affecting the
placement logic implemented by the scheduler, support
vCPU hard affinity.

Basically, a vCPU will now be assigned only to a pCPU
that is part of its own hard affinity. If such pCPU(s)
is (are) busy, the vCPU will wait, like it happens
when there are no free pCPUs.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: sched: introduce the 'null' semi-static scheduler
Dario Faggioli [Fri, 7 Apr 2017 12:28:15 +0000 (14:28 +0200)]
xen: sched: introduce the 'null' semi-static scheduler

In cases where one is absolutely sure that there will be
less vCPUs than pCPUs, having to pay the cost, mostly in
terms of overhead, of an advanced scheduler may be not
desirable.

The simple scheduler implemented here could be a solution.
Here how it works:
 - each vCPU is statically assigned to a pCPU;
 - if there are pCPUs without any vCPU assigned, they
   stay idle (as in, the run their idle vCPU);
 - if there are vCPUs which are not assigned to any
   pCPU (e.g., because there are more vCPUs than pCPUs)
   they *don't* run, until they get assigned;
 - if a vCPU assigned to a pCPU goes away, one of the
   waiting to be assigned vCPU, if any, gets assigned
   to the pCPU and can run there.

This scheduler, therefore, if used in configurations
where every vCPUs can be assigned to a pCPU, guarantees
low overhead, low latency, and consistent performance.

If used as default scheduler, at Xen boot, it is
recommended to limit the number of Dom0 vCPUs (e.g., with
'dom0_max_vcpus=x'). Otherwise, all the pCPUs will have
one Dom0's vCPU assigned, and there won't be room for
running efficiently (if at all) any guest.

Target use cases are embedded and HPC, but it may well
be interesting also in circumnstances.

Kconfig and documentation are update accordingly.

While there, also document the availability of sched=rtds
as boot parameter, which apparently had been forgotten.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: sched: make sure a pCPU added to a pool runs the scheduler ASAP
Dario Faggioli [Fri, 7 Apr 2017 12:28:08 +0000 (14:28 +0200)]
xen: sched: make sure a pCPU added to a pool runs the scheduler ASAP

When a pCPU is added to a cpupool, the pool's scheduler
should immediately run on it so, for instance, any runnable
but not running vCPU can start executing there.

This currently does not happen. Make it happen by raising
the scheduler softirq directly from the function that
sets up the new scheduler for the pCPU.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: sched: improve robustness (and rename) DOM2OP()
Dario Faggioli [Fri, 7 Apr 2017 12:28:01 +0000 (14:28 +0200)]
xen: sched: improve robustness (and rename) DOM2OP()

Clarify and enforce (with ASSERTs) when the function
is called on the idle domain, and explain in comments
what it means and when it is ok to do so.

While there, change the name of the function to a more
self-explanatory one, and do the same to VCPU2OP.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
8 years agox86/mce: always re-initialize 'severity_cpu' in mcheck_cmn_handler()
Haozhong Zhang [Fri, 7 Apr 2017 13:56:09 +0000 (15:56 +0200)]
x86/mce: always re-initialize 'severity_cpu' in mcheck_cmn_handler()

mcheck_cmn_handler() does not always set 'severity_cpu' to override
its value taken from previous rounds of MC handling, which will
interfere the current round of MC handling. Always re-initialize it to
clear the historical value.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/mce: make 'severity_cpu' private to its users
Haozhong Zhang [Fri, 7 Apr 2017 13:55:34 +0000 (15:55 +0200)]
x86/mce: make 'severity_cpu' private to its users

The current 'severity_cpu' is used by both mcheck_cmn_handler() and
mce_softirq(). If MC# happens during mce_softirq(), the values set in
mcheck_cmn_handler() and mce_softirq() may interfere with each
other. Use private 'severity_cpu' for each function to fix this issue.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/monitor: add support for descriptor access events
Adrian Pop [Fri, 7 Apr 2017 13:39:32 +0000 (15:39 +0200)]
x86/monitor: add support for descriptor access events

Adds monitor support for descriptor access events (reads & writes of
IDTR/GDTR/LDTR/TR) for the x86 architecture (VMX and SVM).

Signed-off-by: Adrian Pop <apop@bitdefender.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
[jb: minor cosmetic (hopefully!) cleanup]
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agopassthrough/io: fall back to remapping interrupt when we can't use VT-d PI
Chao Gao [Fri, 7 Apr 2017 13:38:40 +0000 (15:38 +0200)]
passthrough/io: fall back to remapping interrupt when we can't use VT-d PI

The current logic of using VT-d pi is when guest configurates the pirq's
destination vcpu to a single vcpu, the according IRTE is updated to
posted format. If the destination of the pirq is multiple vcpus, we will
stay in posted format. Obviously, we should fall back to remapping interrupt
when guest wrongly configurate destination of pirq or makes it have
multi-destination vcpus.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
[jb: guard against vcpu being NULL]
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agoVT-d: introduce update_irte to update irte safely
Chao Gao [Fri, 7 Apr 2017 13:38:17 +0000 (15:38 +0200)]
VT-d: introduce update_irte to update irte safely

We used structure assignment to update irte which was non-atomic when the
whole IRTE was to be updated. It is unsafe when a interrupt happened during
update. Furthermore, no bug or warning would be reported when this happened.

This patch introduces two variants, atomic and non-atomic, to update irte.
For initialization and release case, the non-atomic variant will be used. for
other cases (such as reprogramming to set irq affinity), the atomic variant
will be used. If the caller requests an atomic update but we can't meet it, we
raise a bug.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com> [x86]
8 years agoVMX: fixup PI descriptor when cpu is offline
Feng Wu [Fri, 7 Apr 2017 13:37:55 +0000 (15:37 +0200)]
VMX: fixup PI descriptor when cpu is offline

When cpu is offline, we need to move all the vcpus in its blocking
list to another online cpu, this patch handles it.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
8 years agoVT-d: some cleanups
Feng Wu [Fri, 7 Apr 2017 13:37:33 +0000 (15:37 +0200)]
VT-d: some cleanups

Use type-safe structure assignment instead of memcpy()
Use sizeof(*iremap_entry).

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
8 years agoVT-d: introduce new fields in msi_desc to track binding with guest interrupt
Feng Wu [Fri, 7 Apr 2017 13:37:07 +0000 (15:37 +0200)]
VT-d: introduce new fields in msi_desc to track binding with guest interrupt

msi_msg_to_remap_entry() is buggy when the live IRTE is in posted format. It
wrongly inherits the 'im' field meaning the IRTE is in posted format but
updates all the other fields to remapping format.

There are also two situations that lead to the above issue. One is some callers
really want to change the IRTE to remapped format. The other is some callers
only want to update msi message (e.g. set msi affinity) for they don't aware
that this msi is binded with a guest interrupt. We should suppress update
in the second situation. To distinguish them, straightforwardly, we can let
caller specify which format of IRTE they want update to. It isn't feasible for
making all callers be aware of the binding with guest interrupt will cause a
far more complicated change (including the interfaces exposed to IOAPIC and
MSI). Also some callings happen in interrupt context where we can't acquire
d->event_lock to read struct hvm_pirq_dpci.

This patch introduces two new fields in msi_desc to track binding with a guest
interrupt such that msi_msg_to_remap_entry() can get the binding and update
IRTE accordingly. After that change, pi_update_irte() can utilize
msi_msg_to_remap_entry() to update IRTE to posted format.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agopassthrough: don't migrate pirq when it is delivered through VT-d PI
Chao Gao [Fri, 7 Apr 2017 13:36:20 +0000 (15:36 +0200)]
passthrough: don't migrate pirq when it is delivered through VT-d PI

When a vCPU was migrated to another pCPU, pt irqs binded to this vCPU might
also need migration as a optimization to reduce IPI between pCPUs. When VT-d
PI is enabled, interrupt vector will be recorded to a main memory resident
data-structure and a notification whose destination is decided by NDST is
generated. NDST is properly adjusted during vCPU migration so pirq directly
injected to guest needn't be migrated.

This patch adds a indicator, @posted, to show whether the pt irq is delivered
through VT-d PI. Also this patch fixes a bug that hvm_migrate_pirq() accesses
pirq_dpci->gmsi.dest_vcpu_id without checking the pirq_dpci's type.

Signed-off-by: Chao Gao <chao.gao@intel.com>
[jb: remove an extranious check from hvm_migrate_pirq()]
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86: add multiboot2 protocol support for relocatable images
Daniel Kiper [Fri, 7 Apr 2017 11:37:24 +0000 (13:37 +0200)]
x86: add multiboot2 protocol support for relocatable images

Add multiboot2 protocol support for relocatable images. Only GRUB2 with
"multiboot2: Add support for relocatable images" patch understands
that feature. Older multiboot protocol (regardless of version)
compatible loaders ignore it and everything works as usual.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
8 years agox86/boot: rename sym_phys() to sym_offs()
Daniel Kiper [Fri, 7 Apr 2017 11:37:02 +0000 (13:37 +0200)]
x86/boot: rename sym_phys() to sym_offs()

This way macro name better describes its function.
Currently it is used to calculate symbol offset in
relation to the beginning of Xen image mapping.
However, value returned by sym_offs() for a given
symbol is not always equal its physical address.

There is no functional change.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
8 years agox86: make Xen early boot code relocatable
Daniel Kiper [Fri, 7 Apr 2017 11:36:32 +0000 (13:36 +0200)]
x86: make Xen early boot code relocatable

Every multiboot protocol (regardless of version) compatible image must
specify its load address (in ELF or multiboot header). Multiboot protocol
compatible loader have to load image at specified address. However, there
is no guarantee that the requested memory region (in case of Xen it starts
at 2 MiB and ends at ~5 MiB) where image should be loaded initially is a RAM
and it is free (legacy BIOS platforms are merciful for Xen but I found at
least one EFI platform on which Xen load address conflicts with EFI boot
services; it is Dell PowerEdge R820 with latest firmware). To cope with that
problem we must make Xen early boot code relocatable and help boot loader to
relocate image in proper way by suggesting, not requesting specific load
addresses as it is right now, allowed address ranges. This patch does former.
It does not add multiboot2 protocol interface which is done in "x86: add
multiboot2 protocol support for relocatable images" patch.

This patch changes following things:
  - %esi register is used as a storage for Xen image load base address;
    it is mostly unused in early boot code and preserved during C functions
    calls in 32-bit mode,
  - %fs is used as base for Xen data relative addressing in 32-bit code
    if it is possible; %esi is used for that thing during error printing
    because it is not always possible to properly and efficiently
    initialize %fs.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/setup: use XEN_IMG_OFFSET instead of...
Daniel Kiper [Fri, 7 Apr 2017 11:36:01 +0000 (13:36 +0200)]
x86/setup: use XEN_IMG_OFFSET instead of...

..calculating its value during runtime.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>