Roger Pau Monne [Tue, 21 Feb 2017 15:41:44 +0000 (15:41 +0000)]
x86/vioapic: allow the vIO APIC to have a variable number of pins
Altough it's still always set to VIOAPIC_NUM_PINS (48). Add a new filed to the
hvm_hw_ioapic struct to contain the number of pins (number of IO redirection
table registers), and add the migration compatibility code.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Tue, 21 Feb 2017 15:41:44 +0000 (15:41 +0000)]
x86/vioapic: move domain out of hvm_vioapic struct
And then remove hvm_vioapic (since it just contains a hvm_hw_ioapic struct
now). This is a preparatory change for introducing support for multiple vIO
APICs per domain.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Tue, 21 Feb 2017 15:41:44 +0000 (15:41 +0000)]
xen/x86: setup PVHv2 Dom0 ACPI tables
Create a new MADT table that contains the topology exposed to the guest. A
new XSDT table is also created, in order to filter the tables that we want
to expose to the guest, plus the Xen crafted MADT. This in turn requires Xen
to also create a new RSDP in order to make it point to the custom XSDT.
Also, regions marked as E820_ACPI or E820_NVS are identity mapped into Dom0
p2m, plus any top-level ACPI tables that should be accessible to Dom0 and
reside in reserved regions. This is needed because some memory maps don't
properly account for all the memory used by ACPI, so it's common to find ACPI
tables in reserved regions.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v5:
- s/hvm_copy_to_guest_phys_vcpu/hvm_copy_to_guest_phys/.
- Move pvh_add_mem_range to previous patch.
- Add a comment regarding the current limitation to only 1 emulated IO APIC.
- s/dom0_max_vcpus()/max_vcpus/ in pvh_setup_acpi_madt.
- Cast structures to void when assigning.
- Declare banned_tables with the initconst annotation.
- Expand some comments messages.
- Initialize the RSDP local variable.
- Only provide x2APIC entries in the MADT.
Changes since v4:
- s/hvm/pvh.
- Use hvm_copy_to_guest_phys_vcpu.
- Don't allocate up to E820MAX entries for the Dom0 memory map and instead
allow pvh_add_mem_range to dynamically grow the memory map.
- Add a comment about the lack of x2APIC MADT entries.
- Change acpi_intr_overrides to unsigned int and the max iterator bound to
UINT_MAX.
- Set the MADT version as the minimum version between the hardware value and
our supported version (4).
- Set the MADT IO APIC ID to the current value of the domain vioapic->id.
- Use void * when subtracting two pointers.
- Fix indentation of nr_pages and use PFN_UP instead of DIV_ROUND_UP.
- Change wording of the pvh_acpi_table_allowed error message.
- Make j unsigned in pvh_setup_acpi_xsdt.
- Move initialization of local variables with declarations in
pvh_setup_acpi_xsdt.
- Reword the comment about the allocated size of the xsdt custom table.
- Fix line splitting.
- Add a comment regarding the layering violation caused by the usage of
acpi_tb_checksum.
- Pass IO APIC NMI sources found in the MADT to Dom0.
- Create x2APIC entries if the native MADT also contains them.
- s/acpi_intr_overrrides/acpi_intr_overrides/.
- Make sure the MADT is properly mapped into Dom0, or else Dom0 might not be
able to access the output of the _MAT method depending on the
implementation.
- Get the first ACPI processor ID and use that as the base processor ID of the
crafted MADT. This is done so that local/x2 APIC NMI entries match with the
local/x2 APIC objects.
Changes since v3:
- Use hvm_copy_to_phys in order to copy the tables to Dom0 memory.
- Return EEXIST for overlaping ranges in hvm_add_mem_range.
- s/ov/ovr/ for interrupt override parsing functions.
- Constify intr local variable in acpi_set_intr_ovr.
- Use structure asignement for type safety.
- Perform sizeof using local variables in hvm_setup_acpi_madt.
- Manually set revision of crafted/modified tables.
- Only map tables to guest that reside in reserved or ACPI memory regions.
- Copy the RSDP OEM signature to the crafted RSDP.
- Pair calls to acpi_os_map_memory/acpi_os_unmap_memory.
- Add memory regions for allowed ACPI tables to the memory map and then
perform the identity mappings. This avoids having to call modify_identity_mmio
multiple times.
- Add a FIXME comment regarding the lack of multiple vIO-APICs.
Roger Pau Monne [Tue, 21 Feb 2017 15:41:44 +0000 (15:41 +0000)]
xen/x86: Setup PVHv2 Dom0 CPUs
Initialize Dom0 BSP/APs and setup the memory and IO permissions. This also sets
the initial BSP state in order to match the protocol specified in
docs/misc/hvmlite.markdown.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v5:
- Make cpus and i unsigned ints.
- Use an initializer for cpu_ctx (and remove the memset).
- Move the clear_bit of vcpu 0 the end of pvh_setup_cpus.
Roger Pau Monne [Tue, 21 Feb 2017 15:41:43 +0000 (15:41 +0000)]
xen/x86: parse Dom0 kernel for PVHv2
Introduce a helper to parse the Dom0 kernel.
A new helper is also introduced to libelf, that's used to store the destination
vcpu of the domain. This parameter is needed when loading the kernel on a HVM
domain (PVHv2), since hvm_copy_to_guest_phys requires passing the destination
vcpu.
While there also fix image_base and image_start to be of type "void *", and do
the necessary fixup of related functions.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
---
Changes since v5:
- s/hvm_copy_to_guest_phys_vcpu/hvm_copy_to_guest_phys/.
- Use void * for image_base and image_start, make the necessary changes.
- Introduce elf_set_vcpu in order to store the destination vcpu in
elf_binary, and use it in elf_load_image. This avoids having to override
current.
- Style fixes.
- Round up the position of the modlist/start_info to an aligned address
depending on the kernel bitness.
Changes since v4:
- s/hvm/pvh.
- Use hvm_copy_to_guest_phys_vcpu.
Changes since v3:
- Change one error message.
- Indent "out" label by one space.
- Introduce hvm_copy_to_phys and slightly simplify the code in hvm_load_kernel.
Changes since v2:
- Remove debug messages.
- Don't hardcode the number of modules to 1.
Roger Pau Monne [Tue, 21 Feb 2017 15:41:43 +0000 (15:41 +0000)]
x86/libelf: pass the destination vCPU to libelf for Dom0 build
Allow setting the destination vCPU for libelf, so that elf_load_image can take
it into account when loading the kernel for Dom0. This is needed for PVHv2 Dom0
build, so that hvm_copy_to_guest_phys can be called with a Dom0 vCPU instead of
current (that contains the idle vCPU at this point).
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v6:
- New in this version.
Roger Pau Monne [Tue, 21 Feb 2017 15:41:43 +0000 (15:41 +0000)]
xen/x86: populate PVHv2 Dom0 physical memory map
Craft the Dom0 e820 memory map and populate it. Introduce a helper to remove
memory pages that are shared between Xen and a domain, and use it in order to
remove low 1MB RAM regions from dom_io in order to assign them to a PVHv2 Dom0.
On hardware lacking support for unrestricted mode also craft the identity page
tables and the TSS used for virtual 8086 mode.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v6:
- Rebase on top of Jan VM86 TSS fix.
- Use hvm_copy_to_guest_phys to zero the TSS area.
- Request the TSS memory area to be aligned to 128.
- Move write_32bit_pse_identmap to arch-specific mm.c file.
Changes since v5:
- Adjust the logic to set need_paging.
- Remove the usage of the _AC macro.
- Subtract memory from the end of regions instead of the start.
- Create the VM86_TSS before the identity page table, so that the page table
is aligned to a page boundary.
- Use MB1_PAGES in modify_identity_mmio.
- Move and simply the ASSERT in pvh_setup_p2m.
- Move the creation of the PSE page tables to a separate function, and use it
in shadow_enable also.
- Make the map modify_identiy_mmio parameter a constant.
- Add a comment to HVM_VM86_TSS_SIZE, although it seems this might need
further fixing.
- Introduce pvh_add_mem_range in order to mark the regions used by the VM86
TSS and the identity page tables as reserved in the memory map.
- Add a parameter to request aligned memory from pvh_steal_ram.
Changes since v4:
- Move process_pending_softirqs to previous patch.
- Fix off-by-one errors in some checks.
- Make unshare_xen_page_with_guest __init.
- Improve unshare_xen_page_with_guest by making use of already existing
is_xen_heap_page and put_page.
- s/hvm/pvh/.
- Use PAGE_ORDER_4K in pvh_setup_e820 in order to keep consistency with the
p2m code.
Changes since v3:
- Drop get_order_from_bytes_floor, it was only used by
hvm_populate_memory_range.
- Switch hvm_populate_memory_range to use frame numbers instead of full memory
addresses.
- Add a helper to steal the low 1MB RAM areas from dom_io and add them to Dom0
as normal RAM.
- Introduce unshare_xen_page_with_guest in order to remove pages from dom_io,
so they can be assigned to other domains. This is needed in order to remove
the low 1MB RAM regions from dom_io and assign them to the hardware_domain.
- Simplify the loop in hvm_steal_ram.
- Move definition of map_identity_mmio into this patch.
Changes since v2:
- Introduce get_order_from_bytes_floor as a local function to
domain_build.c.
- Remove extra asserts.
- Make hvm_populate_memory_range return an error code instead of panicking.
- Fix comments and printks.
- Use ULL sufix instead of casting to uint64_t.
- Rename hvm_setup_vmx_unrestricted_guest to
hvm_setup_vmx_realmode_helpers.
- Only substract two pages from the memory calculation, that will be used
by the MADT replacement.
- Remove some comments.
- Remove printing allocation information.
- Don't stash any pages for the MADT, TSS or ident PT, those will be
subtracted directly from RAM regions of the memory map.
- Count the number of iterations before calling process_pending_softirqs
when populating the memory map.
- Move the initial call to process_pending_softirqs into construct_dom0,
and remove the ones from construct_dom0_hvm and construct_dom0_pv.
- Make memflags global so it can be shared between alloc_chunk and
hvm_populate_memory_range.
Changes since RFC:
- Use IS_ALIGNED instead of checking with PAGE_MASK.
- Use the new %pB specifier in order to print sizes in human readable form.
- Create a VM86 TSS for hardware that doesn't support unrestricted mode.
- Subtract guest RAM for the identity page table and the VM86 TSS.
- Split the creation of the unrestricted mode helper structures to a
separate function.
- Use preemption with paging_set_allocation.
- Use get_order_from_bytes_floor.
Roger Pau Monne [Tue, 21 Feb 2017 15:41:42 +0000 (15:41 +0000)]
xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
PVHv2 guests, unlike HVM guests, won't have the option to route interrupts
from physical or emulated devices over event channels using PIRQs. This
applies to both DomU and Dom0 PVHv2 guests.
Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
route physical interrupts (even from emulated devices) over event channels,
and is thus allowed to use some of the PHYSDEV ops.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v6:
- Rebase on top of the HVM hypercall changes (and drop Andrew's RB).
Changes since v5:
- Introduce a has_pirq macro to match other XEN_X86_EMU_ options, and simplify
some of the code.
Changes since v3:
- Update docs.
Changes since v2:
- Change local variable name to currd instead of d.
- Use currd where it makes sense.
Roger Pau Monne [Tue, 21 Feb 2017 15:41:42 +0000 (15:41 +0000)]
x86/VMX: sanitize VM86 TSS handling
The present way of setting this up is flawed: Leaving the I/O bitmap
pointer at zero means that the interrupt redirection bitmap lives
outside (ahead of) the allocated space of the TSS. Similarly setting a
TSS limit of 255 when only 128 bytes get allocated means that 128 extra
bytes may be accessed by the CPU during I/O port access processing.
Introduce a new HVM param to set the allocated size of the TSS, and
have the hypervisor actually take care of setting namely the I/O bitmap
pointer. Both this and the segment limit now take the allocated size
into account.
Andrew Cooper [Fri, 6 Jan 2017 15:10:34 +0000 (15:10 +0000)]
x86/cpuid: Always enable faulting for the control domain
The domain builder in libxc no longer depends on leaked CPUID information to
properly construct HVM domains. Remove the control domain exclusion.
On capable hardware, this prevents all unintended leakage of hardware CPUID
values into the control domain, and brings the hypervisor leaves into view.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
--- CC: Jan Beulich <JBeulich@suse.com>
Andrew Cooper [Fri, 17 Feb 2017 18:31:45 +0000 (18:31 +0000)]
x86/cpuid: Handle leaf 0xb in guest_cpuid()
Leaf 0xb is reserved by AMD, and uniformly hidden from guests by the toolstack
logic and hypervisor PV logic.
The previous dynamic logic filled in the x2APIC ID for all HVM guests. This
is modified to respect the entire leaf being reserved by AMD, but is altered
to include PV Intel guests, so they get more sensible values in their emulated
and faulted view of CPUID.
Sensibly exposing the rest of the leaf requires further topology
infrastructure.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
--- CC: Jan Beulich <JBeulich@suse.com>
Andrew Cooper [Fri, 17 Feb 2017 18:24:45 +0000 (18:24 +0000)]
x86/cpuid: Handle leaf 0xa in guest_cpuid()
Leaf 0xa is reserved by AMD, and only exposed to Intel guests when vPMU is
enabled. Leave the logic as-was, ready to be cleaned up when further
toolstack infrastructure is in place.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
--- CC: Jan Beulich <JBeulich@suse.com>
Andrew Cooper [Fri, 17 Feb 2017 18:03:58 +0000 (18:03 +0000)]
x86/cpuid: Handle leaf 0x6 in guest_cpuid()
The thermal/performance leaf was previously hidden from HVM guests, but fully
visible to PV guests. Most of the leaf refers to MSR availability, and there
is nothing an unprivileged PV guest can do with the information, so hide the
leaf entirely.
The PV MSR handling logic as minimal support for some thermal/perf operations
from the hardware domain, so leak through the implemented subset of features.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
--- CC: Jan Beulich <JBeulich@suse.com>
Andrew Cooper [Fri, 17 Feb 2017 17:32:29 +0000 (17:32 +0000)]
x86/cpuid: Handle leaf 0x5 in guest_cpuid()
The MONITOR flag isn't exposed to guests. The existing toolstack logic, and
pv_cpuid() in the hypervisor, zero the MONITOR leaf for queries.
However, the MONITOR leaf is still visible in the hardware domains native
CPUID view, and Linux depends on this to set up C-state information. Leak the
hosts MONITOR leaf under the same circumstances that the MONITOR feature is
leaked.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
--- CC: Jan Beulich <JBeulich@suse.com>
Andrew Cooper [Fri, 17 Feb 2017 17:21:35 +0000 (17:21 +0000)]
x86/cpuid: Handle leaf 0x4 in guest_cpuid()
Leaf 0x4 is reserved by AMD. For Intel, it is a multi-invocation leaf with
ecx enumerating different cache details.
Add a new union for it in struct cpuid_policy, collect it from hardware in
calculate_raw_policy(), audit it in recalculate_cpuid_policy() and update
guest_cpuid() and update_domain_cpuid_info() to properly insert/extract data.
A lot of the data here will need further auditing/refinement when better
topology support is introduced, but for now, this matches the existing
toolstack behaviour.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
--- CC: Jan Beulich <JBeulich@suse.com>
Andrew Cooper [Fri, 17 Feb 2017 17:10:50 +0000 (17:10 +0000)]
x86/cpuid: Handle leaf 0x1 in guest_cpuid()
The features words, ecx and edx, are already audited as part of the featureset
logic. The existing leaf 0x80000001 dynamic logic has its SYSCALL adjustment
split out, as the rest of the adjustments are common with leaf 0x1. The
existing leaf 0x1 feature adjustments from {pv,hvm}_cpuid() are moved
wholesale into guest_cpuid(), although deduped against the common adjustments.
The eax word is family/model/stepping information, and is fine to use as
provided by the toolstack, although with reserved bits cleared.
The ebx word is more problematic. The low 8 bits are the brand ID and safe to
pass straight through. The next 8 bits are the CLFLUSH line size. This value
is forwarded straight from hardware, as nothing good can possibly come of
providing an alternative value to the guest.
The next 8 bits are slightly different between Intel and AMD, but are both
some property of the number of logical cores in the current physical package.
For now, the toolstack value is used unchanged until better topology support
is available.
The final 8 bits are the initial legacy APIC ID. For HVM guests, this was
overridden to vcpu_id * 2. The same logic is now applied to PV guests, so
guests don't observe a constant number on all vcpus via their emulated or
faulted view.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
--- CC: Jan Beulich <JBeulich@suse.com> CC: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Boris: This conflicts textually but not functionally with your vPMU
adjustments. Whichever way round we end up needing to rebase should be easy.
Roger Pau Monné [Fri, 17 Feb 2017 15:10:28 +0000 (16:10 +0100)]
build: enable no-parentheses in clang
And fix the following errors reported:
traps.c:2014:25: error: equality comparison with extraneous parentheses
[-Werror,-Wparentheses-equality]
else if ( (port == RTC_PORT(0)) )
~~~~~^~~~~~~~~~~~~~
traps.c:2014:25: note: remove extraneous parentheses around the comparison to silence this warning
else if ( (port == RTC_PORT(0)) )
~ ^ ~
traps.c:2014:25: note: use '=' to turn this equality comparison into an assignment
else if ( (port == RTC_PORT(0)) )
^~
=
traps.c:2083:25: error: equality comparison with extraneous parentheses
[-Werror,-Wparentheses-equality]
else if ( (port == RTC_PORT(0)) )
~~~~~^~~~~~~~~~~~~~
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Fri, 17 Feb 2017 15:09:38 +0000 (16:09 +0100)]
build/printf: fix incorrect format specifiers
The following incorrect format specifiers and incorrect number of parameters
passed to printf like functions are reported by clang:
mce.c:601:18: error: data argument not used by format string [-Werror,-Wformat-extra-args]
smp_processor_id());
^
xenpm.c:102:23: error: data argument not used by format string [-Werror,-Wformat-extra-args]
what, argv[argc > 1]);
^
libxl_internal.c:25:69: error: data argument not used by format string
[-Werror,-Wformat-extra-args]
libxl__log(ctx, XTL_CRITICAL, ENOMEM, 0,0, func, INVALID_DOMID, L);
^
libxl_internal.c:24:17: note: expanded from macro 'L'
func, (unsigned long)nmemb, (unsigned long)size
^
libxl_internal.c:26:21: error: data argument not used by format string
[-Werror,-Wformat-extra-args]
fprintf(stderr, L);
^
libxl_internal.c:24:17: note: expanded from macro 'L'
func, (unsigned long)nmemb, (unsigned long)size
^
This patch contains the fixes for them and enables -Wformat for clang.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monné [Fri, 17 Feb 2017 15:08:37 +0000 (16:08 +0100)]
x86/vmx: fix compilation after 997382
997382 introduced the following errors:
intr.c:342:46: error: address of array 'vlapic->regs->data' will always evaluate to 'true'
[-Werror,-Wpointer-bool-conversion]
if ( vlapic && vlapic->regs->data )
~~ ~~~~~~~~~~~~~~^~~~
intr.c:352:42: error: address of array 'pi_desc->pir' will always evaluate to 'true'
[-Werror,-Wpointer-bool-conversion]
if ( pi_desc && pi_desc->pir )
~~ ~~~~~~~~~^~~
Both of those checks are done against static arrays, which doesn't seem to make
much sense, so just remove them.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 17 Feb 2017 14:59:15 +0000 (15:59 +0100)]
console: avoid wrapping of console pointers
We particularly want/need to avoid accessing data outside (ahead of)
the ring buffer. Also latch both pointers into local variable to
avoid different steps of the calculation being done with different
values.
Reported-by: Quan Luo <a4651386@163.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Haozhong Zhang [Fri, 17 Feb 2017 14:55:47 +0000 (15:55 +0100)]
x86/mce: remove declarations of non-existing functions in mce.h
Remove declarations of functions
intel_mcheck_timer()
mce_intel_feature_init()
mce_cap_init()
x86_mcinfo_getptr()
whose definitions had been removed long time ago.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Norbert Manthey [Fri, 17 Feb 2017 14:51:37 +0000 (15:51 +0100)]
x86/mm: fix memory hotplug error cleanup
During destroying the m2p mapping, the loop variable was always incremented
by one, as the current version used a compare operator on the left hand side,
which always evaluated to true, i.e.
i += 1UL < (L2_PAGETABLE_SHIFT - 2)
The fix increments the value of the variable by the actual page size by
using the shift operator instead.
Signed-off-by: Norbert Manthey <nmanthey@amazon.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 17 Feb 2017 14:51:03 +0000 (15:51 +0100)]
x86: package up context switch hook pointers
They're all solely dependent on guest type, so we don't need to repeat
all the same three pointers in every vCPU control structure. Instead use
static const structures, and store pointers to them in the domain
control structure.
Since touching it anyway, take the opportunity and expand
schedule_tail() in the only two places invoking it, allowing the macro
to be dropped.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Fri, 17 Feb 2017 14:49:56 +0000 (15:49 +0100)]
VMX: fix VMCS race on context-switch paths
When __context_switch() is being bypassed during original context
switch handling, the vCPU "owning" the VMCS partially loses control of
it: It will appear non-running to remote CPUs, and hence their attempt
to pause the owning vCPU will have no effect on it (as it already
looks to be paused). At the same time the "owning" CPU will re-enable
interrupts eventually (the lastest when entering the idle loop) and
hence becomes subject to IPIs from other CPUs requesting access to the
VMCS. As a result, when __context_switch() finally gets run, the CPU
may no longer have the VMCS loaded, and hence any accesses to it would
fail. Hence we may need to re-load the VMCS in vmx_ctxt_switch_from().
For consistency use the new function also in vmx_do_resume(), to
avoid leaving an open-coded incarnation of it around.
Reported-by: Kevin Mayer <Kevin.Mayer@gdata.de> Reported-by: Anshul Makkar <anshul.makkar@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com>
We don't need a lock in vgic_get_target_vcpu anymore, solving the
following lock inversion bug: the rank lock should be taken first, then
the vgic lock. However, gic_update_one_lr is called with the vgic lock
held, and it calls vgic_get_target_vcpu, which tries to obtain the rank
lock.
Jan Beulich [Thu, 16 Feb 2017 17:11:42 +0000 (18:11 +0100)]
x86emul: catch exceptions occurring in stubs
Before adding more use of stubs cloned from decoded guest insns, guard
ourselves against mistakes there: Should an exception (with the
noteworthy exception of #PF) occur inside the stub, forward it to the
guest.
Since the exception fixup table entry can't encode the address of the
faulting insn itself, attach it to the return address instead. This at
once provides a convenient place to hand the exception information
back: The return address is being overwritten by it before branching to
the recovery code.
Take the opportunity and (finally!) add symbol resolution to the
respective log messages (the new one is intentionally not being coded
that way, as it covers stub addresses only, which don't have symbols
associated).
Also take the opportunity and make search_one_extable() static again.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 14 Feb 2017 18:21:22 +0000 (18:21 +0000)]
x86/hypercall: Make the HVM hcall_64bit boolean common
HVM guests currently make use of arch.hvm_vcpu.hcall_64bit to track the ABI of
the hypercall in use.
The rest of Xen deals in terms of the comat ABI or not, so rename the boolean
and make it common, guared by CONFIG_COMPAT to avoid bloat if a compat ABI is
not wanted/needed.
Set hcall_compat uniformly for PV guests as well as HVM guests. This removes
the remaining piece of guest-type-specific knowledge from
hypercall_create_continuation(), allowing it to operate only in terms of the
hypercall ABI in use.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 14 Feb 2017 17:02:04 +0000 (17:02 +0000)]
x86/hypercall: Make the HVM hcall_preempted boolean common
HVM guests currently make use of arch.hvm_vcpu.hcall_preempted to track
hypercall preemption in struct vcpu. Move this boolean to being common at the
top level of struct vcpu, which will allow it to be reused elsewhere.
Alter the PV preemption logic to use this boolean. This simplifies the code
by removing guest-type-specific knowledge, and removes the risk of accidently
skipping backwards or forwards multiple times and corrupting %rip.
In pv_hypercall() the old_rip bodge can be removed, and parameter clobbering
can happen based on a more obvious condition.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 15 Feb 2017 18:04:58 +0000 (18:04 +0000)]
xen/include: Include xen/kconfig.h automatically
generated/autoconf.h is already included automatically so CONFIG_* defines are
available. However, the companion macros such as IS_ENABLED() are not
included.
Include them uniformly everywhere.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
George Dunlap [Wed, 15 Feb 2017 17:08:11 +0000 (17:08 +0000)]
tools/libxl: Introduce LIBXL_CPUPOOL_POOLID_ANY
Callers to libxl_cpupool_create() can either request a specific pool
id, or request that Xen do it for them. But at the moment, the
"automatic" selection is indicated by using a magic value, 0. This is
undesirable both because it doesn't obviously have meaning, but also
because '0' is a valid cpupool (albeit one which at the moment can't
be changed).
Introduce a constant, LIBXL_CPUPOOL_POOLID_ANY, to indicate this
instead. Still accept '0' as meaning "ANY" for backwards
compatibility.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
[ wei: removed two trailing spaces ] Signed-off-by: Wei Liu <wei.liu2@citrix.com>
George Dunlap [Wed, 15 Feb 2017 17:08:10 +0000 (17:08 +0000)]
tools/libxc: Introduce XC_CPUPOOL_POOLID_ANY
Callers to xc_cpupool_create() can either request a specific pool id,
or request that Xen do it for them. But at the moment, the
"automatic" selection is indicated by using a magic value, 0. This is
undesirable both because it doesn't obviously have meaning, but also
because '0' is a valid cpupool (albeit one which at the moment can't
be changed).
Introduce a constant, XC_CPUPOOL_POOLID_ANY, to indicate this instead.
Have it be the default for the python bindings.
Manually translate it, even though it's the same underlying value,
because we don't yet have a relaible way of enforcing that these
values are the same.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Wed, 15 Feb 2017 11:11:12 +0000 (12:11 +0100)]
libxl: correct xenstore entry for empty cdrom
Specifying an empty cdrom device will result in a Xenstore entry
params = aio:(null)
as the physical device path isn't existing. This lets a domain booted
via OVMF hang as OVMF is checking for "aio:" only in order to detect
the empty cdrom case.
Use an empty string for the physical device path in this case. As a
cdrom device for HVM is always backed by qdisk we only need to cover this
backend.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Julien Grall [Fri, 3 Feb 2017 19:18:45 +0000 (19:18 +0000)]
xen/arm: acpi: Handle correctly detection of GICv2 on GICv3
When the GICv3 is not GICv2 compatible, the associated field in the MADT
will be zeroed. However, the rest of the code expects the variable to
be set to INVALID_PADDR.
This will result to false detection of GICv2 and give I/O access to page
0 for the hardware domain.
Thankfully, it will fail because the size of GICV has not been set.
Fix the detection by converting 0 to INVALID_PADDR for the GICC and
GICV base. At the same time only set the size of each region when the
base address is not 0.
Julien Grall [Fri, 3 Feb 2017 19:21:13 +0000 (19:21 +0000)]
xen/arm: Remove the makefile target xen.axf
Since commit 4557c22 "xen: arm: rewrite start of day page table and cpu
bring up", Xen requires to be launched in NS HYP/EL2.
xen.axf is generated in order to directly boot Xen on ARM models (e.g
Foundation). However they usually start in secure mode, which mean Xen
cannot boot.
The way forward to boot Xen on models is using either EFI or
bootwrapper [1].
Andrew Cooper [Thu, 9 Feb 2017 17:08:44 +0000 (17:08 +0000)]
x86/asm: Use ASM_FLAG_OUT() to simplify atomic and bitop stubs
bitops.h cannot include asm_defns.h, because the static inlines in cpumasks.h
result in forward declarations of the bitops.h contents. Move ASM_FLAG_OUT()
to a new asm/compiler.h to compensate.
While making changes, switch bool_t to bool and use named asm parameters.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Wed, 15 Feb 2017 17:13:22 +0000 (17:13 +0000)]
xen/p2m: Fix p2m_flush_table for non-nested cases
Commit 71bb7304e7a7a35ea6df4b0cedebc35028e4c159 added flushing of
nested p2m tables whenever the host p2m table changed. Unfortunately
in the process, it added a filter to p2m_flush_table() function so
that the p2m would only be flushed if it was being used as a nested
p2m. This meant that the p2m was not being flushed at all for altp2m
callers.
Only check np2m_base if p2m_class for nested p2m's.
NB that this is not a security issue: The only time this codepath is
called is in cases where either nestedp2m or altp2m is enabled, and
neither of them are in security support.
Reported-by: Matt Leinhos <matt@starlab.io> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Tested-by: Tamas K Lengyel <tamas@tklengyel.com>
Dario Faggioli [Wed, 15 Feb 2017 15:47:29 +0000 (15:47 +0000)]
xen: sched: harmonize debug dump output among schedulers.
Information we currently print for idle vCPUs is
rather useless. Credit2 already stopped showing that,
do the same for Credit and RTDS.
Also, define a new CPU status dump hook, which is
not defined by those schedulers which already dump
such info in other ways (e.g., Credit2, which does
that while dumping runqueue information).
This also means that, still in Credit2, we can keep
the runqueue and pCPU info closer together.
That way we have one person who can: a) poke other maintainers
or pull them in with new drivers are introduced, b) we have
one maintainer who can shepherd the patches along instead of
depending on the REST maintainers which may be busy with
other responsibilities.
Acked-by: Ian Jackson <ian.jackson@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
There is a possible scenario when (d)->need_iommu remains unset
during guest domain execution. For example, when no devices
were assigned to it. Taking into account that teardown callback
is not called when (d)->need_iommu is unset we might have unreleased
resourses after destroying domain.
So, always call teardown callback to roll back actions
that were performed in init callback.
This is XSA-207.
Signed-off-by: Oleksandr Tyshchenko <olekstysh@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Tested-by: Jan Beulich <jbeulich@suse.com> Tested-by: Julien Grall <julien.grall@arm.com>
Andrew Cooper [Mon, 13 Feb 2017 11:49:30 +0000 (11:49 +0000)]
x86/hvm: Improve physdev_op hypercall dispatching
hvm_physdev_op() and hvm_physdev_op_compat32() are almost identical, but there
is no need to have two functions instantiated at the end of different function
pointers.
Combine the two into a single hvm_physdev_op() and dispatch to
{do,compat}_physdev_op() based on the hcall_64bit setting.
This also fixes an inconsistency where 64bit PVH hardware domains were
permitted access to extra physdev ops, but 32bit domains weren't.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
hvm_grant_table_op() and hvm_grant_table_op_compat32() are almost identical,
but there is no need to have two functions instantiated at the end of
different function pointers.
Combine the two into a single hvm_grant_table_op() (folding
grant_table_op_is_allowed() into is now-single caller) and dispatch to
{do,compat}_grant_table_op() based on the hcall_64bit setting.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 13 Feb 2017 11:49:24 +0000 (11:49 +0000)]
x86/hvm: Improve memory_op hypercall dispatching
hvm_memory_op() and hvm_memory_op_compat32() are almost identical, but there
is no need to have two functions instantiated at the end of different function
pointers.
Combine the two into single hvm_memory_op() which dispatches to
{do,compat}_memory_op() based on the hcall_64bit setting.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 3 Feb 2017 16:21:22 +0000 (16:21 +0000)]
x86/hvm: Rework HVM_HCALL_invalidate handling
Sending an invalidation to the device model is an internal detail of
completing the hypercall; callers should not need to be responsible for it.
Drop HVM_HCALL_invalidate entirely and call send_invalidate_req() when
appropriate.
This makes the function boolean in nature, although the existing
HVM_HCALL_{completed,preempted} constants are kept to aid code clarity. While
updating the return type, drop _do from the name, as it is redundant.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Boris Ostrovsky [Mon, 13 Feb 2017 14:23:58 +0000 (15:23 +0100)]
x86: adjust which files need vpmu.h
asm-x86/vmcs.h doesn't need it while asm-x86/domain.h does.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Mon, 13 Feb 2017 14:23:34 +0000 (15:23 +0100)]
x86/PVHv2: fix dom0_max_vcpus so it's capped to HVM_MAX_VCPUS for PVHv2 Dom0
PVHv2 Dom0 is limited to 128 vCPUs, as are all HVM guests at the moment. Fix
dom0_max_vcpus so it takes this limitation into account.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Mon, 13 Feb 2017 14:22:01 +0000 (15:22 +0100)]
x86: split Dom0 build into PV and PVHv2
Split the Dom0 builder into two different functions, one for PV (and classic
PVH), and another one for PVHv2. Introduce a new command line parameter called
'dom0' that can be used to request the creation of a PVHv2 Dom0 by setting the
'hvm' sub-option. A panic has also been added if a user tries to use dom0=hvm
until all the code is in place, then the panic will be removed.
While there mark the dom0_shadow option that was used by PV Dom0 as deprecated,
it was lacking documentation and was not functional. Point users towards
dom0=shadow instead.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 13 Feb 2017 14:21:24 +0000 (15:21 +0100)]
x86/time: tsc_check_writability() may need to be run a second time
While we shouldn't remove its current invocation, we need to re-run it
for the case that the X86_FEATURE_TSC_RELIABLE feature flag has been
cleared, in order to avoid using the TSC rendezvous function in case
the TSC can't be written.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Chao Gao [Mon, 13 Feb 2017 14:19:42 +0000 (15:19 +0100)]
y86/vmx: dump PIR and vIRR before ASSERT()
Commit c7bdecae42 ("x86/apicv: fix RTC periodic timer and apicv issue") has
added a assertion that intack.vector is the highest priority vector. But
according to the osstest, the assertion failed sometimes. More discussion can
be found in the thread
(https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg01019.html).
The assertion failure is hard to reproduce. In order to root cause issue, this
patch is to add logs to dump PIR and vIRR when failure takes place. It should
be reverted once the root cause is found.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Thu, 9 Feb 2017 18:22:50 +0000 (18:22 +0000)]
x86/bitops: Force __scanbit() to be always inline
It turns out that GCCs 4.9.2 and 6.3.0 instantiate __scanbit() in three
translation units, but never references the result. All real uses of
__scanbit() are already suitably inline.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Wed, 8 Feb 2017 16:09:30 +0000 (17:09 +0100)]
libxl: carve out console specific functions from libxl.c
libxl.c has grown to an uncomfortable size. Carve out the console
related functions (including channels, keyboard and frame buffer)
to libxl_console.c.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>