Ian Jackson [Mon, 25 Jun 2018 14:48:32 +0000 (15:48 +0100)]
process docs: Add some detail about changes during branching
Split out the required work for the new and old branches and be more
specific about what is to be done. In the RT checklist, reformat and
expand the "turn off debug" instructions.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Mon, 25 Jun 2018 14:46:25 +0000 (15:46 +0100)]
process docs: Drop some obsolete stuff
* Drop reference to long-gone Citrix-internal HG trees
* Drop reference to RT-accessible web pages; web page editing
is now handled via the RM, community manager, etc.
* Drop reference to git description files; this is not needed
because now we have one tree with all branches, not one per branch
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Fri, 16 Mar 2018 14:04:53 +0000 (14:04 +0000)]
tools/ocaml: Drop int_array_of_uuid_string()
This function is entirely internal to xenctrl stubs, and serves only to
convert the uuid string to an integer array (making 16 memory allocations as
it goes), while the C stubs turns the integer array back into a binary array.
Instead, pass the string all the way down into C, and have sscanf() unpack it
directly into a xen_domain_handle_t object.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Andrew Cooper [Wed, 11 Apr 2018 13:34:02 +0000 (13:34 +0000)]
x86/cpuid: Alter the policy logic for leaf 0xb to be multi-invocation
The new data lives in the .topo union, rather than being treated as a single
leaf in the basic union.
While adjusting cpuid_policy, pad .basic to CPUID_GUEST_NR_BASIC for the
benefit of people extending the number of leaves in the future.
Host data is scanned when filling in the raw policy, but Xen still discards
any toolstack settings for now.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Wed, 27 Jun 2018 12:26:36 +0000 (13:26 +0100)]
x86: Address "Bitwise-and with zero CONSTANT_EXPRESSION_RESULT" Coverity issues
Coverity complains at code which which performs a bitwise and with a constant
that happens to be zero. Rearrange the C to test the constant first and short
circuit the bitwise and.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <JBeulich@suse.com>
Andrew Cooper [Tue, 23 May 2017 16:32:30 +0000 (17:32 +0100)]
x86/vmx: Don't leak EFER.NXE into guest context
Intel hardware only uses 4 bits in MSR_EFER. Changes to LME and LMA are
handled automatically via the VMENTRY_CTLS.IA32E_MODE bit.
SCE is handled by ad-hoc logic in context_switch(), vmx_restore_guest_msrs()
and vmx_update_guest_efer(), and works by altering the host SCE value to match
the setting the guest wants. This works because, in HVM vcpu context, Xen
never needs to execute a SYSCALL or SYSRET instruction.
However, NXE has never been context switched. Unlike SCE, NXE cannot be
context switched at vcpu boundaries because disabling NXE makes PTE.NX bits
reserved and cause a pagefault when encountered. This means that the guest
always has Xen's setting in effect, irrespective of the bit it can see and
modify in its virtualised view of MSR_EFER.
This isn't a major problem for production operating systems because they, like
Xen, always turn the NXE on when it is available. However, it does have an
observable effect on which guest PTE bits are valid, and whether
PFEC_insn_fetch is visible in a #PF error code.
Second generation VT-x hardware has host and guest EFER fields in the VMCS,
and support for loading and saving them automatically. First generation VT-x
hardware needs to use MSR load/save lists to cause an atomic switch of
MSR_EFER on vmentry/exit.
Therefore we update vmx_init_vmcs_config() to find and use guest/host EFER
support when available (and MSR load/save lists on older hardware) and drop
all ad-hoc alteration of SCE.
There are two minor complications when selecting the EFER setting:
* For shadow guests, NXE is a paging setting and must remain under host
control, but this is fine as Xen also handles the pagefaults.
* When the Unrestricted Guest control is clear, hardware doesn't tolerate LME
and LMA being different. This doesn't matter in practice as we intercept
all writes to CR0 and reads from MSR_EFER, so can provide architecturally
consistent behaviour from the guests point of view.
With changing how EFER is loaded, vmcs_dump_vcpu() needs adjusting. Read EFER
from the appropriate information source, and identify when dumping the guest
EFER value which source was used.
As a result of fixing EFER context switching, we can remove the Intel-special
case from hvm_nx_enabled() and let guest_walk_tables() work with the real
guest paging settings.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Support load-only guest MSR list entries
Currently, the VMX_MSR_GUEST type maintains completely symmetric guest load
and save lists, by pointing VM_EXIT_MSR_STORE_ADDR and VM_ENTRY_MSR_LOAD_ADDR
at the same page, and setting VM_EXIT_MSR_STORE_COUNT and
VM_ENTRY_MSR_LOAD_COUNT to the same value.
However, for MSRs which we won't let the guest have direct access to, having
hardware save the current value on VMExit is unnecessary overhead.
To avoid this overhead, we must make the load and save lists asymmetric. By
making the entry load count greater than the exit store count, we can maintain
two adjacent lists of MSRs, the first of which is saved and restored, and the
second of which is only restored on VMEntry.
For simplicity:
* Both adjacent lists are still sorted by MSR index.
* It undefined behaviour to insert the same MSR into both lists.
* The total size of both lists is still limited at 256 entries (one 4k page).
Split the current msr_count field into msr_{load,save}_count, and introduce a
new VMX_MSR_GUEST_LOADONLY type, and update vmx_{add,find}_msr() to calculate
which sublist to search, based on type. VMX_MSR_HOST has no logical sublist,
whereas VMX_MSR_GUEST has a sublist between 0 and the save count, while
VMX_MSR_GUEST_LOADONLY has a sublist between the save count and the load
count.
One subtle point is that inserting an MSR into the load-save list involves
moving the entire load-only list, and updating both counts.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Pass an MSR value into vmx_msr_add()
The main purpose of this change is to allow us to set a specific MSR value,
without needing to know whether there is already a load/save list slot for it.
Previously, callers wanting this property needed to call both vmx_add_*_msr()
and vmx_write_*_msr() to cover both cases, and there are no callers which want
the old behaviour of being a no-op if an entry already existed for the MSR.
As a result of this API improvement, the default value for guest MSRs need not
be 0, and the default for host MSRs need not be passed via hardware register.
In practice, this cleans up the VPMU allocation logic, and avoids an MSR read
as part of vcpu construction.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Improvements to LBR MSR handling
The main purpose of this patch is to only ever insert the LBR MSRs into the
guest load/save list once, as a future patch wants to change the behaviour of
vmx_add_guest_msr().
The repeated processing of lbr_info and the guests MSR load/save list is
redundant, and a guest using LBR itself will have to re-enable
MSR_DEBUGCTL.LBR in its #DB handler, meaning that Xen will repeat this
redundant processing every time the guest gets a debug exception.
Rename lbr_fixup_enabled to lbr_flags to be a little more generic, and use one
bit to indicate that the MSRs have been inserted into the load/save list.
Shorten the existing FIXUP* identifiers to reduce code volume.
Furthermore, handing the guest #MC on an error isn't a legitimate action. Two
of the three failure cases are definitely hypervisor bugs, and the third is a
boundary case which shouldn't occur in practice. The guest also won't execute
correctly, so handle errors by cleanly crashing the guest.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Support remote access to the MSR lists
At the moment, all modifications of the MSR lists are in current context.
However, future changes may need to put MSR_EFER into the lists from domctl
hypercall context.
Plumb a struct vcpu parameter down through the infrastructure, and use
vmx_vmcs_{enter,exit}() for safe access to the VMCS in vmx_add_msr(). Use
assertions to ensure that access is either in current context, or while the
vcpu is paused.
Note these expectations beside the fields in arch_vmx_struct, and reorder the
fields to avoid unnecessary padding.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Factor locate_msr_entry() out of vmx_find_msr() and vmx_add_msr()
Instead of having multiple algorithms searching the MSR lists, implement a
single one. It has the semantics required by vmx_add_msr(), to identify the
position in which an MSR should live, if it isn't already present.
There will be a marginal improvement for vmx_find_msr() by avoiding the
function pointer calls to vmx_msr_entry_key_cmp(), and a major improvement for
vmx_add_msr() by using a binary search instead of a linear search.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Internal cleanup for MSR load/save infrastructure
* Use an arch_vmx_struct local variable to reduce later code volume.
* Use start/total instead of msr_area/msr_count. This is in preparation for
more finegrained handling with later changes.
* Use ent/end pointers (again for preparation), and to make the vmx_add_msr()
logic easier to follow.
* Make the memory allocation block of vmx_add_msr() unlikely, and calculate
virt_to_maddr() just once.
No practical change to functionality.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: API improvements for MSR load/save infrastructure
Collect together related infrastructure in vmcs.h, rather than having it
spread out. Turn vmx_{read,write}_guest_msr() into static inlines, as they
are simple enough.
Replace 'int type' with 'enum vmx_msr_list_type', and use switch statements
internally. Later changes are going to introduce a new type.
Rename the type identifiers for consistency with the other VMX_MSR_*
constants.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Wed, 20 Jun 2018 07:43:57 +0000 (15:43 +0800)]
x86/suspend: Fix restoration of guest state across S3/S4
The call to freeze_domains() in enter_state() guarentees that we are
running in idle context for the duration of S3/S4.
In restore_rest_processor_state(), the stts() is problematic as it
unilaterally sets %cr0.ts even in fully_eager FPU context. It also fails to
account for the non-lazy xsave state. Luckily, these are both latent bugs, as
the FPU state is corrected by the subsequent context switch away from the idle
vcpu.
Another aspect is that the !is_idle_vcpu(curr) paths in
restore_rest_processor_state() are actually dead code, and removing
these highlights that the segment saving logic is also unused.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 24 May 2018 17:41:53 +0000 (17:41 +0000)]
x86: Improvements to ler debugging
* Command line documentation for what the option does.
* Implement a canonicalise_addr() helper and replace the opencoded use in
sign_extend_msr()
* Canonicalise the ler pointers and print symbol information.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Thu, 24 May 2018 17:20:09 +0000 (17:20 +0000)]
x86/vmx: Fix handing of MSR_DEBUGCTL on VMExit
Currently, whenever the guest writes a nonzero value to MSR_DEBUGCTL, Xen
updates a host MSR load list entry with the current hardware value of
MSR_DEBUGCTL.
On VMExit, hardware automatically resets MSR_DEBUGCTL to 0. Later, when the
guest writes to MSR_DEBUGCTL, the current value in hardware (0) is fed back
into guest load list. As a practical result, `ler` debugging gets lost on any
PCPU which has ever scheduled an HVM vcpu, and the common case when `ler`
debugging isn't active, guest actions result in an unnecessary load list entry
repeating the MSR_DEBUGCTL reset.
Restoration of Xen's debugging setting needs to happen from the very first
vmexit. Due to the automatic reset, Xen need take no action in the general
case, and only needs to load a value when debugging is active.
This could be fixed by using a host MSR load list entry set up during
construct_vmcs(). However, a more efficient option is to use an alternative
block in the VMExit path, keyed on whether hypervisor debugging has been
enabled.
In order to set this up, drop the per cpu ler_msr variable (as there is no
point having it per cpu when it will be the same everywhere), and use a single
read_mostly variable instead. Split calc_ler_msr() out of percpu_traps_init()
for clarity.
Finally, clean up do_debug(). Reinstate LBR early to help catch cascade
errors, which allows for the removal of the out label.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Wed, 27 Jun 2018 11:34:47 +0000 (12:34 +0100)]
x86/msr: Use the architectural layout for MSR_{MISC_ENABLES,PLATFORM_INFO}
This simplifies future interactions with the toolstack, by removing the need
for per-MSR custom accessors when shuffling data in/out of a policy.
Use a 32bit raw backing integer (for simplicity), and use a bitfield to move
the cpuid_faulting field to its appropriate position.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 27 Jun 2018 11:34:47 +0000 (11:34 +0000)]
x86/msr: Drop {MISC_ENABLES,PLATFORM_INFO}.available
These MSRs are non-architectural and the available booleans were used in lieu
of an architectural signal of availability.
However, in hindsight, the additional booleans make toolstack MSR interactions
more complicated. The MSRs are unconditionally available to HVM guests, but
currently for PV guests, are hidden when CPUID faulting is unavailable.
Instead, switch them to being unconditionally readable, even for PV guests.
The new behaviour is:
* PLATFORM_INFO is unconditionally readable even for PV guests and will
indicate the presence or absence of CPUID Faulting in bit 31.
* MISC_FEATURES_ENABLES is unconditionally readable, and bit 0 may be set
iff PLATFORM_INFO reports that CPUID Faulting is available.
As a minor bugfix, CPUID Faulting for HVM guests is not restricted to
Intel/AMD hardware. In particular, VIA have a VT-x implementaion conforming
to the Intel specification.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Fri, 29 Jun 2018 16:28:13 +0000 (16:28 +0000)]
xen: Plumb an is_priv boolean into domain_create()
The current mechanism of setting dom0->is_privileged after construction means
that the is_control_domain() predicate returns false during construction.
In particular, this means that the CPUID Faulting special case in
init_domain_msr_policy() fails to take effect. (In actual fact, faulting
support is advertised to dom0, but attempting to configure it is silently
ignored because of the dom0 special case in ctxt_switch_levelling().)
This could be implemented using a flag in xen_domctl_createdomain, but using
an extra boolean parameter like this means that we can't accidentally allow
domain_create() to create a second dom0 due to parameter mis-auditing.
While adjusting the setting of dom0->is_privileged, drop the redundant zeroing
of dom0->target.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Mon, 2 Jul 2018 11:11:33 +0000 (13:11 +0200)]
x86: move per-vendor early CPU init declarations
They're local to cpu/, so they belong into cpu/cpu.h (and some of them
have been out of use for quite some time). Drop the asm/setup.h
inclusions then as well.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 2 Jul 2018 11:10:19 +0000 (13:10 +0200)]
x86/HPET: drop useless check
Commit 9e051a840d ("x86/hpet: Improve handling of timer_deadline")
removed all code between for_each_cpu() and cpumask_test_cpu(),
rendering the latter pointless.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@ctirix.com>
svm: don't clear interception for MSRs required for introspection
This patch mirrors the VMX code that doesn't allow
vmx_clear_msr_intercept() to clear interception of MSRs that an
introspection agent is trying to monitor.
Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Current update process of already bound MSI interrupts is wrong
because unmap_domain_pirq calls pci_disable_msi, which disables MSI
interrupts on the device. On the other hand map_domain_pirq doesn't
enable MSI, so the current update process of already enabled MSI
entries is wrong because MSI control bit will be disabled by
unmap_domain_pirq and not re-enabled by map_domain_pirq.
In order to fix this avoid unmapping the PIRQs and just update the
binding of the PIRQ. A new arch helper to do that is introduced.
Note that MSI-X is not affected because unmap_domain_pirq only
disables the MSI enable control bit for the MSI case, for MSI-X the
bit is left untouched by unmap_domain_pirq.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Mon, 2 Jul 2018 11:06:49 +0000 (13:06 +0200)]
VT-d: reconcile iommu_inclusive_mapping and iommu=dom0-strict
The documentation for the iommu_inclusive_mapping Xen command line option
states:
"Use this to work around firmware issues providing incorrect RMRR entries"
Unfortunately this workaround does not function correctly if the dom0-strict
iommu option is also specified.
The documentation goes on to say:
"Rather than only mapping RAM pages for IOMMU accesses for Dom0, with this
option all pages up to 4GB, not marked as unusable in the E820 table, will
get a mapping established."
This patch modifies the VT-d hardware domain initialization code such that
the workaround will continue to function in dom0-strict mode, by mapping
all pages not marked as unusable *unless* they are RAM pages not assigned
to dom0.
NOTE: This patch modifies the test in drivers/passthrough/vtd/iommu.c from
need_iommu() to is_pv_domain() since dom0-strict implies need_iommu()
so we no longer want to gate invocation of vtd_set_hwdom_mapping()
on that.
It also exports the iommu_dom0_strict flag so that the implementation
of vtd_set_hwdom_mapping() can test it explicitly. It would be
possible to test need_iommu() instead, but it is more illustrative
to test the original flag rather than one of its side-effects.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Roger Pau Monne <roger.pau@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Paul Durrant [Mon, 2 Jul 2018 11:05:36 +0000 (13:05 +0200)]
VT-d: re-phrase logic in vtd_set_hwdom_mapping() for clarity
It is hard to reconcile the comment at the top of the loop in
vtd_set_hwdom_mapping() with the if statement following it. This patch
re-phrases the logic, preserving the semantics, but making it easier
to read.
The patch also modifies the Xen command line documentation to make it
clear that iommu_inclusive_mapping only applies to pages up to the 4GB
boundary.
NOTE: This patch also corrects the indentation of the printk() towards
the end of vtd_set_hwdom_mapping().
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Roger Pau Monne <roger.pau@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Thu, 28 Jun 2018 10:49:32 +0000 (12:49 +0200)]
gnttab: silence table expansion message
This currently shows up for basically every domain, when originally it
was logged only when going beyond the default table size. Restore that
behavior.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Thu, 28 Jun 2018 10:48:47 +0000 (12:48 +0200)]
x86/XPTI: use %r12 to write zero into xen_cr3
Now that we zero all registers early on all entry paths, use that to
avoid a couple of immediates here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monne [Thu, 28 Jun 2018 10:12:07 +0000 (12:12 +0200)]
libxc: remove xch parameter from xc_cpuid_policy
It's not used by the function or any of the helpers called by it.
Reported-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Wed, 23 May 2018 16:53:17 +0000 (16:53 +0000)]
x86/vmx: Drop VMX signal for full real-mode
The hvmloader code which used this signal was deleted 10 years ago (c/s 50b12df83 "x86 vmx: Remove vmxassist"). Furthermore, the value gets discarded
anyway because the HVM domain builder unconditionally sets %rax to 0 in the
same action it uses to set %rip to the appropriate entrypoint.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Mon, 28 May 2018 14:02:34 +0000 (15:02 +0100)]
x86/vmx: Defer vmx_vmcs_exit() as long as possible in construct_vmcs()
paging_update_paging_modes() and vmx_vlapic_msr_changed() both operate on the
VMCS being constructed. Avoid dropping and re-acquiring the reference
multiple times.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Thu, 24 May 2018 13:15:32 +0000 (14:15 +0100)]
x86/vmx: Simplify PAT handling during vcpu construction
The host PAT value is a compile time constant, and doesn't need to be read out
of hardware. Merge this if block into the previous block, which has an
identical condition.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Thu, 24 May 2018 13:09:49 +0000 (14:09 +0100)]
x86/pat: Simplify host PAT handling
With the removal of the 32bit hypervisor build, host_pat is a constant value.
Drop the variable and the redundant cpu_has_pat predicate, and use a define
instead.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Tue, 8 May 2018 09:33:00 +0000 (11:33 +0200)]
pci: treat class 0 devices as endpoints
Class 0 devices are legacy pre PCI 2.0 devices that didn't have a
class code. Treat them as endpoints, so that they can be handled by
the IOMMU and properly passed-through to the hardware domain.
Such device has been seen on a Super Micro server, lspci -vv reports:
00:13.0 Non-VGA unclassified device: Intel Corporation Device a135 (rev 31)
Subsystem: Super Micro Computer Inc Device 0931
Flags: bus master, fast devsel, latency 0, IRQ 11
Memory at df222000 (64-bit, non-prefetchable) [size=4K]
Capabilities: [80] Power Management version 3
Arguably this is not a legacy device (since this is a new server), but
in any case Xen needs to deal with it.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Ross Lagerwall [Mon, 14 May 2018 11:03:00 +0000 (13:03 +0200)]
x86/shutdown: use ACPI reboot method for Dell PowerEdge R540
When EFI booting the Dell PowerEdge R540 it consistently wanders into
the weeds and gets an invalid opcode in the EFI ResetSystem call. This
is the same bug which affects the PowerEdge R740 so fix it in the same
way: quirk this hardware to use the ACPI reboot method instead.
BIOS Information
Vendor: Dell Inc.
Version: 1.3.7
Release Date: 02/09/2018
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R540
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Daniel Kiper [Fri, 20 Apr 2018 09:54:00 +0000 (11:54 +0200)]
x86/setup: properly update PTEs if src/dst overlaps when relocating Xen image
Commit 0d31d16 (x86/setup: do not relocate Xen over current Xen image
placement) disallowed src/dst images overlaps when relocating Xen image.
Though it deliberately allowed destination region between __image_base__
and (__image_base__ + XEN_IMG_OFFSET) overlaps with the end of source
image. And here is the problem. If anything between __page_tables_start
and __page_tables_end in source image lands in the overlap then some or
even all page table entries may not be updated. This usually means boom
in early boot which will be difficult to the investigate. So, I think
that we have three choices to fix the issue:
- drop XEN_IMG_OFFSET from
if ( (end > s) && (end - reloc_size + XEN_IMG_OFFSET >= __pa(_end)) )
- add XEN_IMG_OFFSET to xen_phys_start in PFN_DOWN(xen_phys_start)
used in loops as one of conditions and replace ">" with ">=",
- change PFN_DOWN(xen_phys_start) to PFN_DOWN(xen_remap_end_pfn)
proposed in earlier version of this patch.
This patch implements the second option. This way we still allow source
and destination partial overlap as described above but PTEs are properly
updated now.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Olaf Hering [Tue, 12 Jun 2018 14:11:00 +0000 (16:11 +0200)]
unmodified_drivers: unplug the emulated devices at resume time
Since qemu-2.10 it is required to unplug emulated devices again after
a live migration. If this is not done, qemu's block-backend driver
will be unable to open the backing disk image because it is still busy
by qemu's IDE driver. As a result the domUs block-frontend driver will
be unable to access the disks, and the domU has to be destroyed.
libxl is unable to detect the situation.
Apply the same workaround for this qemu bug that was done already
years ago in linux.git with commit 512b109ec962 ("xen: unplug the
emulated devices at resume time") to make sure xenlinux based domUs
can be migrated to unfixed hosts.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Wed, 27 Jun 2018 14:33:00 +0000 (16:33 +0200)]
x86/cpuid: fix generation of auto cpuid header
The makefile rule to generate the cpuid-autogen.h header passes the
whole list of dependencies to gen-cpuid.py but only the first
dependency is actually needed.
So far this seems to be harmless.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Thu, 28 Jun 2018 07:08:04 +0000 (09:08 +0200)]
x86: guard against #NM
Just in case we still don't get CR0.TS handling right, prevent a host
crash by honoring exception fixups in do_device_not_available(). This
would in particular cover emulator stubs raising #NM.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 28 Jun 2018 07:07:06 +0000 (09:07 +0200)]
x86/HVM: don't cause #NM to be raised in Xen
The changes for XSA-267 did not touch management of CR0.TS for HVM
guests. In fully eager mode this bit should never be set when
respective vCPU-s are active, or else hvmemul_get_fpu() might leave it
wrongly set, leading to #NM in hypervisor context.
{svm,vmx}_enter() and {svm,vmx}_fpu_dirty_intercept() become unreachable
this way. Explicit {svm,vmx}_fpu_leave() invocations need to be guarded
now.
With no CR0.TS management necessary in fully eager mode, there's also no
need anymore to intercept #NM.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Ian Jackson [Wed, 13 Jun 2018 14:54:53 +0000 (15:54 +0100)]
libxl: restore passing "readonly=" to qemu for SCSI disks
A read-only check was introduced for XSA-142, commit ef6cb76026 ("libxl:
relax readonly check introduced by XSA-142 fix") added the passing of
the extra setting, but commit dab0539568 ("Introduce COLO mode and
refactor relevant function") dropped the passing of the setting again,
quite likely due to improper re-basing.
Restore the readonly= parameter to SCSI disks. For IDE disks this is
supposed to be rejected; add an assert. And there is a bare ad-hoc
disk drive string in libxl__build_device_model_args_new, which we also
update.
This is XSA-266.
Reported-by: Andrew Reimers <andrew.reimers@orionvm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Wed, 13 Jun 2018 14:51:36 +0000 (15:51 +0100)]
libxl: qemu_disk_scsi_drive_string: Break out common parts of disk config
The generated configurations are identical apart from, in some cases,
reordering of the id=%s element. So, overall, no functional change.
This is part of XSA-266.
Reported-by: Andrew Reimers <andrew.reimers@orionvm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Andrew Cooper [Thu, 28 Jun 2018 07:04:20 +0000 (09:04 +0200)]
x86: Refine checks in #DB handler for faulting conditions
One of the fix for XSA-260 (c/s 75d6828bc2 "x86/traps: Fix handling of #DB
exceptions in hypervisor context") added some safety checks to help avoid
livelocks of #DB faults.
While a General Detect #DB exception does have fault semantics, hardware
clears %dr7.gd on entry to the handler, meaning that it is actually safe to
return to. Furthermore, %dr6.gd is guest controlled and sticky (never cleared
by hardware). A malicious PV guest can therefore trigger the fatal_trap() and
crash Xen.
Instruction breakpoints are more tricky. The breakpoint match bits in %dr6
are not sticky, but the Intel manual warns that they may be set for
non-enabled breakpoints, so add a breakpoint enabled check.
Beyond that, because of the restriction on the linear addresses PV guests can
set, and the fault (rather than trap) nature of instruction breakpoints
(i.e. can't be deferred by a MovSS shadow), there should be no way to
encounter an instruction breakpoint in Xen context. However, for extra
robustness, deal with this situation by clearing the breakpoint configuration,
rather than crashing.
This is XSA-265
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 26 Jun 2018 13:23:08 +0000 (15:23 +0200)]
x86/EFI: further correct FPU state handling around runtime calls
We must not leave a vCPU with CR0.TS clear when it is not in fully eager
mode and has not touched non-lazy state. Instead of adding a 3rd
invocation of stts() to vcpu_restore_fpu_eager(), consolidate all of
them into a single one done at the end of the function.
Rename the function at the same time to better reflect its purpose, as
the patches touches all of its occurences anyway.
The new function parameter is not really well named, but
"need_stts_if_not_fully_eager" seemed excessive to me.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Roger Pau Monné [Tue, 26 Jun 2018 06:48:14 +0000 (08:48 +0200)]
x86/dom0: add extra RAM regions as UNUSABLE for PVH memory map
When running as PVH Dom0 the native memory map is used in order to
craft a tailored memory map for Dom0 taking into account it's memory
limit.
Dom0 memory is always going to be smaller than the total amount
of memory present on the host, so in order to prevent Dom0 from
relocating PCI BARs over RAM regions mark all the RAM regions not
available to Dom0 as UNUSABLE in the memory map.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 26 Jun 2018 06:47:17 +0000 (08:47 +0200)]
x86/HVM: alter completion-needed checking
The function only looks at the ioreq_t, so pass it a pointer to just
that. Also use it in hvmemul_do_io().
Suggested-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Jan Beulich [Tue, 26 Jun 2018 06:41:08 +0000 (08:41 +0200)]
x86/HVM: attempts to emulate FPU insns need to set fpu_initialised
My original way of thinking here was that this would be set anyway at
the point state gets reloaded after the adjustments hvmemul_put_fpu()
does, but the flag should already be set before that - after all the
guest may never again touch the FPU before e.g. getting migrated/saved.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Paul Durrant <paul.durrant@citrix.com>
Julien Grall [Tue, 12 Jun 2018 11:36:40 +0000 (12:36 +0100)]
xen/arm64: Implement a fast path for handling SMCCC_ARCH_WORKAROUND_2
The function ARM_SMCCC_ARCH_WORKAROUND_2 will be called by the guest for
enabling/disabling the ssbd mitigation. So we want the handling to
be as fast as possible.
The new sequence will forward guest's ARCH_WORKAROUND_2 call to EL3 and
also track the state of the workaround per-vCPU.
Note that since we need to execute branches, this always executes after
the spectre-v2 mitigation.
This code is based on KVM counterpart "arm64: KVM: Handle guest's
ARCH_WORKAROUND_2 requests" written by Marc Zyngier.
This is based on the Linux commit dea5e2a4c5bc "arm64: alternatives: Add
dynamic patching feature" written by Marc Zyngier:
We've so far relied on a patching infrastructure that only gave us
a single alternative, without any way to provide a range of potential
replacement instructions. For a single feature, this is an all or
nothing thing.
It would be interesting to have a more flexible grained way of patching the
kernel though, where we could dynamically tune the code that gets injected.
In order to achive this, let's introduce a new form of dynamic patching,
assiciating a callback to a patching site. This callback gets source and
target locations of the patching request, as well as the number of
instructions to be patched.
Dynamic patching is declared with the new ALTERNATIVE_CB and alternative_cb
directives:
asm volatile(ALTERNATIVE_CB("mov %0, #0\n", callback)
: "r" (v));
or
where callback is the C function computing the alternative.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
This is part of XSA-263.
Julien Grall [Tue, 12 Jun 2018 11:36:37 +0000 (12:36 +0100)]
xen/arm: Simplify alternative patching of non-writable region
During the MMU setup process, Xen will set SCTLR_EL2.WNX
(Write-Non-eXecutable) bit. Because of that, the alternative code need
to re-mapped the region in a difference place in order to modify the
text section.
At the moment, the function patching the code is only aware of the
re-mapped region. This requires the caller to mess with Xen internal in
order to have function such as is_active_kernel_text() working.
All the interactions with Xen internal can be removed by specifying the
offset between the region patch and the writable region for updating the
instruction
This simplification will also make it easier to integrate dynamic patching
in a follow-up patch. Indeed, the callback address should be in
an original region and not re-mapped only which is writeable non-executable.
Julien Grall [Tue, 12 Jun 2018 11:36:36 +0000 (12:36 +0100)]
xen/arm: Add ARCH_WORKAROUND_2 support for guests
In order to offer ARCH_WORKAROUND_2 support to guests, we need to track the
state of the workaround per-vCPU. The field 'pad' in cpu_info is now
repurposed to store flags easily accessible in assembly.
As the hypervisor will always run with the workaround enabled, we may
need to enable (on guest exit) or disable (on guest entry) the
workaround.
A follow-up patch will add fastpath for the workaround for arm64 guests.
Note that check_workaround_ssbd() is used instead of ssbd_get_state()
because the former is implemented using an alternative. Thefore the code
will be shortcut on affected platform.
Julien Grall [Tue, 12 Jun 2018 11:36:35 +0000 (12:36 +0100)]
xen/arm: Add command line option to control SSBD mitigation
On a system where the firmware implements ARCH_WORKAROUND_2, it may be
useful to either permanently enable or disable the workaround for cases
where the user decides that they'd rather not get a trap overhead, and
keep the mitigation permanently on or off instead of switching it on
exception entry/exit. In any case, default to mitigation being enabled.
The new command line option is implemented as list of one option to
follow x86 option and also allow to extend it more easily in the future.
Note that for convenience, the full implemention of the workaround is
done in the .matches callback.
Lastly, a accessor is provided to know the state of the mitigation.
After this patch, there are 3 methods complementing each other to find the
state of the mitigation:
- The capability ARM_SSBD indicates the platform is affected by the
vulnerability. This will also return false if the user decide to force
disabled the mitigation (spec-ctrl="ssbd=force-disable"). The
capability is useful for putting shortcut in place using alternative.
- ssbd_state indicates the global state of the mitigation (e.g
unknown, force enable...). The global state is required to report
the state to a guest.
- The per-cpu ssbd_callback_required indicates whether a pCPU
requires to call the SMC. This allows to shortcut SMC call
and save an entry/exit to EL3.
Julien Grall [Tue, 12 Jun 2018 11:36:31 +0000 (12:36 +0100)]
xen/arm: domain: Zero the per-vCPU cpu_info
A stack is allocated per vCPU to be used by Xen. The allocation is done
with alloc_xenheap_pages that does not zero the memory returned. However
the top of the stack is containing information that will be used to
store the initial state of the vCPU (see struct cpu_info). Some of the
fields may not be initialized and will lead to use/leak bits of previous
memory in some cases on the first run of vCPU (AFAICT this only happen on
vCPU0 for Dom0).
Jan Beulich [Thu, 21 Jun 2018 09:35:46 +0000 (11:35 +0200)]
x86/EFI: fix FPU state handling around runtime calls
There are two issues. First, the nonlazy xstates were never restored
after returning from the runtime call.
Secondly, with the fully_eager_fpu mitigation for XSA-267 / LazyFPU, the
unilateral stts() is no longer correct, and hits an assertion later when
a lazy state restore tries to occur for a fully eager vcpu.
Fix both of these issues by calling vcpu_restore_fpu_eager(). As EFI
runtime services can be used in the idle context, the idle assertion
needs to move until after the fully_eager_fpu check.
Introduce a "curr" local variable and replace other uses of "current"
at the same time.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Juergen Gross <jgross@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Juergen Gross [Mon, 18 Jun 2018 07:18:56 +0000 (09:18 +0200)]
tools/libxc: retry hypercall in case of EFAULT
A hypercall issued via the privcmd driver can very rarely return
-EFAULT even if the hypercall buffers are locked in memory. This
happens for hypercall buffers in user memory when the Linux kernel
is doing memory scans e.g. for page migration or compaction.
Retry the getpageframeinfo3 hypercall up to 2 times in case
-EFAULT is returned and the hypervisor might see invalid PTEs for
user hypercall buffers (which should be the case only if the kernel
doesn't offer a /dev/xen/hypercall node).
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Juergen Gross [Mon, 18 Jun 2018 07:18:54 +0000 (09:18 +0200)]
tools/libxencall: use hypercall buffer device if available
Instead of using anonymous memory for hypercall buffers which is then
locked into memory, use the hypercall buffer device of the Linux
privcmd driver if available.
This has the advantage of needing just a single mmap() for allocating
the buffer and page migration or compaction can't make the buffer
unaccessible for the hypervisor.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Fri, 15 Jun 2018 09:49:06 +0000 (11:49 +0200)]
x86/HVM: account for fully eager FPU mode in emulation
In fully eager mode we must not clear fpu_dirtied, set CR0.TS, or invoke
the fpu_leave() hook. Instead do what the mode's name says: Restore
state right away.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Thu, 7 Jun 2018 16:00:37 +0000 (17:00 +0100)]
x86/spec-ctrl: Mitigations for LazyFPU
Intel Core processors since at least Nehalem speculate past #NM, which is the
mechanism by which lazy FPU context switching is implemented.
On affected processors, Xen must use fully eager FPU context switching to
prevent guests from being able to read FPU state (SSE/AVX/etc) from previously
scheduled vcpus.
This is part of XSA-267 / CVE-2018-3665
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Display and input protocols define "unique-id" XenBus field as string
which is much more flexible in defining unique identifiers comparing
to integer used by sound protocol. For example, this allows to provide
UUIDs as unique ID's. Align sound protocol with display and input
and redefine "unique-id" field as string.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Juergen Gross <jgross@suse.com>
If frontend is configured to expose multiple connectors then backend may
require a way to uniquely identify concrete virtual connector within the
frontend. This is useful for use-cases where connector needs to be
matched to physical display connector.
Add XenBus "unique-id" node parameter, so this sort of use-cases can
be implemented.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Juergen Gross <jgross@suse.com>
If frontend is configured to expose multiple input device instances
then backend may require a way to uniquely identify concrete input
device within the frontend. This is useful for use-cases where
virtual input device needs to be matched to physical input device.
Add XenBus "unique-id" node parameter, so this sort of use-cases can
be implemented.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Juergen Gross <jgross@suse.com>
xen/kbdif: Move multi-touch device parameters to backend nodes
In current kbdif protocol definition multi-touch device parameters
are described as a part of frontend's XenBus configuration nodes while
they belong to backend's configuration. Fix this by moving
the parameters to the proper section.
xen/arm: Enable errata for secondary CPU on hotplug after the boot
On boot, enabling errata workarounds will be triggered by the boot CPU
from start_xen(). On CPU hotplug (non-boot scenario) this would not be
done. This patch adds the code required to enable errata workarounds for
a CPU being hotplugged after the system boots. This is triggered using
a notifier. If the CPU fails to enable workarounds the notifier will
return an error and Xen will hit the BUG_ON() in notify_cpu_starting().
To avoid the BUG_ON() in an error case either enabling notifiers should
be fixed to return void (not propagate error to notify_cpu_starting())
and the errata notifier will always return success for CPU_STARTING
event, or the notify_cpu_starting() and other common code should be
fixed to expect an error at CPU_STARTING phase.
xen/arm: Free memory allocated for sibling/core maps on CPU hot-unplug
The memory allocated in setup_cpu_sibling_map() when a CPU is hotplugged
has to be freed when the CPU is hot-unplugged. This is done in
remove_cpu_sibling_map() and called when the CPU dies. The call to
remove_cpu_sibling_map() is made from a notifier callback when
CPU_DEAD event is received.
xen/arm: Disable timers and release their interrupts on CPU hot-unplug
When a CPU is hot-unplugged we need to disable timers and release
their interrupts in order to free the memory that was allocated when
interrupts were requested (using request_irq()). The request_irq()
is called for each timer interrupt when the CPU gets hotplugged
(start_secondary->init_timer_interrupt->request_irq).
With this patch timers will be disabled and interrupts will be
released when the newly added callback receives CPU_DYING event.
xen/arm: Release maintenance interrupt when CPU is hot-unplugged
When a CPU is hot-unplugged the maintenance interrupt has to be
released in order to free the memory that was allocated when the CPU
was hotplugged and interrupt requested. The interrupt was requested
using request_irq() which is called from start_secondary->
init_maintenance_interrupt. With this patch the interrupt will be
released when the CPU_DYING event is received by the callback which
is added in gic.c.
xen/common: Restore IRQ affinity when hotplugging a pCPU
Non-boot pCPUs are being hot-unplugged during the system suspend to
RAM and hotplugged during the resume. When non-boot pCPUs are
hot-unplugged the interrupts that were targeted to them are migrated
to the boot pCPU.
On suspend, each guest could have its own wake-up devices/interrupts
(passthrough) that could trigger the system resume. These interrupts
could be targeted to a non-boot pCPU, e.g. if the guest's vCPU is
pinned to a non-boot pCPU. Due to the hot-unplug of non-boot pCPUs
during the suspend such interrupts will be migrated from non-boot pCPUs
to the boot pCPU (this is fine). However, when non-boot pCPUs are
hotplugged on resume, these interrupts are not migrated back to non-boot
pCPUs, i.e. IRQ affinity is not restored on resume (this is wrong).
This patch adds the restoration of IRQ affinity when a pCPU is hotplugged.
xen/arm: Setup virtual paging for non-boot CPUs on hotplug/resume
In existing code the virtual paging for non-boot CPUs is setup only on boot.
The setup is triggered from start_xen() after all CPUs are brought online.
In other words, the initialization of VTCR_EL2 register is done out of the
cpu_up/start_secondary() control flow. However, the cpu_up flow is also used
to hotplug non-boot CPUs on resume from suspend to RAM state, in which case
the virtual paging will not be configured.
With this patch the setting of paging is triggered from start_secondary()
function using cpu starting notifier (notify_cpu_starting() call). The
notifier is registered in p2m.c using init call. This has to be done with
init call rather than presmp_init because the registered callback depends
on vtcr configuration value which is setup after the presmp init calls
are executed (do_presmp_initcalls() called from start_xen()). Init calls
are executed after initial virtual paging is set up for all CPUs on boot.
This ensures that no callback can fire until the vtcr value is calculated
by Xen and virtual paging is set up initially for all CPUs. Also, this way
the virtual paging setup in boot scenario remains unchanged.
It is assumed here that after the system completed the boot, CPUs that
execute start_secondary() were booted as well when the Xen itself was
booted. According to this assumption non-boot CPUs will always be compliant
with the VTCR_EL2 value that was selected by Xen on boot.
Currently, there is no mechanism to trigger hotplugging of a CPU. This
will be added with the suspend to RAM support for ARM, where the hotplug
of non-boot CPUs will be triggered via enable_nonboot_cpus() call.
xen/arm: Remove __initdata and __init to enable CPU hotplug
CPU up flow is currently used during the initial boot to start secondary
CPUs. However, the same flow should be used for CPU hotplug, e.g. when
hotplugging secondary CPUs within the resume procedure (resume from the
suspend to RAM). Therefore, prefixes __initdata and __init had to be removed
from few data structures and functions that are used within the cpu up flow.
During the system suspend to RAM non-boot CPUs will be hotplugged.
This will be triggered via disable_nonboot_cpus() call. When
hotplugged the CPU will end up in an infinite wfi loop in stop_cpu().
This patch adds PSCI CPU_OFF call to the EL3 with the aim to get powered
down the calling CPU during the suspend. The CPU_OFF call will be made
only if the PSCI version is higher than v0.1 (Note that the CPU_OFF
function is mandatory since PSCI v0.2).
If PSCI CPU_OFF call to the EL3 succeeds it will not return. Otherwise,
when the PSCI CPU_OFF call returns we'll raise panic, because the
calling CPU couldn't be enabled afterwards (stays in WFI loop forever).
Note that if the PSCI version is higher than v0.1 the CPU_OFF will be
called regardless of the system state. This is done because scenarios
other than suspend may benefit from powering off the CPU.
xen/arm: Ignore write to GICD_ISACTIVERn registers (vgic-v2)
Guests attempt to write into these registers on resume (for example Linux).
Without this patch a data abort exception will be raised to the guest.
This patch handles the write access by ignoring it, but only if the value
to be written is zero. This should be fine because reading these registers
is already handled as 'read as zero'.
xen/arm64: Added handling of the trapped access to OSLSR register
Linux/dom0 accesses OSLSR register when saving CPU context during the
suspend procedure. Xen traps access to this register, but has no handling
for it. Consequently, Xen injects undef exception to linux, causing it to
crash. This patch adds handling of the trapped access to OSLSR as read
only as a fixed value.
For convenience, introduce a new helper handle_ro_read_val() based on
handle_ro_raz() that will return a specified value on read and
re-implement handle_ro_raz() based on the new helper.
Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Julien Grall <julien.grall@arm.com>
[julien: Add a word about the new helper]