]> xenbits.xensource.com Git - people/dariof/xen.git/log
people/dariof/xen.git
5 years agoxen: sched: refactor the ASSERTs around vcpu_deassing() rel/sched/null-fix-vcpu-hotplug-v2
Dario Faggioli [Thu, 25 Jul 2019 17:51:19 +0000 (19:51 +0200)]
xen: sched: refactor the ASSERTs around vcpu_deassing()

It is all the time that we call vcpu_deassing() that the vcpu _must_ be
assigned to a pCPU, and hence that such pCPU can't be free.

Therefore, move the ASSERT-s which check for these properties in that
function, where they belong better.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citix.com>
5 years agoxen: sched: reassign vCPUs to pCPUs, when they come back online
Dario Faggioli [Thu, 25 Jul 2019 17:43:01 +0000 (19:43 +0200)]
xen: sched: reassign vCPUs to pCPUs, when they come back online

When a vcpu that was offline, comes back online, we do want it to either
be assigned to a pCPU, or go into the wait list.

Detecting that a vcpu is coming back online is a bit tricky. Basically,
if the vcpu is waking up, and is neither assigned to a pCPU, nor in the
wait list, it must be coming back from offline.

When this happens, we put it in the waitqueue, and we "tickle" an idle
pCPU (if any), to go pick it up.

Looking at the patch, it seems that the vcpu wakeup code is getting
complex, and hence that it could potentially introduce latencies.
However, all this new logic is triggered only by the case of a vcpu
coming online, so, basically, the overhead during normal operations is
just an additional 'if()'.

Signed-off-by: Dario Faggioli <dario.faggioli@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Roger Pau Monne <roger.pau@citrix.com>
5 years agoxen: sched: deal with vCPUs being or becoming online or offline
Dario Faggioli [Thu, 25 Jul 2019 16:46:56 +0000 (18:46 +0200)]
xen: sched: deal with vCPUs being or becoming online or offline

If a vCPU is, or is going, offline we want it to be neither
assigned to a pCPU, nor in the wait list, so:
- if an offline vcpu is inserted (or migrated) it must not
  go on a pCPU, nor in the wait list;
- if an offline vcpu is removed, we are sure that it is
  neither on a pCPU nor in the wait list already, so we
  should just bail, avoiding doing any further action;
- if a vCPU goes offline we need to remove it either from
  its pCPU or from the wait list.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Roger Pau Monne <roger.pau@citrix.com>
---
Changes from v1:
* improved wording in changelog and comments
* this patch is the result of the merge of patches 2 and 3 from v1

5 years agoxen: sched: refector code around vcpu_deassign() in null scheduler
Dario Faggioli [Thu, 25 Jul 2019 16:40:23 +0000 (18:40 +0200)]
xen: sched: refector code around vcpu_deassign() in null scheduler

vcpu_deassign() is called only once (in _vcpu_remove()).

Let's consolidate the two functions into one.

No functional change intended.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
5 years agotboot: remove maintainers and declare orphaned
Roger Pau Monne [Thu, 25 Jul 2019 13:51:12 +0000 (15:51 +0200)]
tboot: remove maintainers and declare orphaned

Gang Wei Intel email address has been bouncing for some time now, and
the other maintainer is non-responsive to patches [0], so remove
maintainers and declare INTEL(R) TRUSTED EXECUTION TECHNOLOGY (TXT)
orphaned.

[0] https://lists.xenproject.org/archives/html/xen-devel/2019-05/msg00563.html

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/dmi: Constify quirks data
Andrew Cooper [Wed, 24 Jul 2019 14:08:16 +0000 (15:08 +0100)]
x86/dmi: Constify quirks data

All DMI quirks tables are mutable, but are only ever read.

Update dmi_check_system() and dmi_system_id.callback to pass a const pointer,
and move all quirks tables into __initconstrel.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/dmi: Drop trivial callback functions
Andrew Cooper [Wed, 24 Jul 2019 14:05:16 +0000 (15:05 +0100)]
x86/dmi: Drop trivial callback functions

dmi_check_system() returns the number of matches.  This being nonzero is more
efficient than making a function pointer call to a trivial function to modify
a variable.

No functional change, but this results in less compiled code, which is
also (fractionally) quicker to run.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: Drop CONFIG_ACPI_SLEEP
Andrew Cooper [Wed, 24 Jul 2019 17:10:52 +0000 (18:10 +0100)]
x86: Drop CONFIG_ACPI_SLEEP

This option is hardcoded to 1, and the #ifdef-ary doesn't exclude wakeup.S,
which makes it useless code noise.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/dmi: Drop warning with an obsolete URL
Andrew Cooper [Wed, 24 Jul 2019 17:47:25 +0000 (18:47 +0100)]
x86/dmi: Drop warning with an obsolete URL

This quirk doesn't change anything in Xen, and the web page doesn't exist.

The wayback machine confirms that the link disappeared somewhere between
2003-06-14 and 2004-07-07.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/iommu: avoid mapping the interrupt address range for hwdom
Roger Pau Monné [Thu, 25 Jul 2019 10:17:34 +0000 (12:17 +0200)]
x86/iommu: avoid mapping the interrupt address range for hwdom

Current code only prevent mapping the lapic page into the guest
physical memory map. Expand the range to be 0xFEEx_xxxx as described
in the Intel VTd specification section 3.13 "Handling Requests to
Interrupt Address Range".

AMD also lists this address range in the AMD SR5690 Databook, section
2.4.4 "MSI Interrupt Handling and MSI to HT Interrupt Conversion".

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agopassthrough/amd: Clean iommu_hap_pt_share enabled code
Alexandru Isaila [Thu, 25 Jul 2019 10:16:58 +0000 (12:16 +0200)]
passthrough/amd: Clean iommu_hap_pt_share enabled code

At this moment IOMMU pt sharing is disabled by commit [1].

This patch cleans the unreachable code garded by iommu_hap_pt_share.

[1] c2ba3db31ef2d9f1e40e7b6c16cf3be3d671d555

Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Brian Woods <brian.woods@amd.com>
5 years agoiommu / x86: move call to scan_pci_devices() out of vendor code
Paul Durrant [Thu, 25 Jul 2019 10:16:21 +0000 (12:16 +0200)]
iommu / x86: move call to scan_pci_devices() out of vendor code

It's not vendor specific so it doesn't really belong there.

Scanning the PCI topology also really doesn't have much to do with IOMMU
initialization. It doesn't depend on there even being an IOMMU. This patch
moves to the call to the beginning of iommu_hardware_setup() but only
places it there because the topology information would be otherwise unused.

Subsequent patches will actually make use of the PCI topology during
(x86) IOMMU initialization.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: "Roger Pau Monné" <roger.pau@citrix.com>
Acked-by: Brian Woods <brian.woods@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/IOMMU: don't restrict IRQ affinities to online CPUs
Jan Beulich [Thu, 25 Jul 2019 10:14:52 +0000 (12:14 +0200)]
x86/IOMMU: don't restrict IRQ affinities to online CPUs

In line with "x86/IRQ: desc->affinity should strictly represent the
requested value" the internally used IRQ(s) also shouldn't be restricted
to online ones. Make set_desc_affinity() (set_msi_affinity() then does
by implication) cope with a NULL mask being passed (just like
assign_irq_vector() does), and have IOMMU code pass NULL instead of
&cpu_online_map (when, for VT-d, there's no NUMA node information
available).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Brian Woods <brian.woods@amd.com>
5 years agox86/pv: Move async_exception_cleanup() into pv/iret.c
Andrew Cooper [Tue, 23 Jul 2019 19:46:35 +0000 (20:46 +0100)]
x86/pv: Move async_exception_cleanup() into pv/iret.c

All callers are in pv/iret.c.  Move the function and make it static.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agoxen/x86: cleanup unused NMI/MCE code
Juergen Gross [Wed, 24 Jul 2019 11:26:57 +0000 (13:26 +0200)]
xen/x86: cleanup unused NMI/MCE code

pv_raise_interrupt() is only called for NMIs these days, so the MCE
specific part can be removed. Rename pv_raise_interrupt() to
pv_raise_nmi() and NMI_MCE_SOFTIRQ to NMI_SOFTIRQ.

Additionally there is no need to pin the vcpu which the NMI is delivered
to; that is a leftover of (already removed) MCE handling. So remove the
pinning, too. Note that pinning was introduced by commit 355b0469a8
adding MCE support (with NMI support existing already). MCE using that
pinning was removed with commit 3a91769d6e again without cleaning up the
code.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-and-tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agopassthrough/vtd: Don't DMA to the stack in queue_invalidate_wait()
Andrew Cooper [Thu, 19 Oct 2017 10:50:18 +0000 (11:50 +0100)]
passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait()

DMA-ing to the stack is considered bad practice.  In this case, if a
timeout occurs because of a sluggish device which is processing the
request, the completion notification will corrupt the stack of a
subsequent deeper call tree.

Place the poll_slot in a percpu area and DMA to that instead.

Fix the declaration of saddr in struct qinval_entry, to avoid a shift by
two.  The requirement here is that the DMA address is dword aligned,
which is covered by poll_slot's type.

This change does not address other issues.  Correlating completions
after a timeout with their request is a more complicated change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
5 years agox86/iommu: add comment regarding setting of need_sync
Roger Pau Monné [Tue, 23 Jul 2019 15:00:07 +0000 (17:00 +0200)]
x86/iommu: add comment regarding setting of need_sync

Clarify why relaxed hardware domains don't need iommu page-table
syncing.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agopci: switch pci_conf_write32 to use pci_sbdf_t
Roger Pau Monné [Tue, 23 Jul 2019 14:59:23 +0000 (16:59 +0200)]
pci: switch pci_conf_write32 to use pci_sbdf_t

This reduces the number of parameters of the function to two, and
simplifies some of the calling sites.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Brian Woods <brian.woods@amd.com>
5 years agopci: switch pci_conf_write16 to use pci_sbdf_t
Roger Pau Monné [Tue, 23 Jul 2019 14:58:42 +0000 (16:58 +0200)]
pci: switch pci_conf_write16 to use pci_sbdf_t

This reduces the number of parameters of the function to two, and
simplifies some of the calling sites.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agopci: switch pci_conf_write8 to use pci_sbdf_t
Roger Pau Monné [Tue, 23 Jul 2019 14:58:07 +0000 (16:58 +0200)]
pci: switch pci_conf_write8 to use pci_sbdf_t

This reduces the number of parameters of the function to two, and
simplifies some of the calling sites.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agopci: switch pci_conf_read32 to use pci_sbdf_t
Roger Pau Monné [Tue, 23 Jul 2019 14:54:38 +0000 (16:54 +0200)]
pci: switch pci_conf_read32 to use pci_sbdf_t

This reduces the number of parameters of the function to two, and
simplifies some of the calling sites.

While there convert {IGD/IOH}_DEV to be a pci_sbdf_t itself instead of
a device number.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Brian Woods <brian.woods@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agopci: switch pci_conf_read16 to use pci_sbdf_t
Roger Pau Monné [Tue, 23 Jul 2019 14:54:01 +0000 (16:54 +0200)]
pci: switch pci_conf_read16 to use pci_sbdf_t

This reduces the number of parameters of the function to two, and
simplifies some of the calling sites.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Brian Woods <brian.woods@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
5 years agopci: switch pci_conf_read8 to use pci_sbdf_t
Roger Pau Monné [Tue, 23 Jul 2019 14:53:24 +0000 (16:53 +0200)]
pci: switch pci_conf_read8 to use pci_sbdf_t

This reduces the number of parameters of the function to two, and
simplifies some of the calling sites.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Brian Woods <brian.woods@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
5 years agox86emul: unconditionally deliver #UD for LWP insns
Jan Beulich [Tue, 23 Jul 2019 14:52:19 +0000 (16:52 +0200)]
x86emul: unconditionally deliver #UD for LWP insns

This is to accompany commit 91f86f8634 ("x86/svm: Drop support for AMD's
Lightweight Profiling").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agoxen/sched: fix locking in restore_vcpu_affinity()
Juergen Gross [Tue, 23 Jul 2019 09:20:55 +0000 (11:20 +0200)]
xen/sched: fix locking in restore_vcpu_affinity()

Commit 0763cd2687897b55e7 ("xen/sched: don't disable scheduler on cpus
during suspend") removed a lock in restore_vcpu_affinity() which needs
to stay: cpumask_scratch_cpu() must be protected by the scheduler
lock. restore_vcpu_affinity() is being called by thaw_domains(), so
with multiple domains in the system another domain might already be
running and the scheduler might make use of cpumask_scratch_cpu()
already.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
5 years agoxen/arm: remove unused dt_device_node parameter
Viktor Mitin [Tue, 18 Jun 2019 08:58:51 +0000 (11:58 +0300)]
xen/arm: remove unused dt_device_node parameter

Some of the function generating nodes (e.g make_timer_node)
take in a dt_device_node parameter, but never used it.
It is actually misused when creating DT for DomU.
So it is the best to remove the parameter.

Suggested-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Viktor Mitin <viktor_mitin@epam.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
5 years agox86/crash: fix kexec transition breakage
Igor Druzhinin [Fri, 19 Jul 2019 13:07:48 +0000 (14:07 +0100)]
x86/crash: fix kexec transition breakage

Following 6ff560f7f ("x86/SMP: don't try to stop already stopped CPUs")
an incorrect condition was placed into kexec transition path
leaving crashing CPU always online breaking kdump kernel entering.
Correct it by unifying the condition with smp_send_stop().

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
5 years agoAMD/IOMMU: pass IOMMU to amd_iommu_alloc_intremap_table()
Jan Beulich [Mon, 22 Jul 2019 10:06:10 +0000 (12:06 +0200)]
AMD/IOMMU: pass IOMMU to amd_iommu_alloc_intremap_table()

The function will want to know IOMMU properties (specifically the IRTE
size) subsequently.

Correct indentation of one of the call sites at this occasion.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Brian Woods <brian.woods@amd.com>
5 years agoAMD/IOMMU: pass IOMMU to iterate_ivrs_entries() callback
Jan Beulich [Mon, 22 Jul 2019 10:05:27 +0000 (12:05 +0200)]
AMD/IOMMU: pass IOMMU to iterate_ivrs_entries() callback

Both users will want to know IOMMU properties (specifically the IRTE
size) subsequently. Leverage this to avoid pointless calls to the
callback when IVRS mapping table entries are unpopulated. To avoid
leaking interrupt remapping tables (bogusly) allocated for IOMMUs
themselves, this requires suppressing their allocation in the first
place, taking a step further what commit 757122c0cf ('AMD/IOMMU: don't
"add" IOMMUs') had done.

Additionally suppress the call for alias entries, as again both users
don't care about these anyway. In fact this eliminates a fair bit of
redundancy from dump output.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Brian Woods <brian.woods@amd.com>
5 years agoAMD/IOMMU: process softirqs while dumping IRTs
Jan Beulich [Mon, 22 Jul 2019 10:03:46 +0000 (12:03 +0200)]
AMD/IOMMU: process softirqs while dumping IRTs

When there are sufficiently many devices listed in the ACPI tables (no
matter if they actually exist), output may take way longer than the
watchdog would like.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Brian Woods <brian.woods@amd.com>
5 years agoAMD/IOMMU: free more memory when cleaning up after error
Jan Beulich [Mon, 22 Jul 2019 09:59:01 +0000 (11:59 +0200)]
AMD/IOMMU: free more memory when cleaning up after error

The interrupt remapping in-use bitmaps were leaked in all cases. The
ring buffers and the mapping of the MMIO space were leaked for any IOMMU
that hadn't been enabled yet.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Brian Woods <brian.woods@amd.com>
5 years agox86/vLAPIC: avoid speculative out of bounds accesses
Jan Beulich [Mon, 22 Jul 2019 09:50:58 +0000 (11:50 +0200)]
x86/vLAPIC: avoid speculative out of bounds accesses

Array indexes used in the MSR read/write emulation functions as well as
the direct VMX / APIC-V hook are derived from guest controlled values.
Restrict their ranges to limit the side effects of speculative
execution.

Along these lines also constrain the vlapic_lvt_mask[] access.

Remove the unused vlapic_lvt_{vector,dm}() instead of adjusting them.

This is part of the speculative hardening effort.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: move {,_}clear_irq_vector()
Jan Beulich [Mon, 22 Jul 2019 09:48:08 +0000 (11:48 +0200)]
x86/IRQ: move {,_}clear_irq_vector()

This is largely to drop a forward declaration. There's one functional
change - clear_irq_vector() gets marked __init, as its only caller is
check_timer(). Beyond this only a few stray blanks get removed.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: eliminate some on-stack cpumask_t instances
Jan Beulich [Mon, 22 Jul 2019 09:47:38 +0000 (11:47 +0200)]
x86/IRQ: eliminate some on-stack cpumask_t instances

Use scratch_cpumask where possible, to avoid creating these possibly
large stack objects. We can't use it in _assign_irq_vector() and
set_desc_affinity(), as these get called in IRQ context.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: tighten vector checks
Jan Beulich [Mon, 22 Jul 2019 09:47:06 +0000 (11:47 +0200)]
x86/IRQ: tighten vector checks

Use valid_irq_vector() rather than "> 0".

Also replace an open-coded use of IRQ_VECTOR_UNASSIGNED.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: drop redundant cpumask_empty() from move_masked_irq()
Jan Beulich [Mon, 22 Jul 2019 09:46:31 +0000 (11:46 +0200)]
x86/IRQ: drop redundant cpumask_empty() from move_masked_irq()

The subsequent cpumask_intersects() covers the "empty" case quite fine.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: make fixup_irqs() skip unconnected internally used interrupts
Jan Beulich [Mon, 22 Jul 2019 09:45:58 +0000 (11:45 +0200)]
x86/IRQ: make fixup_irqs() skip unconnected internally used interrupts

Since the "Cannot set affinity ..." warning is a one time one, avoid
triggering it already at boot time when parking secondary threads and
the serial console uses a (still unconnected at that time) PCI IRQ.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQs: correct/tighten vector check in _clear_irq_vector()
Jan Beulich [Mon, 22 Jul 2019 09:45:28 +0000 (11:45 +0200)]
x86/IRQs: correct/tighten vector check in _clear_irq_vector()

If any particular value was to be checked against, it would need to be
IRQ_VECTOR_UNASSIGNED.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Be more strict though and use valid_irq_vector() instead.

Take the opportunity and also convert local variables to unsigned int.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: target online CPUs when binding guest IRQ
Jan Beulich [Mon, 22 Jul 2019 09:44:50 +0000 (11:44 +0200)]
x86/IRQ: target online CPUs when binding guest IRQ

fixup_irqs() skips interrupts without action. Hence such interrupts can
retain affinity to just offline CPUs. With "noirqbalance" in effect,
pirq_guest_bind() so far would have left them alone, resulting in a non-
working interrupt.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: fix locking around vector management
Jan Beulich [Mon, 22 Jul 2019 09:44:02 +0000 (11:44 +0200)]
x86/IRQ: fix locking around vector management

All of __{assign,bind,clear}_irq_vector() manipulate struct irq_desc
fields, and hence ought to be called with the descriptor lock held in
addition to vector_lock. This is currently the case for only
set_desc_affinity() (in the common case) and destroy_irq(), which also
clarifies what the nesting behavior between the locks has to be.
Reflect the new expectation by having these functions all take a
descriptor as parameter instead of an interrupt number.

Also take care of the two special cases of calls to set_desc_affinity():
set_ioapic_affinity_irq() and VT-d's dma_msi_set_affinity() get called
directly as well, and in these cases the descriptor locks hadn't got
acquired till now. For set_ioapic_affinity_irq() this means acquiring /
releasing of the IO-APIC lock can be plain spin_{,un}lock() then.

Drop one of the two leading underscores from all three functions at
the same time.

There's one case left where descriptors get manipulated with just
vector_lock held: setup_vector_irq() assumes its caller to acquire
vector_lock, and hence can't itself acquire the descriptor locks (wrong
lock order). I don't currently see how to address this.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com> [VT-d]
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: consolidate use of ->arch.cpu_mask
Jan Beulich [Mon, 22 Jul 2019 09:43:16 +0000 (11:43 +0200)]
x86/IRQ: consolidate use of ->arch.cpu_mask

Mixed meaning was implied so far by different pieces of code -
disagreement was in particular about whether to expect offline CPUs'
bits to possibly be set. Switch to a mostly consistent meaning
(exception being high priority interrupts, which would perhaps better
be switched to the same model as well in due course). Use the field to
record the vector allocation mask, i.e. potentially including bits of
offline (parked) CPUs. This implies that before passing the mask to
certain functions (most notably cpu_mask_to_apicid()) it needs to be
further reduced to the online subset.

The exception of high priority interrupts is also why for the moment
_bind_irq_vector() is left as is, despite looking wrong: It's used
exclusively for IRQ0, which isn't supposed to move off CPU0 at any time.

The prior lack of restricting to online CPUs in set_desc_affinity()
before calling cpu_mask_to_apicid() in particular allowed (in x2APIC
clustered mode) offlined CPUs to end up enabled in an IRQ's destination
field. (I wonder whether vector_allocation_cpumask_flat() shouldn't
follow a similar model, using cpu_present_map in favor of
cpu_online_map.)

For IO-APIC code it was definitely wrong to potentially store, as a
fallback, TARGET_CPUS (i.e. all online ones) into the field, as that
would have caused problems when determining on which CPUs to release
vectors when they've gone out of use. Disable interrupts instead when
no valid target CPU can be established (which code elsewhere should
guarantee to never happen), and log a message in such an unlikely event.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: desc->affinity should strictly represent the requested value
Jan Beulich [Mon, 22 Jul 2019 09:42:32 +0000 (11:42 +0200)]
x86/IRQ: desc->affinity should strictly represent the requested value

desc->arch.cpu_mask reflects the actual set of target CPUs. Don't ever
fiddle with desc->affinity itself, except to store caller requested
values. Note that assign_irq_vector() now takes a NULL incoming CPU mask
to mean "all CPUs" now, rather than just "all currently online CPUs".
This way no further affinity adjustment is needed after onlining further
CPUs.

This renders both set_native_irq_info() uses (which weren't using proper
locking anyway) redundant - drop the function altogether.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: deal with move cleanup count state in fixup_irqs()
Jan Beulich [Mon, 22 Jul 2019 09:41:55 +0000 (11:41 +0200)]
x86/IRQ: deal with move cleanup count state in fixup_irqs()

The cleanup IPI may get sent immediately before a CPU gets removed from
the online map. In such a case the IPI would get handled on the CPU
being offlined no earlier than in the interrupts disabled window after
fixup_irqs()' main loop. This is too late, however, because a possible
affinity change may incur the need for vector assignment, which will
fail when the IRQ's move cleanup count is still non-zero.

To fix this
- record the set of CPUs the cleanup IPIs gets actually sent to alongside
  setting their count,
- adjust the count in fixup_irqs(), accounting for all CPUs that the
  cleanup IPI was sent to, but that are no longer online,
- bail early from the cleanup IPI handler when the CPU is no longer
  online, to prevent double accounting.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: deal with move-in-progress state in fixup_irqs()
Jan Beulich [Mon, 22 Jul 2019 09:41:02 +0000 (11:41 +0200)]
x86/IRQ: deal with move-in-progress state in fixup_irqs()

The flag being set may prevent affinity changes, as these often imply
assignment of a new vector. When there's no possible destination left
for the IRQ, the clearing of the flag needs to happen right from
fixup_irqs().

Additionally _assign_irq_vector() needs to avoid setting the flag when
there's no online CPU left in what gets put into ->arch.old_cpu_mask.
The old vector can be released right away in this case.

Also extend the log message about broken affinity to include the new
affinity as well, allowing to notice issues with affinity changes not
actually having taken place. Swap the if/else-if order there at the
same time to reduce the amount of conditions checked.

At the same time replace two open coded instances of the new helper
function.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agotools/libxc: allow controlling the max C-state sub-state
Ross Lagerwall [Mon, 22 Jul 2019 09:35:19 +0000 (11:35 +0200)]
tools/libxc: allow controlling the max C-state sub-state

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Make handling in do_pm_op() more homogeneous: Before interpreting
op->cpuid as such, handle all operations not acting on a particular
CPU. Also expose the setting via xenpm.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86: allow limiting the max C-state sub-state
Ross Lagerwall [Mon, 22 Jul 2019 09:34:32 +0000 (11:34 +0200)]
x86: allow limiting the max C-state sub-state

Allow limiting the max C-state sub-state by appending to the max_cstate
command-line parameter. E.g. max_cstate=1,0
The limit only applies to the highest legal C-state. For example:
 max_cstate = 1, max_csubstate = 0 ==> C0, C1 okay, but not C1E
 max_cstate = 1, max_csubstate = 1 ==> C0, C1 and C1E okay, but not C2
 max_cstate = 2, max_csubstate = 0 ==> C0, C1, C1E, C2 okay, but not C3
 max_cstate = 2, max_csubstate = 1 ==> C0, C1, C1E, C2 okay, but not C3

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/AMD: make C-state handling independent of Dom0
Jan Beulich [Mon, 22 Jul 2019 09:34:03 +0000 (11:34 +0200)]
x86/AMD: make C-state handling independent of Dom0

At least for more recent CPUs, following what BKDG / PPR suggest for the
BIOS to surface via ACPI we can make ourselves independent of Dom0
uploading respective data.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/cpuidle: really use C1 for "urgent" CPUs
Jan Beulich [Mon, 22 Jul 2019 09:32:20 +0000 (11:32 +0200)]
x86/cpuidle: really use C1 for "urgent" CPUs

For one on recent AMD CPUs entering C1 (if available at all) requires
use of MWAIT, while HLT (i.e. default_idle()) would put the processor
into as deep as CC6. And then even on other vendors' CPUs we should
avoid entering default_idle() when the intended state can be reached
by using the active idle driver's facilities.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/cpuidle: switch to uniform meaning of "max_cstate="
Jan Beulich [Mon, 22 Jul 2019 09:31:38 +0000 (11:31 +0200)]
x86/cpuidle: switch to uniform meaning of "max_cstate="

While the MWAIT idle driver already takes it to mean an actual C state,
the ACPI idle driver so far used it as a list index. The list index,
however, is an implementation detail of Xen and affected by firmware
settings (i.e. not necessarily uniform for a particular system).

While touching this code also avoid invoking menu_get_trace_data()
when tracing is not active. For consistency do this also for the
MWAIT driver.

Note that I'm intentionally not adding any sorting logic to set_cx():
Before and after this patch we assume entries to arrive in order, so
this would be an orthogonal change.

Take the opportunity and add minimal documentation for the command line
option.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/shadow: ditch dangerous declarations
Jan Beulich [Mon, 22 Jul 2019 09:30:10 +0000 (11:30 +0200)]
x86/shadow: ditch dangerous declarations

This started out with me noticing the latent bug of there being HVM
related declarations in common.c that their producer doesn't see, and
that hence could go out of sync at any time. However, go farther than
fixing just that and move the functions actually using these into hvm.c.
This way the items in question can simply become static, and no separate
declarations are needed at all.

Within the moved code constify and rename or outright delete the struct
vcpu * local variables and re-format a comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/mtrr: Skip cache flushes on CPUs with cache self-snooping
Ricardo Neri [Fri, 19 Jul 2019 11:51:24 +0000 (13:51 +0200)]
x86/mtrr: Skip cache flushes on CPUs with cache self-snooping

Programming MTRR registers in multi-processor systems is a rather lengthy
process. Furthermore, all processors must program these registers in lock
step and with interrupts disabled; the process also involves flushing
caches and TLBs twice. As a result, the process may take a considerable
amount of time.

On some platforms, this can lead to a large skew of the refined-jiffies
clock source. Early when booting, if no other clock is available (e.g.,
booting with hpet=disabled), the refined-jiffies clock source is used to
monitor the TSC clock source. If the skew of refined-jiffies is too large,
Linux wrongly assumes that the TSC is unstable:

  clocksource: timekeeping watchdog on CPU1: Marking clocksource
               'tsc-early' as unstable because the skew is too large:
  clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last:
               fffedb90 mask: ffffffff
  clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4
               mask: ffffffffffffffff
  tsc: Marking TSC unstable due to clocksource watchdog

As per measurements, around 98% of the time needed by the procedure to
program MTRRs in multi-processor systems is spent flushing caches with
wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32
Architectures Software Developer's Manual, it is not necessary to flush
caches if the CPU supports cache self-snooping. Thus, skipping the cache
flushes can reduce by several tens of milliseconds the time needed to
complete the programming of the MTRR registers:

Platform                       Before    After
104-core (208 Threads) Skylake  1437ms      28ms
  2-core (  4 Threads) Haswell   114ms       2ms

Reported-by: Mohammad Etemadi <mohammad.etemadi@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
[Linux commit fd329f276ecaad7a371d6f91b9bbea031d0c3440]

Use alternatives patching instead of static_cpu_has() (which we don't
have [yet]).

Interestingly we've been lacking the 2nd wbinvd(), which I'm taking the
liberty here.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata
Ricardo Neri [Fri, 19 Jul 2019 11:50:38 +0000 (13:50 +0200)]
x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata

From: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>

Processors which have self-snooping capability can handle conflicting
memory type across CPUs by snooping its own cache. However, there exists
CPU models in which having conflicting memory types still leads to
unpredictable behavior, machine check errors, or hangs.

Clear this feature on affected CPUs to prevent its use.

Suggested-by: Alan Cox <alan.cox@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
[Linux commit 1e03bff3600101bd9158d005e4313132e55bdec8]

Strip Yonah - as per ark.intel.com it doesn't look to be 64-bit capable.
Call the new function on the boot CPU only. Don't clear the CPU feature
flag itself, as it is exposed to guests (who could otherwise observe it
disappear after migration).

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/mem_sharing: compile mem_sharing subsystem only when kconfig is enabled
Tamas K Lengyel [Fri, 19 Jul 2019 11:49:47 +0000 (13:49 +0200)]
x86/mem_sharing: compile mem_sharing subsystem only when kconfig is enabled

Disable it by default as it is only an experimental subsystem.

Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/mem_sharing: enable mem_share audit mode only in debug builds
Tamas K Lengyel [Fri, 19 Jul 2019 11:49:26 +0000 (13:49 +0200)]
x86/mem_sharing: enable mem_share audit mode only in debug builds

Improves performance for release builds.

Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
5 years agox86/mem_sharing: copy a page_lock version to be internal to memshr
Tamas K Lengyel [Fri, 19 Jul 2019 11:48:38 +0000 (13:48 +0200)]
x86/mem_sharing: copy a page_lock version to be internal to memshr

Patch cf4b30dca0a "Add debug code to detect illegal page_lock and put_page_type
ordering" added extra sanity checking to page_lock/page_unlock for debug builds
with the assumption that no hypervisor path ever locks two pages at once.

This assumption doesn't hold during memory sharing so we copy a version of
page_lock/unlock to be used exclusively in the memory sharing subsystem
without the sanity checks.

Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/mem_sharing: reorder when pages are unlocked and released
Tamas K Lengyel [Fri, 19 Jul 2019 11:47:17 +0000 (13:47 +0200)]
x86/mem_sharing: reorder when pages are unlocked and released

Calling _put_page_type while also holding the page_lock for that page
can cause a deadlock. There may be code-paths still in place where this
is an issue, but for normal sharing purposes this has been tested and
works.

Removing grabbing the extra page reference at certain points is done
because it is no longer needed, a reference is held till necessary with
this reorder thus the extra reference is redundant.

The comment being dropped is incorrect since it's now out-of-date.

Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agoxen/trace: Implement TRACE_?D() in a more efficient fashon
Andrew Cooper [Thu, 18 Jul 2019 15:24:42 +0000 (16:24 +0100)]
xen/trace: Implement TRACE_?D() in a more efficient fashon

These can easily be expressed with a variadic macro. No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
5 years agoxen/trace: Adjust types in function declarations
Andrew Cooper [Thu, 18 Jul 2019 13:41:48 +0000 (14:41 +0100)]
xen/trace: Adjust types in function declarations

Use uint32_t consistently for 'event', bool consistently for 'cycles',
and unsigned int consistently for 'extra'.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
5 years agoxen/trace: Add trace.h to MAINTAINER
Andrew Cooper [Thu, 18 Jul 2019 16:53:03 +0000 (17:53 +0100)]
xen/trace: Add trace.h to MAINTAINER

... to match the existing trace.c entry.

Reported-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
5 years agolibxl_qmp: wait for completion of device removal
Chao Gao [Fri, 19 Jul 2019 09:24:08 +0000 (10:24 +0100)]
libxl_qmp: wait for completion of device removal

To remove a device from a domain, a qmp command is sent to qemu. But it is
handled by qemu asychronously. Even the qmp command is claimed to be done,
the actual handling in qemu side may happen later.
This behavior brings two questions:
1. Attaching a device back to a domain right after detaching the device from
that domain would fail with error:

libxl: error: libxl_qmp.c:341:qmp_handle_error_response: Domain 1:received an
error message from QMP server: Duplicate ID 'pci-pt-60_00.0' for device

2. Accesses to PCI configuration space in Qemu may overlap with later device
reset issued by 'xl' or by pciback.

In order to avoid mentioned questions, wait for the completion of device
removal by querying all pci devices using qmp command and ensuring the target
device isn't listed. Only retry 5 times to avoid 'xl' potentially being blocked
by qemu.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Message-Id: <1562133373-19208-1-git-send-email-chao.gao@intel.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
5 years agogolang/xenlight: Fixing compilation for go 1.11
Daniel P. Smith [Thu, 18 Jul 2019 21:11:44 +0000 (22:11 +0100)]
golang/xenlight: Fixing compilation for go 1.11

This deals with two casting issues for compiling under go 1.11:
- explicitly cast to *C.xentoollog_logger for Ctx.logger pointer
- add cast to unsafe.Pointer for the C string cpath

Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agoMAINTAINERS: Make myself libxl golang binding maintainer
George Dunlap [Mon, 8 Jul 2019 10:56:24 +0000 (06:56 -0400)]
MAINTAINERS: Make myself libxl golang binding maintainer

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
5 years agox86emul: Ignore ssse3-{aes,pclmul}.[ch] as well
Andrew Cooper [Thu, 18 Jul 2019 15:09:27 +0000 (16:09 +0100)]
x86emul: Ignore ssse3-{aes,pclmul}.[ch] as well

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agoxen/trace: Fix build with !CONFIG_TRACEBUFFER
Andrew Cooper [Thu, 18 Jul 2019 13:29:35 +0000 (14:29 +0100)]
xen/trace: Fix build with !CONFIG_TRACEBUFFER

GCC reports:

In file included from hvm.c:24:0:
/local/xen.git/xen/include/xen/trace.h: In function ‘tb_control’:
/local/xen.git/xen/include/xen/trace.h:60:13: error: ‘ENOSYS’
undeclared (first use in this function)
     return -ENOSYS;
             ^~~~~~

Include xen/errno.h to resolve the issue.  While tweaking this, add comments
to the #else and #endif, as they are a fair distance apart.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/mm: Provide more useful information in diagnostics
Andrew Cooper [Sat, 13 Apr 2019 21:03:05 +0000 (22:03 +0100)]
x86/mm: Provide more useful information in diagnostics

 * alloc_l?_table() should identify the failure, not just state that there is
   one.
 * get_page() should use %pd for the two domains, to render system domains in
   a more obvious way.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86emul: add a PCLMUL/VPCLMUL test case to the harness
Jan Beulich [Wed, 17 Jul 2019 13:46:08 +0000 (15:46 +0200)]
x86emul: add a PCLMUL/VPCLMUL test case to the harness

Also use this for AVX512_VBMI2 VPSH{L,R}D{,V}{D,Q,W} testing (only the
quad word right shifts get actually used; the assumption is that their
"left" counterparts as well as the double word and word forms then work
as well).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citirx.com>
5 years agox86emul: add a SHA test case to the harness
Jan Beulich [Wed, 17 Jul 2019 13:45:34 +0000 (15:45 +0200)]
x86emul: add a SHA test case to the harness

Also use this for AVX512VL VPRO{L,R}{,V}D as well as some further shifts
testing.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86emul: add an AES/VAES test case to the harness
Jan Beulich [Wed, 17 Jul 2019 13:44:54 +0000 (15:44 +0200)]
x86emul: add an AES/VAES test case to the harness

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86emul: restore ordering within main switch statement
Jan Beulich [Wed, 17 Jul 2019 13:43:57 +0000 (15:43 +0200)]
x86emul: restore ordering within main switch statement

Incremental additions and/or mistakes have lead to some code blocks
sitting in "unexpected" places. Re-sort the case blocks (opcode space;
major opcode; 66/F3/F2 prefix; legacy/VEX/EVEX encoding).

As an exception the opcode space 0x0f EVEX-encoded VPEXTRW is left at
its current place, to keep it close to the "pextr" label.

Pure code movement.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citirx.com>
5 years agox86emul: support GFNI insns
Jan Beulich [Wed, 17 Jul 2019 13:43:06 +0000 (15:43 +0200)]
x86emul: support GFNI insns

As to the feature dependency adjustment, while strictly speaking SSE is
a sufficient prereq (to have XMM registers), vectors of bytes and qwords
have got introduced only with SSE2. gcc, for example, uses a similar
connection in its respective intrinsics header.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86emul: support VAES insns
Jan Beulich [Wed, 17 Jul 2019 13:41:58 +0000 (15:41 +0200)]
x86emul: support VAES insns

As to the feature dependency adjustment, just like for VPCLMULQDQ while
strictly speaking AVX is a sufficient prereq (to have YMM registers),
256-bit vectors of integers have got fully introduced with AVX2 only.

A new test case (also covering AESNI) will be added to the harness by a
subsequent patch.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citirx.com>
5 years agox86emul: support VPCLMULQDQ insns
Jan Beulich [Wed, 17 Jul 2019 13:41:20 +0000 (15:41 +0200)]
x86emul: support VPCLMULQDQ insns

As to the feature dependency adjustment, while strictly speaking AVX is
a sufficient prereq (to have YMM registers), 256-bit vectors of integers
have got fully introduced with AVX2 only. Sadly gcc can't be used as a
reference here: They don't provide any AVX512-independent built-in at
all.

Along the lines of PCLMULQDQ, since the insns here and in particular
their memory access patterns follow the usual scheme, I didn't think it
was necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86emul: support AVX512_VNNI insns
Jan Beulich [Wed, 17 Jul 2019 13:40:42 +0000 (15:40 +0200)]
x86emul: support AVX512_VNNI insns

Along the lines of the 4FMAPS case, convert the 4VNNIW-based table
entries to a decoder adjustment. Because of the current sharing of table
entries between different (implied) opcode prefixes and with the same
major opcodes being used for vp4dpwssd{,s}, which have a different
memory operand size and different Disp8 scaling, the pre-existing table
entries get converted to a decoder override. The table entries will now
represent the insns here, in line with other table entries preferably
representing the prefix-66 insns.

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86emul: support AVX512_4VNNIW insns
Jan Beulich [Wed, 17 Jul 2019 13:39:54 +0000 (15:39 +0200)]
x86emul: support AVX512_4VNNIW insns

As in a few cases before, since the insns here and in particular their
memory access patterns follow the AVX512_4FMAPS scheme, I didn't think
it was necessary to add contrived tests specifically for them, beyond
the Disp8 scaling ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86emul: support AVX512_4FMAPS insns
Jan Beulich [Wed, 17 Jul 2019 13:39:10 +0000 (15:39 +0200)]
x86emul: support AVX512_4FMAPS insns

A decoder adjustment is needed here because of the current sharing of
table entries between different (implied) opcode prefixes: The same
major opcodes are used for vfmsub{132,213}{p,s}{s,d}, which have a
different memory operand size and different Disp8 scaling.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86emul: support remaining AVX512_VBMI2 insns
Jan Beulich [Wed, 17 Jul 2019 13:38:35 +0000 (15:38 +0200)]
x86emul: support remaining AVX512_VBMI2 insns

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86emul: support of AVX512_IFMA insns
Jan Beulich [Wed, 17 Jul 2019 13:37:54 +0000 (15:37 +0200)]
x86emul: support of AVX512_IFMA insns

Once again take the liberty and also correct the (public interface) name
of the AVX512_IFMA feature flag to match the SDM, on the assumption that
no external consumer has actually been using that flag so far.

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86emul: support of AVX512* population count insns
Jan Beulich [Wed, 17 Jul 2019 13:37:00 +0000 (15:37 +0200)]
x86emul: support of AVX512* population count insns

Plus the only other AVX512_BITALG one.

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agoxen: make tracebuffer configurable
Baodong Chen [Wed, 17 Jul 2019 13:35:22 +0000 (15:35 +0200)]
xen: make tracebuffer configurable

Xen internal running status(trace event at pre-defined trace point)
will be saved to trace memory when enabled.
Trace event data and config params can be read/changed
by system control hypercall at run time.

Can be disabled for smaller code footprint.

Signed-off-by: Baodong Chen <chenbaodong@mxnavi.com>
Acked-by: George Dunlap <george.dunlap@citrix.com> [tracing]
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agodom_cow is needed for mem-sharing only
Jan Beulich [Wed, 17 Jul 2019 13:34:23 +0000 (15:34 +0200)]
dom_cow is needed for mem-sharing only

A couple of adjustments are needed to code checking for dom_cow, but
since there are pretty few it is probably better to adjust those than
to set up and keep around a never used domain.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <julien.grall@arm.com>
5 years agox86/PV: drop page table ownership check from emul-priv-op.c:read_cr()
Jan Beulich [Wed, 17 Jul 2019 13:33:05 +0000 (15:33 +0200)]
x86/PV: drop page table ownership check from emul-priv-op.c:read_cr()

We have such a check here but no-where else. It shouldn't have been
added by af909e7e16 ("M2P translation cannot be handled through flat
table with") in the first place.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agoxen/doc: update ARM warning about testing gcov on arm
Viktor Mitin [Wed, 10 Jul 2019 04:57:37 +0000 (07:57 +0300)]
xen/doc: update ARM warning about testing gcov on arm

Update ARM code coverage warning about testing gcov on arm

Signed-off-by: Viktor Mitin <viktor_mitin@epam.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/suspend: Don't save/restore %cr8
Andrew Cooper [Mon, 15 Jul 2019 16:21:02 +0000 (17:21 +0100)]
x86/suspend: Don't save/restore %cr8

%cr8 is an alias of APIC_TASKPRI, which is handled by
lapic_{suspend,resume}() with the rest of the Local APIC state.  Saving
and restoring the TPR state in isolation is not a clever idea.

Drop it all.

While editing wakeup_prot.S, trim its include list to just the headers
which are used, which is precicely none of them.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/smpboot: Remove redundant order calculations
Andrew Cooper [Thu, 11 Jul 2019 14:50:17 +0000 (09:50 -0500)]
x86/smpboot: Remove redundant order calculations

The GDT and IDT allocations are all order 0, and not going to change.

Use an explicit 0, instead of calling get_order_from_pages().  This
allows for the removal of the 'order' local parameter in both
cpu_smpboot_{alloc,free}().

While making this adjustment, rearrange cpu_smpboot_free() to fold the
two "if ( remove )" clauses.  There is no explicit requirements for the
order of free()s.

No practical change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agomm.h: fix BUG_ON() condition in put_page_alloc_ref()
Paul Durrant [Tue, 16 Jul 2019 11:29:02 +0000 (13:29 +0200)]
mm.h: fix BUG_ON() condition in put_page_alloc_ref()

The BUG_ON() was misplaced when this function was introduced in commit
ec83f825 "mm.h: add helper function to test-and-clear _PGC_allocated".
It will fire incorrectly if _PGC_allocated is already clear on entry. Thus
it should be moved after the if statement.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agomm.h: add helper function to test-and-clear _PGC_allocated
Paul Durrant [Tue, 16 Jul 2019 07:10:36 +0000 (09:10 +0200)]
mm.h: add helper function to test-and-clear _PGC_allocated

The _PGC_allocated flag is set on a page when it is assigned to a domain
along with an initial reference count of at least 1. To clear this
'allocation' reference it is necessary to test-and-clear _PGC_allocated and
then only drop the reference if the test-and-clear succeeds. This is open-
coded in many places. It is also unsafe to test-and-clear _PGC_allocated
unless the caller holds an additional reference.

This patch adds a helper function, put_page_alloc_ref(), to replace all the
open-coded test-and-clear/put_page occurrences. That helper function
incorporates a check that an additional page reference is held and will
BUG() if it is not.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/hvm: make hvmemul_virtual_to_linear()'s reps parameter optional
Jan Beulich [Tue, 16 Jul 2019 07:09:44 +0000 (09:09 +0200)]
x86/hvm: make hvmemul_virtual_to_linear()'s reps parameter optional

A majority of callers wants just a single iteration handled. Allow to
express this by passing in a NULL pointer, instead of setting up a local
variable just to hold the "1" to pass in here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
5 years agox86/ept: pass correct level to p2m_entry_modify
Roger Pau Monné [Tue, 16 Jul 2019 07:05:28 +0000 (09:05 +0200)]
x86/ept: pass correct level to p2m_entry_modify

EPT differs from NPT and shadow when translating page orders to levels
in the physmap page tables. EPT page tables level for order 0 pages is
0, while NPT and shadow instead use 1, ie: EPT page tables levels
starts at 0 while NPT and shadow starts at 1.

Fix the p2m_entry_modify call in atomic_write_ept_entry to always add
one to the level, in order to match NPT and shadow usage.

While there also add a check to ensure p2m_entry_modify is never
called with level == 0. That should allow to catch future errors
related to the level parameter.

Fixes: c7a4c088ad1c ('x86/mm: split p2m ioreq server pages special handling into helper')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
5 years agotools/xenstored: Drop mapping of the ring via foreign map
Andrew Cooper [Fri, 17 May 2019 10:08:56 +0000 (11:08 +0100)]
tools/xenstored: Drop mapping of the ring via foreign map

This is a vestigial remnent of the pre xenstored stub domain days.

Foreign mapping via MFN is a privileged operation which is not
necessary, because grant details are unconditionally set up during
domain construction.  In practice, this means xenstored never uses its
ability to foreign map the ring.

Drop the ability completely, which removes the penultimate use of the
unstable libxc interface.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
5 years agotools/xenstored: Make gnttab interface mandatory
Andrew Cooper [Fri, 17 May 2019 10:06:16 +0000 (11:06 +0100)]
tools/xenstored: Make gnttab interface mandatory

xenstored currently requires an libxc and evtchn interface, but leaves
the gnttab interface as optional.

gnttab is ubiquitous these days, and in practice mandatory in all cases
where xenstored isn't running as root in dom0 (due to the inability to
foreign map by MFN).

The toolstack has unconditionally set up grant details for many years
now, and longterm it would be good to phase out the use of libxc.  This
requires that xenstored map the store ring by grant map, rather than
foreign map.

No practical change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: fix pci device re-assigning after domain reboot
Juergen Gross [Wed, 26 Jun 2019 13:37:26 +0000 (14:37 +0100)]
libxl: fix pci device re-assigning after domain reboot

After a reboot of a guest only the first pci device configuration will
be retrieved from Xenstore resulting in loss of any further assigned
passed through pci devices.

The main reason is that all passed through pci devices reside under a
common root device "0" in Xenstore. So when the device list is rebuilt
from Xenstore after a reboot the sub-devices below that root device
need to be selected instead of using the root device number as a
selector.

Fix that by adding a new member to struct libxl_device_type which when
set is used to get the number of devices. Add such a member for pci to
get the correct number of pci devices instead of implying it from the
number of pci root devices (which will always be 1).

While at it fix the type of libxl__device_pci_from_xs_be() to match
the one of the .from_xenstore member of struct libxl_device_type. This
fixes a latent bug checking the return value of a function returning
void.

Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
5 years agox86/ctxt-switch: Document and improve GDT handling
Andrew Cooper [Thu, 4 Jul 2019 15:13:32 +0000 (16:13 +0100)]
x86/ctxt-switch: Document and improve GDT handling

Calling virt_to_mfn() in the context switch path is a lot
of wasted cycles for a result which is constant after boot.

Begin by documenting how Xen handles the GDTs across context switch.

The loop in write_full_gdt_ptes() is unnecessary, because
NR_RESERVED_GDT_PAGES is 1.  Dropping it makes the code substantially
more clear, and with it dropped, write_full_gdt_ptes() becomes more
obviously a poor name, so rename it to update_xen_slot_in_full_gdt().

Furthermore, load_full_gdt() is completely independent of the current
CPU, and load_default_gdt() only needs the current CPU's regular
GDT.  (This is a change in behaviour, as previously it may have used the
compat GDT, but either will do.)

Add two extra per-cpu variables which cache the L1e for the regular and compat
GDT, calculated in cpu_smpboot_alloc()/trap_init() as appropriate, so
update_xen_slot_in_full_gdt() doesn't need to waste time performing the same
calculation on every context switch.

One performance scenario of Jüergen's (time to build the hypervisor on
an 8 CPU system, with two single-vCPU MiniOS VMs constantly interrupting
dom0 with events) shows the following, average over 5 measurements:

            elapsed  user   system
  Unpatched  66.51  232.93  109.21
  Patched    57.00  225.47  105.47

which is a substantial improvement.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agoxen/arm: use correct device tree root node name
Will Abele [Tue, 9 Jul 2019 13:22:23 +0000 (13:22 +0000)]
xen/arm: use correct device tree root node name

The root node of a device tree should not have a node name. This is
specified in section 2.2.1 of version 0.2 of the device tree
specification, available from devicetree.org.

Linux Kernel versions prior to 4.15 misinterpret flattened device trees
with a "/" as the name of the root node as an FDT version older than 16.
Linux then fails to parse the FDT.

Signed-off-by: Will Abele <will.abele@starlab.io>
Reviewed-by: Julien Grall <julien.grall@arm.com>
5 years agoxen/arm: optee: document OPTEE option in tee/Kconfig
Volodymyr Babchuk [Wed, 19 Jun 2019 17:54:26 +0000 (17:54 +0000)]
xen/arm: optee: document OPTEE option in tee/Kconfig

Add basic information about the OP-TEE mediator and note about
dependency on virtualization-aware OP-TEE.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Acked-by: Julien Grall <julien.grall@arm.com>
5 years agoxen/arm: optee: check if OP-TEE is virtualization-aware
Volodymyr Babchuk [Wed, 19 Jun 2019 17:54:24 +0000 (17:54 +0000)]
xen/arm: optee: check if OP-TEE is virtualization-aware

This is workaround for OP-TEE 3.5. This is the first OP-TEE release
which supports virtualization, but there is no way to tell if
OP-TEE was built with that support enabled. We can probe for it
by calling SMC that is available only when OP-TEE is built with
virtualization support.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Acked-by: Julien Grall <julien.grall@arm.com>
5 years agoxen/arm: tee: place OP-TEE Kconfig option right after TEE
Volodymyr Babchuk [Wed, 19 Jun 2019 17:54:22 +0000 (17:54 +0000)]
xen/arm: tee: place OP-TEE Kconfig option right after TEE

It is nicer, when options for particular TEE mediators (currently,
OP-TEE only) are following generic "Enable TEE mediators support"
option in the menuconfig:

 [*] Enable TEE mediators support
 [ ]   Enable OP-TEE mediator

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
5 years agoget_maintainer: Improve patch recognition
Joe Perches [Tue, 18 Jun 2019 11:12:15 +0000 (11:12 +0000)]
get_maintainer: Improve patch recognition

There are mode change and rename only patches that are unrecognized
by the get_maintainer.pl script.

Recognize them.

[ Linux commit 0455c74788fd5aad4399f00e3fbbb7e87450ca58 ]

Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
CC: Julien Grall <julien.grall@arm.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Acked-by: Julien Grall <julien.grall@arm.com>
5 years agoxen/arm: domain_build: Black list devices using PPIs
Amit Singh Tomar [Sun, 23 Jun 2019 12:56:31 +0000 (18:26 +0530)]
xen/arm: domain_build: Black list devices using PPIs

Currently, the vGIC is not able to cope with hardware PPIs routed to guests.
One of the solutions to this problem is to skip any device that uses PPI
source completely while building the domain itself.

This patch goes through all the interrupt sources of a device and skip it
if one of the interrupts sources is a PPI. It fixes XEN boot on i.MX8MQ by
skipping the PMU node.

Suggested-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Amit Singh Tomar <amittomer25@gmail.com>
Acked-by: Julien Grall <julien.grall@arm.com>
5 years agox86/gnttab: Use explicit instruction size in gnttab_clear_flags()
Andrew Cooper [Mon, 8 Jul 2019 22:12:06 +0000 (23:12 +0100)]
x86/gnttab: Use explicit instruction size in gnttab_clear_flags()

The OpenSUSE Leap compilers complain about ambiguity:

In file included from grant_table.c:33:
In file included from ...xen/include/xen/grant_table.h:30:
...xen/include/asm/grant_table.h:67:19: error: ambiguous instructions require
an explicit suffix (could be 'andb', 'andw', 'andl', or 'andq')
    asm volatile ("lock and %1,%0" : "+m" (*addr) : "ir" ((uint16_t)~mask));
                  ^
<inline asm>:1:2: note: instantiated into assembly here
        lock and $-17,(%rsi)
        ^

Full logs: https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/247600284
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agoxen/gnttab: Fold adjacent calls to gnttab_clear_flags()
Andrew Cooper [Wed, 21 Nov 2018 18:38:41 +0000 (18:38 +0000)]
xen/gnttab: Fold adjacent calls to gnttab_clear_flags()

Atomic operations are expensive to use, especially following XSA-295 for ARM.
It is wasteful to use two of them back-to-back when one will do.

Especially for a misbehaving guest on ARM, this will reduce the system
disruption required to complete the grant operations.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>