]> xenbits.xensource.com Git - people/liuw/xen.git/log
people/liuw/xen.git
6 years agotools/xen-foreign: Update python scripts to be Py3 compatible for-4.13
Andrew Cooper [Mon, 4 Mar 2019 18:31:48 +0000 (18:31 +0000)]
tools/xen-foreign: Update python scripts to be Py3 compatible

The issues are:
 * dict.has_key() was completely removed in Py3
 * dict.keys() is an iterable rather than list in Py3, so .sort() doesn't work.
 * list.sort(cmp=) was deprecated in Py2.4 and removed in Py3.  Replace it
   with a key= sort instead.

This is all compatible with Py2.4 and later, which is when the sorted()
builtin was introduced.  Tested with Py2.7 and Py3.4

Reported-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
6 years agotools: add link path flag for local build to pkg-config files
Juergen Gross [Thu, 21 Feb 2019 17:36:13 +0000 (18:36 +0100)]
tools: add link path flag for local build to pkg-config files

The qemu build process is requiring the link path of Xen libraries
to be specified both with -L and -Wl,-rpath-link. Add the -L flag
to the local pkg-config files.

At the same time let the pkg-config files depend on the Makefile
creating them, too.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
6 years agovm_event: Add a new opcode to get VM_EVENT_INTERFACE_VERSION
Petre Pircalabu [Thu, 14 Feb 2019 14:18:11 +0000 (16:18 +0200)]
vm_event: Add a new opcode to get VM_EVENT_INTERFACE_VERSION

Currently, the VM_EVENT_INTERFACE_VERSION is determined at runtime, by
inspecting the corresponding field in a vm_event_request. This helper
opcode will query the hypervisor supported version before the vm_event
related structures and layout are set-up.

Signed-off-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
6 years agotools/xentop: Display '-' when stats are not available.
Pritha Srivastava [Fri, 22 Feb 2019 11:48:06 +0000 (11:48 +0000)]
tools/xentop: Display '-' when stats are not available.

Displaying 0 is misleading.

Signed-off-by: Pritha Srivastava <pritha.srivastava@citrix.com>
Signed-off-by: Ronan Abhamon <ronan.abhamon@vates.fr>
Acked-by: Wei Liu <wei.liu2@citrix.com>
6 years agoxen: make grant table support configurable
Wei Liu [Fri, 18 Jan 2019 12:43:57 +0000 (12:43 +0000)]
xen: make grant table support configurable

Introduce CONFIG_GRANT_TABLE. Provide stubs and make sure x86 and arm
hypervisors build with grant table disabled.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien.grall@arm.com>
6 years agox86/pv: Fix construction of 32bit dom0's
Andrew Cooper [Thu, 14 Feb 2019 11:10:09 +0000 (11:10 +0000)]
x86/pv: Fix construction of 32bit dom0's

dom0_construct_pv() has logic to transition dom0 into a compat domain when
booting an ELF32 image.

One aspect which is missing is the CPUID policy recalculation, meaning that a
32bit dom0 sees a 64bit policy, which differ by the Long Mode feature flag in
particular.  Another missing item is the x87_fip_width initialisation.

Update dom0_construct_pv() to use switch_compat(), rather than retaining the
opencoding.  Position the call to switch_compat() such that the compat32 local
variable can disappear entirely.

The 32bit monitor table is now created by setup_compat_l4(), avoiding the need
to for manual creation later.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/cpuid: add missing PCLMULQDQ dependency
Jan Beulich [Tue, 5 Mar 2019 17:04:23 +0000 (18:04 +0100)]
x86/cpuid: add missing PCLMULQDQ dependency

Since we can't seem to be able to settle our discussion for the wider
adjustment previously posted, let's at least add the missing dependency
for 4.12. I'm not convinced though that attaching it to SSE is correct.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/dom0: propagate PVH vlapic EOIs to hardware
Roger Pau Monné [Tue, 5 Mar 2019 16:41:14 +0000 (17:41 +0100)]
x86/dom0: propagate PVH vlapic EOIs to hardware

Current check for MSI EIO is missing a special case for PVH Dom0,
which doesn't have a hvm_irq_dpci struct but requires EIOs to be
forwarded to the physical lapic for passed-through devices.

Add a short-circuit to allow EOIs from PVH Dom0 to be propagated.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agotools/libfsimage: Add `XEN' to environment variable name
Ian Jackson [Tue, 5 Mar 2019 15:31:32 +0000 (15:31 +0000)]
tools/libfsimage: Add `XEN' to environment variable name

This library, which is private to Xen and was properly namespaced in
  1a814711881beb17f073f5f57e27e5bd4da1b956
  tools/libfsimage: Add `xen' to .h names and principal .so name
honours an environment variable to override the directory where
shared objects (ie filesystem plugins) are to be loaded from.

Rename that variable from FSIMAGE_FSDIR to XEN_FSIMAGE_FSDIR, to give
it a proper namespace prefix.

Nothing in xen.git sets this variable.  The three hits for the string
`FSIMAGE_FSDIR' are this getenv, and two references to a compile-time
manifest constant which provides the default value (the -D which sets
it, and the place it is used).

I have also checked the current Debian Xen package in buster and the
variable is not set there either.

CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Jan Beulich <JBeulich@suse.com>
CC: George Dunlap <george.dunlap@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/mm: fix #GP(0) in switch_cr3_cr4()
Jan Beulich [Tue, 5 Mar 2019 16:02:36 +0000 (17:02 +0100)]
x86/mm: fix #GP(0) in switch_cr3_cr4()

With "pcid=no-xpti" and opposite XPTI settings in two 64-bit PV domains
(achievable with one of "xpti=no-dom0" or "xpti=no-domu"), switching
from a PCID-disabled to a PCID-enabled 64-bit PV domain fails to set
CR4.PCIDE in time, as CR4.PGE would not be set in either (see
pv_fixup_guest_cr4(), in particular as used by write_ptbase()), and
hence the early CR4 write would be skipped.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/pv: _toggle_guest_pt() may not skip TLB flush for shadow mode guests
Jan Beulich [Tue, 5 Mar 2019 12:54:42 +0000 (13:54 +0100)]
x86/pv: _toggle_guest_pt() may not skip TLB flush for shadow mode guests

For shadow mode guests (e.g. PV ones forced into that mode as L1TF
mitigation, or during migration) update_cr3() -> sh_update_cr3() may
result in a change to the (shadow) root page table (compared to the
previous one when running the same vCPU with the same PCID). This can,
first and foremost, be a result of memory pressure on the shadow memory
pool of the domain. Shadow code legitimately relies on the original
(prior to commit 5c81d260c2 ["xen/x86: use PCID feature"]) behavior of
the subsequent CR3 write to flush the TLB of entries still left from
walks with an earlier, different (shadow) root page table.

Restore the flushing behavior, also for the second CR3 write on the exit
path to guest context when XPTI is active. For the moment accept that
this will introduce more flushes than are strictly necessary - no flush
would be needed when the (shadow) root page table doesn't actually
change, but this information isn't readily (i.e. without introducing a
layering violation) available here.

This is XSA-294.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back
Andrew Cooper [Tue, 5 Mar 2019 12:54:05 +0000 (13:54 +0100)]
x86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back

Currently, a 64bit PV guest can appear to set and clear FSGSBASE in %cr4, but
the bit remains set in hardware.  Therefore, the {RD,WR}{FS,GS}BASE are usable
even when the guest kernel believes that they are disabled.

The FSGSBASE feature isn't currently supported in Linux, and its context
switch path has some optimisations which rely on userspace being unable to use
the WR{FS,GS}BASE instructions.  Xen's current behaviour undermines this
expectation.

In 64bit PV guest context, always load the guest kernels setting of FSGSBASE
into %cr4.  This requires adjusting how Xen uses the {RD,WR}{FS,GS}BASE
instructions.

 * Delete the cpu_has_fsgsbase helper.  It is no longer safe, as users need to
   check %cr4 directly.
 * The raw __rd{fs,gs}base() helpers are only safe to use when %cr4.fsgsbase
   is set.  Comment this property.
 * The {rd,wr}{fs,gs}{base,shadow}() and read_msr() helpers are updated to use
   the current %cr4 value to determine which mechanism to use.
 * toggle_guest_mode() and save_segments() are update to avoid reading
   fs/gsbase if the values in hardware cannot be stale WRT struct vcpu.  A
   consequence of this is that the write_cr() path needs to cache the current
   bases, as subsequent context switches will skip saving the values.
 * write_cr4() is updated to ensure that the shadow %cr4.fsgsbase value is
   observed in a safe way WRT the hardware setting, if an interrupt happens to
   hit in the middle.
 * load_segments() is updated to use the VMLOAD optimisation if FSGSBASE is
   unavailable, even if only gs_shadow needs updating.  As a minor perf
   improvement, check cpu_has_svm first to short circuit a context-dependent
   conditional on Intel hardware.
 * pv_make_cr4() is updated for 64bit PV guests to use the guest kernels
   choice of FSGSBASE.

This is part of XSA-293.

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/pv: Rewrite guest %cr4 handling from scratch
Andrew Cooper [Tue, 5 Mar 2019 12:53:32 +0000 (13:53 +0100)]
x86/pv: Rewrite guest %cr4 handling from scratch

The PV cr4 logic is almost impossible to follow, and leaks bits into guest
context which definitely shouldn't be visible (in particular, VMXE).

The biggest problem however, and source of the complexity, is that it derives
new real and guest cr4 values from the current value in hardware - this is
context dependent and an inappropriate source of information.

Rewrite the cr4 logic to be invariant of the current value in hardware.

First of all, modify write_ptbase() to always use mmu_cr4_features for IDLE
and HVM contexts.  mmu_cr4_features *is* the correct value to use, and makes
the ASSERT() obviously redundant.

For PV guests, curr->arch.pv.ctrlreg[4] remains the guests view of cr4, but
all logic gets reworked in terms of this and mmu_cr4_features only.

Two masks are introduced; bits which the guest has control over, and bits
which are forwarded from Xen's settings.  One guest-visible change here is
that Xen's VMXE setting is no longer visible at all.

pv_make_cr4() follows fairly closely from pv_guest_cr4_to_real_cr4(), but
deliberately starts with mmu_cr4_features, and only alters the minimal subset
of bits.

The boot-time {compat_,}pv_cr4_mask variables are removed, as they are a
remnant of the pre-CPUID policy days.  pv_fixup_guest_cr4() gains a related
derivation from the policy.

Another guest visible change here is that a 32bit PV guest can now flip
FSGSBASE in its view of CR4.  While the {RD,WR}{FS,GS}BASE instructions are
unusable outside of a 64bit code segment, the ability to modify FSGSBASE
matches real hardware behaviour, and avoids the need for any 32bit/64bit
differences in the logic.

Overall, this patch shouldn't have a practical change in guest behaviour.
VMXE will disappear from view, and an inquisitive 32bit kernel can now see
FSGSBASE changing, but this new logic is otherwise bug-compatible with before.

This is part of XSA-293.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/mm: properly flush TLB in switch_cr3_cr4()
Jan Beulich [Tue, 5 Mar 2019 12:52:44 +0000 (13:52 +0100)]
x86/mm: properly flush TLB in switch_cr3_cr4()

The CR3 values used for contexts run with PCID enabled uniformly have
CR3.NOFLUSH set, resulting in the CR3 write itself to not cause any
flushing at all. When the second CR4 write is skipped or doesn't do any
flushing, there's nothing so far which would purge TLB entries which may
have accumulated again if the PCID doesn't change; the "just in case"
flush only affects the case where the PCID actually changes. (There may
be particularly many TLB entries re-accumulated in case of a watchdog
NMI kicking in during the critical time window.)

Suppress the no-flush behavior of the CR3 write in this particular case.

Similarly the second CR4 write may not cause any flushing of TLB entries
established again while the original PCID was still in use - it may get
performed because of unrelated bits changing. The flush of the old PCID
needs to happen nevertheless.

At the same time also eliminate a possible race with lazy context
switch: Just like for CR4, CR3 may change at any time while interrupts
are enabled, due to the __sync_local_execstate() invocation from the
flush IPI handler. It is for that reason that the CR3 read, just like
the CR4 one, must happen only after interrupts have been turned off.

This is XSA-292.

Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com>
6 years agox86/mm: don't retain page type reference when IOMMU operation fails
Jan Beulich [Tue, 5 Mar 2019 12:52:15 +0000 (13:52 +0100)]
x86/mm: don't retain page type reference when IOMMU operation fails

The IOMMU update in _get_page_type() happens between recording of the
new reference and validation of the page for its new type (if
necessary). If the IOMMU operation fails, there's no point in actually
carrying out validation. Furthermore, with this resulting in failure
getting indicated to the caller, the recorded type reference also needs
to be dropped again.

Note that in case of failure of alloc_page_type() there's no need to
undo the IOMMU operation: Only special types get handed to the function.
The function, upon failure, clears ->u.inuse.type_info, effectively
converting the page to PGT_none. The IOMMU mapping, however, solely
depends on whether the type is PGT_writable_page.

This is XSA-291.

Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86/mm: add explicit preemption checks to L3 (un)validation
Jan Beulich [Tue, 5 Mar 2019 12:51:46 +0000 (13:51 +0100)]
x86/mm: add explicit preemption checks to L3 (un)validation

When recursive page tables are used at the L3 level, unvalidation of a
single L4 table may incur unvalidation of two levels of L3 tables, i.e.
a maximum iteration count of 512^3 for unvalidating an L4 table. The
preemption check in free_l2_table() as well as the one in
_put_page_type() may never be reached, so explicit checking is needed in
free_l3_table().

When recursive page tables are used at the L4 level, the iteration count
at L4 alone is capped at 512^2. As soon as a present L3 entry is hit
which itself needs unvalidation (and hence requiring another nested loop
with 512 iterations), the preemption checks added here kick in, so no
further preemption checking is needed at L4 (until we decide to permit
5-level paging for PV guests).

The validation side additions are done just for symmetry.

This is part of XSA-290.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86/mm: also allow L2 (un)validation to be fully preemptible
Jan Beulich [Tue, 5 Mar 2019 12:51:18 +0000 (13:51 +0100)]
x86/mm: also allow L2 (un)validation to be fully preemptible

Commit c612481d1c ("x86/mm: Plumbing to allow any PTE update to fail
with -ERESTART") added assertions next to the {alloc,free}_l2_table()
invocations to document (and validate in debug builds) that L2
(un)validations are always preemptible.

The assertion in free_page_type() was now observed to trigger when
recursive L2 page tables get cleaned up.

In particular put_page_from_l2e()'s assumption that _put_page_type()
would always succeed is now wrong, resulting in a partially un-validated
page left in a domain, which has no other means of getting cleaned up
later on. If not causing any problems earlier, this would ultimately
trigger the check for ->u.inuse.type_info having a zero count when
freeing the page during cleanup after the domain has died.

As a result it should be considered a mistake to not have extended
preemption fully to L2 when it was added to L3/L4 table handling, which
this change aims to correct.

The validation side additions are done just for symmetry.

This is part of XSA-290.

Reported-by: Manuel Bouyer <bouyer@antioche.eu.org>
Tested-by: Manuel Bouyer <bouyer@antioche.eu.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agoxen: Make coherent PV IOMMU discipline
George Dunlap [Wed, 23 Jan 2019 11:57:46 +0000 (11:57 +0000)]
xen: Make coherent PV IOMMU discipline

In order for a PV domain to set up DMA from a passed-through device to
one of its pages, the page must be mapped in the IOMMU.  On the other
hand, before a PV page may be used as a "special" page type (such as a
pagetable or descriptor table), it _must not_ be writable in the IOMMU
(otherwise a malicious guest could DMA arbitrary page tables into the
memory, bypassing Xen's safety checks); and Xen's current rule is to
have such pages not in the IOMMU at all.

At the moment, in order to accomplish this, the code borrows HVM
domain's "physmap" concept: When a page is assigned to a guest,
guess_physmap_add_entry() is called, which for PV guests, will create
a writable IOMMU mapping; and when a page is removed,
guest_physmap_remove_entry() is called, which will remove the mapping.

Additionally, when a page gains the PGT_writable page type, the page
will be added into the IOMMU; and when the page changes away from a
PGT_writable type, the page will be removed from the IOMMU.

Unfortunately, borrowing the "physmap" concept from HVM domains is
problematic.  HVM domains have a lock on their p2m tables, ensuring
synchronization between modifications to the p2m; and all hypercall
parameters must first be translated through the p2m before being used.

Trying to mix this locked-and-gated approach with PV's lock-free
approach leads to several races and inconsistencies:

* A race between a page being assigned and it being put into the
  physmap; for example:
  - P1: call populate_physmap() { A = allocate_domheap_pages() }
  - P2: Guess page A's mfn, and call decrease_reservation(A).  A is owned by the domain,
        and so Xen will clear the PGC_allocated bit and free the page
  - P1: finishes populate_physmap() { guest_physmap_add_entry() }

  Now the domain has a writable IOMMU mapping to a page it no longer owns.

* Pages start out as type PGT_none, but with a writable IOMMU mapping.
  If a guest uses a page as a page table without ever having created a
  writable mapping, the IOMMU mapping will not be removed; the guest
  will have a writable IOMMU mapping to a page it is currently using
  as a page table.

* A newly-allocated page can be DMA'd into with no special actions on
  the part of the guest; However, if a page is promoted to a
  non-writable type, the page must be mapped with a writable type before
  DMA'ing to it again, or the transaction will fail.

To fix this, do away with the "PV physmap" concept entirely, and
replace it with the following IOMMU discipline for PV guests:
 - (type == PGT_writable) <=> in iommu (even if type_count == 0)
 - Upon a final put_page(), check to see if type is PGT_writable; if so,
   iommu_unmap.

In order to achieve that:

- Remove PV IOMMU related code from guest_physmap_*

- Repurpose cleanup_page_cacheattr() into a general
  cleanup_page_mappings() function, which will both fix up Xen
  mappings for pages with special cache attributes, and also check for
  a PGT_writable type and remove pages if appropriate.

- For compatibility with current guests, grab-and-release a
  PGT_writable_page type for PV guests in guest_physmap_add_entry().
  This will cause most "normal" guest pages to start out life with
  PGT_writable_page type (and thus an IOMMU mapping), but no type
  count (so that they can be used as special cases at will).

Also, note that there is one exception to to the "PGT_writable => in
iommu" rule: xenheap pages shared with guests may be given a
PGT_writable type with one type reference.  This reference prevents
the type from changing, which in turn prevents page from gaining an
IOMMU mapping in get_page_type().  It's not clear whether this was
intentional or not, but it's not something to change in a security
update.

This is XSA-288.

Reported-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
6 years agosteal_page: Get rid of bogus struct page states
George Dunlap [Fri, 18 Jan 2019 15:00:34 +0000 (15:00 +0000)]
steal_page: Get rid of bogus struct page states

The original rules for `struct page` required the following invariants
at all times:

- refcount > 0 implies owner != NULL
- PGC_allocated implies refcount > 0

steal_page, in a misguided attempt to protect against unknown races,
violates both of these rules, thus introducing other races:

- Temporarily, the count_info has the refcount go to 0 while
  PGC_allocated is set

- It explicitly returns the page PGC_allocated set, but owner == NULL
  and page not on the page_list.

The second one meant that page_get_owner_and_reference() could return
NULL even after having successfully grabbed a reference on the page,
leading the caller to leak the reference (since "couldn't get ref" and
"got ref but no owner" look the same).

Furthermore, rather than grabbing a page reference to ensure that the
owner doesn't change under its feet, it appears to rely on holding
d->page_alloc lock to prevent this.

Unfortunately, this is ineffective: page->owner remains non-NULL for
some time after the count has been set to 0; meaning that it would be
entirely possible for the page to be freed and re-allocated to a
different domain between the page_get_owner() check and the count_info
check.

Modify steal_page to instead follow the appropriate access discipline,
taking the page through series of states similar to being freed and
then re-allocated with MEMF_no_owner:

- Grab an extra reference to make sure we don't race with anyone else
  freeing the page

- Drop both references and PGC_allocated atomically, so that (if
successful), anyone else trying to grab a reference will fail

- Attempt to reset Xen's mappings

- Reset the rest of the state.

Then, modify the two callers appropriately:

- Leave count_info alone (it's already been cleared)
- Call free_domheap_page() directly if appropriate
- Call assign_pages() rather than open-coding a partial assign

With all callers to assign_pages() now passing in pages with the
type_info field clear, tighten the respective assertion there.

This is XSA-287.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
6 years agoIOMMU/x86: fix type ref-counting race upon IOMMU page table construction
Jan Beulich [Tue, 5 Mar 2019 12:47:36 +0000 (13:47 +0100)]
IOMMU/x86: fix type ref-counting race upon IOMMU page table construction

When arch_iommu_populate_page_table() gets invoked for an already
running guest, simply looking at page types once isn't enough, as they
may change at any time. Add logic to re-check the type after having
mapped the page, unmapping it again if needed.

This is XSA-285.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tentatively-Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agognttab: set page refcount for copy-on-grant-transfer
Jan Beulich [Tue, 5 Mar 2019 12:45:58 +0000 (13:45 +0100)]
gnttab: set page refcount for copy-on-grant-transfer

Commit 5cc77f9098 ("32-on-64: Fix domain address-size clamping,
implement"), which introduced this functionality, took care of clearing
the old page's PGC_allocated, but failed to set the bit (and install the
associated reference) on the newly allocated one. Furthermore the "mfn"
local variable was never updated, and hence the wrong MFN was passed to
guest_physmap_add_page() (and back to the destination domain) in this
case, leading to an IOMMU mapping into an unowned page.

Ideally the code would use assign_pages(), but the call to
gnttab_prepare_for_transfer() sits in the middle of the actions
mirroring that function.

This is XSA-284.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
6 years agotools/tests: Drop obsolete test infrastructure
Andrew Cooper [Mon, 25 Feb 2019 13:06:22 +0000 (13:06 +0000)]
tools/tests: Drop obsolete test infrastructure

The regression/ directory was identified as already broken in 2012 (c/s
953953cc5).  The logic is intended to test *.py files in the Xen tree against
different versions of python, but every python version is obsolete as well.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agonpt/shadow: allow getting foreign page table entries
Roger Pau Monne [Wed, 27 Feb 2019 11:09:05 +0000 (12:09 +0100)]
npt/shadow: allow getting foreign page table entries

Current npt and shadow code to get an entry will always return
INVALID_MFN for foreign entries. Allow to return the entry mfn for
foreign entries, like it's done for grant table entries.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
6 years agox86/mm: handle foreign mappings in p2m_entry_modify
Roger Pau Monne [Wed, 27 Feb 2019 11:09:04 +0000 (12:09 +0100)]
x86/mm: handle foreign mappings in p2m_entry_modify

So that the specific handling can be removed from
atomic_write_ept_entry and be shared with npt and shadow code.

This commit also removes the check that prevent non-ept PVH dom0 from
mapping foreign pages.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
6 years agop2m: change write_p2m_entry to return an error code
Roger Pau Monne [Wed, 27 Feb 2019 11:09:03 +0000 (12:09 +0100)]
p2m: change write_p2m_entry to return an error code

This is in preparation for also changing p2m_entry_modify to return an
error code.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
6 years agox86/mm: split p2m ioreq server pages special handling into helper
Roger Pau Monne [Wed, 27 Feb 2019 11:09:02 +0000 (12:09 +0100)]
x86/mm: split p2m ioreq server pages special handling into helper

So that it can be shared by both ept, npt and shadow code, instead of
duplicating it.

No change in functionality intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
6 years agox86/p2m: pass the p2m to write_p2m_entry handlers
Roger Pau Monne [Wed, 27 Feb 2019 11:09:01 +0000 (12:09 +0100)]
x86/p2m: pass the p2m to write_p2m_entry handlers

Current callers pass the p2m to paging_write_p2m_entry, but the
implementation specific handlers of the write_p2m_entry hook instead
of a p2m get a domain struct due to the handling done in
paging_write_p2m_entry.

Change the code so that the implementations of write_p2m_entry take a
p2m instead of a domain.

This is a non-functional change, but will be used by follow up
patches.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
6 years agox86/nmi: correctly check MSB of P6 performance counter MSR in watchdog
Igor Druzhinin [Tue, 26 Feb 2019 17:44:06 +0000 (17:44 +0000)]
x86/nmi: correctly check MSB of P6 performance counter MSR in watchdog

The logic currently tries to work out if a recent overflow (that indicates
that NMI comes from the watchdog) happened by checking MSB of performance
counter MSR that is initially sign extended from a negative value
that we program it to. A possibly incorrect assumption here is that
MSB is always bit 32 while on modern hardware it's usually 47 and
the actual bit-width is reported through CPUID. Checking bit 32 for
overflows is usually fine since we never program it to anything
exceeding 32-bits and NMI is handled shortly after overflow occurs.

A problematic scenario that we saw occurs on systems where SMIs taking
significant time are possible. In that case, NMI handling is deferred to
the point firmware exits SMI which might take enough time for the counter
to go through bit 32 and set it to 1 again. So the logic described above
will misread it and report an unknown NMI erroneously.

Fortunately, we can use the actual MSB, which is usually higher than the
currently hardcoded 32, and treat this case correctly at least on modern
hardware.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/altp2m: Fix build with !CONFIG_HVM
Andrew Cooper [Thu, 28 Feb 2019 12:49:13 +0000 (12:49 +0000)]
x86/altp2m: Fix build with !CONFIG_HVM

c/s 0ec9b4ef3148 "x86/vmx: Fix security issue when a guest balloons out the #VE
info page" introduced a caller of altp2m_vcpu_disable_ve() in a common path,
but c/s e72ecc761541 "x86/altp2m: Rework #VE enable/disable paths" didn't have
a suitable prototype in the !CONFIG_HVM case.

Introduce one to fix the build.

Spotted by Travis.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/hvm: Increase the triple fault log message level to XENLOG_ERR
Andrew Cooper [Thu, 28 Feb 2019 10:48:16 +0000 (10:48 +0000)]
x86/hvm: Increase the triple fault log message level to XENLOG_ERR

At INFO level, it doesn't get printed out by default in release builds,
leading to unqualified logging such as this:

  (XEN) [   66.995993] Freed 524kB init memory
  (XEN) [ 1993.144997] *** Dumping Dom9 vcpu#2 state: ***
  (XEN) [ 1993.145008] ----[ Xen-4.11.1  x86_64  debug=n   Not tainted ]----
  (XEN) [ 1993.145011] CPU:    21
  (XEN) [ 1993.145015] RIP:    0010:[<ffffe0002ba950ef>]
  (XEN) [ 1993.145018] RFLAGS: 0000000000010246   CONTEXT: hvm guest (d9v2)
  (XEN) [ 1993.145026] rax: 00000000ffffe000   rbx: ffffe0002d8e1440   rcx: 0000ffffe0002ba9
  (XEN) [ 1993.145031] rdx: 0000000000000000   rsi: ffffe0002ba93575   rdi: fffff803dfb9f340
  (XEN) [ 1993.145035] rbp: ffffd001cd791200   rsp: ffffd001cd791140   r8:  0000000000000130
  (XEN) [ 1993.145039] r9:  0000000080000000   r10: 0000000000000000   r11: 0000000000000020
  (XEN) [ 1993.145043] r12: ffffe0002ba9306d   r13: 0000000000000000   r14: 0000000000000001
  (XEN) [ 1993.145047] r15: fffff803dfb9f200   cr0: 0000000080050031   cr4: 0000000000170678
  (XEN) [ 1993.145051] cr3: 00000000001aa002   cr2: 0000020488403f70
  (XEN) [ 1993.145056] fsb: 0000000060f71000   gsb: ffffd001cc1af000   gss: 0000009d60f6f000
  (XEN) [ 1993.145060] ds: 002b   es: 002b   fs: 0053   gs: 002b   ss: 0018   cs: 0010

A triple fault is fatal to the domain under all circumstances (so will print
at most once), and in practice is always an error condition rather than a
reboot fallback.

Reported-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agotools/ocaml: Dup2 /dev/null to stdin in daemonize()
Christian Lindig [Wed, 27 Feb 2019 10:33:42 +0000 (10:33 +0000)]
tools/ocaml: Dup2 /dev/null to stdin in daemonize()

Don't close stdin in daemonize() but dup2 /dev/null instead.  Otherwise, fd 0
gets reused later:

  [root@idol ~]# ls -lav /proc/`pgrep xenstored`/fd
  total 0
  dr-x------ 2 root root  0 Feb 28 11:02 .
  dr-xr-xr-x 9 root root  0 Feb 27 15:59 ..
  lrwx------ 1 root root 64 Feb 28 11:02 0 -> /dev/xen/evtchn
  l-wx------ 1 root root 64 Feb 28 11:02 1 -> /dev/null
  l-wx------ 1 root root 64 Feb 28 11:02 2 -> /dev/null
  lrwx------ 1 root root 64 Feb 28 11:02 3 -> /dev/xen/privcmd
  ...

Signed-off-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/vmx: Properly flush the TLB when an altp2m is modified
Andrew Cooper [Mon, 11 Feb 2019 13:31:02 +0000 (13:31 +0000)]
x86/vmx: Properly flush the TLB when an altp2m is modified

Modifications to an altp2m mark the p2m as needing flushing, but this was
never wired up in the return-to-guest path.  As a result, stale TLB entries
can remain after resuming the guest.

In practice, this manifests as a missing EPT_VIOLATION or #VE exception when
the guest subsequently accesses a page which has had its permissions reduced.

vmx_vmenter_helper() now has 11 p2ms to potentially invalidate, but issuing 11
INVEPT instructions isn't clever.  Instead, count how many contexts need
invalidating, and use INVEPT_ALL_CONTEXT if two or more are in need of
flushing.

This doesn't have an XSA because altp2m is not yet a security-supported
feature.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
6 years agox86/vmx: Fix security issue when a guest balloons out the #VE info page
Andrew Cooper [Thu, 17 Jan 2019 12:26:17 +0000 (12:26 +0000)]
x86/vmx: Fix security issue when a guest balloons out the #VE info page

The logic in altp2m_vcpu_{en,dis}able_ve() and vmx_vcpu_update_vmfunc_ve() is
dangerous.  After #VE has been set up, the guest can balloon out and free the
nominated GFN, after which the processor may write to it.  Also, the unlocked
GFN query means the MFN is stale by the time it is used.  Alternatively, a
guest can race two disable calls to cause one VMCS to still reference the
nominated GFN after the tracking information was dropped.

Rework the logic from scratch to make it safe.

Hold an extra page reference on the underlying frame, to account for the
VMCS's reference.  This means that if the GFN gets ballooned out, it isn't
freed back to Xen until #VE is disabled, and the VMCS no longer refers to the
page.

A consequence of this is that altp2m_vcpu_disable_ve() needs to be called
during the domain_kill() path, to drop the reference for domains which shut
down with #VE still enabled.

For domains using altp2m, we expect a single enable call and no disable for
the remaining lifetime of the domain.  However, to avoid problems with
concurrent calls, use cmpxchg() to locklessly maintain safety.

This doesn't have an XSA because altp2m is not yet a security-supported
feature.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/vioapic: block speculative out-of-bound accesses
Norbert Manthey [Tue, 26 Feb 2019 15:57:56 +0000 (16:57 +0100)]
x86/vioapic: block speculative out-of-bound accesses

When interacting with io apic, a guest can specify values that are used
as index to structures, and whose values are not compared against
upper bounds to prevent speculative out-of-bound accesses. This change
prevents these speculative accesses.

Furthermore, variables are initialized and the compiler is asked to not
optimized these initializations, as the uninitialized variables might be
used in a speculative out-of-bound access. Out of the four initialized
variables, two are potentially problematic, namely ones in the functions
vioapic_irq_positive_edge and vioapic_get_trigger_mode.

As the two problematic variables are both used in the common function
gsi_vioapic, the mitigation is implemented there. As the access pattern
of the currently non-guest-controlled functions might change in the
future as well, the other variables are initialized as well.

This is part of the speculative hardening effort.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoevtchn: block speculative out-of-bound accesses
Norbert Manthey [Tue, 26 Feb 2019 15:57:18 +0000 (16:57 +0100)]
evtchn: block speculative out-of-bound accesses

Guests can issue event channel interaction with guest specified data.
To avoid speculative out-of-bound accesses, we use the nospec macros,
or the domain_vcpu function. Where appropriate, we use the vcpu_id of
the seleceted vcpu instead of the parameter that can be influenced by
the guest, so that only one access needs to be protected.

This is part of the speculative hardening effort.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/shadow: don't use map_domain_page_global() on paths that may not fail
Jan Beulich [Tue, 26 Feb 2019 15:56:26 +0000 (16:56 +0100)]
x86/shadow: don't use map_domain_page_global() on paths that may not fail

The assumption (according to one comment) and hope (according to
another) that map_domain_page_global() can't fail are both wrong on
large enough systems. Do away with the guest_vtable field altogether,
and establish / tear down the desired mapping as necessary.

The alternatives, discarded as being undesirable, would have been to
either crash the guest in sh_update_cr3() when the mapping fails, or to
bubble up an error indicator, which upper layers would have a hard time
to deal with (other than again by crashing the guest).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoviridian: fix the HvFlushVirtualAddress/List hypercall implementation
Paul Durrant [Tue, 26 Feb 2019 15:55:06 +0000 (16:55 +0100)]
viridian: fix the HvFlushVirtualAddress/List hypercall implementation

The current code uses hvm_asid_flush_vcpu() but this is insufficient for
a guest running in shadow mode, which results in guest crashes early in
boot if the 'hcall_remote_tlb_flush' is enabled.

This patch, instead of open coding a new flush algorithm, adapts the one
already used by the HVMOP_flush_tlbs Xen hypercall. The implementation is
modified to allow TLB flushing a subset of a domain's vCPUs. A callback
function determines whether or not a vCPU requires flushing. This mechanism
was chosen because, while it is the case that the currently implemented
viridian hypercalls specify a vCPU mask, there are newer variants which
specify a sparse HV_VP_SET and thus use of a callback will avoid needing to
expose details of this outside of the viridian subsystem if and when those
newer variants are implemented.

NOTE: Use of the common flush function requires that the hypercalls are
      restartable and so, with this patch applied, viridian_hypercall()
      can now return HVM_HCALL_preempted. This is safe as no modification
      to struct cpu_user_regs is done before the return.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoxen/arm: domain_build: Panic message should end with a newline
Julien Grall [Mon, 18 Feb 2019 10:21:06 +0000 (10:21 +0000)]
xen/arm: domain_build: Panic message should end with a newline

Since commit 25eb5eec79 "xen: Fix inconsistent callers of panic()" all
the panic message should end with a newline. Unfortunately, some
commits pushed afterwards does not follow the rule.

Modify the offending panic messages to avoid more inconsistency.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoxen/arm: domain_build: Require the property "cpus" when building a domU
Julien Grall [Mon, 18 Feb 2019 10:14:36 +0000 (10:14 +0000)]
xen/arm: domain_build: Require the property "cpus" when building a domU

The 3rd argument of function dt_property_read_u32() is only valid when
the call succeeded. So we cannot assume the value will not be modifed
in case of failure.

The documentation of Dom0less does not give a default value when the
property "cpus" is not set. So require the property in the configuration.

Coverity-ID: 1476825
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoxen/arm: psci: Populate arm_smccc_res on PSCI_FEATURES call
Julien Grall [Mon, 18 Feb 2019 09:42:27 +0000 (09:42 +0000)]
xen/arm: psci: Populate arm_smccc_res on PSCI_FEATURES call

Commit 0bc6a68da5 "xen/arm: Replace call_smc with arm_smccc_smc"
mistakenly forgot to populate arm_smccc_res. So a garbage value was
used as return value.

Coverity-ID: 1476827
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/altp2m: Rework #VE enable/disable paths
Andrew Cooper [Thu, 17 Jan 2019 12:26:17 +0000 (12:26 +0000)]
x86/altp2m: Rework #VE enable/disable paths

Split altp2m_vcpu_{enable,disable}_ve() out of the
HVMOP_altp2m_vcpu_{enable,disable}_notify marshalling logic.  A future change
is going to need to call altp2m_vcpu_disable_ve() from the domain_kill() path.

While at it, clean up the logic in altp2m_vcpu_{initialise,destroy}().
altp2m_vcpu_reset() has no external callers, so fold it into its two
callsites.  This in turn allows for altp2m_vcpu_destroy() to reuse
altp2m_vcpu_disable_ve() rather than opencoding it.

No practical change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86: Improve the efficiency of domain_relinquish_resources()
Andrew Cooper [Wed, 20 Feb 2019 13:39:20 +0000 (13:39 +0000)]
x86: Improve the efficiency of domain_relinquish_resources()

pci_release_devices() takes the global PCI lock.  Once pci_release_devices()
has completed, it will be called redundantly each time paging_teardown() and
vcpu_destroy_pagetables() continue.

This is liable to be millions of times for a reasonably sized guest, and is a
serialising bottleneck now that domain_kill() can be run concurrently on
different domains.

Instead of propagating the opencoding of the relinquish state machine, take
the opportunity to clean it up.

Leave a proper set of comments explaining that domain_relinquish_resources()
implements a co-routine.  Introduce a documented PROGRESS() macro to avoid
latent bugs such as the RELMEM_xen case, and make the new PROG_* states
private to domain_relinquish_resources().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/shadow: don't pass wrong L4 MFN to guest_walk_tables()
Jan Beulich [Wed, 20 Feb 2019 16:07:17 +0000 (17:07 +0100)]
x86/shadow: don't pass wrong L4 MFN to guest_walk_tables()

64-bit PV guest user mode runs on a different L4 table. Make sure
- the accessed bit gets set in the correct table (and in log-dirty
  mode the correct page gets marked dirty) during guest walks,
- the correct table gets audited by sh_audit_gw(),
- correct info gets logged by print_gw().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/pmtimer: fix hvm_acpi_sleep_button behavior
Varad Gautam [Wed, 20 Feb 2019 16:06:25 +0000 (17:06 +0100)]
x86/pmtimer: fix hvm_acpi_sleep_button behavior

Commit 19fb14622e941 "x86/pmtimer: move ACPI registers from PMTState to
hvm_domain" misconfigures pm1a_sts for hvm_acpi_sleep_button with
PWRBTN_STS instead of SLPBTN_STS, which leads to
XEN_DOMCTL_SENDTRIGGER_SLEEP causing guest powerdowns. Fix this.

Signed-off-by: Varad Gautam <vrd@amazon.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86/vpmu: Improve documentation and parsing for vpmu=
Andrew Cooper [Fri, 1 Feb 2019 16:56:38 +0000 (16:56 +0000)]
x86/vpmu: Improve documentation and parsing for vpmu=

The behaviour of vpmu=<bool> being exclusive of vpmu=bts|ipc|arch is odd and
contrary to Xen's normal command line parsing behaviour.  Rewrite the parsing
to use the normal form, but retain the previous behaviour where the use of
bts/ipc/arch implies vpmu=true.

Parts of the documenation are stale, most notibly the HVM-only statement.
Update it for consistency and correctness.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agolibs/gnttab: add missing FreeBSD functions
Roger Pau Monne [Tue, 19 Feb 2019 15:26:08 +0000 (16:26 +0100)]
libs/gnttab: add missing FreeBSD functions

The FreeBSD implementation is missing the following functions:

osdep_gnttab_dmabuf_exp_from_refs
osdep_gnttab_dmabuf_exp_wait_released
osdep_gnttab_dmabuf_imp_to_refs
osdep_gnttab_dmabuf_imp_release

Which all deal with dmabufs, that only exists on Linux. Implement them
using abort, since such functions should never be called on FreeBSD.

FTR, I realized those functions where missing when attempting to use
pygrub:

Traceback (most recent call last):
  File "/usr/local/lib/xen/bin/pygrub", line 19, in <module>
    import xen.lowlevel.xc
ImportError: /usr/local/lib/libxengnttab.so.1: Undefined symbol "osdep_gnttab_dmabuf_exp_from_refs"

Fixes: ee8105 ("libgnttab: Add support for Linux dma-buf")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
6 years agovpci: reduce verboseness of BAR write warnings
Roger Pau Monne [Mon, 18 Feb 2019 16:24:28 +0000 (17:24 +0100)]
vpci: reduce verboseness of BAR write warnings

Avoid printing a warning message when writing to a BAR register with
memory decoding enabled if the value written is the same as the
current one.

No functional change.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/altp2m: fix HVMOP_altp2m_set_domain_state race
Razvan Cojocaru [Mon, 18 Feb 2019 12:46:02 +0000 (13:46 +0100)]
x86/altp2m: fix HVMOP_altp2m_set_domain_state race

HVMOP_altp2m_set_domain_state does not domain_pause(), presumably
on purpose (as it was originally supposed to cater to a in-guest
agent, and a domain pausing itself is not a good idea).

This can lead to domain crashes in the vmx_vmexit_handler() code
that checks if the guest has the ability to switch EPTP without an
exit. That code can __vmread() the host p2m's EPT_POINTER
(before HVMOP_altp2m_set_domain_state "for_each_vcpu()" has a
chance to run altp2m_vcpu_initialise(), but after
d->arch.altp2m_active is set).

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoaltp2m: Prevent deadlocks when a domain performs altp2m operations on itself
George Dunlap [Mon, 18 Feb 2019 12:45:24 +0000 (13:45 +0100)]
altp2m: Prevent deadlocks when a domain performs altp2m operations on itself

domain_pause_except_self() was introduced to allow a domain to pause
itself while doing altp2m operations.  However, as written, it has a
risk fo deadlock if two vcpus enter the loop at the same time.

Luckily, there's already a solution for this: Attempt to call domain's
hypercall_deadlock_mutex, and restart the entire hypercall if you
fail.

Make domain_pause_except_self() attempt to grab this mutex when
pausing itself, returning -ERESTART if it fails.  Have the callers
check for errors and pass the value up.  In both cases, the top-level
do_hvm_op() should DTRT when -ERESTART is returned.

The (necessary) reuse of the hypercall deadlock mutex poses the risk
of getting called from a context where the lock was already acquired
(e.g. someone may (say) call domctl_lock(), then afterwards call
domain_pause_except_self()). However, in the interest of not
overcomplicating things, no changes are made here to the mutex.
Attempted nesting of this lock isn't a security issue, because all
that will happen is that the vcpu will livelock taking continuations.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agopvh/dom0: warn when dom0_mem is not set
Roger Pau Monné [Mon, 18 Feb 2019 12:44:53 +0000 (13:44 +0100)]
pvh/dom0: warn when dom0_mem is not set

There have been several reports of the dom0 builder running out of
memory when building a PVH dom0 without having specified a dom0_mem
value. Print a warning message if dom0_mem is not set when booting in
PVH mode.

This is a temporary workaround until accounting for internal memory
required by Xen (ie: paging structures) is improved.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoamd/npt/shadow: replace assert that prevents creating 2M/1G MMIO entries
Roger Pau Monné [Mon, 18 Feb 2019 12:44:24 +0000 (13:44 +0100)]
amd/npt/shadow: replace assert that prevents creating 2M/1G MMIO entries

The assert was originally added to make sure that higher order
regions (> PAGE_ORDER_4K) could not be used to bypass the
mmio_ro_ranges check performed by p2m_type_to_flags.

This however is already checked in set_mmio_p2m_entry, which makes
sure that higher order mappings don't overlap with mmio_ro_ranges,
thus allowing the creation of high order MMIO mappings safely.

Replace the assert to allow 2M/1G entries to be created for MMIO
regions and add some extra asserts as a replacement to make sure
there's no overlapping with MMIO read-only ranges.

Note that 1G MMIO entries will not be created unless mmio_order is
changed to allow it.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/pvh: reorder PVH dom0 iommu initialization
Roger Pau Monné [Mon, 18 Feb 2019 12:43:50 +0000 (13:43 +0100)]
x86/pvh: reorder PVH dom0 iommu initialization

So that the iommu is initialized before populating the p2m, and
entries added get the corresponding iommu page table entries if
required. This requires splitting the current pvh_setup_p2m into two
different functions. One that crafts dom0 physmap and sets the paging
allocation, and another one that actually populates the p2m with RAM
regions.

Note that this allows to remove the special casing done for the low
1MB in hwdom_iommu_map.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agodom0/pvh: align allocation and mapping order to start address
Roger Pau Monné [Mon, 18 Feb 2019 12:42:51 +0000 (13:42 +0100)]
dom0/pvh: align allocation and mapping order to start address

The p2m and iommu mapping code always had the requirement that
addresses and orders must be aligned when populating the p2m or the
iommu page tables.

PVH dom0 builder didn't take this requirement into account, and can
call into the p2m/iommu mapping helpers with addresses and orders that
are not aligned.

Fix this by making sure the orders passed to the physmap population
helpers are always aligned to the guest address to be populated.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agotools/libxendevicemodel: add xendevicemodel_modified_memory_bulk to map
Paul Durrant [Fri, 15 Feb 2019 10:02:00 +0000 (10:02 +0000)]
tools/libxendevicemodel: add xendevicemodel_modified_memory_bulk to map

Commit e3b93b3c595 "dmop: add xendevicemodel_modified_memory_bulk()" added
the implementation to the library almost 2 years ago, but the function
was not included in the map file, essentially making it useless. This
patch rectifies the situation.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/hvm: block speculative out-of-bound accesses
Norbert Manthey [Tue, 12 Feb 2019 14:20:15 +0000 (15:20 +0100)]
x86/hvm: block speculative out-of-bound accesses

There are multiple arrays in the HVM interface that are accessed
with indices that are provided by the guest. To avoid speculative
out-of-bound accesses, we use the array_index_nospec macro.

When blocking speculative out-of-bound accesses, we can classify arrays
into dynamic arrays and static arrays. Where the former are allocated
during run time, the size of the latter is known during compile time.
On static arrays, compiler might be able to block speculative accesses
in the future.

This is part of the speculative hardening effort.

Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoVMX: don't ignore P2M setup error xen-pt-allocation-1.1-base
Jan Beulich [Tue, 12 Feb 2019 10:54:57 +0000 (11:54 +0100)]
VMX: don't ignore P2M setup error

set_mmio_p2m_entry() may fail, in particular with -ENOMEM. Don't ignore
such an error, but instead cause domain creation to fail in such a case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoiommu: fix iommu_ops initialization
Juergen Gross [Tue, 12 Feb 2019 10:54:07 +0000 (11:54 +0100)]
iommu: fix iommu_ops initialization

Commit 32a5ea00ec75ef53e ("IOMMU/x86: remove indirection from certain
IOMMU hook accesses") introduced iommu_ops initialized at boot time
with data declared as __initconstrel.

On Intel systems there is another path where iommu_ops is initialized
and this path is relevant on resume after returning from system suspend.
As the initialization data is no longer accessible in this case that
second initialization must be dropped in case the system isn't just
booting.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoasm: handle comments when creating header file
Norbert Manthey [Wed, 6 Feb 2019 14:09:33 +0000 (15:09 +0100)]
asm: handle comments when creating header file

In the early steps of compilation, the asm header files are created, such
as include/asm-$(TARGET_ARCH)/asm-offsets.h. These files depend on the
assembly file arch/$(TARGET_ARCH)/asm-offsets.s, which is generated
before. Depending on the used toolchain, there might be comments in the
assembly files. Especially the goto-gcc compiler of the bounded model
checker CBMC adds comments that start with a '#' symbol at the beginning
of the line.

This commit adds handling comments in assembler during the creation of the
asm header files, especially ignoring lines that start with '#', which
indicate comments for both ARM and x86 assembler. The used tool goto-as
produces exactly comments of this kind.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
Signed-off-by: Michael Tautschnig <tautschn@amazon.co.uk>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/shadow: adjust minimum allocation calculations
Jan Beulich [Mon, 11 Feb 2019 08:09:13 +0000 (09:09 +0100)]
x86/shadow: adjust minimum allocation calculations

A previously bad situation has become worse with the early setting of
->max_vcpus: The value returned by shadow_min_acceptable_pages() has
further grown, and hence now holds back even more memory from use for
the p2m.

Make sh_min_allocation() account for all p2m memory needed for
shadow_enable() to succeed during domain creation (at which point the
domain has no memory at all allocated to it yet, and hence use of
d->tot_pages is meaningless).

Also make shadow_min_acceptable_pages() no longer needlessly add 1 to
the vCPU count.

Finally make the debugging printk() in shadow_alloc_p2m_page() a little
more useful by logging some of the relevant domain settings.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agodocs: features/qemu-depriv formatting fixes
George Dunlap [Thu, 7 Feb 2019 12:41:17 +0000 (12:41 +0000)]
docs: features/qemu-depriv formatting fixes

Need a space between the paragraph and the list so pandoc knows it's a
list.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agodocs: Update credit/credit2 feature docs reflecting new default scheduler
George Dunlap [Thu, 7 Feb 2019 12:05:43 +0000 (12:05 +0000)]
docs: Update credit/credit2 feature docs reflecting new default scheduler

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agotools: init scripts: make XEN_RUN_DIR and XEN_LOCK_DIR mode 700
Ian Jackson [Thu, 7 Feb 2019 15:02:27 +0000 (15:02 +0000)]
tools: init scripts: make XEN_RUN_DIR and XEN_LOCK_DIR mode 700

These directories ought not to be even world-readable.  If this script
for some reason runs with a lax umask they might be created
overly-writeable.  Avoid any such bug by setting the mode explicitly.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agotools: init scripts: xencommons: Fixes to Description
Ian Jackson [Thu, 7 Feb 2019 15:02:26 +0000 (15:02 +0000)]
tools: init scripts: xencommons: Fixes to Description

`neeeded' is a typo.  And xend is long gone.

No functional change.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agotools: init scripts: xencommons: Provides `xen'
Ian Jackson [Thu, 7 Feb 2019 15:02:25 +0000 (15:02 +0000)]
tools: init scripts: xencommons: Provides `xen'

It is useful to have a single `xen' facility (in the LSB Provides
namespace).  That allows other facilities to specify that they should
go after `xen' without needing to know the implementation details.

This service name is already Provide'd by the (fairly different) init
scripts used in Debian.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoxen/arm: gic-v2: deactivate interrupts during initialization
Stefano Stabellini [Tue, 5 Feb 2019 21:38:53 +0000 (13:38 -0800)]
xen/arm: gic-v2: deactivate interrupts during initialization

Interrupts could be ACTIVE at boot. Make sure to deactivate them during
initialization.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
CC: julien.grall@arm.com
CC: peng.fan@nxp.com
CC: jgross@suse.com
6 years agodocs, argo: add design document for Argo
Christopher Clark [Wed, 6 Feb 2019 08:56:00 +0000 (09:56 +0100)]
docs, argo: add design document for Argo

Document provides a brief introduction to the Argo interdomain
communication mechanism and a detailed description of the granular
locking used within the Argo implementation.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoSUPPORT.md : add new entry for the Argo feature
Christopher Clark [Wed, 6 Feb 2019 09:04:00 +0000 (10:04 +0100)]
SUPPORT.md : add new entry for the Argo feature

Status: Experimental

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoMAINTAINERS: add new section for Argo and self as maintainer
Christopher Clark [Wed, 6 Feb 2019 08:56:00 +0000 (09:56 +0100)]
MAINTAINERS: add new section for Argo and self as maintainer

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoxsm, argo: notify: don't describe rings that cannot be sent to
Christopher Clark [Wed, 6 Feb 2019 08:56:00 +0000 (09:56 +0100)]
xsm, argo: notify: don't describe rings that cannot be sent to

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Tested-by: Chris Patterson <pattersonc@ainfosec.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoxsm, argo: XSM control for any access to argo by a domain
Christopher Clark [Wed, 6 Feb 2019 08:56:00 +0000 (09:56 +0100)]
xsm, argo: XSM control for any access to argo by a domain

Will inhibit initialization of the domain's argo data structure to
prevent receiving any messages or notifications and access to any of
the argo hypercall operations.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Tested-by: Chris Patterson <pattersonc@ainfosec.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoxsm, argo: XSM control for argo message send operation
Christopher Clark [Wed, 6 Feb 2019 09:02:00 +0000 (10:02 +0100)]
xsm, argo: XSM control for argo message send operation

Default policy: allow.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Tested-by: Chris Patterson <pattersonc@ainfosec.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoxsm, argo: XSM control for argo register
Christopher Clark [Wed, 6 Feb 2019 08:55:00 +0000 (09:55 +0100)]
xsm, argo: XSM control for argo register

XSM controls for argo ring registration with two distinct cases, where
the ring being registered is:

1) Single source:  registering a ring for communication to receive messages
                   from a specified single other domain.
   Default policy: allow.

2) Any source:     registering a ring for communication to receive messages
                   from any, or all, other domains (ie. wildcard).
   Default policy: deny, with runtime policy configuration via bootparam.

This commit modifies the signature of core XSM hook functions in order to
apply 'const' to arguments, needed in order for 'const' to be accepted in
signature of functions that invoke them.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Tested-by: Chris Patterson <pattersonc@ainfosec.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoargo: implement the notify op
Christopher Clark [Wed, 6 Feb 2019 08:55:00 +0000 (09:55 +0100)]
argo: implement the notify op

Queries for data about space availability in registered rings and
causes notification to be sent when space has become available.

The hypercall op populates a supplied data structure with information about
ring state and if insufficient space is currently available in a given ring,
the hypervisor will record the domain's expressed interest and notify it
when it observes that space has become available.

Checks for free space occur when this notify op is invoked, so it may be
intentionally invoked with no data structure to populate
(ie. a NULL argument) to trigger such a check and consequent notifications.

Limit the maximum number of notify requests in a single operation to a
simple fixed limit of 256.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Tested-by: Chris Patterson <pattersonc@ainfosec.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoargo: implement the sendv op; evtchn: expose send_guest_global_virq
Christopher Clark [Wed, 6 Feb 2019 08:55:00 +0000 (09:55 +0100)]
argo: implement the sendv op; evtchn: expose send_guest_global_virq

sendv operation is invoked to perform a synchronous send of buffers
contained in iovs to a remote domain's registered ring.

It takes:
 * A destination address (domid, port) for the ring to send to.
   It performs a most-specific match lookup, to allow for wildcard.
 * A source address, used to inform the destination of where to reply.
 * The address of an array of iovs containing the data to send
 * .. and the length of that array of iovs
 * and a 32-bit message type, available to communicate message context
   data (eg. kernel-to-kernel, separate from the application data).

If insufficient space exists in the destination ring, it will return
-EAGAIN and Xen will notify the caller when sufficient space becomes
available.

Accesses to the ring indices are appropriately atomic. The rings are
mapped into Xen's private address space to write as needed and the
mappings are retained for later use.

Notifications are sent to guests via VIRQ and send_guest_global_virq is
exposed in the change to enable argo to call it. VIRQ_ARGO is claimed
from the VIRQ previously reserved for this purpose (#11).

The VIRQ notification method is used rather than sending events using
evtchn functions directly because:

* no current event channel type is an exact fit for the intended
  behaviour. ECS_IPI is closest, but it disallows migration to
  other VCPUs which is not necessarily a requirement for Argo.

* at the point of argo_init, allocation of an event channel is
  complicated by none of the guest VCPUs being initialized yet
  and the event channel logic expects that a valid event channel
  has a present VCPU.

* at the point of signalling a notification, the VIRQ logic is already
  defensive: if d->vcpu[0] is NULL, the notification is just silently
  dropped, whereas the evtchn_send logic is not so defensive: vcpu[0]
  must not be NULL, otherwise a null pointer dereference occurs.

Using a VIRQ removes the need for the guest to query to determine which
event channel notifications will be delivered on. This is also likely to
simplify establishing future L0/L1 nested hypervisor argo communication.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Tested-by: Chris Patterson <pattersonc@ainfosec.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoargo: implement the unregister op
Christopher Clark [Wed, 6 Feb 2019 09:04:00 +0000 (10:04 +0100)]
argo: implement the unregister op

Takes a single argument: a handle to the ring unregistration struct,
which specifies the port and partner domain id or wildcard.

The ring's entry is removed from the hashtable of registered rings;
any entries for pending notifications are removed; and the ring is
unmapped from Xen's address space.

If the ring had been registered to communicate with a single specified
domain (ie. a non-wildcard ring) then the partner domain state is removed
from the partner domain's argo send_info hash table.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Chris Patterson <pattersonc@ainfosec.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoargo: implement the register op
Christopher Clark [Wed, 6 Feb 2019 08:55:00 +0000 (09:55 +0100)]
argo: implement the register op

The register op is used by a domain to register a region of memory for
receiving messages from either a specified other domain, or, if specifying a
wildcard, any domain.

This operation creates a mapping within Xen's private address space that
will remain resident for the lifetime of the ring. In subsequent commits,
the hypervisor will use this mapping to copy data from a sending domain into
this registered ring, making it accessible to the domain that registered the
ring to receive data.

Wildcard any-sender rings are default disabled and registration will be
refused with EPERM unless they have been specifically enabled with the
new mac-permissive flag that is added to the argo boot option here. The
reason why the default for wildcard rings is 'deny' is that there is
currently no means to protect the ring from DoS by a noisy domain
spamming the ring, affecting other domains ability to send to it. This
will be addressed with XSM policy controls in subsequent work.

Since denying access to any-sender rings is a significant functional
constraint, the new option "mac-permissive" for the argo bootparam
enables overriding this. eg: "argo=1,mac-permissive=1"

The p2m type of the memory supplied by the guest for the ring must be
p2m_ram_rw and the memory will be pinned as PGT_writable_page while the ring
is registered.

This hypercall op and its interface currently only supports 4K-sized pages.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Tested-by: Chris Patterson <pattersonc@ainfosec.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoxen/arm: introduce guest_handle_for_field()
Christopher Clark [Wed, 6 Feb 2019 08:55:00 +0000 (09:55 +0100)]
xen/arm: introduce guest_handle_for_field()

ARM port of c/s bb544585: "introduce guest_handle_for_field()"

This helper turns a field of a GUEST_HANDLE into a GUEST_HANDLE.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoerrno: add POSIX error codes EMSGSIZE, ECONNREFUSED to the ABI
Christopher Clark [Wed, 6 Feb 2019 08:55:00 +0000 (09:55 +0100)]
errno: add POSIX error codes EMSGSIZE, ECONNREFUSED to the ABI

EMSGSIZE: Argo's sendv operation will return EMSGSIZE when an excess amount
of data, across all iovs, has been supplied, exceeding either the statically
configured maximum size of a transmittable message, or the (variable) size
of the ring registered by the destination domain.

ECONNREFUSED: Argo's register operation will return ECONNREFUSED if a ring
is being registered to communicate with a specific remote domain that does
exist but is not argo-enabled.

These codes are described by POSIX here:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/errno.h.html
    EMSGSIZE     : "Message too large"
    ECONNREFUSED : "Connection refused".

The numeric values assigned to each are taken from Linux, as is the case
for the existing error codes.
    EMSGSIZE     : 90
    ECONNREFUSED : 111

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoargo: init, destroy and soft-reset, with enable command line opt
Christopher Clark [Wed, 6 Feb 2019 08:55:00 +0000 (09:55 +0100)]
argo: init, destroy and soft-reset, with enable command line opt

Initialises basic data structures and performs teardown of argo state
for domain shutdown.

Inclusion of the Argo implementation is dependent on CONFIG_ARGO.

Introduces a new Xen command line parameter 'argo': bool to enable/disable
the argo hypercall. Defaults to disabled.

New headers:
  public/argo.h: with definions of addresses and ring structure, including
  indexes for atomic update for communication between domain and hypervisor.

  xen/argo.h: to expose the hooks for integration into domain lifecycle:
    argo_init: per-domain init of argo data structures for domain_create.
    argo_destroy: teardown for domain_destroy and the error exit
                  path of domain_create.
    argo_soft_reset: reset of domain state for domain_soft_reset.

Adds a new field to struct domain: struct argo_domain *argo;

In accordance with recent work on _domain_destroy, argo_destroy is
idempotent. It will tear down: all rings registered by this domain, all
rings where this domain is the single sender (ie. specified partner,
non-wildcard rings), and all pending notifications where this domain is
awaiting signal about available space in the rings of other domains.

A count will be maintained of the number of rings that a domain has
registered in order to limit it below the fixed maximum limit defined here.

Macros are defined to verify the internal locking state within the argo
implementation. The macros are ASSERTed on entry to functions to validate
and document the required lock state prior to calling.

The hash function for the hashtables that hold ring state is derived from
the string hashing function djb2 (http://www.cse.yorku.ca/~oz/hash.html)
by Daniel J. Bernstein. Basic testing with a limited number of domains and
ports has shown reasonable distribution for the table size.

The software license on the public header is the BSD license, standard
procedure for the public Xen headers. The public header was originally
posted under a GPL license at: [1]:
https://lists.xenproject.org/archives/html/xen-devel/2013-05/msg02710.html

The following ACK by Lars Kurth is to confirm that only people being
employees of Citrix contributed to the header files in the series posted at
[1] and that thus the copyright of the files in question is fully owned by
Citrix. The ACK also confirms that Citrix is happy for the header files to
be published under a BSD license in this series (which is based on [1]).

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Acked-by: Lars Kurth <lars.kurth@citrix.com>
Reviewed-by: Ross Philipson <ross.philipson@oracle.com>
Tested-by: Chris Patterson <pattersonc@ainfosec.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoargo: define argo_dprintk for subsystem debugging
Christopher Clark [Wed, 6 Feb 2019 08:55:00 +0000 (09:55 +0100)]
argo: define argo_dprintk for subsystem debugging

A convenience for working on development of the argo subsystem:
setting a #define variable enables additional debug messages.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
6 years agoargo: introduce the argo_op hypercall boilerplate
Christopher Clark [Wed, 6 Feb 2019 08:55:00 +0000 (09:55 +0100)]
argo: introduce the argo_op hypercall boilerplate

Presence is gated upon CONFIG_ARGO.

Registers the hypercall previously reserved for this.
Takes 5 arguments, does nothing and returns -ENOSYS.

Implementation will provide a compat ABI so COMPAT_CALL is the selected
macro for the hypercall tables.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoargo: Introduce the Kconfig option to govern inclusion of Argo
Christopher Clark [Wed, 6 Feb 2019 08:55:00 +0000 (09:55 +0100)]
argo: Introduce the Kconfig option to govern inclusion of Argo

Defines CONFIG_ARGO when enabled. Default: disabled.

When the Kconfig option is enabled, the Argo hypercall implementation
will be included, allowing use of the hypervisor-mediated interdomain
communication mechanism.

Argo is implemented for x86 and ARM hardware platforms.

Availability of the option depends on EXPERT and Argo is currently an
experimental feature.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoarm: gic-v3: deactivate interrupts during initialization
Peng Fan [Tue, 5 Feb 2019 05:55:35 +0000 (05:55 +0000)]
arm: gic-v3: deactivate interrupts during initialization

On i.MX8, we implemented partition reboot which means Cortex-A reboot
will not impact M4 cores and System control Unit core. However GICv3 is
not reset because we also need to support A72 Cluster reboot without
affecting A53 Cluster.

The gic-v3 controller is configured with EOImode to 1, so during xen
reboot, there is a function call "smp_call_function(halt_this_cpu, NULL, 0);"
but halt_this_cpu never returns, that means other CPUs have no chance to
deactivate the SGI interrupt, because the deactivate_irq operation is at
the end of do_sgi. During the next boot of Xen, CPU0 will issue
GIC_SGI_CALL_FUNCTION to other CPUs. As the Active state for SGI is left
untouched during the reboot, the GIC_SGI_CALL_FUNCTION will still be active
on the non-boot CPUs. This means the interrupt cannot be triggered again
until it get deactivated.

And according to IHI0069D_gic_architecture_specification, chapter
"8.11.3 GICR_ICACTIVER0, Interrupt Clear-Active Register 0", the RW
field of GICR_ICACTIVER0 resets to a value that is architecturally UNKNOWN.
So make sure all interrupts are deactivated during initialization by
clearing the state.

Signed-off-by: Peng Fan <peng.fan@nxp.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
6 years agotools: drop obsolete xen-ringwatch
Wei Liu [Mon, 4 Feb 2019 13:58:24 +0000 (13:58 +0000)]
tools: drop obsolete xen-ringwatch

This utility can't possibly work with modern Xen setup: none of the
sysfs path used (under /sys/devices/xen-backend) is documented as
stable ABI in upstream Linux kernel.

Archaeology shows that the path used could have been part of the
xenolinux fork which never got upstreamed.

Its utility is zero nowadays. Drop it.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
6 years agoxen/arm: irq: End cleanly spurious interrupt
Julien Grall [Mon, 28 Jan 2019 16:00:23 +0000 (16:00 +0000)]
xen/arm: irq: End cleanly spurious interrupt

no_irq_type handlers are used when an IRQ does not have action attached.
This is useful to detect misconfiguration between the interrupt
controller and the software.

Currently, all the handlers will do nothing on spurious interrupt. This
means if such interrupt is received, the priority of the interrupt will
not be dropped and the processor will lose the ability to receive any
interrupt lower or equal to the priority.

Spurious interrupt can happen while releasing interrupt assigned to
guest (happen during domain destruction). The interaction is roughly

CPU0                                CPU1
release_guest_irq(A)
spin_lock(&desc->lock)
gic_remove_irq_from_guest
                                    receive IRQ A
                                    spin_lock(&desc->lock)
    desc->handler->shutdown()
      set_bit(IRQ_DISABLED)
    desc->handler = &no_irq_type
spin_unlock(&desc->lock)
                                    desc->handler->end();
                                    spin_unlock(&desc->lock)

Because the no_irq_type.end callback is implemented as a NOP, CPU1 will
not drop the priority of the interrupt. So the CPU will not be able to
receive any interrupt route to any guest afterwards.

The problem can be prevented by dropping the priority and deactivating
the interrupt via gic_hw_ops->gic_host_irq->end().

Note that, for now, interrupt used by Xen are safe because it is not
using no_irq_type on release.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
6 years agotools/misc: Remove obsolete xen-bugtool
Hans van Kranenburg [Sun, 3 Feb 2019 20:35:18 +0000 (21:35 +0100)]
tools/misc: Remove obsolete xen-bugtool

xen-bugtool relies on code that has been removed in commit 9e8672f1c3
"tools: remove xend and associated python modules", more than 5 years
ago. Remove it, since it confuses users.

    -$ /usr/sbin/xen-bugtool
    Traceback (most recent call last):
      File "/usr/sbin/xen-bugtool", line 9, in <module>
from xen.util import bugtool
    ImportError: No module named xen.util

Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=866380
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoautomation: introduce a QEMU smoke test for PVH Dom0
Wei Liu [Thu, 24 Jan 2019 14:03:48 +0000 (14:03 +0000)]
automation: introduce a QEMU smoke test for PVH Dom0

Make qemu-smoke-x86-64.sh take a variant argument. Make two new tests
in test.yaml.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Doug Goldstein <cardoe@cardoe.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agolibxl: When restricted, start QEMU paused
Anthony PERARD [Thu, 31 Jan 2019 10:57:48 +0000 (10:57 +0000)]
libxl: When restricted, start QEMU paused

libxl runs the command "cont" later during guest creation; i.e. it
is expecting that QEMU would not do any emulation.  Use the "-S"
command option to achieve this.

Unfortunately, when QEMU is started with "-S", it won't write QEMU's
readiness into xenstore. So only activate this option when we have a
QEMU startup notification via QMP available, i.e. when dm_restrict
is activated.

The -S option has the side-effect of suppressing the startup
notification via xenstore: libxl will only get the notification via
QMP.

It is important to rely only on QMP for notification when we have
QMP available, as (due to a qemu bug) not waiting for that QMP
notification may result in the QMP socket becoming blocked, so that
QEMU stops responding to new connections even if no existing ones
are active.

When the QEMU bug happens, the actions taken by both libxl and QEMU
are roughly as follows:
- libxl connects and handshakes with QEMU, then sends the
  cmd "query-status".
- QEMU prepares and maybe tries to send the response,
  while also writing "running" into xenstore.
- libxl sees via xenstore that QEMU is running and disconnects from the
  QMP socket before receiving the response from the cmd.
=> The QMP socket (monitor) is thereby blocked and will never reply
  to commands on new connections.

This is due to QEMU only responding to one command at a time, and
suspending its monitor (QMP) until the command has been processed and
sent. Disconnecting from the socket doesn't unsuspend the monitor. The
race described here is very likely to happen with QEMU 3.1.50 (during
3.2 development), but can be reproduced with QEMU 3.1.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
6 years agox86/svm: Improve diagnostics when svm_get_insn_len() fails
Andrew Cooper [Fri, 30 Nov 2018 13:50:54 +0000 (13:50 +0000)]
x86/svm: Improve diagnostics when svm_get_insn_len() fails

Sadly, a lone:

  (XEN) emulate.c:156:d2v0 svm_get_insn_len: Mismatch between expected and actual instruction: eip = fffff804564139c0

on the console is of no use trying to identify what went wrong.  Dump as much
state as we can to help identify what went wrong.

  (XEN) Insn mismatch: Expected opcode 0xf0031, modrm 0, got nrip_len 3, emul_len 3
  (XEN) SVM Insn len emulation failed (1): d1v0 64bit @ 0008:0010475f -> 0f 01 f9 0f 31 5b 31 ff 31 c0 e9 c2 db ff ff 00

Drop the debug-only early exit if the sources of length disagree, because the
only effect it has it to avoid the more detailed analysis of what went wrong.

Reported-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Brian Woods <brian.woods@amd.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/svm: Drop enum instruction_index and simplify svm_get_insn_len()
Andrew Cooper [Thu, 13 Dec 2018 17:01:24 +0000 (17:01 +0000)]
x86/svm: Drop enum instruction_index and simplify svm_get_insn_len()

Passing a 32-bit integer index into an array with entries containing less than
32 bits of data is wasteful, and creates an unnecessary error condition of
passing an out-of-range index.

The width of the X86EMUL_OPC() encoding is currently 20 bits for the
instructions used, which leaves room for a modrm byte.  Drop opc_tab[]
entirely, and encode the expected opcode/modrm information directly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Brian Woods <brian.woods@amd.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/svm: Remove list functionality from __get_instruction_length_* infrastructure
Andrew Cooper [Thu, 13 Dec 2018 17:01:24 +0000 (09:01 -0800)]
x86/svm: Remove list functionality from __get_instruction_length_* infrastructure

The existing __get_instruction_length_from_list() has a single user
which uses the list functionality.  That user however should be looking
specifically for INVD or WBINVD, as reported by the vmexit exit reason.

Modify svm_vmexit_do_invalidate_cache() to ask for the correct
instruction, and drop all list functionality from the helper.

Take the opportunity to rename it to svm_get_insn_len(), and drop the
IOIO length handling which has never been used.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Brian Woods <brian.woods@amd.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86emul: correct AVX512BW write masking checks
Jan Beulich [Thu, 31 Jan 2019 10:38:24 +0000 (11:38 +0100)]
x86emul: correct AVX512BW write masking checks

For VPSADBW this likely was a result of bad copy-and-paste.

For VPS{L,R}LDQ comment and code were not in line, but then again the
comment also wasn't fully updated from the AVX2 original it got cloned
from.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agotools: fix build dependency upon generated header(s)
Jan Beulich [Thu, 31 Jan 2019 10:37:56 +0000 (11:37 +0100)]
tools: fix build dependency upon generated header(s)

Commit fd35f32b4b ("tools/x86emul: Use struct cpuid_policy in the
userspace test harnesses") didn't account for the dependencies of
cpuid-autogen.h to potentially change between incremental builds.
Putting the make invocation to produce the header together with the
directory tree creation therefore does not work. Introduce a separate
goal.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoxen/cmdline: Work around some specific command line warnings
Andrew Cooper [Tue, 29 Jan 2019 19:07:40 +0000 (19:07 +0000)]
xen/cmdline: Work around some specific command line warnings

Xen will warn when an unknown parameter is found in the command line.  e.g.

  (d8) [ 1556.334664] (XEN) parameter "pv-shim" unknown!

One case where this goes wrong is a workaround for an old grub bug, which
resulted in "placeholder" being prepended to the command line.

Another case is when booting a CONFIG_PV_SHIM_EXCLUSIVE build, in which the
parsing for the "pv-shim" parameter is discarded.

Introduce ignore_param() and OPT_IGNORE to cope with known cases, where
issuing a warning is the wrong course of action to take.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/pvh-boot: don't mandate validity of RSDP pointer
Wei Liu [Wed, 30 Jan 2019 13:55:55 +0000 (13:55 +0000)]
x86/pvh-boot: don't mandate validity of RSDP pointer

RSDP is not mandatory according to PVH spec. Remove the BUG_ON. The
guest (xen) will fall back to scanning if necessary.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooepr3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agoxen/arm: gic-vgic: Fix the assert condition in vgic_connect_hw_irq
Andrii Anisov [Fri, 25 Jan 2019 17:06:02 +0000 (19:06 +0200)]
xen/arm: gic-vgic: Fix the assert condition in vgic_connect_hw_irq

Currently, the assert condition in vgic_connect_hw_irq does not
correspond to the comment above and result to hit the assertion
on HW IRQ disconnection.

Fix the condition so it corresponds to the comment and allows IRQ
disconnection on debug builds.

Fixes: ec2a2f1 ("ARM: VGIC: factor out vgic_connect_hw_irq()")
Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
Suggested-by: Stefan Nuernberger <snu@amazon.de>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
[julieng: Reword the commit message]
Acked-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agolibxl: correctly dispose of dominfo list in libxl_name_to_domid
Wei Liu [Tue, 29 Jan 2019 11:37:59 +0000 (11:37 +0000)]
libxl: correctly dispose of dominfo list in libxl_name_to_domid

Tamas reported ssid_label was leaked. Use the designated function to
free dominfo list to fix the leakage.

Reported-by: Tamas K Lengyel <tamas@tklengyel.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Tested-by: Tamas K Lengyel <tamas@tklengyel.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/hvm: Fix bit checking for CR4 and MSR_EFER
Andrew Cooper [Fri, 25 Jan 2019 16:23:46 +0000 (16:23 +0000)]
x86/hvm: Fix bit checking for CR4 and MSR_EFER

Before the cpuid_policy logic came along, %cr4/EFER auditing on migrate-in was
complicated, because at that point no CPUID information had been set for the
guest.  Auditing against the host CPUID was better than nothing, but not
ideal.

Similarly at the time, PVHv1 lacked the "CPUID passed through from hardware"
behaviour with PV guests had, and PVH dom0 had to be special-cased to be able
to boot.

Order of information in the migration stream is still an issue (hence we still
need to keep the restore parameter to cope with a nested virt corner case for
%cr4), but since Xen 4.9, all domains start with a suitable CPUID policy,
which is a more appropriate upper bound than host_cpuid_policy.

Finally, reposition the UMIP logic as it is the only row out of order.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agox86/p2m: Drop erroneous #VE-enabled check in ept_set_entry()
Andrew Cooper [Tue, 22 Jan 2019 18:58:56 +0000 (18:58 +0000)]
x86/p2m: Drop erroneous #VE-enabled check in ept_set_entry()

Code clearing the "Suppress VE" bit in an EPT entry isn't nececsserily running
in current context.  In ALTP2M_external mode, it definitely is not, and in PV
context, vcpu_altp2m(current) acts upon the HVM union.

Even if we could sensibly resolve the target vCPU, it may legitimately not be
fully set up at this point, so rejecting the EPT modification would be buggy.

There is a path in hvm_hap_nested_page_fault() which explicitly emulates #VE
in the cpu_has_vmx_virt_exceptions case, so the -EOPNOTSUPP part of this
condition is also wrong.

Drop the !sve check entirely.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
6 years agopvh/dom0: fix deadlock in GSI mapping
Roger Pau Monne [Mon, 28 Jan 2019 14:22:45 +0000 (15:22 +0100)]
pvh/dom0: fix deadlock in GSI mapping

The current GSI mapping code can cause the following deadlock:

(XEN) *** Dumping CPU0 host state: ***
(XEN) ----[ Xen-4.12.0-rc  x86_64  debug=y   Tainted:  C   ]----
[...]
(XEN) Xen call trace:
(XEN)    [<ffff82d080239852>] vmac.c#_spin_lock_cb+0x32/0x70
(XEN)    [<ffff82d0802ed40f>] vmac.c#hvm_gsi_assert+0x2f/0x60 <- pick hvm.irq_lock
(XEN)    [<ffff82d080255cc9>] io.c#hvm_dirq_assist+0xd9/0x130 <- pick event_lock
(XEN)    [<ffff82d080255b4b>] io.c#dpci_softirq+0xdb/0x120
(XEN)    [<ffff82d080238ce6>] softirq.c#__do_softirq+0x46/0xa0
(XEN)    [<ffff82d08026f955>] domain.c#idle_loop+0x35/0x90
(XEN)
[...]
(XEN) *** Dumping CPU3 host state: ***
(XEN) ----[ Xen-4.12.0-rc  x86_64  debug=y   Tainted:  C   ]----
[...]
(XEN) Xen call trace:
(XEN)    [<ffff82d08023985d>] vmac.c#_spin_lock_cb+0x3d/0x70
(XEN)    [<ffff82d080281fc8>] vmac.c#allocate_and_map_gsi_pirq+0xc8/0x130 <- pick event_lock
(XEN)    [<ffff82d0802f44c0>] vioapic.c#vioapic_hwdom_map_gsi+0x80/0x130
(XEN)    [<ffff82d0802f4399>] vioapic.c#vioapic_write_redirent+0x119/0x1c0 <- pick hvm.irq_lock
(XEN)    [<ffff82d0802f4075>] vioapic.c#vioapic_write+0x35/0x40
(XEN)    [<ffff82d0802e96a2>] vmac.c#hvm_process_io_intercept+0xd2/0x230
(XEN)    [<ffff82d0802e9842>] vmac.c#hvm_io_intercept+0x22/0x50
(XEN)    [<ffff82d0802dbe9b>] emulate.c#hvmemul_do_io+0x21b/0x3c0
(XEN)    [<ffff82d0802db302>] emulate.c#hvmemul_do_io_buffer+0x32/0x70
(XEN)    [<ffff82d0802dcd29>] emulate.c#hvmemul_do_mmio_buffer+0x29/0x30
(XEN)    [<ffff82d0802dcc19>] emulate.c#hvmemul_phys_mmio_access+0xf9/0x1b0
(XEN)    [<ffff82d0802dc6d0>] emulate.c#hvmemul_linear_mmio_access+0xf0/0x180
(XEN)    [<ffff82d0802de971>] emulate.c#hvmemul_linear_mmio_write+0x21/0x30
(XEN)    [<ffff82d0802de742>] emulate.c#linear_write+0xa2/0x100
(XEN)    [<ffff82d0802dce15>] emulate.c#hvmemul_write+0xb5/0x120
(XEN)    [<ffff82d0802babba>] vmac.c#x86_emulate+0x132aa/0x149a0
(XEN)    [<ffff82d0802c04f9>] vmac.c#x86_emulate_wrapper+0x29/0x70
(XEN)    [<ffff82d0802db570>] emulate.c#_hvm_emulate_one+0x50/0x140
(XEN)    [<ffff82d0802e9e31>] vmac.c#hvm_emulate_one_insn+0x41/0x100
(XEN)    [<ffff82d080345066>] guest_4.o#sh_page_fault__guest_4+0x976/0xd30
(XEN)    [<ffff82d08030cc69>] vmac.c#vmx_vmexit_handler+0x949/0xea0
(XEN)    [<ffff82d08031411a>] vmac.c#vmx_asm_vmexit_handler+0xfa/0x270

In order to solve it move the vioapic_hwdom_map_gsi outside of the
locked region in vioapic_write_redirent. vioapic_hwdom_map_gsi will
not access any of the vioapic fields, so there's no need to call the
function holding the hvm.irq_lock.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>