xenbits.xensource.com Git - people/liuw/xen.git/log

x86/shim: copy back the result of EVTCHNOP_status

The event channel data was not copied back to guest memory, fix this
by doing the copy.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/hvm: Update code in HVMOP_altp2m_set_suppress_ve

Originally the gfn and altp2m_idx are assigned from the a.u.mem_access union.
This works because it's the same memory used. This patch addresses this
issue by changing the mem_access union with the suppress_ve union for
consistency.

Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/pv: Fix !CONFIG_PV build following XSA-299

PTF_* are declared within CONFIG_PV, and used outside:

  mm.c: In function ‘_put_page_type’:
  mm.c:2819:32: error: ‘PTF_preemptible’ undeclared (first use in this function)
       bool preemptible = flags & PTF_preemptible;
                                  ^~~~~~~~~~~~~~~
  mm.c:2819:32: note: each undeclared identifier is reported only once for each
  function it appears in
  mm.c:2842:24: error: ‘PTF_partial_set’ undeclared (first use in this function)
           if ( !(flags & PTF_partial_set) )
                          ^~~~~~~~~~~~~~~
  mm.c: In function ‘put_page_type_preemptible’:
  mm.c:3090:33: error: ‘PTF_preemptible’ undeclared (first use in this function)
       return _put_page_type(page, PTF_preemptible, NULL);
                                   ^~~~~~~~~~~~~~~
  mm.c: In function ‘put_old_guest_table’:
  mm.c:3108:25: error: ‘PTF_preemptible’ undeclared (first use in this function)
                           PTF_preemptible |
                           ^~~~~~~~~~~~~~~
  mm.c:3110:27: error: ‘PTF_partial_set’ undeclared (first use in this function)
                             PTF_partial_set : 0 ),
                             ^~~~~~~~~~~~~~~
  mm.c: In function ‘put_page_type_preemptible’:
  mm.c:3091:1: error: control reaches end of non-void function
  [-Werror=return-type]
   }
   ^
  cc1: all warnings being treated as errors

Re-position the definitions to be outside of the #ifdef CONFIG_PV

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/arm64: Don't blindly unmask interrupts on trap without a change of level

Some of the traps without a change of the level (i.e. hypervisor ->
hypervisor) will unmask interrupts regardless the state of them in the
interrupted context.

One of the consequences is IRQ will be unmasked when receiving a
synchronous exception (used by WARN*()). This could result to unexpected
behavior such as deadlock (if a lock was shared with interrupts).

In a nutshell, interrupts should only be unmasked when it is safe to
do. Xen only unmask IRQ and Abort interrupts, so the logic can stay
simple:
    - hyp_error: All the interrupts are now kept masked. SError should
      be pretty rare and if ever happen then we most likely want to
      avoid any other interrupts to be generated. The potential main
      "caller" is during virtual SError synchronization on the exit
      path from the guest (see check_pending_vserror).

    - hyp_sync: The interrupts state is inherited from the interrupted
      context.

    - hyp_irq: All the interrupts but IRQ state are inherited from the
      interrupted context. IRQ is kept masked.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>

xen/arm32: Don't blindly unmask interrupts on trap without a change of level

Exception vectors will unmask interrupts regardless the state of them in
the interrupted context.

One of the consequences is IRQ will be unmasked when receiving an
undefined instruction exception (used by WARN*) from the hypervisor.
This could result to unexpected behavior such as deadlock (if a lock was
shared with interrupts).

In a nutshell, interrupts should only be unmasked when it is safe to do.
Xen only unmask IRQ and Abort interrupts, so the logic can stay simple.

As vectors exceptions may be shared between guest and hypervisor, we now
need to have a different policy for the interrupts.

On exception from hypervisor, each vector will select the list of
interrupts to inherit from the interrupted context. Any interrupts not
listed will be kept masked.

On exception from the guest, the Abort and IRQ will be unmasked
depending on the exact vector.

The interrupts will be kept unmasked when the vector cannot used by
either guest or hypervisor.

Note that each vector is not anymore preceded by ALIGN. This is fine
because the alignment is already bigger than what we need.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>

xen/arm32: entry: Fold the macro SAVE_ALL in the macro vector

Follow-up rework will require the macro vector to distinguish between
a trap from a guest vs while in the hypervisor.

The macro SAVE_ALL already has code to distinguish between the two and
it is only called by the vector macro. So fold the former into the
latter. This will help to avoid duplicating the check.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>

xen/arm32: entry: Split __DEFINE_ENTRY_TRAP in two

The preprocessing macro __DEFINE_ENTRY_TRAP is used to generate trap
entry function. While the macro is fairly small today, follow-up patches
will increase the size signicantly.

In general, assembly macros are more readable as they allow you to name
parameters and avoid '\'. So the actual implementation of the trap is
now switched to an assembly macro.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>

passthrough: quarantine PCI devices

When a PCI device is assigned to an untrusted domain, it is possible for
that domain to program the device to DMA to an arbitrary address. The
IOMMU is used to protect the host from malicious DMA by making sure that
the device addresses can only target memory assigned to the guest. However,
when the guest domain is torn down the device is assigned back to dom0,
thus allowing any in-flight DMA to potentially target critical host data.

This patch introduces a 'quarantine' for PCI devices using dom_io. When
the toolstack makes a device assignable (by binding it to pciback), it
will now also assign it to DOMID_IO and the device will only be assigned
back to dom0 when the device is made unassignable again. Whilst device is
assignable it will only ever transfer between dom_io and guest domains.
dom_io is actually only used as a sentinel domain for quarantining purposes;
it is not configured with any IOMMU mappings. Assignment to dom_io simply
means that the device's initiator (requestor) identifier is not present in
the IOMMU's device table and thus any DMA transactions issued will be
terminated with a fault condition.

In addition, a fix to assignment handling is made for VT-d.  Failure
during the assignment step should not lead to a device still being
associated with its prior owner. Hand the device to DomIO temporarily,
until the assignment step has completed successfully.  Remove the PI
hooks from the source domain then earlier as well.

Failure of the recovery reassign_device_ownership() may not go silent:
There e.g. may still be left over RMRR mappings in the domain assignment
to which has failed, and hence we can't allow that domain to continue
executing.

NOTE: This patch also includes one printk() cleanup; the
      "XEN_DOMCTL_assign_device: " tag is dropped in iommu_do_pci_domctl(),
      since similar printk()-s elsewhere also don't log such a tag.

This is XSA-302.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

xen/arm: p2m: Don't check the return of p2m_get_root_pointer() with BUG_ON()

It turns out that the BUG_ON() was actually reachable with well-crafted
hypercalls. The BUG_ON() is here to prevent catch logical error, so
crashing Xen is a bit over the top.

While all the holes should now be fixed, it would be better to downgrade
the BUG_ON() to something less fatal to prevent any more DoS.

The BUG_ON() in p2m_get_entry() is now replaced by ASSERT_UNREACHABLE()
to catch mistake in debug build and return INVALID_MFN for production
build. The interface also requires to set page_order to give an idea of
the size of "hole". So 'level' is now set so we report a hole of size of
the an entry of the root page-table. This stays inline with what happen
when the GFN is higher than p2m->max_mapped_gfn.

The BUG_ON() in p2m_resolve_translation_fault() is now replaced by
ASSERT_UNREACHABLE() to catch mistake in debug build and just report a
fault for producion build.

This is part of XSA-301.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: p2m: Avoid off-by-one check on p2m->max_mapped_gfn

The code base is using inconsistently the field p2m->max_mapped_gfn.
Some of the useres expect that p2m->max_guest_gfn contain the highest
mapped GFN while others expect highest + 1.

p2m->max_guest_gfn is set as highest + 1, because of that the sanity
check on the GFN in p2m_resolved_translation_fault() and
p2m_get_entry() can be bypassed when GFN == p2m->max_guest_gfn.

p2m_get_root_pointer(p2m->max_guest_gfn) may return NULL if it is
outside of address range supported and therefore the BUG_ON() could be
hit.

The current value hold in p2m->max_mapped_gfn is inconsistent with the
expectation of the common code (see domain_get_maximum_gpfn()) and also
the documentation of the field.

Rather than changing the check in p2m_translation_fault() and
p2m_get_entry(), p2m->max_mapped_gfn is now containing the highest
mapped GFN and the callers assuming "highest + 1" are now adjusted.

Take the opportunity to use 1UL rather than 1 as page_order could
theoritically big enough to overflow a 32-bit integer.

Lastly, the documentation of the field max_guest_gfn to reflect how it
is computed.

This is part of XSA-301.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: p2m: Avoid aliasing guest physical frame

The P2M helpers implementation is quite lax and will end up to ignore
the unused top bits of a guest physical frame.

This effectively means that p2m_set_entry() will create a mapping for a
different frame (it is always equal to gfn & (mask unused bits)). Yet
p2m->max_mapped_gfn will be updated using the original frame.

At the moment, p2m_get_entry() and p2m_resolve_translation_fault()
assume that p2m_get_root_pointer() will always return a non-NULL pointer
when the GFN is smaller than p2m->max_mapped_gfn.

Unfortunately, because of the aliasing described above, it would be
possible to set p2m->max_mapped_gfn high enough so it covers frame that
would lead p2m_get_root_pointer() to return NULL.

As we don't sanity check the guest physical frame provided by a guest, a
malicious guest could craft a series of hypercalls that will hit the
BUG_ON() and therefore DoS Xen.

To prevent aliasing, the function p2m_get_root_pointer() is now reworked
to return NULL If any of the unused top bits are not zero. The caller
can then decide what's the appropriate action to do. Since the two paths
(i.e. P2M_ROOT_PAGES == 1 and P2M_ROOT_PAGES != 1) are now very
similarly, take the opportunity to consolidate them making the code a
bit simpler.

With this change, p2m_get_entry() will not try to insert a mapping as
the root pointer is invalid.

Note that root_table is now switch to unsigned long as unsigned int is
not enough to hold part of a GFN.

This is part of XSA-301.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

x86/mm: Don't drop a type ref unless you held a ref to begin with

Validation and de-validation of pagetable trees may take arbitrarily
large amounts of time, and so must be preemptible.  This is indicated
by setting the PGT_partial bit in the type_info, and setting
nr_validated_entries and partial_flags appropriately.  Specifically,
if the entry at [nr_validated_entries] is partially validated,
partial_flags should have the PGT_partial_set bit set, and the entry
should hold a general reference count.  During de-validation,
put_page_type() is called on partially validated entries.

Unfortunately, there are a number of issues with the current algorithm.

First, doing a "normal" put_page_type() is not safe when no type ref
is held: there is nothing to stop another vcpu from coming along and
picking up validation again: at which point the put_page_type may drop
the only page ref on an in-use page.  Some examples are listed in the
appendix.

The core issue is that put_page_type() is being called both to clean
up PGT_partial, and to drop a type count; and has no way of knowing
which is which; and so if in between, PGT_partial is cleared,
put_page_type() will drop the type ref erroneously.

What is needed is to distinguish between two states:
- Dropping a type ref which is held
- Cleaning up a page which has been partially de/validated

Fix this by telling put_page_type() which of the two activities you
intend.

When cleaning up a partial de/validation, take no action unless you
find a page partially validated.

If put_page_type() is called without PTF_partial_set, and finds the
page in a PGT_partial state anyway, then there's certainly been a
misaccounting somewhere, and carrying on would almost certainly cause
a security issue, so crash the host instead.

In put_page_from_lNe, pass partial_flags on to _put_page_type().

old_guest_table may be set either with a fully validated page (when
using the "deferred put" pattern), or with a partially validated page
(when a normal "de-validation" is interrupted, or when a validation
fails part-way through due to invalid entries).  Add a flag,
old_guest_table_partial, to indicate which of these it is, and use
that to pass the appropriate flag to _put_page_type().

While here, delete stray trailing whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix:

Suppose page A, when interpreted as an l3 pagetable, contains all
valid entries; and suppose A[x] points to page B, which when
interpreted as an l2 pagetable, contains all valid entries.

P1: PIN_L3_TABLE
  A -> PGT_l3_table | 1 | valid
  B -> PGT_l2_table | 1 | valid

P1: UNPIN_TABLE
  > Arrange to interrupt after B has been de-validated
  B:
    type_info -> PGT_l2_table | 0
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_enties -> (less than x)

P2: mod_l4_entry to point to A
  > Arrange for this to be interrupted while B is being validated
  B:
    type_info -> PGT_l2_table | 1 | partial
    (nr_validated_entires &c set as appropriate)
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_entries -> x
    partial_pte = 1

P3: mod_l3_entry some other unrelated l3 to point to B:
  B:
    type_info -> PGT_l2_table | 1

P1: Restart UNPIN_TABLE

At this point, since A.nr_validate_entries == x and A.partial_pte !=
0, free_l3_table() will call put_page_from_l3e() on pl3e[x], dropping
its type count to 0 while it's still being pointed to by some other l3

A similar issue arises with old_guest_table.  Consider the following
scenario:

Suppose A is a page which, when interpreted as an l2, has valid entries
until entry x, which is invalid.

V1:  PIN_L2_TABLE(A)
  <Validate until we try to validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V1 -> old_guest_table = A
  <delayed>

V2: PIN_L2_TABLE(A)
  <Pick up where V1 left off, try to re-validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V2 -> old_guest_table = A
  <restart>
  put_old_guest_table()
    _put_page_type(A)
      A -> PGT_l2_table | 0

V1: <restart>
  put_old_guest_table()
    _put_page_type(A) # UNDERFLOW

Indeed, it is possible to engineer for old_guest_table for every vcpu
a guest has to point to the same page.

x86/mm: Fix nested de-validation on error

If an invalid entry is discovered when validating a page-table tree,
the entire tree which has so far been validated must be de-validated.
Since this may take a long time, alloc_l[2-4]_table() set current
vcpu's old_guest_table immediately; put_old_guest_table() will make
sure that put_page_type() will be called to finish off the
de-validation before any other MMU operations can happen on the vcpu.

The invariant for partial pages should be:

* Entries [0, nr_validated_ptes) should be completely validated;
  put_page_type() will de-validate these.

* If [nr_validated_ptes] is partially validated, partial_flags should
  set PTF_partiaL_set.  put_page_type() will be called on this page to
  finish off devalidation, and the appropriate refcount adjustments
  will be done.

alloc_l[2-3]_table() indicates partial validation to its callers by
setting current->old_guest_table.

Unfortunately, this is mishandled.

Take the case where validating lNe[x] returns an error.

First, alloc_l3_table() doesn't check old_guest_table at all; as a
result, partial_flags is not set when it should be.  nr_validated_ptes
is set to x; and since PFT_partial_set clear, de-validation resumes at
nr_validated_ptes-1.  This means that the l2 page at pl3e[x] will not
have put_page_type() called on it when de-validating the rest of the
l3: it will be stuck in the PGT_partial state until the domain is
destroyed, or until it is re-used as an l2.  (Any other page type will
fail.)

Worse, alloc_l4_table(), rather than setting PTF_partial_set as it
should, sets nr_validated_ptes to x+1.  When de-validating, since
partial is 0, this will correctly resume calling put_page_type at [x];
but, if the put_page_type() is never called, but instead
get_page_type() is called, validation will pick up at [x+1],
neglecting to validate [x].  If the rest of the validation succeeds,
the l4 will be validated even though [x] is invalid.

Fix this in both cases by setting PTF_partial_set if old_guest_table
is set.

While here, add some safety catches:
- old_guest_table must point to the page contained in
  [nr_validated_ptes].
- alloc_l1_page shouldn't set old_guest_table

If we experience one of these situations in production builds, it's
safer to avoid calling put_page_type for the pages in question.  If
they have PGT_partial set, they will be cleaned up on domain
destruction; if not, we have no idea whether a type count is safe to
drop.  Retaining an extra type ref that should have been dropped may
trigger a BUG() on the free_domain_page() path, but dropping a type
count that shouldn't be dropped may cause a privilege escalation.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mm: Properly handle linear pagetable promotion failures

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated, and a general reference count is held.

Unfortunately, in cases where an entry began with PTF_partial_set set,
and get_page_from_lNe() returns -EINVAL, the PTF_partial_set bit is
erroneously dropped.  (This scenario can be engineered mainly by the
use of interleaving of promoting and demoting a page which has "linear
pagetable" entries; see the appendix for a sketch.)  This means that
we will "leak" a general reference count on the page in question,
preventing the page from being freed.

Fix this by setting page->partial_flags to the partial_flags local
variable.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix

Suppose A and B can both be promoted to L2 pages, and A[x] points to B.

V1: PIN_L2 B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY pointing something to A.
  In the process of validating A[x], grab an extra type / ref on B:
  B.type_count = 2 | PGT_validated
  B.count = 3 | PGC_allocated
  A.type_count = 1 | PGT_validated
  A.count = 2 | PGC_allocated

V1: UNPIN B.
  B.type_count = 1 | PGT_validate
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY removing the reference to A.
  De-validate A, down to A[x], which points to B.
  Drop the final type on B.  Arrange to be interrupted.
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = -1

V2: MOD_L3_ENTRY adds a reference to A.

At this point, get_page_from_l2e(A[x]) tries
get_page_and_type_from_mfn(), which fails because it's the wrong type;
and get_l2_linear_pagetable() also fails, because B isn't validated as
an l2 anymore.

x86/mm: Collapse PTF_partial_set and PTF_partial_general_ref into one

...now that they are equivalent. No functional change intended.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mm: Always retain a general ref on partial

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page struct:
nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, because a refcount is not held, it is possible to
engineer a situation where PFT_partial_set is set but the page in
question has been assigned to another domain.  A sketch is provided in
the appendix.

Fix this by having the parent page table entry hold a general
reference count whenever PFT_partial_set is set.  (For clarity of
change, keep two separate flags.  These will be collapsed in a
subsequent changeset.)

This has two basic implications.  On the put_page_from_lNe() side,
this mean that the (partial_set && !partial_ref) case can never happen,
and no longer needs to be special-cased.

Secondly, because both flags are set together, there's no need to carry over
existing bits from partial_pte.

(NB there is still another issue with calling _put_page_type() on a
page which had PGT_partial set; that will be handled in a subsequent
patch.)

On the get_page_and_type_from_mfn() side, we need to distinguish
between callers which hold a reference on partial (i.e.,
alloc_lN_table()), and those which do not (new_cr3, PIN_LN_TABLE, and
so on): pass a flag if the type should be retained on interruption.

NB that since l1 promotion can't be preempted, that get_page_from_l2e
can't return -ERESTART.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
* Appendix: Engineering PTF_partial_set while a page belongs to a
  foreign domain

Suppose A is a page which can be promoted to an l3, and B is a page
which can be promoted to an l2, and A[x] points to B.  B has
PGC_allocated set but no other general references.

V1:  PIN_L3 A.
  A is validated, B is validated.
  A.type_count = 1 | PGT_validated | PGT_pinned
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated (A[x] holds a general ref)

V1: UNPIN A.
  A begins de-validation.
  Arrange to be interrupted when i < x
  V1->old_guest_table = A
  V1->old_guest_table_ref_held = false
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = i < x
  B.type_count = 0
  B.count = 1 | PGC_allocated

V2: MOD_L4_ENTRY to point some l4e to A.
  Picks up re-validation of A.
  Arrange to be interrupted halfway through B's validation
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated (PGT_partial holds a general ref)
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = PTF_partial_set

V3: MOD_L3_ENTRY to point some other l3e (not in A) to B.
  Validates B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated ("other l3e" holds a general ref)

V3: MOD_L3_ENTRY to clear l3e pointing to B.
  Devalidates B.
  B.type_count = 0
  B.count = 1 | PGC_allocated

V3: decrease_reservation(B)
  Clears PGC_allocated
  B.count = 0 => B is freed

B gets assigned to a different domain

V1: Restarts UNPIN of A
  put_old_guest_table(A)
    ...
      free_l3_table(A)

Now since A.partial_flags has PTF_partial_set, free_l3_table() will
call put_page_from_l3e() on A[x], which points to B, while B is owned
by another domain.

If A[x] held a general refcount for B on partial validation, as it does
for partial de-validation, then B would still have a reference count of
1 after PGC_allocated was freed; so B wouldn't be freed until after
put_page_from_l3e() had happend on A[x].

x86/mm: Have alloc_l[23]_table clear partial_flags when preempting

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, when alloc_l[23]_table check hypercall_preempt_check()
and return -ERESTART, they set nr_entries_validated, but don't clear
partial_flags.

If we were picking up from a previously-interrupted promotion, that
means that PTF_partial_set would be set even though
[nr_entries_validated] was not partially validated.  This means that
if the page in this state were de-validated, put_page_type() would
erroneously be called on that entry.

Perhaps worse, if we were racing with a de-validation, then we might
leave both PTF_partial_set and PTF_partial_general_ref; and when
de-validation picked up again, both the type and the general ref would
be erroneously dropped from [nr_entries_validated].

In a sense, the real issue here is code duplication.  Rather than
duplicate the interruption code, set rc to -EINTR and fall through to
the code which already handles that case correctly.

Given the logic at this point, it should be impossible for
partial_flags to be non-zero; add an ASSERT() to catch any changes.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mm: Rework get_page_and_type_from_mfn conditional

Make it easier to read by declaring the conditions in which we will
retain the ref, rather than the conditions under which we release it.

The only way (page == current->arch.old_guest_table) can be true is if
preemptible is true; so remove this from the query itself, and add an
ASSERT() to that effect on the opposite path.

No functional change intended.

NB that alloc_lN_table() mishandle the "linear pt failure" situation
described in the comment; this will be addressed in a future patch.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mm: Use flags for _put_page_type rather than a boolean

This is in mainly in preparation for _put_page_type taking the
partial_flags value in the future. It also makes it easier to read in
the caller (since you see a flag name rather than `true` or `false`).

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mm: Separate out partial_pte tristate into individual flags

At the moment, partial_pte is a tri-state that contains two distinct bits
of information:

1. If zero, the pte at index [nr_validated_ptes] is un-validated.  If
   non-zero, the pte was last seen with PGT_partial set.

2. If positive, the pte at index [nr_validated_ptes] does not hold a
   general reference count.  If negative, it does.

To make future patches more clear, separate out this functionality
into two distinct, named bits: PTF_partial_set (for #1) and
PTF_partial_general_ref (for #2).

Additionally, a number of functions which need this information also
take other flags to control behavior (such as `preemptible` and
`defer`).  These are hard to read in the caller (since you only see
'true' or 'false'), and ugly when many are added together.  In
preparation for adding yet another flag in a future patch, collapse
all of these into a single `flag` variable.

NB that this does mean checking for what was previously the '-1'
condition a bit more ugly in the put_page_from_lNe functions (since
you have to check for both partial_set and general ref); but this
clause will go away in a future patch.

Also note that the original comment had an off-by-one error:
partial_flags (like partial_pte before it) concerns
plNe[nr_validated_ptes], not plNe[nr_validated_ptes+1].

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mm: Don't re-set PGT_pinned on a partially de-validated page

When unpinning pagetables, if an operation is interrupted,
relinquish_memory() re-sets PGT_pinned so that the un-pin will
pickedup again when the hypercall restarts.

This is appropriate when put_page_and_type_preemptible() returns
-EINTR, which indicates that the page is back in its initial state
(i.e., completely validated). However, for -ERESTART, this leads to a
state where a page has both PGT_pinned and PGT_partial set.

This happens to work at the moment, although it's not really a
"canonical" state; but in subsequent patches, where we need to make a
distinction in handling between PGT_validated and PGT_partial pages,
this causes issues.

Move to a "canonical" state by:
- Only re-setting PGT_pinned on -EINTR
- Re-dropping the refcount held by PGT_pinned on -ERESTART

In the latter case, the PGT_partial bit will be cleared further down
with the rest of the other PGT_partial pages.

While here, clean up some trainling whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mm: L1TF checks don't leave a partial entry

On detection of a potential L1TF issue, most validation code returns
-ERESTART to allow the switch to shadow mode to happen and cause the
original operation to be restarted.

However, in the validation code, the return value -ERESTART has been
repurposed to indicate 1) the function has partially completed
something which needs to be undone, and 2) calling put_page_type()
should cleanly undo it.  This causes problems in several places.

For L1 tables, on receiving an -ERESTART return from alloc_l1_table(),
alloc_page_type() will set PGT_partial on the page.  If for some
reason the original operation never restarts, then on domain
destruction, relinquish_memory() will call free_page_type() on the
page.

Unfortunately, alloc_ and free_l1_table() aren't set up to deal with
PGT_partial.  When returning a failure, alloc_l1_table() always
de-validates whatever it's validated so far, and free_l1_table()
always devalidates the whole page.  This means that if
relinquish_memory() calls free_page_type() on an L1 that didn't
complete due to an L1TF, it will call put_page_from_l1e() on "page
entries" that have never been validated.

For L2+ tables, setting rc to ERESTART causes the rest of the
alloc_lN_table() function to *think* that the entry in question will
have PGT_partial set.  This will cause it to set partial_pte = 1.  If
relinqush_memory() then calls free_page_type() on one of those pages,
then free_lN_table() will call put_page_from_lNe() on the entry when
it shouldn't.

Rather than indicating -ERESTART, indicate -EINTR.  This is the code
to indicate that nothing has changed from when you started the call
(which is effectively how alloc_l1_table() handles errors).

mod_lN_entry() shouldn't have any of these types of problems, so leave
potential changes there for a clean-up patch later.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/PV: check GDT/LDT limits during emulation

Accesses beyond the LDT limit originating from emulation would trigger
the ASSERT() in pv_map_ldt_shadow_page(). On production builds such
accesses would cause an attempt to promote the touched page (offset from
the present LDT base address) to a segment descriptor one. If this
happens to succeed, guest user mode would be able to elevate its
privileges to that of the guest kernel. This is particularly easy when
there's no LDT at all, in which case the LDT base stored internally to
Xen is simply zero.

Also adjust the ASSERT() that was triggering: It was off by one to
begin with, and for production builds we also better use
ASSERT_UNREACHABLE() instead with suitable recovery code afterwards.

This is XSA-298.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/hypercall: Don't use BUG() for parameter checking in hypercall_create_continuation()

Since c/s 1d429034 "hypercall: update vcpu_op to take an unsigned vcpuid",
which incorrectly swapped 'i' for 'u' in the parameter type list, guests have
been able to hit the BUG() in next_args()'s default case.

Correct these back to 'i'.

In addition, make adjustments to prevent this class of issue from occurring in
the future - crashing Xen is not an appropriate form of parameter checking.

Capitalise NEXT_ARG() to catch all uses, to highlight that it is a macro doing
non-function-like things behind the scenes, and undef it when appropriate.
Implement a bad_fmt: block which prints an error, asserts unreachable, and
crashes the guest.

On the ARM side, drop all parameter checking of p. It is asymmetric with the
x86 side, and akin to expecting memcpy() or sprintf() to check their src/fmt
parameter before use. A caller passing "" or something other than a string
literal will be obvious during code review.

This is XSA-296.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <julien.grall@arm.com>

x86/livepatch: Fail the build if duplicate symbols exist

The binary diffing algorithm used by xen-livepatch depends on having unique
symbols.

The livepatch loading algorithm used by Xen resolves relocations by symbol
name, and thus also depends on having unique symbols.

Introduce CONFIG_ENFORCE_UNIQUE_SYMBOLS to control failing the build if
duplicate symbols are found, and disable it in the RANDCONFIG build.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/nospec: Rename and rework l1tf-barrier as branch-harden

l1tf-barrier is an inappropriate name, and came about because of restrictions
on could be discussed publicly when the patches were proposed.

In practice, it is for general Spectre v1 mitigations, and is necessary in all
cases. An adversary which can control speculation in Xen can leak data in
cross-core (BCBS, etc) or remote (NetSpectre) scenarios - the problem is not
limited to just L1TF with HT active.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/nospec: Introduce CONFIG_SPECULATIVE_HARDEN_BRANCH

Just as with CONFIG_SPECULATIVE_HARDEN_ARRAY, branch hardening should be
configurable at compile time.

The previous CONFIG_HVM was a consequence of what could be discussed publicly
at the time the patches were submitted, and wasn't actually correct. Later
patches will make further corrections.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/nospec: Use always_inline to fix code gen for evaluate_nospec

evaluate_nospec() is incredibly fragile, and this is one giant bodge.

To correctly protect jumps, the generated code needs to be of the form:

    cmp/test <cond>
    jcc 1f
    lfence
    ...
1: lfence
    ...

Critically, the lfence must be at the head of both basic blocks, later in the
instruction stream than the conditional jump in need of protection.

When the compiler chooses to out-of-line the condition calculation (e.g. by
not inlining a predicate), the code layout can end up as:

pred:
    lfence
    <calculate cond>
    ret

    call pred
    cmp $0, %eax
    jcc 1f
    ...
1: ...

which breaks the speculative safety, as the lfences are earlier in the
instruction stream than the jump in need of protection.

Any use of evaluate_nospec() needs all static inline predicates which use it
to be declared always_inline to prevent the optimiser having the flexibility
to generate unsafe code.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/vtx: Fixes to Haswell/Broadwell LBR TSX errata

Cross reference and list all errata, now that they are published.

These errata are specific to Haswell/Broadwell. They should have model and
vendor checks, as Intel isn't the only vendor to implement VT-x.

All affected models use the same MSR indicies, so these can be hard coded
rather than looking up and storing constant values.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/vtx: Corrections to BDF93 errata workaround

At the time of fixing c/s 20f1976b44, no obvious errata had been published,
and BDF14 looked like the most obvious candidate.  Subsequently, BDF93 has
been published and it is obviously this.

The erratum states that LER_TO_LIP is the only affected MSR.  The provisional
fix in Xen adjusted LER_FROM_LIP, but this is not correct.  The FROM MSRs are
intended to have TSX metadata, and for steppings with TSX enabled, it will
corrupt the value the guest sees, while for parts with TSX disabled, it is
redundant with FIXUP_TSX.  Drop the LER_FROM_LIP adjustment.

Replace BDF14 references with BDF93, drop the redundant 'bdw_erratum_' prefix,
and use an Intel vendor check, as other vendors implement VT-x.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

CONTRIBUTING: drop blktap2 and add tools/libs

Blktap2 is gone and tools/libs is missing in the document.

Signed-off-by: Wei Liu <wl@xen.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

tools/ocaml: Fix build error with Arch Linux

gcc (GCC) 9.2.0 complains:

xentoollog_stubs.c: In function ‘stub_xtl_ocaml_vmessage’:
xentoollog_stubs.c:93:16: error: initialization discards ‘const’ qualifier from pointer target type [-Werror=discarded-qualifiers]
93 | value *func = caml_named_value(xtl->vmessage_cb) ;
| ^~~~~~~~~~~~~~~~

This patch constifies the pointer returned by caml_named_value in order
to the accommodate newer versions of OCaml.
In OCaml >= 4.09 the return value pointer of caml_named_value is
declared const.

https://github.com/ocaml/ocaml/commit/4f03a1467d29cf587df5a191830f1525506ee0e3

Signed-off-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wl@xen.org>
[ wei: remove spaces before semicolons ]
Signed-off-by: Wei Liu <wl@xen.org>

libxl: libxl__spawn_stub_dm: Call domain_config_setdefault

Previously, defaulting and checking of some aspects of the domain
config was skipped for stub dms.  This has been the case forever.

In ad011ad08843 "libxl/xl: Overhaul passthrough setting logic" some
defaulting that was needed for stub dms was moved from
libxl__domain_create_info_setdefault to .._config_setdefault with the
result that for stub dms, libxl__domain_make fails with this
assertion:
  xl: libxl_create.c:582: libxl__domain_make: Assertion
  `info->passthrough != LIBXL_PASSTHROUGH_DEFAULT' failed.

Fix this by properly doing all defaulting and all checking for stub
dms.  This is more correct, but (especially at this stage of the
release) it is necessary to more closely evaluate the effects by
reviewing the body of _config_setdefault.  The changes are as follows:

One actual functional change:

* The new passthrough defaulting is properly done.  This is what we
  are trying to actually fix here.

And a lot of things that make no difference:

* shadow_memkb would now be set.  Whether this would be correct is not
  entirely clear.  It seems better to make this patch (whose purpose
  is to fix the passthrough defaulting) *not* include that semantic
  change, so here I have included a hunk to explicitly override this.

* FLASK ssid_label is processed.  But the actual ssidref is copied
  from the guest domain by spawn_stub_dm, and ssid_label is set to
  NULL.  So no change.

* We set iommu_memkb.  But to 0 since passthrough is disabled.

* cpuid pool_name is processed.  But this is not set by
  spawn_stub_dm.  (Arguably this is a bug: stub dms should inherit the
  parent cpupool.)  The effect is to leave poolid set to 0 and call
  libxl_cpupoolid_is_valid but that always succeeds for 0.  So no
  change.

* Various extra checks are done: reject PCI passthrough for HVM with
  POD (stub dm is PV); reject pod + vnuma, or PV + vnuma (stub dm has
  no vnuma); reject nested HVM or pod, with alt2pm-hvm (again, stub dm
  is PV).  So these checks will always pass.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>

libxl: domain_config_setdefault: Document use of domid

We are going to want to call this from a site which has a domid which
is good for logging but not the domid of the domain we are creating
(namely, the stub device domain).

Consequently, add the same comment to
libxl__arch_passthrough_mode_setdefault.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>

SUPPORT.md: Add PV display/sound, update keyboard

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/arm: platform: fix Raspberry Pi compatible string

Both upstream [1] and downstream [2] Linux kernels use "brcm,bcm2711"
as the compatible string for Raspberry Pi 4. Add this string to our
platform compatible list.

The brcm,bcm2838 convention is abandoned. Remove it.

Rename the variables within the file to a rpi4_* prefix since the file
is meant to cover the Raspberry Pi 4 platform.

If you are using a device tree with the old compatible string
brcm,bcm2838, you will need to upgrade your device tree to one that has
the new brcm,bcm2711 compatible string.

[1] https://patchwork.kernel.org/patch/11165407/
[2] https://github.com/raspberrypi/linux/commit/53fdd7b8c8cb9c87190caab4fd459f89e1b4a7f8

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@dornerworks.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

efi: use directmap to access runtime services table

Do not require switching page tables to access (static) information in
the runtime services table itself, use directmap for this. This allows
exiting early from XEN_EFI_query_capsule_capabilities,
XEN_EFI_update_capsule and XEN_EFI_query_variable_info (in case of not
supported call) without all the impact of page table switch.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

efi: optionally call SetVirtualAddressMap()

Some UEFI implementations are not happy about lack of
SetVirtualAddressMap() call. Likely abuse the address map change
notification to do things beyond the necessary ConvertPointer() calls.
Specifically, wihtout the SetVirtualAddressMap() call, some access
EfiBootServices{Code,Data}, or even totally unmapped areas. Example
crash of GetVariable() call on Thinkpad W540:

    Xen call trace:
       [<0000000000000080>] 0000000000000080
       [<8c2b0398e0000daa>] 8c2b0398e0000daa

    Pagetable walk from ffffffff858483a1:
       L4[0x1ff] = 0000000000000000 ffffffffffffffff

    ****************************************
    Panic on CPU 0:
    FATAL PAGE FAULT
    [error_code=0002]
    Faulting linear address: ffffffff858483a1
    ****************************************

Fix this by calling SetVirtualAddressMap() runtime service, giving it
1:1 map for areas marked as needed during runtime. The address space in
which EFI runtime services are called is unchanged, but UEFI view of it
may be.
Since it's fairly late in Xen 4.13 development cycle, disable it
by default and hide behind EXPERT.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

efi: remove old SetVirtualAddressMap() arrangement

Remove unused (#ifdef-ed out) code. Reviving it in its current shape
won't fly because:
- SetVirtualAddressMap() needs to be called with 1:1 mapping, which
isn't the case at this time
- it uses directmap, which may go away soon
- it uses directmap, which is mapped with NX, breaking EfiRuntimeServicesCode

No functional change.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/nospec: Two trivial fixes

The include of asm/cpuid.h in spec_ctrl.c was an artefact of an older version
of c/s 3860d5534df, and is not used in its current incarnation.

Fix a typo in a comment.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/VT-d: Misc initialisation cleanup

* Initialise all spinlock fields together
* No need for an atomic set_bit() to initialise domid_bitmap
* Avoid using partial-line printk()'s.
* Style fixes (too many, and too few spaces)

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

iommu: translate IO-APIC pins when enabling interrupt remapping

On Intel hardware there's currently no translation of already enabled
IO-APIC pins when interrupt remapping is enabled on the IOMMU, hence
introduce a logic similar to the one used in x2apic_bsp_setup in order
to save and mask all IO-APIC pins, and then translate and restore them
after interrupt remapping has been enabled.

With this change the AMD specific logic to deal with enabled pins
(amd_iommu_setup_ioapic_remapping) can be removed, thus unifying the
handling of IO-APIC when enabling interrupt remapping regardless of
the IOMMU vendor.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x2APIC: translate IO-APIC entries when enabling the IOMMU

When interrupt remapping is enabled as part of enabling x2APIC the
IO-APIC entries also need to be translated to the new format and added
to the interrupt remapping table.

This prevents IOMMU interrupt remapping faults when booting on
hardware that has unmasked IO-APIC pins.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x2APIC: simplify resume

There's no need to save and restore the IO-APIC entries, the entries
prior to suspension have already been saved by ioapic_suspend, and
will be restored by ioapic_resume. Note that at the point where
resume_x2apic gets called the IO-APIC has not yet resumed, and hence
all entries should be masked.

Note this shouldn't introduce any functional change.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

MAINTAINERS: correct description of M:

Let's reflect reality, its use by add_maintainers.pl / get_maintainer.pl,
as well as what
https://wiki.xenproject.org/wiki/Submitting_Xen_Project_Patches says.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86: fix off-by-one in is_xen_fixed_mfn()

__2M_rwdata_end marks the first byte after the Xen image, not its last
byte. Subtract 1 to obtain the upper bound to compare against. (Note
that instead switching from <= to < is less desirable, as in principle
__pa() might return rubbish for addresses outside of the Xen image.)

Since the & needs to be dropped from the line in question, also drop it
from the adjacent one.

Reported-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

libxl: On ARM, reject future new passthrough modes too

This is most pleasantly done by also changing the if to a switch.

Suggested-by: Julien Grall <julien@xen.org>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>

libxl/xl: Overhaul passthrough setting logic

LIBXL_PASSTHROUGH_UNKNOWN (aka "ENABLED" in an earlier uncommitted
version of this code) is doing double duty.  We actually need all of
the following to be specifiable:
  * "default": enable PT iff we have devices to
    pass through specified in the initial config file.
  * "enabled" (and fail if the platform doesn't support it).
  * "disabled" (and reject future PT hotplug).
  * "share_pt"/"sync_pt": enable PT and set a specific PT mode.

Defaulting and error checking should be done in libxl.  So, we make
several changes here.

We introduce "enabled", and rename "unknown" to "default".

We move all of the error checking and defaulting code from xl into
libxl.  Now, libxl__domain_config_setdefault has all of the necessary
information to get this right.  So we can do it all there.  Choosing
the specific mode is arch-specific.

We can also arrange to have only one place each which calculates
(i) whether passthrough needs to be enabled because pt devices were
specified (ii) whether pt_share can be used (for each arch).

xl now only has to parse the enum in the same way as it parses all
other enums.

This change fixes a regression from earlier 4.13-pre: until recent
changes, passthrough was only enabled by default if passthrough
devices were specified.  We restore this behaviour.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Andrew Cooper <Andrew.Cooper3@citrix.com>
CC: Paul Durrant <pdurrant@gmail.com>
CC: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>

libxl: Move domain_create_info_setdefault earlier

We need this before we start to figure out the passthrough mode.

I have checked that nothing in libxl__domain_create_info_setdefault
nor the two implementations of ..._arch_... accesses anything else,
other than (i) the domain type (which this function is responsible for
setting and nothing before it looks at) (ii) c_info->ssidref (which is
defaulted by flask code near the top of
libxl__domain_config_setdefault and not accessed afterwards).

So no functional change.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

libxl: create: setdefault: Move physinfo into config_setdefault

No functional change. This will let us refer to it in code we are
about to add to this function.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

libxl: create: setdefault: Make libxl_physinfo info[1]

No functional change. This will let us make it into a pointer without
textual change other than to the definition.

While we are here, fix some style errors (missing { }).

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

libxl: Remove/deprecate libxl_get_required_*_memory from the API

These are now redundant because shadow_memkb and iommu_memkb are now
defaulted automatically by libxl_domain_need_memory and
libxl_domain_create etc.  Callers should not now call these; instead,
they should just let libxl take care of it.

libxl_get_required_shadow_memory was introduced in f89f555827a6
  "remove late (on-demand) construction of IOMMU page tables"
We can freely remove it because it was never in any release.

libxl_get_required_shadow_memory has been in libxl approximately
forever.  It should probably not have survived the creation of
libxl_domain_create, but it seems the API awkwardnesses we see in
recent commits prevented this.  So we have to keep it.  It remains
functional but we can deprecate it.  Hopefully we can get rid of it
completely before we find the need to change the calculation to use
additional information which its arguments do not currently supply.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

libxl: Move shadow_memkb and iommu_memkb defaulting into libxl

Defaulting is supposed to be done by libxl.  So these calculations
should be here in libxl.  libxl__domain_config_setdefault has all the
necessary information including the values of max_memkb and max_vcpus.

The overall functional effect depends on the caller:

For xl, no change.  The code moves from xl to libxl.

For callers who set one or both shadow_memkb and iommu_memkb (whether
from libxl_get_required_shadow_memory or otherwise) before calling
libxl_domain_need_memory (any version): the new code will leave their
setting(s) unchanged.

For callers who do not call libxl_domain_need_memory at all, and who
fail to set one of these memory values: now they are both are properly
set.  The shadow and iommu memory to be properly accounted for as
intended.

For callers which call libxl_domain_need_memory and request the
current API (4.13) or which track libxl, the default values are also
now right and everything works as intended.

For callers which call libxl_domain_need_memory, and request an old
pre-4.13 libxl API, and which leave one of these memkb settings unset,
we take special measures to preserve the old behaviour.

This means that they don't get the additional iommu memory and are at
risk of the domain running out of memory as a result of f89f555827a6
"remove late (on-demand) construction of IOMMU page tables".  But this
is no worse than the state just after f89f555827a6, which already
broke such callers in that way.  This is perhaps justifiable because
of the API stability warning next to libxl_domain_need_memory.

An alternative would be to drop the special-casing of these callers.
That would cause a discrepancy between libxl_domain_need_memory and
libxl_domain_create: the former would not include the iommu memory and
the latter would.  That seems worse, but it's debateable.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

libxl: libxl_domain_need_memory: Make it take a domain_config

This should calculate the extra memory needed for shadow and iommu,
the defaults for which depend on values in c_info.  So we need this to
have the complete domain config available.

And the defaults should actually be updated and stored.  So make it
non-const.

We provide the usual kind of compatibility function for callers
expecting 4.12 and earlier.  This function becomes responsible for the
clone-and-modify of the b_info.

No overall functional change for external libxl callers which use the
API version system to request a particular API version.

Other external libxl callers will need to update their calling code,
and will then find that the new version of this function fills in most
of the defaults in d_config.  Because libxl__domain_config_setdefault
doesn't quite do all of the defaults, that's only partial.  For
present purposes that doesn't matter because none of the missing
settings are used by the memory calculations.  It does mean we need to
document in the API spec that the defaulting is only partial.

This lack of functional change is despite the fact that
numa_place_domain now no longer calls
libxl__domain_build_info_setdefault (via libxl_domain_need_memory).
That is OK because it's idempotent and numa_place_domain's one call
site is libxl__build_pre which is called from libxl__domain_build
which is called from domcreate_bootloader_done, well after the
defaults are set by initiate_domain_create.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

libxl: libxl__domain_config_setdefault: New function

Break out this into a new function.  We are going to want to call it
from a new call site.

Unfortunately not all of the defaults can be moved into the new
function without changing the order in which things are done.  That
does not seem wise at this stage of the release.  The effect is that
additional calls to libxl__domain_config_setdefault (which are going
to be introduced) do not quite set everything.  But they will do what
is needed.  After Xen 4.13 is done, we should move those settings into
the right order.

No functional change.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xl: Pass libxl_domain_config to freemem(), instead of b_info

We are going to change the libxl API in a moment and this change will
make it simpler.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

libxl: Offer API versions 0x040700 and 0x040800

According to git log -G:

0x040700 was introduced in 304400459ef0 (aka 4.7.0-rc1~481)
  "tools/libxl: rename remus device to checkpoint device"

0x040800 was introduced in 57f8b13c7240 (aka 4.8.0-rc1~437)
  "libxl: memory size in kb requires 64 bit variable"

It is surprising that no-one noticed this.

Anyway, in the meantime, we should fix it.  Backporting this is
probably a good idea: it won't change the behaviour for existing
callers but it will avoid errors for some older correct uses.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

MAINTAINERS: Switch SVM maintainership to x86

We are now down to 0 SVM maintainers who are active and wish to hold the
position. In agreement with AMD, Jan and I will take over maintainership in
the short term.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/arm: domain_build: Don't expose IOMMU specific properties to hwdom

We always skip the IOMMU device when creating DT for hwdom if there is
an appropriate driver for it in Xen (device_get_class(iommu_node)
returns DEVICE_IOMMU). So, even if it is not used by Xen it will be skipped.

We should also skip the IOMMU specific properties of the master device
behind that IOMMU in order to avoid exposing an half complete IOMMU
bindings to hwdom.

According to the Linux's docs:
1. Documentation/devicetree/bindings/iommu/iommu.txt
2. Documentation/devicetree/bindings/pci/pci-iommu.txt

Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen: Fix strange byte in common/Kconfig

Current description of the file by `file`:
common/Kconfig: Non-ISO extended-ASCII text

Change that byte to an ascii quote so the file can become properly
encoded, and all ASCII.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/tsc: update vcpu time info on guest TSC adjustments

If a HVM/PVH guest writes to MSR_IA32_TSC{_ADJUST} and thus changes
the value of the time stamp counter the vcpu time info must also be
updated, or the time calculated by the guest using the Xen PV clock
interface will be skewed.

Update the vcpu time info when the guest writes to either MSR_IA32_TSC
or MSR_IA32_TSC_ADJUST. This fixes lockups seen when running the
pv-shim on AMD hardware, since the shim will aggressively try to keep
TSCs in sync by periodically writing to MSR_IA32_TSC if the TSC is not
reliable.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/pvhsim: fix cpu onlining

Since commit 8d3c326f6756d1 ("xen: let vcpu_create() select processor")
the initial processor for all pv-shim vcpus will be 0, as no other cpus
are online when the vcpus are created. Before that commit the vcpus
would have processors set not being online yet, which worked just by
chance.

When the pv-shim vcpu becomes active it will have a hard affinity
not matching its initial processor assignment leading to failing
ASSERT()s or other problems depending on the selected scheduler.

Fix that by doing the affinity setting after onlining the cpu but
before taking the vcpu up. For vcpu 0 this is still in
sched_setup_dom0_vcpus(), for the other vcpus setting the affinity
there can be dropped.

Fixes: 8d3c326f6756d1 ("xen: let vcpu_create() select processor")
Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

x86/vvmx: Fix the use of RDTSCP when it is intercepted at L0

Linux has started using RDTSCP as of v5.1.  This has highlighted a bug in Xen,
where virtual vmexit simply gives up.

  (XEN) d1v1 Unhandled nested vmexit: reason 51
  (XEN) domain_crash called from vvmx.c:2671
  (XEN) Domain 1 (vcpu#1) crashed on cpu#2:

Handle RDTSCP in the virtual vmexit hander in the same was as RDTSC
intercepts.

Reported-by: Sarah Newman <srn@prgmr.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Chris Brannon <cmb@prgmr.com>
Reviewed-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/VT-d: Drop unhelpful information in diagnostics

The virtual address of the base of the IOMMU's regsters is not useful for
diagnostic purposes, and is quite voluminous. The PCI coordinates is by far
the most useful piece of identifying information.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs: Extend with details about runtime microcode loading

The xen-ucode utility is new with the late loading improvements in 4.13.
Update the documentation suitably.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/arm: domain_build: Indent correctly parameters of alloc_bank_memory()

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/arm: mm: Allow generic xen page-tables helpers to be called early

The current implementations of xen_{map, unmap}_table() expect
{map, unmap}_domain_page() to be usable. Those helpers are used to
map/unmap page tables while update Xen page-tables.

Since commit 022387ee1a "xen/arm: mm: Don't open-code Xen PT update in
{set, clear}_fixmap()", setup_fixmap() will make use of the helpers
mentioned above. When booting Xen using GRUB, setup_fixmap() may be used
before map_domain_page() can be called. This will result to data abort:

(XEN) Data Abort Trap. Syndrome=0x5
(XEN) CPU0: Unexpected Trap: Data Abort

[...]

(XEN) Xen call trace:
(XEN)    [<000000000025ab6c>] mm.c#xen_pt_update+0x2b4/0x59c (PC)
(XEN)    [<000000000025ab20>] mm.c#xen_pt_update+0x268/0x59c (LR)
(XEN)    [<000000000025ae70>] set_fixmap+0x1c/0x2c
(XEN)    [<00000000002a9c98>] copy_from_paddr+0x7c/0xdc
(XEN)    [<00000000002a4ae0>] has_xsm_magic+0x18/0x34
(XEN)    [<00000000002a5b5c>] bootfdt.c#early_scan_node+0x398/0x560
(XEN)    [<00000000002a5de0>] device_tree_for_each_node+0xbc/0x144
(XEN)    [<00000000002a5ed4>] boot_fdt_info+0x6c/0x260
(XEN)    [<00000000002ac0d0>] start_xen+0x108/0xc74
(XEN)    [<000000000020044c>] arm64/head.o#paging+0x60/0x88

During early boot, the page tables are either statically allocated in
Xen binary or allocated via alloc_boot_pages().

For statically allocated page-tables, they will already be mapped as
part of Xen binary. So we can easily find the virtual address.

For dynamically allocated page-tables, we need to rely
map_domain_page() to be functionally working.

For arm32, the call will be usable much before page can be dynamically
allocated (see setup_pagetables()). For arm64, the call will be usable
after setup_xenheap_mappings().

In both cases, memory are given to the boot allocator afterwards. So we
can rely on map_domain_page() for mapping page tables allocated
dynamically.

The helpers xen_{map, unmap}_table() are now updated to take into
account the case where page-tables are part of Xen binary.

Fixes: 022387ee1a ('xen/arm: mm: Don't open-code Xen PT update in {set, clear}_fixmap()')
Signed-off-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: setup: Calculate correctly the size of Xen

The current size of Xen is computed using _end - _start + 1. However,
_end is pointing one past the end of Xen, so the size of Xen is
off-by-one.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: Don't use _end in is_xen_fixed_mfn()

virt_to_maddr() is using the hardware page-table walk instructions to
translate a virtual address to physical address. The function should
only be called on virtual address mapped.

_end points past the end of Xen binary and may not be mapped when the
binary size is page-aligned. This means virt_to_maddr() will not be able
to do the translation and therefore crash Xen.

Note there is also an off-by-one issue in this code, but the panic will
trump that.

Both issues can be fixed by using _end - 1 in the check.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

golang/xenlight: fix calls to libxl_domain_unpause/pause

These functions require a third argument of type const *libxl_asyncop_how.

Pass nil to fix compilation errors. This will have the effect of
performing these operations synchronously.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs/sphinx: Introduction

Put together an introduction page for the Sphinx/RST docs, along with a
glossary which will accumulate over time.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Lars Kurth <lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

MAINTAINERS: drop Tim Deegan from 'The Rest'

I have not been active in this role for a while now.

Signed-off-by: Tim Deegan <tim@xen.org>

xen/arm: mm: Clear boot pagetables before bringing-up each secondary CPU

At the moment, boot pagetables are only cleared once at boot. This means
when booting CPU2 (and onwards) then boot pagetables will not be
cleared.

To keep the interface exactly the same for all secondary CPU, the boot
pagetables are now cleared before bringing-up each secondary CPU.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/arm: domain_build: Print the correct domain in dtb_load()

dtb_load() can be called by other domain than dom0. To avoid confusion
in the log, print the correct domain.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

stubdom/vtpm: include stdio.h for declaration of printf

The function read_vtpmblk uses printf(3), but stdio.h is not included
in this file. This results in a warning from gcc-7:

vtpmblk.c: In function 'read_vtpmblk':
vtpmblk.c:322:7: warning: implicit declaration of function 'printf' [-Wimplicit-function-declaration]
printf("Expected: ");
vtpmblk.c:322:7: warning: incompatible implicit declaration of built-in function 'printf'
vtpmblk.c:322:7: note: include '<stdio.h>' or provide a declaration of 'printf'

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs/sphinx: Indent cleanup

Sphinx, its linters, and RST modes in common editors, expect 3 spaces of
indentation. Some bits already conform to this expectation. Update the
rest to match.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Lars Kurth <lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/microcode: Drop trailing whitespace in printk()

This has actually been present since c/s bd7c09c0 in 2008, and survived
through all of the recent microcode refactoring.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

Prep for 4.13.0-rc1: Set version to -rc

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

Prep for 4.13.0-rc1: Pin QEMU_* and MINIOS to tags

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

xen/arm: domain_build: harden make_cpus_node()

make_cpus_node() is using a static buffer to generate the FDT node name.
While mpdir_aff is a 64-bit integer, we only ever use the bits [23:0] as
only AFF{0, 1, 2} are supported for now.

To avoid any potential issues in the future, check that mpdir_aff has
only bits [23:0] set.

Take the opportunity to reduce the size of the buffer. Indeed, only 8
characters are needed to print a 32-bit hexadecimal number. So
sizeof("cpu@") + 8 + 1 (for '\0') = 13 characters is sufficient.

Fixes: c81a791d34 (xen/arm: Set 'reg' of cpu node for dom0 to match MPIDR's affinity)
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/mm: don't needlessly veto migration

Now that xl.cfg has an option to explicitly enable IOMMU mappings for a
domain, migration may be needlessly vetoed due to the check of
is_iommu_enabled() in paging_log_dirty_enable().
There is actually no need to prevent logdirty from being enabled unless
devices are assigned to a domain.

NOTE: While in the neighbourhood, the bool_t parameter type in
paging_log_dirty_enable() is replaced with a bool and the format
of the comment in assign_device() is fixed.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/efi: properly handle 0 in pixel reserved bitmask

In some graphics modes firmware is allowed to return 0 in pixel reserved
bitmask which doesn't go against UEFI Spec 2.8 (12.9 Graphics Output Protocol).

Without this change non-TrueColor modes won't work which will cause
GOP init to fail - observed while trying to boot EFI Xen with Cirrus VGA.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/docs: arm: Update dom0less binding and example

The binding for the dom0less module does not match Xen implementation.
Any module should contain the compatible "multiboot,module" to be
recognized.

This was clearly an oversight as other examples with Xen code base
provide the compatible "multiboot,module".

So fix the binding and the example associated to it.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/hvm: Fix the use of "hap=0" following c/s c0902a9a143a

c/s c0902a9a143a refactored hvm_enable() a little, but dropped the logic which
cleared hap_supported in the case that the user had asked for it off.

This results in Xen booting up, claiming:

(XEN) HVM: Hardware Assisted Paging (HAP) detected but disabled

but with HAP advertised via sysctl, and XEN_DOMCTL_CDF_hap being accepted in
domain_create().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

pci: clear {host/guest}_maskall field on assign

The current implementation of host_maskall makes it sticky across
assign and deassign calls, which means that once a guest forces Xen to
set host_maskall the maskall bit is not going to be cleared until a
call to PHYSDEVOP_prepare_msix is performed. Such call however
shouldn't be part of the normal flow when doing PCI passthrough, and
hence the flag needs to be cleared when assigning in order to prevent
host_maskall being carried over from previous assignations.

Note that the entry maskbit is reset when the msix capability is
initialized, and the guest_maskall field is also cleared so that the
hardware value matches Xen's internal state (hardware maskall =
host_maskall | guest_maskall).

Also note that doing the reset of host_maskall there would allow the
guest to reset such field by enabling and disabling MSIX, which is not
intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

efi/boot: make sure graphics mode is set while booting through MB2

If a bootloader is using native driver instead of EFI GOP it might
reset graphics mode to be different from what has been originally set
by firmware. While booting through MB2 Xen either need to parse video
setting passed by MB2 and use them instead of what GOP reports or
reset the mode to synchronise it with firmware - prefer the latter.

Observed while booting Xen using MB2 with EFI GRUB2 compiled with
all possible video drivers where native drivers take priority over firmware.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

efi/boot: add missing pointer dereference in set_color

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

AMD/IOMMU: pre-fill all DTEs right after table allocation

Make sure we don't leave any DTEs unexpected requests through which
would be passed through untranslated. Set V and IV right away (with
all other fields left as zero), relying on the V and/or IV bits
getting cleared only by amd_iommu_set_root_page_table() and
amd_iommu_set_intremap_table() under special pass-through circumstances.
Switch back to initial settings in amd_iommu_disable_domain_device().

Take the liberty and also make the latter function static, constifying
its first parameter at the same time, at this occasion.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

AMD/IOMMU: allow callers to request allocate_buffer() to skip its memset()

The command ring buffer doesn't need clearing up front in any event.
Subsequently we'll also want to avoid clearing the device tables.

While playing with functions signatures replace undue use of fixed width
types at the same time, and extend this to deallocate_buffer() as well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

AMD/IOMMU: allocate one device table per PCI segment

Having a single device table for all segments can't possibly be right.
(Even worse, the symbol wasn't static despite being used in just one
source file.) Attach the device tables to their respective IVRS mapping
ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

iommu/arm: Remove arch_iommu_populate_page_table() completely

The Arm realization should have been removed in the following commit
as redundant:
f89f555 remove late (on-demand) construction of IOMMU page tables

So, remove unused function completely.

Fixes: f89f555 ('remove late (on-demand) construction of IOMMU page tables')
Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/arm: fix duplicate memory node in DT

When reserved-memory regions are present in the host device tree, dom0
is started with multiple memory nodes. Each memory node should have a
unique name, but today they are all called "memory" leading to Linux
printing the following warning at boot:

OF: Duplicate name in base, renamed to "memory#1"

This patch fixes the problem by appending a "@<unit-address>" to the
name, as per the Device Tree specification, where <unit-address> matches
the base of address of the first region.

Fixes: 248faa637d2 (xen/arm: add reserved-memory regions to the dom0 memory node)
Reported-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/arm: make_memory_node return error on nr_banks == 0

Call make_memory_node for reserved_memory only if we actually have any
reserved_memory regions to handle.

Add a check in make_memory_node to return an error if it has been called
with no memory banks as argument.

Fixes: 248faa637d2 (xen/arm: add reserved-memory regions to the dom0 memory node)
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs: update all URLs in man pages

Specifically
* xen.org to xenproject.org
* http to https
* Replaced pages where page has moved
(including on xen pages as well as external pages)
* Removed some URLs (e.g. downloads for Linux PV drivers)

Tested-by: Lars Kurth <lars.kurth@citrix.com>
Signed-off-by: Lars Kurth <lars.kurth@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/sched: let credit scheduler control its timer all alone

The credit scheduler is the only scheduler with tick_suspend and
tick_resume callbacks. Today those callbacks are invoked without being
guarded by the scheduler lock which is critical when at the same the
cpu those callbacks are active is being moved to or from a cpupool.

Crashes like the following are possible due to that race:

(XEN) ----[ Xen-4.13.0-8.0.12-d  x86_64  debug=y   Not tainted ]----
(XEN) CPU:    79
(XEN) RIP:    e008:[<ffff82d0802467dc>] set_timer+0x39/0x1f7
(XEN) RFLAGS: 0000000000010002   CONTEXT: hypervisor
<snip>
(XEN) Xen call trace:
(XEN)    [<ffff82d0802467dc>] set_timer+0x39/0x1f7
(XEN)    [<ffff82d08022c1f4>]
sched_credit.c#csched_tick_resume+0x54/0x59
(XEN)    [<ffff82d080241dfe>] sched_tick_resume+0x67/0x86
(XEN)    [<ffff82d0802eda52>] mwait-idle.c#mwait_idle+0x32b/0x357
(XEN)    [<ffff82d08027939e>] domain.c#idle_loop+0xa6/0xc2
(XEN)
(XEN) Pagetable walk from 0000000000000048:
(XEN)  L4[0x000] = 00000082cfb9c063 ffffffffffffffff
(XEN)  L3[0x000] = 00000082cfb9b063 ffffffffffffffff
(XEN)  L2[0x000] = 00000082cfb9a063 ffffffffffffffff
(XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 79:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 0000000000000048
(XEN) ****************************************

The callbacks are used when the cpu is going to or coming from idle in
order to allow higher C-states.

The credit scheduler knows when it is going to schedule an idle
scheduling unit or another one after idle, so it can easily stop or
resume the timer itself removing the need to do so via the callback.
As this timer handling is done in the main scheduling function the
scheduler lock is still held, so the race with cpupool operations can
no longer occur. Note that calling the callbacks from schedule_cpu_rm()
and schedule_cpu_add() is no longer needed, as the transitions to and
from idle in the cpupool with credit active will automatically occur
and do the right thing.

With the last user of the callbacks gone those can be removed.

Suggested-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

xen/xsm: flask: Check xmalloc_array() return in security_sid_to_context()

xmalloc_array() may return NULL if there are memory. Rather than trying
to deference it directly, we should check the return value first.

Coverity-ID: 1381852
Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/xsm: flask: Prevent NULL deference in flask_assign_{, dt}device()

flask_assign_{, dt}device() may be used to check whether you can test if
a device is assigned. In this case, the domain will be NULL.

However, flask_iommu_resource_use_perm() will be called and may end up
to deference a NULL pointer. This can be prevented by moving the call
after we check the validity for the domain pointer.

Coverity-ID: 1486741
Fixes: 71e617a6b8 ('use is_iommu_enabled() where appropriate...')
Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Reviewed-by: Paul Durrant <paul@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/Kconfig: Invert the defaults for CONFIG_{PVH_GUEST,PV_SHIM}

This is a minor UI change, but users which have elected to enable
XEN_GUEST (which still defaults to no) will definitely need one of these
options, and will typically want both.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/nospec: Introduce CONFIG_SPECULATIVE_HARDEN_ARRAY

There are legitimate circumstance where array hardening is not wanted or
needed. Allow it to be turned off.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec-ctrl: Annotate remaining model names

The names in retpoline_safe() are copied from should_use_eager_fpu(). The
names in mds_calculations() come partly from Linux's intel-family.h, and
partly from conversations with Intel.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>