Jan Beulich [Tue, 18 Dec 2018 14:21:17 +0000 (15:21 +0100)]
x86emul: avoid triggering assertions with VME/PVI early #GP check
In commit efe9cba66c ("x86emul: VME and PVI modes require a #GP(0) check
first thing") I neglected the fact that the retire flags get zapped only
in x86_decode(), which hasn't been invoked yet at the point of the #GP(0)
check added. Move output state initialization into a helper function,
and invoke it from the callers of x86_decode() instead of doing it
(possibly too late) in that function.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 18 Dec 2018 14:19:47 +0000 (15:19 +0100)]
x86emul: work around SandyBridge errata
There are a number of exception condition related errata on SandyBridge
CPUs, some of which are unexpected #UD (others, of no interest here, are
lack of mandated exceptions, or exceptions of unexpected type). Annotate
the one workaround we already have, and add two more.
Due to the exception recovery we have in place for stub invocations
these aren't security issues.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 18 Dec 2018 13:27:09 +0000 (14:27 +0100)]
x86emul: fix 3-operand IMUL
While commit 75066cd4ea ("x86emul: fix {,i}mul and {,i}div") indeed did
as its title says, it broke the 3-operand form by uniformly using AL/AX/
EAX/RAX as second source operand. Fix this and add tests covering both
cases.
Reported-by: Andrei Lutas <vlutas@bitdefender.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 18 Dec 2018 13:26:44 +0000 (14:26 +0100)]
x86emul/test: drop another instance of .byte
Now that we require use of the {evex} pseudo-prefix, we can also use
the q-suffixed encoding of VPCMPESTRI, which is available as of 2.29
just like {evex} is.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 30 Nov 2018 16:14:08 +0000 (16:14 +0000)]
x86/hvm: Corrections to RDTSCP intercept handling
For both VT-x and SVM, the RDTSCP intercept will trigger if the pipeline
supports the instruction, but the guest may not have RDTSCP in its featureset.
Bring the vmexit handlers in line with the main emulator behaviour by
optionally handing back #UD.
Next on the AMD side, if RDTSCP actually ends up being intercepted on a debug
build or first-gen SVM hardware which lacks NRIP, we first update regs->rcx,
then call __get_instruction_length() asking for RDTSC. As the two
instructions are different (and indeed, different lengths!),
__get_instruction_length_from_list() fails and hands back a #GP fault.
This can demonstrated by putting a guest into tsc_mode="always emulate" and
executing an RDTSCP instruction:
(d1) --- Xen Test Framework ---
(d1) Environment: HVM 64bit (Long mode 4 levels)
(d1) Test rdtscp
(d1) TSC mode 1
(XEN) emulate.c:147:d1v0 __get_instruction_length: Mismatch between expected and actual instruction:
(XEN) emulate.c:152:d1v0 insn_index 8, opcode 0xf0031 modrm 0
(XEN) emulate.c:154:d1v0 rip 0x10475f, nextrip 0x104762, len 3
(XEN) SVM insn len emulation failed (1): d1v0 64bit @ 0008:0010475f -> 0f 01 f9 0f 31 5b 31 ff 31 c0 e9 c2 db ff ff 00
(d1) ******************************
(d1) PANIC: Unhandled exception at 0008:000000000010475f
(d1) Vec 13 #GP[0000]
(d1) ******************************
First, teach __get_instruction_length() to cope with RDTSCP, and improve
svm_vmexit_do_rdtsc() to ask for the correct instruction. Move the regs->rcx
adjustment into this function to ensure it gets done after we are done
potentially raising faults.
Reported-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Juergen Gross [Mon, 10 Dec 2018 11:44:22 +0000 (12:44 +0100)]
xen: add CONFIG item for default dom0 memory size
With being able to specify a dom0_mem value depending on host memory
size on x86 make it easy for distros to specify a default dom0 size by
adding a CONFIG_DOM0_MEM item which presets the dom0_mem boot parameter
value.
It will be used only if no dom0_mem parameter was specified in the
boot parameters.
Andrii Anisov [Wed, 12 Dec 2018 18:20:55 +0000 (20:20 +0200)]
arm/irq: skip action availability check for non-debug build
Under desc->lock taken:
An IRQ with _IRQ_GUEST flag set always has an action.
An IRQ with _IRQ_DISABLED flag cleared always has an action.
Those flags checks cover all accesses to desc->action in do_IRQ,
so we can skip desc->action check in non-debug build.
Keep in place for debug build to help diagnostics potential
misconfiguration.
Andrii Anisov [Wed, 12 Dec 2018 18:20:54 +0000 (20:20 +0200)]
gic-vgic: Drop an excessive clear_lrs
This action is excessive because for an invalid LR there is no need
to write another invalid value to a register. So we can skip it here,
saving a peripheral register write.
Keep clearing the LR for the DEBUG build. This would make dumped
invalid LRs be zero. That is more obvious than picking state bits
from a non-zero value.
Paul Durrant [Thu, 13 Dec 2018 11:01:50 +0000 (12:01 +0100)]
amd-iommu: remove page merging code
The page merging logic makes use of bits 1-8 and bit 63 of a PTE, which
used to be specified as 'ignored'. However, bits 5 and 6 are now specified
as 'accessed' and 'dirty' bits and their use only remains safe as long as
the DTE 'Host Access Dirty' bits remain unused by Xen, or by hardware
before the domain starts running. (XSA-275 disabled the operation of the
code after domain creation completes).
With the page merging logic present in its current form there are no spare
ignored bits in the PTE at all, but PV-IOMMU support will require at least
one spare bit to track which PTEs are added by hypercall.
This patch removes the code, freeing up the remaining PTE ignored bits
for other use, including PV-IOMMU support, as well as significantly
simplifying and shortening the source by ~170 lines. There may be some
marginal performance cost (but none has been observed in manual testing
with a passed-through NVIDIA GPU) since higher order mappings will now be
ruled out until a mapping order parameter is passed to iommu_ops. That will
be dealt with by a subsequent patch though.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
Julien Grall [Thu, 29 Nov 2018 11:37:43 +0000 (11:37 +0000)]
xen/arm: mm: Set-up page permission for Xen mappings earlier on
Xen mapping is first create using a 2MB page and then shatterred in 4KB
page for fine-graine permission. However, it is not safe to break-down
superpage page without going to an intermediate step invalidating
the entry.
As we are changing Xen mappings, we cannot go through the intermediate
step. The only solution is to create Xen mapping using 4KB entries
directly. As the Xen should always access the mappings according with
the runtime permission, it is then possible to set-up the permissions
while create the mapping.
We are still playing with the fire as there are still some
break-before-make issue in setup_pagetables (i.e switch between 2 sets of
page-tables). But it should slightly be better than the current state.
Julien Grall [Thu, 29 Nov 2018 19:02:09 +0000 (19:02 +0000)]
xen/arm: p2m: Rework p2m_cache_flush_range
A follow-up patch will add support for preemption in p2m_cache_flush_range.
Because of the complexity for the 2 loops, it would be necessary to add
preemption in both of them.
This can be avoided by merging the 2 loops together and still keeping
the code fairly simple to read and extend.
Julien Grall [Mon, 26 Nov 2018 14:25:54 +0000 (14:25 +0000)]
xen/arm: traps: Rework leave_hypervisor_tail
The function leave_hypervisor_tail is called before each return to the
guest vCPU. It has two main purposes:
1) Process physical CPU work (e.g rescheduling) if required
2) Prepare the physical CPU to run the guest vCPU
2) will always be done once we finished to process physical CPU work. At
the moment, it is done part of the last iterations of 1) making adding
some extra indentation in the code.
This could be streamlined by moving out 2) of the loop. At the same
time, 1) is moved in a separate function making more obvious what is
happening.
All those changes will help a follow-up patch where we would want to
introduce some vCPU work before returning to the guest vCPU.
Julien Grall [Mon, 6 Aug 2018 16:47:54 +0000 (17:47 +0100)]
xen/arm: p2m: Extend p2m_get_entry to return the value of bit[0] (valid bit)
With the recent changes, a P2M entry may be populated but may not be
valid. In some situation, it would be useful to know whether the entry
has been marked available to guest in order to perform a specific
action. So extend p2m_get_entry to return the value of bit[0] (valid bit).
Julien Grall [Wed, 21 Feb 2018 14:18:44 +0000 (14:18 +0000)]
xen/arm: p2m: Allow to flush cache on any RAM region
Currently, we only allow to flush cache on regions mapped as p2m_ram_{rw,ro}.
There are no real problem in cache flushing any RAM regions such as grants
and foreign mapping. Therefore, relax the check to allow flushing the
cache on any RAM region.
xen/arm: p2m: Introduce a function to resolve translation fault
Currently a Stage-2 translation fault could happen:
1) MMIO emulation
2) Another pCPU was modifying the P2M using Break-Before-Make
3) Guest Physical address is not mapped
A follow-up patch will re-purpose the valid bit in an entry to generate
translation fault. This would be used to do an action on each entry to
track pages used for a given period.
When receiving the translation fault, we would need to walk the pages
table to find the faulting entry and then toggle valid bit. We can't use
p2m_lookup() for this purpose as it only tells us the mapping exists.
So this patch adds a new function to walk the page-tables and updates
the entry. This function will also handle 2) as it also requires walking
the page-table.
The function is able to cope with both table and block entry having the
validate bit unset. This gives flexibility to the function clearing the
valid bits. To keep the algorithm simple, the fault will be propating
one-level down. This will be repeated until a block entry has been
reached.
At the moment, there are no action done when reaching a block/page entry
but setting the valid bit to 1.
Julien Grall [Wed, 21 Feb 2018 14:18:40 +0000 (14:18 +0000)]
xen/arm: p2m: Handle translation fault in get_page_from_gva
A follow-up patch will re-purpose the valid bit of LPAE entries to
generate fault even on entry containing valid information.
This means that when translating a guest VA to guest PA (e.g IPA) will
fail if the Stage-2 entries used have the valid bit unset. Because of
that, we need to fallback to walk the page-table in software to check
whether the fault was expected.
This patch adds the software page-table walk on all the translation
fault. It would be possible in the future to avoid pointless walk when
the fault in PAR_EL1 is not a translation fault.
This function has only worked for guest RAM pages (no foreing mappings or
MMIO mappings) because we require the page to belong to the domain for
getting a reference. This means we can deny all non guest RAM pages.
Julien Grall [Wed, 21 Feb 2018 14:18:42 +0000 (14:18 +0000)]
xen/arm: p2m: Introduce p2m_is_valid and use it
The LPAE format allows to store information in an entry even with the
valid bit unset. In a follow-up patch, we will take advantage of this
feature to re-purpose the valid bit for generating a translation fault
even if an entry contains valid information.
So we need a different way to know whether an entry contains valid
information. It is possible to use the information hold in the p2m_type
to know for that purpose. Indeed all entries containing valid
information will have a valid p2m type (i.e p2m_type != p2m_invalid).
This patch introduces a new helper p2m_is_valid, which implements that
idea, and replace most of lpae_is_valid call with the new helper. The ones
remaining are for TLBs handling and entries accounting.
With the renaming there are 2 others changes required:
- Generate table entry with a valid p2m type
- Detect new mapping for proper stats accounting
Julien Grall [Wed, 21 Feb 2018 14:18:44 +0000 (14:18 +0000)]
xen/arm: Introduce helpers to clear/flags flags in HCR_EL2
A couple of places in the code will need to clear/set flags in HCR_EL2
for a given vCPU and then replicate into the hardware. Introduce
helpers and replace open-coded version.
Juergen Gross [Tue, 11 Dec 2018 08:43:00 +0000 (09:43 +0100)]
x86: add dom0 memory sizing variants
Today the memory size of dom0 can be specified only in terms of bytes
(either an absolute value or "host-mem - value"). When dom0 shouldn't
be auto-ballooned this requires nearly always a manual adaption of the
Xen boot parameters to reflect the actual host memory size.
Add more possibilities to specify memory sizes. Today we have:
dom0_mem= List of ( min:<size> | max:<size> | <size> )
with <size> being a positive or negative size value (e.g. 1G).
Modify that to:
dom0_mem= List of ( min:<sz> | max:<sz> | <sz> )
<sz>: <size> | [<size>+]<frac>%
<frac>: integer value < 100
With the following semantics:
<frac>% specifies a fraction of host memory size in percent.
<sz> is a percentage of host memory plus an offset.
So <sz> being 1G+25% on a 256G host would result in 65G.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Tue, 11 Dec 2018 08:42:20 +0000 (09:42 +0100)]
modify parse_size_and_unit() to support percentage
Modify parse_size_and_unit() to support a value followed by a '%'
character. In this case ps is required to be non-NULL to ensure the
caller can detect that case. The returned value will be the integer
value s was pointing to and *ps will point to the '%' character.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 7 Dec 2018 17:00:47 +0000 (17:00 +0000)]
x86/VT-x: Don't activate VMCS Shadowing outside of nested vmx mode
By default on capable hardware, SECONDARY_EXEC_ENABLE_VMCS_SHADOWING is
activated unilaterally. The VMCS Link pointer is initialised to ~0, but the
VMREAD/VMWRITE bitmap pointers are not.
This causes the 16bit IVT and Bios Data Area get interpreted as the read/write
permission bitmap for guests which blindly execute VMREAD/VMWRITE
instructions.
This is not a security issue because the VMCS Link pointer being ~0 causes
VMREAD/VMWRITE to complete with VMFailInvalid (rather than modifying a
potential shadow VMCS), and the contents of MFN 0 has already been determined
not to contain any interesting data because of L1TF's ability to read that 4k
frame.
Leave VMCS Shadowing disabled by default, and toggle it in
nvmx_{set,clear}_vmcs_pointer(). This isn't the most efficient course of
action, but it is the most simple way of leaving nested-virt working as it did
before.
While editing construct_vmcs(), collect all default secondary_exec_control
modifications together. The disabling of PML is latently buggy because it
happens after secondary_exec_control are written into the VMCS, although there
is an unconditional update later which writes the correct value into hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 7 Dec 2018 13:43:27 +0000 (13:43 +0000)]
docs/cmdline: Rewrite the cpuid_mask_* section
A large amount of the information here is obsolete since Xen 4.7
To being with, however, this patch marks a change in style for section
headings, due to how HTML anchors are generated. Having more than one
parameter per heading makes an awkward anchor, especially when brace globbing
is used. Furthermore, the misc suffixes such as (AMD only) get included, as
do the escaping for the underscores.
Markdown doesn't require escaped underscores in headings (I'm not entirely
sure how we ended up with that style), so remove them and fully expand the
glob syntax. Also adjust com1,com2 while at it, which is the only other
multi-parameter heading. Move the misc suffixes into an "Applicability:" note
alongside the information about defaults.
This results in the headings being unadorned, and identical to how they are
expressed on the command line and in code.
For cpuid_mask_cpu option, collapse the long line of almost identical strings
using [] globbing. The result is much shorter and clearer to read. Add a
warning that this option no longer masks all features on Fam15h and above, due
to not making use of the leaf 7 masks.
For the remainder of the cpuid_mask_* options, collapse them all together into
a single description.
Finally, leave an explicit note explaining that people should not be using
these options for migration safety.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 7 Dec 2018 13:43:23 +0000 (13:43 +0000)]
docs/cmdline: Fix markdown syntax
* vwfi needs a closing `. rmrr needs one as well, and the opening ' switched
to `
* The com1/com2 example lines are already verbatim blocks and shouldn't
escape their underscores. This ends up in the rendered output.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Thu, 6 Dec 2018 14:05:34 +0000 (14:05 +0000)]
x86/pv: Code improvements to do_update_descriptor()
* Add "uint64_t raw" to seg_desc_t to remove the opencoded uint64_t casting
in this function. Change the parameter to be of type seg_desc_t.
* Rename the 'pa' parameter to 'gaddr', because it lives in GFN space rather
than physical address space.
* Use gfn_t and mfn_t rather than unsigned longs.
* Check the alignment and proposed new descriptor before taking a page
reference.
* Use the more flexible ACCESS_ONCE() accessor in preference to
write_atomic()
No expected change in behaviour.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 6 Dec 2018 14:05:29 +0000 (14:05 +0000)]
x86: Switch "struct desc_struct" to being seg_desc_t
The struct suffix is redundant in the name, and a future change will want to
turn it into a union, rather than a structure. As this represents a segment
descriptor, give it an appropriate typedef.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Ian Jackson [Mon, 3 Dec 2018 12:05:41 +0000 (12:05 +0000)]
docs/parse-support-md: Allow definition lists for features
Now, as well as a `code block', with
| Something: some status
we tolerate a definition list which in pandoc terms looks like this
|Term
|: Definition
This ought not usually be be used for features but it will be useful
for linking to the release notes, because markup is not allowed in
code blocks but is in definitions.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Juergen Gross <jgross@suse.com>
Ian Jackson [Mon, 3 Dec 2018 12:01:55 +0000 (12:01 +0000)]
docs/parse-support-md: Correct handling of Status
In fact this was not markdown content, but just a string. We are
however going to make it be markdown content. So adjust the comments,
and the consumer.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Juergen Gross <jgross@suse.com>
Paul Durrant [Fri, 7 Dec 2018 17:50:08 +0000 (17:50 +0000)]
x86/hvm/viridian: stop open coding updates to APIC registers
The code in viridian_synic_wrmsr() duplicates logic in vlapic_reg_write()
to update the ICR, ICR2 and TASKPRI registers. Instead of doing this,
make vlapic_reg_write() non-static and call it.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Rename "offset" to "reg" for consistency with the rest of the vlapic API.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
When viridian APIC assist is active, the code in vlapic_has_pending_irq()
may end up re-calling vlapic_find_highest_isr() after emulating an EOI
whereas simply moving the call after the EOI emulation removes the need
for this duplication.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 6 Dec 2018 11:20:55 +0000 (12:20 +0100)]
console: adjust IRQ initialization
In order for a Xen internal PCI device driver to enable MSI on the
device, we need another hook which the driver can use to create the IRQ
(doing this in the init_preirq hook is too early, since IRQ code hasn't
got initialized at that time yet, and doing it in init_postirq is too
late because at least on x86 smp_intr_init() needs to know the IRQ
number).
On x86 this additionally requires a slight ordering change to IRQ
initialization, to facilitate calling the new hook between basic
initialization and the call path leading to smp_intr_init().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Thu, 6 Dec 2018 11:19:04 +0000 (12:19 +0100)]
make domain_adjust_tot_pages() __must_check
Even if unlikely, donate_page() should not ignore the possible need to
obtain a domain reference. To make people look more closely when they
add new uses of domain_adjust_tot_pages(), force its return value to be
checked. This in turn requires a benign change to assign_pages().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Mon, 26 Feb 2018 12:45:58 +0000 (12:45 +0000)]
x86/hvm: Handle x2apic MSRs via the new guest_{rd,wr}msr() infrastructure
Dispatch from the guest_{rd,wr}msr() functions. The read side should be safe
outside of current context, but the write side is definitely not. As the
toolstack has no legitimate reason to access the APIC registers via this
interface (not least because whether they are accessible at all depends on
guest settings), unilaterally reject access attempts outside of current
context.
Rename to guest_{rd,wr}msr_x2apic() for consistency, and alter the functions
to use X86EMUL_EXCEPTION rather than X86EMUL_UNHANDLEABLE. The previous
callers turned UNHANDLEABLE into EXCEPTION, but using UNHANDLEABLE will now
interfere with the fallback to legacy MSR handling.
While altering guest_rdmsr_x2apic() make a couple of minor improvements.
Reformat the initialiser for readable[] so it indents in a more natural way,
and alter high to be a 64bit integer to avoid shifting 0 by 32 in the common
path.
Observant people might notice that we now don't let PV guests read the x2apic
MSRs. They should never have been able to in the first place.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Wed, 7 Mar 2018 16:48:01 +0000 (16:48 +0000)]
x86: Fix APIC MSR constant names
We currently have MSR_IA32_APICBASE and MSR_IA32_APICBASE_MSR which are
synonymous from a naming point of view, but refer to very different things.
Rename the x2APIC MSRs to MSR_X2APIC_*, which are shorter constants and
visually separate the register function from the generic APIC name. For the
case ranges, introduce MSR_X2APIC_LAST, rather than relying on the knowledge
that there are 0x3ff MSRs architecturally reserved for x2APIC functionality.
For functionality relating to the APIC_BASE MSR, use MSR_APIC_BASE for the MSR
itself, but drop the MSR prefix from the other constants to shorten the names.
In all cases, the fact that we are dealing with the APIC_BASE MSR is obvious
from the context.
No functional change (the combined binary is identical).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Thu, 29 Nov 2018 18:17:45 +0000 (18:17 +0000)]
x86/spec-ctrl: Drop the bti= command line option
bti= was introduced with the original Spectre fixes (Jan 2018), but by the
time Speculative Store Bypass came along (May 2018), it was superceeded by the
more generic spec-ctrl=.
Since then, we've had LazyFPU (June 2018) and L1TF (August 2018), which means
noone will be using the option. Remove it entirely - anyone who happens to
accidentially be using it might now spot Xen complaining about an option it
doesn't understand.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Tue, 4 Dec 2018 13:04:54 +0000 (14:04 +0100)]
pci: apply workaround for Intel errata HSE43 and BDF2/BDX2
These errata affect the values read from the BAR registers, and could
render vPCI (and by extension PVH Dom0 unusable).
HSE43 is a Haswell erratum where a non-BAR register is implemented at
the position where the first BAR of the device should be found in a
Power Control Unit device. Note that there are no BARs on this device,
apart from the bogus CSR register positioned on top of the first BAR.
BDF2/BDX2 is a Broadwell erratum where BARs in the Home Agent device
will return bogus non-zero values.
In both cases the solution is to treat such devices as having no BARs
in the vPCI code.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Tue, 4 Dec 2018 13:04:20 +0000 (14:04 +0100)]
vmx: remove stale prototypes
Some prototypes in include/asm-x86/hvm/vmx/vmx.h have no related
implementation. Remove them.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Tue, 4 Dec 2018 13:02:46 +0000 (14:02 +0100)]
x86emul: skip VIF processing in VME mode for 16-bit POPF at IOPL 3
At IOPL 3 CR4.VME is irrelevant.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Thu, 29 Nov 2018 18:17:01 +0000 (18:17 +0000)]
tools/libxc: Fix error handling in get_cpuid_domain_info()
get_cpuid_domain_info() has two conflicting return styles - either -error for
local failures, or -1/errno for hypercall failures. Switch to consistently
use -error.
While fixing the xc_get_cpu_featureset(), take the opportunity to remove the
redundancy and move it to be adjacent to the other featureset handling.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Thu, 29 Nov 2018 18:10:38 +0000 (18:10 +0000)]
tools/libxc: Fix issues with libxc and Xen having different featureset lengths
In almost all cases, Xen and libxc will agree on the featureset length,
because they are built from the same source.
However, there are circumstances (e.g. security hotfixes) where the featureset
gets longer and dom0 will, after installing updates, be running with an old
Xen but new libxc. Despite writing the code with this scenario in mind, there
were some bugs.
First, xen-cpuid's get_featureset() erroneously allocates a buffer based on
Xen's featureset length, but records libxc's length, which may be longer.
In this situation, the hypercall bounce buffer code reads/writes the recorded
length, which is beyond the end of the allocated object, and a later free()
encounters corrupt heap metadata. Fix this by recording the same length that
we allocate.
Secondly, get_cpuid_domain_info() has a related bug when the passed-in
featureset is a different length to libxc's.
A large amount of the libxc cpuid functionality depends on info->featureset
being as long as expected, and it is allocated appropriately. However, in the
case that a shorter external featureset is passed in, the logic to check for
trailing nonzero bits may read off the end of it. Rework the logic to use the
correct upper bound.
In addition, leave a comment next to the fields in struct cpuid_domain_info
explaining the relationship between the various lengths, and how to cope with
different lengths.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Olaf Hering [Wed, 28 Nov 2018 12:24:34 +0000 (13:24 +0100)]
xl: free bitmaps on exit
Every invocation of xl via valgrind will show three leaks.
Since libxl_bitmap_alloc uses NOGC, the caller has to free the memory
after use. And since xl_ctx_free might be called before
parse_global_config, also move the libxl_bitmap_init calls into
xl_ctx_alloc.
Also move the call to atexit() after xl_ctx_alloc, because the latter is
also called again in postfork.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Fri, 30 Nov 2018 11:10:39 +0000 (12:10 +0100)]
x86/shadow: don't enable shadow mode with too small a shadow allocation
We've had more than one report of host crashes after failed migration,
and in at least one case we've had a hint towards a too far shrunk
shadow allocation pool. Instead of just checking the pool for being
empty, check whether the pool is smaller than what
shadow_set_allocation() would minimally bump it to if it was invoked in
the first place.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Roger Pau Monné [Fri, 30 Nov 2018 11:10:00 +0000 (12:10 +0100)]
amd/iommu: skip host bridge devices when updating IOMMU page tables
Host bridges are not behind an IOMMU, and are already special cased and
skipped in amd_iommu_add_device. Apply the same special casing when
updating page tables.
This is required or else update_paging_mode will fail and return an
error to the caller (amd_iommu_{un}map_page) which will destroy the
domain.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
Roger Pau Monné [Fri, 30 Nov 2018 11:09:09 +0000 (12:09 +0100)]
amd/iommu: assign iommu devices to Xen
AMD IOMMU devices are exposed on the PCI bus, and thus are assigned by
default to the hardware domain. This can cause issues because the
IOMMU devices themselves are not behind an IOMMU, so update_paging_mode will
return an error if Xen tries to expand the page tables of a domain
that has assigned devices not behind an IOMMU. update_paging_mode
failing will cause the domain to be destroyed.
Fix this by hiding PCI IOMMU devices, so they are not assigned to the
hardware domain.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Brian Woods <brian.woods@amd.com>
Jan Beulich [Fri, 30 Nov 2018 11:07:33 +0000 (12:07 +0100)]
ns16550/PCI: fix skipping of devices
Selecting between single/multiple BAR mode should happen after checking
whether to skip the present device, or else multi-BAR devices won't be
skipped correctly, due to port_idx getting set to zero in that case.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Mon, 26 Nov 2018 10:40:44 +0000 (10:40 +0000)]
tools: set Dom0 UUID if requested
Introduce XEN_DOM0_UUID in Xen's global configuration file. Make
xen-init-dom0 accept an extra argument for UUID.
Also switch xs_open error message in xen-init-dom0 to use perror.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Wed, 28 Nov 2018 14:50:26 +0000 (15:50 +0100)]
x86emul: correct 32-bit address handling for AVX2 gathers
As done for other cases by commit 7869e2bafe ("x86emul/fuzz: add
rudimentary limit checking"), address calculations should also use
truncate_ea() for the AVX2 gather insns.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Wed, 28 Nov 2018 12:32:36 +0000 (13:32 +0100)]
xen: remove trailing spaces from public headers
Several public header files have trailing spaces in them. This is
rather annoying when importing them into other projects as they might
be rejected not complying to coding style.
Remove the trailing spaces in all headers below xen/include/public/.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monne [Mon, 26 Nov 2018 17:55:48 +0000 (18:55 +0100)]
mm: make opt_bootscrub non-init
LLVM code generation can attempt to load from a variable in the next
condition of an expression under certain circumstances, thus turning
the following condition:
if ( system_state < SYS_STATE_active && opt_bootscrub == BOOTSCRUB_IDLE )
Andrew Cooper [Mon, 26 Nov 2018 12:03:07 +0000 (12:03 +0000)]
xen/tools: Fix gen-cpuid.py's ability to report errors
c/s 18596903 "xen/tools: support Python 2 and Python 3" unfortunately
introduced a TypeError when changing how Fail exceptions were printed:
/local/xen.git/xen/../xen/tools/gen-cpuid.py:Traceback (most recent call last):
File "/local/xen.git/xen/../xen/tools/gen-cpuid.py", line 483, in <module>
sys.stderr.write(e)
TypeError: expected a character buffer object
Coerce e to a string before printing. While changing this, fold the three
write() calls making up the line into a single one, and take the opportunity
to neaten the output.
A sample error is:
/local/xen.git/xen/tools/gen-cpuid.py: Fail: Aliased value between FOO and BAR
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
This happened because viridian_map_guest_page() was not written to cope
with being called multiple times, but this is unfortunately exactly what
happens when xen-hvmcrash re-loads the domain context (having clobbered
the values of RIP).
This patch simply makes viridian_map_guest_page() return immediately if it
finds the page already mapped (i.e. vp->ptr != NULL).
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Mon, 26 Nov 2018 16:53:51 +0000 (17:53 +0100)]
x86emul: suppress default test harness build with incapable assembler
A top level "make build", as used e.g. by osstest, wants to build all
"all" targets in enabled tools subdirectories, which by default also
includes the emulator test harness. The use of, in particular, {evex}
insn pseudo-prefixes in, again in particular, test_x86_emulator.c causes
this build to fail though when the assembler is not new enough. Take
another big hammer and suppress the default harness build altogether
also when this and other pseudo-prefixes are not supported by the
specified (or defaulted to) assembler.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>