Jan Beulich [Tue, 6 Mar 2018 15:48:44 +0000 (16:48 +0100)]
x86: slightly reduce Meltdown band-aid overhead
I'm not sure why I didn't do this right away: By avoiding the use of
global PTEs in the cloned directmap, there's no need to fiddle with
CR4.PGE on any of the entry paths. Only the exit paths need to flush
global mappings.
The reduced flushing, however, requires that we now have interrupts off
on all entry paths until after the page table switch, so that flush IPIs
can't be serviced while on the restricted pagetables, leaving a window
where a potentially stale guest global mapping can be brought into the
TLB. Along those lines the "sync" IPI after L4 entry updates now needs
to become a real (and global) flush IPI, so that inside Xen we'll also
pick up such changes.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Sergey Dyasli [Tue, 6 Mar 2018 15:47:34 +0000 (16:47 +0100)]
pv_console: remove unnecessary #ifdefs
The header for PV console contains empty function definitions in case of
!CONFIG_XEN_GUEST specially to avoid #ifdefs in a code that uses them
to make the code look cleaner.
Unfortunately, during the release of shim-comet, PV console functions
were enclosed into unnecessary #ifdefs CONFIG_X86. Remove them.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Tue, 6 Mar 2018 15:46:57 +0000 (16:46 +0100)]
x86/xpti: don't map stack guard pages
Other than for the main mappings, don't even do this in release builds,
as there are no huge page shattering concerns here.
Note that since we don't run on the restructed page tables while HVM
guests execute, the non-present mappings won't trigger the triple fault
issue AMD SVM is susceptible to with our current placement of STGI vs
TR loading.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 6 Mar 2018 15:46:27 +0000 (16:46 +0100)]
x86/xpti: really hide almost all of Xen image
Commit 422588e885 ("x86/xpti: Hide almost all of .text and all
.data/.rodata/.bss mappings") carefully limited the Xen image cloning to
just entry code, but then overwrote the just allocated and populated L3
entry with the normal one again covering both Xen image and stubs.
Drop the respective code in favor of an explicit clone_mapping()
invocation. This in turn now requires setup_cpu_root_pgt() to run after
stub setup in all cases. Additionally, with (almost) no unintended
mappings left, the BSP's IDT now also needs to be page aligned.
The moving ahead of cleanup_cpu_root_pgt() is not strictly necessary
for functionality, but things are more logical this way, and we retain
cleanup being done in the inverse order of setup.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
When creating a GICv3 devicetree node, we currently insert the
redistributor-stride and #redistributor-regions properties, with fixed
values which are actually the architected ones. Since those properties are
optional, and in the case of the stride only needed to cover for broken
platforms, we don't need to describe them if they don't differ from the
default values. This will always be the case for our constructed
DomU memory map.
So we drop those properties altogether and provide a clean and architected
GICv3 DT node for DomUs.
Signed-off-by: Andre Przywara <andre.przywara@linaro.org> Reviewed-by: Julien Grall <julien.grall@arm.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Fri, 5 May 2017 16:01:47 +0000 (17:01 +0100)]
x86/pv: Drop int80_bounce from struct pv_vcpu
The int80_bounce field of struct pv_vcpu is a bit of an odd special case,
because it is a simple derivation of trap_ctxt[0x80], which is also stored.
It is also the only use of {compat_,}create_bounce_frame() which isn't
referencing the plain trap_bounce field of struct pv_vcpu. (And altering this
property the purpose of this patch.)
Remove the int80_bounce field entirely, along with init_int80_direct_trap(),
which in turn requires that the int80_direct_trap() path gain logic previously
contained in init_int80_direct_trap().
This does admittedly make the int80 fastpath slightly longer, but these few
instructions are in the noise compared to the architectural context switch
overhead, and it now matches the syscall/sysenter paths (which have far less
architectural overhead already).
No behavioural change from the guests point of view.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 21 Feb 2018 13:00:23 +0000 (13:00 +0000)]
x86/entry: Correct comparisons against boolean variables
The correct way to check a boolean is `cmpb $0` or `testb $0xff`, whereas a
lot of our entry code uses `testb $1`. This will work in principle for values
which are really C _Bool types, but won't work for other integer types which
are intended to have boolean properties.
cmp is the more logical way of thinking about the operation, so adjust all
outstanding uses of `testb $1` against boolean values. Changing test to cmp
changes the logical mnemonic of the following condition from 'zero' to
'equal', but the actual encoding remains the same.
No functional change, as all uses are real C _Bool types, and confirmed by
diffing the disassembly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Fri, 2 Mar 2018 17:45:52 +0000 (17:45 +0000)]
x86/boot: Annotate the multiboot headers with size and type information
This causes objdump not to try and disassemble the data.
While altering this area, switch to using .balign, and fill with 0xc2 to help
highlight the embedded padding (rather than having it filled with 0f 1f 40 00
which is a long nop). Also, shorten the labels by stripping off the _start
suffix.
Since commit "xen/arm: domain_build: Rework the way to allocate the
event channel interrupt", it is not possible for an irq to be both below 16
and greater/equal than 32.
Also fix the reference to linux documentation while we're at it.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@dornerworks.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
[Slightly rework the commit message]
Julien Grall [Tue, 27 Feb 2018 15:15:54 +0000 (15:15 +0000)]
xen/arm: domain_build: Rework the way to allocate the event channel interrupt
At the moment, a placeholder will be created in the device-tree for the
event channel information. Later in the domain construction, the
interrupt for the event channel upcall will be allocated the device-tree
fixed up.
Looking at the code, the current split is not necessary because all the
PPIs used by the hardware domain will by the time we create the node in
the device-tree.
>From now, mandate that all interrupts are registered before
acpi_prepare() and dtb_prepare(). This allows us to rework the event
channel code and remove one placeholder.
Note, this will also help to fix the BUG(...) condition in set_interrupt_ppi
which is completely wrong. See in a follow-up patch.
Juergen Gross [Mon, 26 Feb 2018 08:46:12 +0000 (09:46 +0100)]
tools/xenstore: try to get minimum thread stack size for watch thread
When creating a pthread in xs_watch() try to get the minimal needed
size of the thread from glibc instead of using a constant. This avoids
problems when the library is used in programs with large per-thread
memory.
Use dlsym() to get the pointer to __pthread_get_minstack() in order to
avoid linkage problems and fall back to the current constant size if
not found.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Tested-by: Jim Fehlig <jfehlig@suse.com>
Wei Liu [Fri, 2 Mar 2018 16:46:25 +0000 (16:46 +0000)]
x86: rename HAVE_GAS_* to HAVE_AS_*
Xen also uses clang's assembler when it is possible. Change the macro
names to not be GAS specific.
Patch produced with:
$ for f in `git grep HAVE_GAS_ | cut -d':' -f1`; \
do sed -i 's/HAVE_GAS_/HAVE_AS_/g' $f; done
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Sat, 11 Nov 2017 19:08:37 +0000 (19:08 +0000)]
x86/link: Don't merge .init.text and .init.data
c/s 1308f0170c merged .init.text and .init.data, because EFI might properly
write-protect r/o sections.
However, that change makes xen-syms unusable for disassembly analysis. In
particular, searching for indirect branches as part of the SP2/Spectre
mitigation series.
As the merging isn't necessary for ELF targets at all, make it conditional on
the EFI side of the build.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Paul Semel [Fri, 23 Feb 2018 22:48:57 +0000 (23:48 +0100)]
fuzz/x86_emulate: fix bounds for input size
The maximum size for the input size was set to INPUT_SIZE, which is actually
the size of the data array inside the fuzz_corpus structure and so was not
abling user (or AFL) to fill in the whole structure. Changing to
sizeof(struct fuzz_corpus) correct this problem.
Signed-off-by: Paul Semel <semelpaul@gmail.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jim Fehlig [Mon, 26 Feb 2018 18:28:39 +0000 (11:28 -0700)]
libxl: set channel devid when not provided by application
Applications like libvirt may not populate a device devid field,
delegating that to libxl. If needed, the application can later
retrieve the libxl-produced devid. Indeed most devices are handled
this way in libvirt, channel devices included.
This works well when only one channel device is defined, but more
than one results in
qemu-system-i386: -chardev socket,id=libxl-channel-1,\
path=/tmp/test-org.qemu.guest_agent.00,server,nowait:
Duplicate ID 'libxl-channel-1' for chardev
Besides the odd '-1' value in the id, multiple channels have the same
id, causing qemu to fail. A simple fix is to set an uninitialized
devid (-1) to the dev_num passed to libxl__init_console_from_channel().
Signed-off-by: Jim Fehlig <jfehlig@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
libxl: do not fail device removal if backend domain is gone
Backend domain may be independently destroyed - there is no
synchronization of libxl structures (including /libxl tree) elsewhere.
Backend might also remove the device info from its backend xenstore
subtree on its own.
We have various cases (not comprehensive list):
- both frontend and backend operational: after setting
be/state=XenbusStateClosing backend wait for frontend confirmation
and respond with be/state=XenbusStateClosed; then libxl in dom0
remove frontend entries and libxl in backend domain (which may be the
same) remove backend entries
- unresponsive backend/frontend: after a timeout, force=1 is used to remove
frontend entries, instead of just setting
be/state=XenbusStateClosing; then wait for be/state=XenbusStateClosed.
If that timeout too, remove both frontend and backend entries
- backend gone, with this patch: no place for setting/waiting on
be/state - go directly to removing frontend entries, without waiting
for be/state=XenbusStateClosed (this is the difference vs force=1)
Without this patch the end result is similar, both frontend and backend
entries are removed, but in case of backend gone:
- libxl waits for be/state=XenbusStateClosed (and obviously timeout)
- return value from the function signal an error, which for example
confuse libvirt - it thinks the device remove failed, so is still
there
If such situation is detected, do not fail the removal, but finish the
cleanup of the frontend side and return 0.
This is just workaround, the real fix should watch when the device
backend is removed (including backend domain destruction) and remove
frontend at that time. And report such event to higher layer code, so
for example libvirt could synchronize its state.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Thu, 1 Mar 2018 14:10:02 +0000 (15:10 +0100)]
firmware/shim: better filtering of dependency files during Xen tree setup
I have no idea what *.d1 is supposed to refer to - we only have .*.d
and .*.d2 files (note also the leading dot). Also switch to passing
-name instead of -path to find - that's a requirement for .*.d et al to
work, but would probably have been better from the beginning.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Thu, 1 Mar 2018 14:09:38 +0000 (15:09 +0100)]
libxc: really tolerate empty PV records
Commit 119ee4d773 ("tools/libxc: Tolerate specific zero-content records
in migration v2 streams") meant tolerate those, but failed to set rc
accordingly.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Wed, 21 Feb 2018 18:10:00 +0000 (18:10 +0000)]
x86/hvm: Constify the read side of vlapic handling
This is in preparation to make hvm_x2apic_msr_read() take a const vcpu
pointer. One modification is to alter vlapic_get_tmcct() to not use current.
This in turn needs an alteration to hvm_get_guest_time_fixed(), which is safe
because the only mutable action it makes is to take the domain plt lock.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 26 Feb 2018 14:23:03 +0000 (14:23 +0000)]
x86/vmx: Simplfy the default cases in vmx_msr_{read,write}_intercept()
The default case of vmx_msr_write_intercept() in particular is very tangled.
First of all, fold long_mode_do_msr_{read,write}() into their callers. These
functions were split out in the past because of the 32bit build of Xen, but it
is unclear why the cases weren't simply #ifdef'd in place.
Next, invert the vmx_write_guest_msr()/is_last_branch_msr() logic to break if
the condition is satisfied, rather than nesting if it wasn't. This allows the
wrmsr_hypervisor_regs() call to be un-nested with respect to the other default
logic.
No practical difference from a guests point of view.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Razvan Cojocaru [Wed, 28 Feb 2018 10:38:15 +0000 (12:38 +0200)]
x86/hvm: fix domain crash when CR3 has the noflush bit set
In hardware, when PCID support is enabled and the NOFLUSH bit is set
when writing a CR3 value, the hardware will clear that that bit and
change the CR3 without flushing the TLB. hvm_set_cr3(), however, was
ignoring this bit; the result was that post-vm_event checks detected
an invalid CR3 value and crashed the domain.
Handle NOFLUSH in hvm_set_cr3() by:
1. Clearing the bit
2. Passing a "noflush" flag to lower-level cr3 setting functions to
indicate that a flush should not be performed.
Also clear X86_CR3_NOFLUSH when reporting CR3 monitored CR3 writes.
This allows introspection to be used on VMs whose operating system uses
the NOFLUSH bit.
Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reported-by: Bitweasil <bitweasil@cryptohaze.com> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Ian Jackson [Wed, 31 Jan 2018 13:02:32 +0000 (13:02 +0000)]
release-checklist.txt: Say to increment SUPPORT.md version number
CC: Andrew Cooper <andrew.cooper3@citrix.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Wed, 31 Jan 2018 13:02:01 +0000 (13:02 +0000)]
SUPPORT.md: increment version number
CC: Andrew Cooper <andrew.cooper3@citrix.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Andrew Cooper [Thu, 24 Aug 2017 14:31:08 +0000 (15:31 +0100)]
common/gnttab: Introduce command line feature controls
This patch was originally released as part of XSA-226. It retains the same
command line syntax (as various downstreams are mitigating XSA-226 using this
mechanism) but the defaults have been updated due to the revised XSA-226
patched, after which transitive grants are believed to functioning
properly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 27 Feb 2018 14:12:23 +0000 (15:12 +0100)]
x86/HVM: don't give the wrong impression of WRMSR succeeding
... for non-existent MSRs: wrmsr_hypervisor_regs()'s comment clearly
says that the function returns 0 for unrecognized MSRs, so
{svm,vmx}_msr_write_intercept() should not convert this into success. We
don't want to unconditionally fail the access though, as we can't be
certain the list of handled MSRs is complete enough for the guest types
we care about, so instead mirror what we do on the read paths and probe
the MSR to decide whether to raise #GP.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Roger Pau Monné [Tue, 27 Feb 2018 13:10:33 +0000 (14:10 +0100)]
vmx/hap: optimize CR4 trapping
There a bunch of bits in CR4 that should be allowed to be set directly
by the guest without requiring Xen intervention, currently this is
already done by passing through guest writes into the CR4 used when
running in non-root mode, but taking an expensive vmexit in order to
do so.
xenalyze reports the following when running a PV guest in shim mode:
Note that this optimized trapping is currently only applied to guests
running with HAP on Intel hardware. If using shadow paging more CR4
bits need to be unconditionally trapped, which makes this approach
unlikely to yield any important performance improvements.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Tue, 27 Feb 2018 13:10:00 +0000 (14:10 +0100)]
x86/PV: fix off-by-one in I/O bitmap limit check
With everyone having their tags below agreeing that putting things the
other way around in the comparison makes things easier to understand, do
that rearrangement while changing the line anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.apu@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Alexandru Isaila [Tue, 27 Feb 2018 13:09:21 +0000 (14:09 +0100)]
hvm/svm: implement CPUID events
At this moment the CPUID events for the AMD architecture are not
forwarded to the monitor layer.
This patch adds the CPUID event to the common capabilities and then
forwards the event to the monitor layer.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Andrew Cooper [Tue, 27 Feb 2018 13:08:36 +0000 (14:08 +0100)]
x86/hvm: Disallow the creation of HVM domains without Local APIC emulation
There are multiple problems, not necesserily limited to:
* Guests which configure event channels via hvmop_set_evtchn_upcall_vector(),
or which hit %cr8 emulation will cause Xen to fall over a NULL vlapic->regs
pointer.
* On Intel hardware, disabling the TPR_SHADOW execution control without
reenabling CR8_{LOAD,STORE} interception means that the guests %cr8
accesses interact with the real TPR. Amongst other things, setting the
real TPR to 0xf blocks even IPIs from interrupting this CPU.
* On hardware which sets up the use of Interrupt Posting, including
IOMMU-Posting, guests run without the appropriate non-root configuration,
which at a minimum will result in dropped interrupts.
Whether no-LAPIC mode is of any use at all remains to be seen.
This is XSA-256.
Reported-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 27 Feb 2018 13:07:12 +0000 (14:07 +0100)]
gnttab: don't blindly free status pages upon version change
There may still be active mappings, which would trigger the respective
BUG_ON(). Split the loop into one dealing with the page attributes and
the second (when the first fully passed) freeing the pages. Return an
error if any pages still have pending references.
This is part of XSA-255.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 27 Feb 2018 13:04:44 +0000 (14:04 +0100)]
gnttab/ARM: don't corrupt shared GFN array
... by writing status GFNs to it. Introduce a second array instead.
Also implement gnttab_status_gmfn() properly now that the information is
suitably being tracked.
While touching it anyway, remove a misguided (but luckily benign) upper
bound check from gnttab_shared_gmfn(): We should never access beyond the
bounds of that array.
This is part of XSA-255.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 27 Feb 2018 13:03:27 +0000 (14:03 +0100)]
memory: don't implicitly unpin for decrease-reservation
It very likely was a mistake (copy-and-paste from domain cleanup code)
to implicitly unpin here: The caller should really unpin itself before
(or after, if they so wish) requesting the page to be removed.
This is XSA-252.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Mon, 19 Feb 2018 14:54:57 +0000 (14:54 +0000)]
x86/time: Rework pv_soft_rdtsc() to aid further cleanup
Having pv_soft_rdtsc() emulate all parts of an rdtscp is awkward, and gets in
the way of some intended cleanup.
* Drop the rdtscp parameter and always make the caller responsible for ecx
updates when appropriate.
* Switch the function from being void, and return the main timestamp in the
return value.
The regs parameter is still needed, but only for the stats collection, once
again bringing into question their utility. The parameter can however switch
to being const.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 19 Feb 2018 10:40:20 +0000 (10:40 +0000)]
x86/pv: Avoid leaking other guests' MSR_TSC_AUX values into PV context
If the CPU pipeline supports RDTSCP or RDPID, a guest can observe the value in
MSR_TSC_AUX, irrespective of whether the relevant CPUID features are
advertised/hidden.
At the moment, paravirt_ctxt_switch_to() only writes to MSR_TSC_AUX if
TSC_MODE_PVRDTSCP mode is enabled, but this is not the default mode.
Therefore, default PV guests can read the value from a previously scheduled
HVM vcpu, or TSC_MODE_PVRDTSCP-enabled PV guest.
Alter the PV path to always write to MSR_TSC_AUX, using 0 in the common case.
To amortise overhead cost, introduce wrmsr_tsc_aux() which performs a lazy
update of the MSR, and use this function consistently across the codebase.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Julien Grall [Fri, 23 Feb 2018 18:57:29 +0000 (18:57 +0000)]
xen/arm: vpsci: Rework the logic to start AArch32 vCPU in Thumb mode
32-bit domain is able to select the instruction (ARM vs Thumb) to use
when boot a new vCPU via CPU_ON. This is indicated via bit[0] of the
entry point address (see "T32 support" in PSCI v1.1 DEN0022D). bit[0]
must be cleared when setting the PC.
At the moment, Xen is setting the CPSR.T but never clear bit[0]. Clear
it to match the specification.
At the same time, slighlty rework the code to make clear thumb is only for
32-bit domain. Lastly, take the opportunity to switch is_thumb from int
to bool.
Julien Grall [Fri, 23 Feb 2018 18:57:24 +0000 (18:57 +0000)]
xen/arm: vpsci: Remove parameter 'ver' from do_common_cpu
Currently, the behavior of do_common_cpu will slightly change depending
on the PSCI version passed in parameter. Looking at the code, more the
specific 0.2 behavior could move out of the function or adapted for 0.1:
- x0/r0 can be updated on PSCI 0.1 because general purpose registers
are undefined upon CPU on. This was deduced from the spec not
mentioning the state of general purpose registers on CPU on.
- PSCI 0.1 does not defined PSCI_ALREADY_ON. However, it would be
safer to bail out if the CPU is already on.
Based on this, the parameter 'ver' is removed and do_psci_cpu_on
(implementation for PSCI 0.1) is adapted to avoid returning
PSCI_ALREADY_ON.
Julien Grall [Fri, 23 Feb 2018 18:57:23 +0000 (18:57 +0000)]
xen/arm64: Kill PSCI_GET_VERSION as a variant-2 workaround
Now that we've standardised on SMCCC v1.1 to perform the branch
prediction invalidation, let's drop the previous band-aid. If vendors
haven't updated their firmware to do SMCCC 1.1, they haven't updated
PSCI either, so we don't loose anything.
This is aligned with the Linux commit 3a0a397ff5ff.
One of the major improvement of SMCCC v1.1 is that it only clobbers the
first 4 registers, both on 32 and 64bit. This means that it becomes very
easy to provide an inline version of the SMC call primitive, and avoid
performing a function call to stash the registers that woudl otherwise
be clobbered by SMCCC v1.0.
This patch has been adapted to Xen from Linux commit f2d3b2e8759a. The
changes mades are:
- Using Xen coding style
- Remove HVC as not used by Xen
- Add arm_smccc_res structure
Julien Grall [Fri, 23 Feb 2018 18:57:20 +0000 (18:57 +0000)]
xen/arm: psci: Detect SMCCC version
PSCI 1.0 and later allows the SMCCC version to be (indirectly) probed
via PSCI_FEATURES. If the PSCI_FEATURES does not exist (PSCI 0.2 or
earlier) and the function returns an error, then we assume SMCCC 1.0
is implemented.
Add macros SMCCC_VERSION, SMCCC_VERSION_{MINOR, MAJOR} to easily convert
between a 32-bit value and a version number. The encoding is based on
2.2.2 in "Firmware interfaces for mitigation CVE-2017-5715" (ARM DEN 0070A).
Also re-use them to define ARM_SMCCC_VERSION_1_0 and ARM_SMCCC_VERSION_1_1.
Julien Grall [Fri, 23 Feb 2018 18:57:17 +0000 (18:57 +0000)]
xen/arm64: Implement a fast path for handling SMCCC_ARCH_WORKAROUND_1
The function SMCCC_ARCH_WORKAROUND_1 will be called by the guest for
hardening the branch predictor. So we want the handling to be as fast as
possible.
As the mitigation is applied on every guest exit, we can check for the
call before saving all the context and return very early.
For now, only provide a fast path for HVC64 call. Because the code rely
on 2 registers, x0 and x1 are saved in advance.
Julien Grall [Fri, 23 Feb 2018 18:57:15 +0000 (18:57 +0000)]
xen/arm: vsmc: Implement SMCCC_ARCH_WORKAROUND_1 BP hardening support
SMCCC 1.1 offers firmware-based CPU workarounds. In particular,
SMCCC_ARCH_WORKAROUND_1 provides BP hardening for variant 2 of XSA-254
(CVE-2017-5715).
If the hypervisor has some mitigation for this issue, report that we
deal with it using SMCCC_ARCH_WORKAROUND_1, as we apply the hypervisor
workaround on every guest exit.
Julien Grall [Fri, 23 Feb 2018 18:57:14 +0000 (18:57 +0000)]
xen/arm: vsmc: Implement SMCCC 1.1
The new SMC Calling Convention (v1.1) allows for a reduced overhead when
calling into the firmware, and provides a new feature discovery
mechanism. See "Firmware interfaces for mitigating CVE-2017-5715"
ARM DEN 00070A.
Julien Grall [Fri, 23 Feb 2018 18:57:13 +0000 (18:57 +0000)]
xen/arm: vpsci: Add support for PSCI 1.1
At the moment, Xen provides virtual PSCI interface compliant with 0.1
and 0.2. Since them, the specification has been updated and the latest
version is 1.1 (see ARM DEN 0022D).
>From an implementation point of view, only PSCI_FEATURES is mandatory.
The rest is optional and can be left unimplemented for now.
At the same time, the compatible for PSCI node have been updated to
expose "arm,psci-1.0".
Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: mirela.simonovic@aggios.com
Julien Grall [Fri, 23 Feb 2018 18:57:12 +0000 (18:57 +0000)]
xen/arm: psci: Rework the PSCI definitions
Some PSCI functions are only available in the 32-bit version. After
recent changes, Xen always needs to know whether the call was made using
32-bit id or 64-bit id. So we don't emulate reserved one.
With the current naming scheme, it is not easy to know which call
supports 32-bit and 64-bit id. So rework the definitions to encode the
version in the name. From now the functions will be named PSCI_0_2_FNxx
where xx is 32 or 64.
Andrew Cooper [Tue, 20 Feb 2018 11:08:32 +0000 (11:08 +0000)]
x86/hvm: Don't shadow the domain parameter in hvm_save_cpu_msrs()
c/s d2f86bf604 which introduced "struct hvm_save_descriptor *d" accidentally
ended up shadowing the "struct domain *d" function parameter. Rename the
former to desc.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Fri, 23 Feb 2018 14:11:00 +0000 (14:11 +0000)]
x86/clang: allow integrated assembler usage
If the required features are present.
Modify as-option-add to add an option in case the test fails, and use
it to detect whether the required clang integrated assembler features
are present.
This patch has been tested with clang 3.5, clang 6, gcc 6.4.0 without
retpoline support and gcc 7.3.1 with retpoline support.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Alan Robinson [Fri, 23 Feb 2018 13:24:56 +0000 (14:24 +0100)]
get_maintainers.pl: Avoid THE_REST when files are added or removed
When files are added or removed /dev/null is used as a place
holder name in the patch for the absent file. Don't try and
find a MAINTAINER for this place holder, it only ever flags
and then spams THE REST, behaviour for a real filename is
unchanged.
Signed-off-by: Alan Robinson <Alan.Robinson@ts.fujitsu.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Wed, 21 Feb 2018 17:58:04 +0000 (17:58 +0000)]
build: Help attempts to syntax highlight Config.mk
Some attempts to syntax highlight Config.mk end up thinking that most of
Config.mk is a string, due to the unbalanced squote. Provide a balancing
squote in a comment to compensate.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Doug Goldstein [Fri, 23 Feb 2018 10:05:35 +0000 (11:05 +0100)]
xen: append EXTRA_CFLAGS_XEN_CORE to CFLAGS
Allow a user to supply extra CFLAGS via the EXTRA_CFLAGS_XEN_CORE
environment variable for hypervisor builds. This is not a
configuration that is supported but is only aimed to help support
testing and troubleshooting when you need to make changes.
Signed-off-by: Doug Goldstein <cardoe@cardoe.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monné [Fri, 23 Feb 2018 10:05:19 +0000 (11:05 +0100)]
build: remove shim related targets
There's no need to have shim specific targets, so just use the regular
xen makefile targets in order to build the shim binary.
When the shim is build as part of the firmware directory install the
stripped Xen binary to the firmware directory and place a binary with
symbols in the debug directory.
The objcopy step of the shim build is also removed in this patch:
since the shim is booted in PVH mode there's no need for the resulting
binary to be in elf32 format. Xen can load PVH kernels with either a
32 or 64bit elf header.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Brian Woods [Fri, 23 Feb 2018 10:03:36 +0000 (11:03 +0100)]
x86/svm: add support for pause filtering threshold
Add support for enabling the pause filtering threshold feature. This
causes the pause filtering count to reset if there's pause filtering
threshold cycles or greater between pauses. See AMD APM Vol 2 Section
15.14.4 for more details.
The values of the pause filtering count and threshold were found by
iterating over different values of the count and threshold while running
kernbench and a pi spigot algorithm with yields placed in it. A
balanced setting for both variable provides:
(Using averaged elapsed time with kernbench)
old = 852.0
new = 848.8
improvement = .4%
For system without pause filtering threshold, the change, from 3000 to
4000 for the count, should not negatively effect system performance.
Signed-off-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Roger Pau Monné [Fri, 23 Feb 2018 10:00:31 +0000 (11:00 +0100)]
x86: fix indirect thunk usage of CONFIG_INDIRECT_THUNK
When indirect_thunk_asm.h is instantiated directly into assembly files
CONFIG_INDIRECT_THUNK might not be defined, and thus using .if against
it is wrong.
Add a check to define CONFIG_INDIRECT_THUNK to 0 if not defined, so
that using .if CONFIG_INDIRECT_THUNK is always correct.
This suppresses the following clang error:
<instantiation>:8:9: error: expected absolute expression
.if CONFIG_INDIRECT_THUNK == 1
^
<instantiation>:1:1: note: while in macro instantiation
INDIRECT_BRANCH call %rdx
^
entry.S:589:9: note: while in macro instantiation
INDIRECT_CALL %rdx
^
Note that this is a preparatory patch in order to enable clang's
integrated assembler, the integrated assembler is not yet enabled for
assembly files.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Haozhong Zhang [Fri, 23 Feb 2018 09:59:31 +0000 (10:59 +0100)]
VT-d: use two 32-bit writes to update DMAR fault address registers
The 64-bit DMAR fault address is composed of two 32 bits registers
DMAR_FEADDR_REG and DMAR_FEUADDR_REG. According to VT-d spec:
"Software is expected to access 32-bit registers as aligned doublewords",
a hypervisor should use two 32-bit writes to DMAR_FEADDR_REG and
DMAR_FEUADDR_REG separately in order to update a 64-bit fault address,
rather than a 64-bit write to DMAR_FEADDR_REG. Note that when x2APIC
is not enabled DMAR_FEUADDR_REG is reserved and it's not necessary to
update it.
Though I haven't seen any errors caused by such one 64-bit write on
real machines, it's still better to follow the specification.
Fixes: ae05fd3912b ("VT-d: use qword MMIO access for MSI address writes") Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Brian Woods [Tue, 20 Feb 2018 22:27:02 +0000 (16:27 -0600)]
x86/svm: add EFER SVME support for VGIF/VLOAD
Only enable virtual VMLOAD/SAVE and VGIF if the guest EFER.SVME is set.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Julien Grall [Wed, 21 Feb 2018 14:02:44 +0000 (14:02 +0000)]
xen/tmem: Convert the file common/tmem_xen.c to use typesafe MFN
The file common/tmem_xen.c is now converted to use typesafe. This is
requiring to override the macro page_to_mfn to make it work with mfn_t.
Note that all variables converted to mfn_t havem there initial value,
when set, switch from 0 to INVALID_MFN. This is fine because the initial
values was always overriden before used.
Also add a couple of missing newlines suggested by Andrew in the code.
Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Roger Pau Monne [Tue, 20 Feb 2018 14:10:12 +0000 (14:10 +0000)]
build: filter out command line assembler arguments
If the assembler is not used. This happens when using cc -E or cc -S
for example. GCC will just ignore the -Wa,... when the assembler is
not called, but clang will complain loudly and fail.
Also enable passing -Wa,-I$(BASEDIR)/include to clang now that it's
safe to do so.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monne [Tue, 20 Feb 2018 14:10:11 +0000 (14:10 +0000)]
build: do not hardcode AFLAGS for as-insn tests
Hardcoding as-insn to use AFLAGS is not correct. For once the test is
performed using a C file with inline assembly, and secondly the flags
used can be passed by the caller together with the CC.
Fix as-insn-check to pass the flags given as parameter to the test.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
[Fix usage comments as they are changing] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Fri, 16 Feb 2018 14:59:56 +0000 (14:59 +0000)]
xen/arm: vgic: Make sure the number of SPIs is a multiple of 32
The vGIC relies on having a pending_irq available for every IRQs
described in the ranks. As each rank describes 32 interrupts, we need to
make sure the number of SPIs is a multiple of 32.
Sergey Dyasli [Mon, 19 Feb 2018 11:29:26 +0000 (11:29 +0000)]
x86/msr: add Raw and Host domain policies
Raw policy contains the actual values from H/W MSRs. Add PLATFORM_INFO
msr to the policy during probe_cpuid_faulting().
Host policy may have certain features disabled if Xen decides not
to use them. For now, make Host policy equal to Raw policy with
cpuid_faulting availability dependent on X86_FEATURE_CPUID_FAULTING.
Finally, derive HVM/PV max domain policies from the Host policy.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Igor Druzhinin [Tue, 20 Feb 2018 09:16:56 +0000 (10:16 +0100)]
x86/nmi: start NMI watchdog on CPU0 after SMP bootstrap
We're noticing a reproducible system boot hang on certain
Skylake platforms where the BIOS is configured in legacy
boot mode with x2APIC disabled. The system stalls immediately
after writing the first SMP initialization sequence into APIC ICR.
The cause of the problem is watchdog NMI handler execution -
somewhere near the end of NMI handling (after it's already
rescheduled the next NMI) it tries to access IO port 0x61
to get the actual NMI reason on CPU0. Unfortunately, this
port is emulated by BIOS using SMIs and this emulation for
some reason takes more time than we expect during INIT-SIPI-SIPI
sequence. As the result, the system is constantly moving between
NMI and SMI handler and not making any progress.
To avoid this, initialize the watchdog after SMP bootstrap on
CPU0 and, additionally, protect the NMI handler by moving
IO port access before NMI re-scheduling. The latter should also
help in case of post boot CPU onlining. Although we're running
watchdog at much lower frequency at this point, it's neveretheless
possible we may trigger the issue anyway.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 20 Feb 2018 09:10:59 +0000 (10:10 +0100)]
shim: allow building of just the shim with build-ID-incapable linker
The ELF note the shim build inserts causes mkelf32 to choke on the
second program header. However, the output of mkelf32 isn't really
needed when building inside tools/firmware/ - an attempt to build it is
made solely because of a wrong dependency.
Further changes to the make logic will be needed to also allow building
a shim-enabled "normal" xen with such a linker (as it looks the --notes
option will need passing not just when the linker support build ID
generation).
Also drop a stray variable setting from the x86 Makefile.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Dario Faggioli [Fri, 16 Feb 2018 18:38:48 +0000 (19:38 +0100)]
tools: libxenstat: fix format string overflow
With gcc 7.3.0, the build fails like this:
src/xenstat_linux.c: In function ‘getBridge’
src/xenstat_linux.c:78:34: warning: ‘%s’ directive writing up to 255 bytes into a region of size 241 [-Wformat-overflow=]
sprintf(tmp, "/sys/class/net/%s/bridge", de->d_name);
^~
src/xenstat_linux.c:78:5: note: ‘sprintf’ output between 23 and 278 bytes into a destination of size 256
sprintf(tmp, "/sys/class/net/%s/bridge", de->d_name);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix by making the buffer bigger.
Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Mon, 19 Feb 2018 13:00:31 +0000 (14:00 +0100)]
shut down domain when last vCPU goes down
I've just had to deal with an early boot crash of Linux which occurred
so early that even "earlyprintk=xen" did not produce any useful output.
Hence the domain appeared to hang, while in fact it had brought down its
only vCPU. By translating this to a shutdown, the situation will be
better recognizable.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Mon, 19 Feb 2018 12:59:37 +0000 (13:59 +0100)]
x86/PV: avoid indirect call/thunk in I/O emulation
The stub is within reach from the .text section, so there's no point
using an indirect call here. This has the added benefit of there no
longer being two sufficiently different approaches, breaking one of
which people may not even notice.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citix.com>
Uwe Dannowski [Fri, 16 Feb 2018 13:19:54 +0000 (13:19 +0000)]
x86/microcode: Propagate microcode update errors
Errors on updating the microcode in the processor were silently
dropped when invoked via the microcode_update hypercall. Also, the log
message was misleading.
Signed-off-by: Uwe Dannowski <uwed@amazon.de> Reviewed-by: Stefan Nuernberger <snu@amazon.de> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Reviewed-by: Amit Shah <aams@amazon.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>