Juergen Gross [Tue, 14 Mar 2017 13:31:12 +0000 (14:31 +0100)]
tools: add support for additional items in .pc files for local builds
Some libraries require different compiler-flags when being used in a
local build compared to a build using installed libraries.
Reflect that by supporting local cflags variables in generated
pkg-config files. The local variants will be empty in the installed
pkg-config files.
The flags for the linker in the local variants will have to specify
the search patch for the library with "-Wl,-rpath-link=", while the
flags for the installed library will be "-L".
Add needed directory patterns.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Tue, 14 Mar 2017 13:31:08 +0000 (14:31 +0100)]
tools: fix typo in tools/Rules.mk
Commit 78fb69ad9 ("tools/Rules.mk: Properly handle libraries with
recursive dependencies.") introduced a copy and paste error in
tools/Rules.mk:
LDLIBS_libxenstore and SHLIB_libxenstore don't use SHDEPS_libxenstore
but SHDEPS_libxenguest. This will add a superfluous dependency of
libxenstore on libxenevtchn.
Correct this bug.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Zhang Chen [Mon, 6 Mar 2017 02:59:25 +0000 (10:59 +0800)]
COLO-Proxy: Use socket to get checkpoint event.
We use kernel colo proxy's way to get the checkpoint event
from qemu colo-compare.
Qemu colo-compare need add a API to support this(I will add this in qemu).
Qemu side patch:
https://lists.nongnu.org/archive/html/qemu-devel/2017-02/msg07265.html
Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Sergey Dyasli [Tue, 14 Mar 2017 11:25:14 +0000 (12:25 +0100)]
x86/vvmx: correct nested shadow VMCS handling
Currently xen always sets the shadow VMCS-indicator bit on nested
vmptrld and always clears it on nested vmclear. This behavior is
wrong when the guest loads a shadow VMCS: shadow bit will be lost
on nested vmclear.
Fix this by checking if the guest has provided a shadow VMCS.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Sergey Dyasli [Tue, 14 Mar 2017 11:24:38 +0000 (12:24 +0100)]
x86/vvmx: add mov-ss blocking check to vmentry
Intel SDM states that if there is a current VMCS and there is MOV-SS
blocking, VMFailValid occurs and control passes to the next instruction.
Implement such behaviour for nested vmlaunch and vmresume.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Fri, 17 Feb 2017 18:31:45 +0000 (18:31 +0000)]
x86/cpuid: Handle leaf 0xb in guest_cpuid()
Leaf 0xb is reserved by AMD, and uniformly hidden from guests by the toolstack
logic and hypervisor PV logic. The previous dynamic logic filled in the
x2APIC ID for all HVM guests.
In practice, leaf 0xb is tightly linked with x2APIC, and x2APIC is offered to
guests on AMD hardware, as Xen's APIC emulation is x2APIC capable even if
hardware isn't.
Sensibly exposing the rest of the leaf requires further topology
infrastructure.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 17 Feb 2017 18:24:45 +0000 (18:24 +0000)]
x86/cpuid: Handle leaf 0xa in guest_cpuid()
Leaf 0xa is reserved by AMD, and only exposed to Intel guests when vPMU is
enabled. Leave the logic as-was, ready to be cleaned up when further
toolstack infrastructure is in place.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 17 Feb 2017 18:03:58 +0000 (18:03 +0000)]
x86/cpuid: Handle leaf 0x6 in guest_cpuid()
The thermal/performance leaf was previously hidden from HVM guests, but fully
visible to PV guests. Most of the leaf refers to MSR availability, and there
is nothing an unprivileged PV guest can do with the information, so hide the
leaf entirely.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 17 Feb 2017 17:32:29 +0000 (17:32 +0000)]
x86/cpuid: Handle leaf 0x5 in guest_cpuid()
The MONITOR flag isn't exposed to guests. The existing toolstack logic, and
pv_cpuid() in the hypervisor, zero the MONITOR leaf for queries.
However, the MONITOR leaf is still visible in the hardware domains native
CPUID view, and Linux depends on this to set up C-state information. Leak the
hosts MONITOR leaf under the same circumstances that the MONITOR feature is
leaked.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 17 Feb 2017 17:21:35 +0000 (17:21 +0000)]
x86/cpuid: Handle leaf 0x4 in guest_cpuid()
Leaf 0x4 is reserved by AMD. For Intel, it is a multi-invocation leaf with
ecx enumerating different cache details.
Add a new union for it in struct cpuid_policy, collect it from hardware in
calculate_raw_policy(), audit it in recalculate_cpuid_policy() and update
guest_cpuid() and update_domain_cpuid_info() to properly insert/extract data.
A lot of the data here will need further auditing/refinement when better
topology support is introduced, but for now, this matches the existing
toolstack behaviour.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 24 May 2016 14:46:01 +0000 (15:46 +0100)]
x86/pagewalk: Consistently use guest_walk_*() helpers for translation
hap_p2m_ga_to_gfn() and sh_page_fault() currently use guest_l1e_get_gfn() to
obtain the translation of a pagewalk. This is conceptually wrong (the
semantics of gw.l1e is an internal detail), and will actually be wrong when
PSE36 superpage support is fixed. Switch them to using guest_walk_to_gfn().
guest_walk_tables() also uses guest_l1e_get_gfn(), and is updated for
consistency.
Take the opportunity to const-correct the walk_t parameter of the
guest_walk_to_*() helpers, and implement guest_walk_to_gpa() in terms of
guest_walk_to_gfn() to avoid duplicating the actual translation calculation.
While editing guest_walk_to_gpa(), fix a latent bug by causing it to return
INVALID_PADDR rather than 0 for a failed translation, as 0 is also a valid
successful result. The sole caller, sh_page_fault(), has already confirmed
that the translation is valid, so this doesn't cause a behavioural change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Sun, 3 Jul 2016 12:04:34 +0000 (13:04 +0100)]
x86/shadow: Try to correctly identify implicit supervisor accesses
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
All actions which refer to the active ldt/gdt/idt or task register
(e.g. loading a new segment selector) are known as implicit supervisor
accesses, even when the access originates from user code.
Right away, this fixes a bug during userspace emulation where a pagewalk for a
system table was (incorrectly) performed as a user access, causing an access
violation in the common case, as system tables reside on supervisor mappings.
The implicit/explicit distinction is necessary in the pagewalk when SMAP is
enabled. Refer to Intel SDM Vol 3 "Access Rights" for the exact details.
Introduce a new pagewalk input, and make use of the new system segment
references in hvmemul_{read,write}(). While modifying those areas, move the
calculation of the appropriate pagewalk input before its first use.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Jan Beulich [Thu, 9 Mar 2017 16:42:55 +0000 (17:42 +0100)]
x86emul: suppress reads for unhandled 0f38/0f3a extension space insns
The way these extension spaces get handled we so far always end up
going through the generic SrcMem operand fetch path for unused table
entries. Suppress actual memory accesses happening by forcing op_bytes
to zero in those cases.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 9 Mar 2017 16:41:58 +0000 (17:41 +0100)]
x86emul: correct vzero{all,upper} for non-64-bit-mode
The registers only accessible in 64-bit mode need to be left alone in
this case.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Wed, 8 Mar 2017 18:06:00 +0000 (18:06 +0000)]
xen/arm: hvm_domain does not need to be cacheline aligned
hvm_domain only contains the HVM_PARAM that on ARM are not used often.
So it is not necessary to have hvm_domain fitting in a cacheline. Drop
it to save 128 bytes in the structure arch_domain.
Haozhong Zhang [Wed, 8 Mar 2017 14:11:06 +0000 (15:11 +0100)]
x86/mce: remove ASSERT's about mce_[u|d]handler_num in mce_action()
Those assertions as well as mce_[u|d]handlers[], mce_[u|d]handler_num
and mce_action() were intel only and lifted to the common code by c/s 3a91769d6e1. However, MCE handling on AMD does not use mce_[u|d]handlers[]
before and after that commit, so assertions in mce_action() about their
size do not make sense for AMD. To be worse, they can crash the debug
build on AMD. Remove them to make the debug build work on AMD.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Haozhong Zhang [Wed, 8 Mar 2017 14:10:45 +0000 (15:10 +0100)]
x86/mce: clear MSR_IA32_MCG_STATUS by writing 0
On Intel CPU, an attemp to write to MSR_IA32_MCG_STATUS with any
non-zero value would result in #GP.
This commit writes 0 on AMD CPU as well instead of just clearing MCIP
bit, because all non-reserved bits of MSR_IA32_MCG_STATUS have been
handled at this point.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Haozhong Zhang [Wed, 8 Mar 2017 14:10:29 +0000 (15:10 +0100)]
x86/vmce: fill MSR_IA32_MCG_STATUS on all vcpus in broadcast case
The current implementation only fills MC MSRs on vcpu0 and leaves MC
MSRs on other vcpus empty in the broadcast case. When guest reads 0
from MSR_IA32_MCG_STATUS on vcpuN (N > 0), it may think it's not
possible to recover the execution on that vcpu and then get panic,
although MSR_IA32_MCG_STATUS filled on vcpu0 may imply the injected
vMCE is actually recoverable. To avoid such unnecessary guest panic,
set MSR_IA32_MCG_STATUS on vcpuN (N > 0) to MCG_STATUS_MCIP|MCG_STATUS_RIPV.
In addition, fill_vmsr_data(mc_bank, ...) is changed to return -EINVAL
rather than 0, if an invalid domain ID is contained in mc_bank.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Haozhong Zhang [Wed, 8 Mar 2017 14:10:06 +0000 (15:10 +0100)]
x86/mce: set mcinfo_comm.type and .size in x86_mcinfo_reserve()
All existing calls to x86_mcinfo_reserve() are followed by statements
that set the size and the type of the reserved space, so move them into
x86_mcinfo_reserve() to simplify the code.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Haozhong Zhang [Wed, 8 Mar 2017 14:09:46 +0000 (15:09 +0100)]
x86/mce: remove unused x86_mcinfo_add()
c/s 9d13fd9fd320a7740c6446c048ff6a2990095966 turned to update the
mcinfo buffer in-place instead of using x86_mcinfo_add(). The last
uses of x86_mcinfo_add() were removed by that commit as well.
Therefore, x86_mcinfo_add() was deprecated in fact.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Haozhong Zhang [Wed, 8 Mar 2017 14:09:16 +0000 (15:09 +0100)]
x86/mce: adjust comment of callback register functions
c/s e966818264908e842e2847f579ca4d94e586eaac added
mce_need_clearbank_register below the comment of
x86_mce_callback_register(). This commit (1) adjusts the first
paragraph of comment to be a general statement of all callback
register functions, and (2) moves the second paragraph to the
front of x86_mce_callback_register().
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 8 Mar 2017 14:07:41 +0000 (15:07 +0100)]
x86/MCE: sanitize domain/vcpu ID handling
Storing -1 into both fields was misleading consumers: We really should
have a manifest constant for "invalid vCPU" here, and the already
existing DOMID_INVALID should be used.
Also correct a bogus (dead code) check in mca_init_global(), at once
introducing a manifest constant for the early boot "invalid vCPU"
pointer (avoiding proliferation of the open coding). Make that pointer
a non-canonical address at once.
Finally, don't leave mc_domid uninitialized in mca_init_bank().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 8 Mar 2017 14:07:14 +0000 (15:07 +0100)]
MAINTAINERS: drop Christoph Egger
Other Amazon folks indicate he's not available as a maintainer anymore
at this point in time. Maintenance of the MCE sub-component will fall
back to the x86 maintainers.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christoph Egger <chegger@amazon.de>
Andrew Cooper [Tue, 7 Mar 2017 23:32:24 +0000 (23:32 +0000)]
x86/emul: Avoid #UD in SIMD stubs
v{,u}comis{s,d}, and vcvt{,t}s{s,d}2si are two-operand instructions, while
vzero{all,upper} take no operands. Each require vex.reg set to ~0 to avoid
suffering #UD.
Spotted while fuzzing with AFL Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Tue, 7 Mar 2017 14:58:04 +0000 (14:58 +0000)]
vlapic/viridian: abort existing APIC assist if any vector is pending in ISR
The vlapic code already aborts an APIC assist if an interrupt is deferred
because a higher priority interrupt has already been delivered (and hence
its vector is pending in the ISR).
However, it is also necessary to abort an APIC assist in the case where a
higher priority is about to be delivered because, in either case, at least
two vectors will be pending in the ISR and hence an EOI is necessary.
Also, following on from the above reasoning, the decision to start a new
APIC assist should clearly be based upon whether any other vector is
pending in the ISR, regardless of whether it is lower or higher in
priority. (In fact the code in question cannot be reached if the
vector is lower in priority). Thus the single use of
vlapic_find_lowest_vector() can be replaced with a call to
vlapic_find_highest_isr() and the former function removed.
Without this patch, because the logic is flawed, a domain_crash() results
when an attempt is made to erroneously start a new APIC assist.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
There where a couple of unneeded packed attributes in several x86-specific
structures, that are obviously aligned. The only non-trivial one is
vmcb_struct, which has been checked to have the same layout with and without
the packed attribute using pahole. In that case add a build-time size check to
be on the safe side.
No functional change is expected as a result of this commit.
Jan Beulich [Tue, 7 Mar 2017 16:09:09 +0000 (17:09 +0100)]
x86emul: test coverage for SSE3/SSSE3/SSE4* insns
... and their AVX equivalents. Note that a few instructions aren't
covered (yet), but those all fall into common pattern groups, so I
would hope that for now we can do with what is there.
Just like for SSE/SSE2, MMX insns aren't being covered at all, as
they're not easy to deal with: The compiler refuses to emit such for
other than uses of built-in functions.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 7 Mar 2017 16:06:38 +0000 (17:06 +0100)]
x86emul: test coverage for SSE/SSE2 insns
... and their AVX equivalents. Note that a few instructions aren't
covered (yet), but those all fall into common pattern groups, so I
would hope that for now we can do with what is there.
MMX insns aren't being covered at all, as they're not easy to deal
with: The compiler refuses to emit such for other than uses of built-in
functions.
The current way of testing AVX insns is meant to be temporary only:
Once we fully support that feature, the present tests should rather be
replaced than full ones simply added.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 7 Mar 2017 16:04:08 +0000 (17:04 +0100)]
x86emul: support MMX/SSE/SSE2 converts
Note that other than most scalar instructions, vcvt{,t}s{s,d}2si do #UD
when VEX.l is set on at least some Intel models. To be on the safe
side, implement the most restrictive mode here for now when emulating
an Intel CPU, and simply clear the bit when emulating an AMD one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 7 Mar 2017 16:03:45 +0000 (17:03 +0100)]
x86emul: support MMX/SSE{,2,3} moves
Previously supported insns are being converted to the new model, and
several new ones are being added.
To keep the stub handling reasonably simple, integrate SET_SSE_PREFIX()
into copy_REX_VEX(), at once switching the stubs to use an empty REX
prefix instead of a double DS: one (no byte registers are being
accessed, so an empty REX prefix has no effect), except (of course) for
the 32-bit test harness build.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 7 Mar 2017 16:02:53 +0000 (17:02 +0100)]
x86emul: support most memory accessing MMX/SSE{,2,3} insns
This aims at covering most MMX/SSEn/AVX instructions in the 0x0f-escape
space with memory operands. Not covered here are irregular moves,
converts, and {,U}COMIS{S,D} (modifying EFLAGS).
Note that the distinction between simd_*_fp isn't strictly needed, but
I've kept them as separate entries since in an earlier version I needed
them to be separate, and we may well find it useful down the road to
have that distinction.
Also take the opportunity and adjust the vmovdqu test case the new
LDDQU one here has been cloned from: To zero a ymm register we don't
need to go through hoops, as 128-bit AVX insns zero the upper portion
of the destination register, and in the disabled AVX2 code there was a
wrong YMM register used.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
xen/arm: fix affected memory range by dcache clean functions
clean_dcache_va_range and clean_and_invalidate_dcache_va_range don't
calculate the range correctly when "end" is not cacheline aligned. As a
result, the last cacheline is not skipped. Fix the issue by aligning the
start address to the cacheline size.
In addition, make the code simpler and faster in
invalidate_dcache_va_range, by removing the module operation and using
bitmasks instead. Also remove the size adjustments in
invalidate_dcache_va_range, because the size variable is not used later
on.
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com> Reviewed-by: Julien Grall <julien.grall@arm.com> Tested-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Razvan Cojocaru [Mon, 6 Mar 2017 16:51:15 +0000 (17:51 +0100)]
x86/mem_access: fix vm_event emulation check with altp2m enabled
Currently, p2m_mem_access_emulate_check() uses p2m_get_mem_access()
to check if the page restrictions have been lifted between the time
of sending the vm_event out and the reception of the reply - in
which case emulation is no longer required. Unfortunately,
p2m_get_mem_access() uses p2m_get_hostp2m(d) which only checks the
default EPT (view 0 in altp2m parlance). This patch fixes this by
checking the active altp2m view instead, whenever applicable.
Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Andrew Cooper [Thu, 2 Mar 2017 19:58:20 +0000 (19:58 +0000)]
x86/cpuid: Fix booting on AMD Phenom 6-core platform
c/s 5cecf60f4 "x86/cpuid: Handle leaf 0x1 in guest_cpuid()" causes Linux 4.10
to crash during boot.
It turns out to be because of the reported apic_id, which was altered to be
more consistent across guests. Revert back to the previous behaviour, by
limiting the apic_id adjustment to HVM guests only. Whomever gets to fixes
topology representation is going to have a lot of fun with non-power-of-2 AMD
boxes.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Introduce new Xen command line parameter called "vwfi", which stands for
virtual wfi. The default is "trap": Xen traps guest wfi and wfe
instructions. In the case of wfi, Xen calls vcpu_block on the guest
vcpu; in the case of guest wfe, Xen calls vcpu_yield on the guest vcpu.
The behavior can be changed by setting vwfi to "native", in that case
Xen doesn't trap neither wfi nor wfe, running them in guest context.
The result is strong reduction in irq latency (from 5000ns to 2000ns,
measured using https://github.com/edgarigl/tbm, the physical timer, and
1 pcpu dedicated to 1 vcpu). The downside is that the scheduler thinks
that the guest is busy when actually is sleeping, leading to suboptimal
scheduling decisions.
Jan Beulich [Fri, 3 Mar 2017 16:08:36 +0000 (17:08 +0100)]
x86/SVM: correct boot time cpu_data[] handling
start_svm() already runs after cpu_data[] was set up, so it shouldn't
modify it anymore (at least not directly). Constify the involved
pointers.
Furthermore LMSLE feature detection was broken by 566ddbe833 ("x86:
Fail CPU bringup cleanly if it cannot initialise HVM"), as Andrew
Cooper has pointed out: c couldn't possibly equal &boot_cpu_data
anymore. (But since it's unsafe migration-wise for some more time,
suppress the feature actually being enabled for us.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Feng Wu [Fri, 3 Mar 2017 16:07:08 +0000 (17:07 +0100)]
VMX: properly handle pi when all the assigned devices are removed
This patch handles some corner cases when the last assigned device
is removed from the domain. In this case we should carefully handle
pi descriptor and the per-cpu blocking list, to make sure:
- all the PI descriptor are in the right state when next time a
devices is assigned to the domain again.
- No remaining vcpus of the domain in the per-cpu blocking list.
Here we call vmx_pi_unblock_vcpu() to remove the vCPU from the blocking list
if it is on the list. However, this could happen when vmx_vcpu_block() is
being called, hence we might incorrectly add the vCPU to the blocking list
while the last devcie is detached from the domain. Consider that the situation
can only occur when detaching the last device from the domain and it is not
a frequent operation, so we use domain_pause before that, which is considered
as an clean and maintainable solution for the situation.
Signed-off-by: Feng Wu <feng.wu@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Thu, 2 Mar 2017 11:41:17 +0000 (11:41 +0000)]
x86/emul: Hold x86_emulate() to strict X86EMUL_EXCEPTION requirements
All known paths raising faults behind the back of the emulator have been
fixed. Reinstate the original intended assertion concerning the behaviour of
X86EMUL_EXCEPTION and ctxt->event_pending.
As x86_emulate_wrapper() now covers both PV and HVM guests properly, there is
no need for the PV assertions following calls to x86_emulate().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 2 Mar 2017 12:41:38 +0000 (12:41 +0000)]
x86/hvm: Don't raise #GP behind the emulators back for CR writes
hvm_set_cr{0,4}() are reachable from the emulator, but use
hvm_inject_hw_exception() directly.
Alter the API to make the callers of hvm_set_cr{0,3,4}() responsible for
raising #GP, and apply this change to all existing callers.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Issues identified which I am purposefully not fixing in this patch:
(I will try to get around to them, but probably not in the 4.9 timeframe, at
this point.)
* hvm_set_cr3() doesn't handle bad 32bit PAE PDPTRs properly, as it doesn't
actually have a path which raises #GP.
* There is a lot of redundancy in our HVM CR setting routines, but not enough
to trivially dedup at this point.
* Both nested VT-x and SVM are liable raise #GP with L1, rather than failing
the virtual vmentry/vmexit. This is not a change in behaviour, but is far
more obvious now.
* The hvm_do_resume() path for vm_event processing has the same bug as the
MSR side, where exceptions are raised after %rip has moved forwards. This
is also not a change in behaviour.
Quan Xu [Fri, 3 Mar 2017 11:00:35 +0000 (12:00 +0100)]
x86/apicv: fix wrong IPI suppression during posted interrupt delivery
__vmx_deliver_posted_interrupt() wrongly used a softirq bit to decide whether
to suppress an IPI. Its logic was: the first time an IPI was sent, we set
the softirq bit. Next time, we would check that softirq bit before sending
another IPI. If the 1st IPI arrived at the pCPU which was in
non-root mode, the hardware would consume the IPI and sync PIR to vIRR.
During the process, no one (both hardware and software) will clear the
softirq bit. As a result, the following IPI would be wrongly suppressed.
This patch discards the suppression check, always sending an IPI.
The softirq also need to be raised. But there is a little change.
This patch moves the place where we raise a softirq for
'cpu != smp_processor_id()' case to the IPI interrupt handler.
Namely, don't raise a softirq for this case and set the interrupt handler
to pi_notification_interrupt()(in which a softirq is raised) regardless of
VT-d PI enabled or not. The only difference is when an IPI arrives at the
pCPU which is happened in non-root mode, the code will not raise a useless
softirq since the IPI is consumed by hardware rather than raise a softirq
unconditionally.
Signed-off-by: Quan Xu <xuquan8@huawei.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 3 Mar 2017 11:00:05 +0000 (12:00 +0100)]
x86emul: assert no duplicate mappings of stub space
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Fix this by adding appropriate checks for vmcs id during vmptrld
emulation.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Fix this by emulating VMfailInvalid if the address is invalid.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Feng Wu [Fri, 3 Mar 2017 10:58:13 +0000 (11:58 +0100)]
VMX: make sure PI is in proper state before install the hooks
We may hit the last ASSERT() in vmx_vcpu_block in the current code,
since vmx_vcpu_block() may get called before vmx_pi_switch_to()
has been installed or executed. Here We use cmpxchg to update
the NDST field, this can make sure we only update the NDST when
vmx_pi_switch_to() has not been called. So the NDST is in a
proper state in vmx_vcpu_block().
Suggested-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Feng Wu <feng.wu@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Feng Wu [Fri, 3 Mar 2017 10:57:30 +0000 (11:57 +0100)]
VMX: permanently assign PI hook vmx_pi_switch_to()
PI hook vmx_pi_switch_to() is needed even after any previously
assigned device is detached from the domain. Since 'SN' bit is
also used to control the CPU side PI and we change the state of
SN bit in vmx_pi_switch_to() and vmx_pi_switch_from(), then
evaluate this bit in vmx_deliver_posted_intr() when trying to
deliver the interrupt in posted way via software. The problem
is if we deassign the hooks while the vCPU is runnable in the
runqueue with 'SN' set, all the furture notificaton event will
be suppressed. This patch makes the hook permanently assigned.
Signed-off-by: Feng Wu <feng.wu@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>). Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Tue, 5 Jul 2016 09:40:21 +0000 (10:40 +0100)]
x86/hvm: Adjust hvm_nx_enabled() to match how Xen behaves
On Intel hardware, EFER is not fully switched between host and guest contexts.
In practice, this means that Xen's EFER.NX setting leaks into guest context,
and influences the behaviour of the hardware pagewalker.
When servicing a pagefault, Xen's model of guests behaviour should match
hardware's behaviour, to allow correct interpretation of the pagefault error
code, and to avoid creating observable difference in behaviour from the guests
point of view.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andy Lutomirski [Fri, 9 Dec 2016 18:24:07 +0000 (10:24 -0800)]
x86/microcode: Replace sync_core() with cpuid_eax()
The Intel microcode driver is using sync_core() to mean "do CPUID
with EAX=1".
Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Borislav Petkov <bp@alien8.de>
[Linux commit 484d0e5c7943644cc46e7308a8f9d83be598f2b9]
[Ported to Xen] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
keios [Tue, 3 Oct 2006 08:13:49 +0000 (01:13 -0700)]
xen/common: low performance of lib/sort.c
It is a non-standard heap-sort algorithm implementation because the index
of child node is wrong . The sort function still outputs right result, but
the performance is O( n * ( log(n) + 1 ) ) , about 10% ~ 20% worse than
standard algorithm.
Signed-off-by: keios <keios.cn@gmail.com>
[Linux commit: d3717bdf8f08a0e1039158c8bab2c24d20f492b6]
[Ported to Xen] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>