Roger Pau Monne [Wed, 29 Jan 2020 11:38:00 +0000 (12:38 +0100)]
nvmx: always trap accesses to x2APIC MSRs
Nested VMX doesn't expose support for
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE,
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY or
SECONDARY_EXEC_APIC_REGISTER_VIRT, and hence the x2APIC MSRs should
always be trapped in the nested guest MSR bitmap, or else a nested
guest could access the hardware x2APIC MSRs given certain conditions.
Accessing the hardware MSRs could be achieved by forcing the L0 Xen to
use SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE and
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY or
SECONDARY_EXEC_APIC_REGISTER_VIRT (if supported), and then creating a
L2 guest with a MSR bitmap that doesn't trap accesses to the x2APIC
MSR range. Then OR'ing both L0 and L1 MSR bitmaps would result in a
bitmap that doesn't trap certain x2APIC MSRs and a VMCS that doesn't
have SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE and
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY or
SECONDARY_EXEC_APIC_REGISTER_VIRT set either.
Fix this by making sure x2APIC MSRs are always trapped in the nested
MSR bitmap.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
Changes since v4:
- Fix size of x2APIC region to use 0x100.
Changes since v3:
- Use bitmap_set.
Changes since v1:
- New in this version (split from #1 patch).
- Use non-locked set_bit.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v4:
- Introduce BIT_WORD in generic header bitops.h (instead of the x86
one).
- Include byteorder.h for __LITTLE_ENDIAN
- Remove EXPORT_SYMBOL.
Roger Pau Monne [Mon, 17 Feb 2020 10:16:48 +0000 (11:16 +0100)]
arm: rename BIT_{WORD/MASK/PER_WORD) to BITOP_*
So BIT_WORD can be imported from Linux. The difference between current
Linux implementation of BIT_WORD is that the size of the word unit is
a long integer, while the Xen one is hardcoded to 32 bits.
Current users of BITOP_WORD on Arm (which considers a word a long
integer) are switched to use the generic BIT_WORD which also operates
on long integers.
No functional change intended.
Suggested-by: Julien Grall <julien@xen.org> Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v5:
- Also rename BIT_MASK and BITS_PER_WORD.
Roger Pau Monne [Tue, 7 Jan 2020 11:32:39 +0000 (12:32 +0100)]
nvmx: implement support for MSR bitmaps
Current implementation of nested VMX has a half baked handling of MSR
bitmaps for the L1 VMM: it maps the L1 VMM provided MSR bitmap, but
doesn't actually load it into the nested vmcs, and thus the nested
guest vmcs ends up using the same MSR bitmap as the L1 VMM.
This is wrong as there's no assurance that the set of features enabled
for the L1 vmcs are the same that L1 itself is going to use in the
nested vmcs, and thus can lead to misconfigurations.
For example L1 vmcs can use x2APIC virtualization and virtual
interrupt delivery, and thus some x2APIC MSRs won't be trapped so that
they can be handled directly by the hardware using virtualization
extensions. On the other hand, the nested vmcs created by L1 VMM might
not use any of such features, so using a MSR bitmap that doesn't trap
accesses to the x2APIC MSRs will be leaking them to the underlying
hardware.
Fix this by crafting a merged MSR bitmap between the one used by L1
and the nested guest.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v4:
- Add static to vcpu_relinquish_resources.
Changes since v3:
- Free the merged MSR bitmap page in nvmx_purge_vvmcs.
Changes since v2:
- Pass shadow_ctrl into update_msrbitmap, and check there if
CPU_BASED_ACTIVATE_MSR_BITMAP is set.
- Do not enable MSR bitmap unless it's enabled in both L1 and L2.
- Rename L1 guest to L2 in nestedvmx struct comment.
Changes since v1:
- Split the x2APIC MSR fix into a separate patch.
- Move setting MSR_BITMAP vmcs field into load_vvmcs_host_state for
virtual vmexit.
- Allocate memory with MEMF_no_owner.
- Use tabs to align comment of the nestedvmx struct field.
Set 'e' correctly to reflect the location that Xen is actually relocated
to from its default 2MiB location. Not 2MiB below that.
This is only vaguely a bug fix. The "missing" 2MiB would have been used
in the end, and fed to the allocator. It's just that other things don't
get to sit right up *next* to the Xen image, and it isn't very tidy.
For live update, I'd quite like a single contiguous region for the
reserved bootmem and Xen, allowing the 'slack' in the former to be used
when Xen itself grows larger. Let's not allow 2MiB of random heap pages
to get in the way...
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
David Woodhouse [Sat, 1 Feb 2020 00:32:56 +0000 (00:32 +0000)]
x86/smp: reset x2apic_enabled in smp_send_stop()
Just before smp_send_stop() re-enables interrupts when shutting down
for reboot or kexec, it calls __stop_this_cpu() which in turn calls
disable_local_APIC(), which puts the APIC back in to the mode Xen found
it in at boot.
If that means turning x2APIC off and going back into xAPIC mode, then
a timer interrupt occurring just after interrupts come back on will
lead to a GP# when apic_timer_interrupt() attempts to ack the IRQ
through the EOI register in x2APIC MSR 0x80b:
We can't clear the global x2apic_enabled variable in disable_local_APIC()
itself because that runs on each CPU. Instead, correct it (by using
current_local_apic_mode()) in smp_send_stop() while interrupts are still
disabled immediately after calling __stop_this_cpu() for the boot CPU,
after all other CPUs have been stopped.
cf: d639bdd9bbe ("x86/apic: Disable the LAPIC later in smp_send_stop()")
... which didn't quite fix it completely.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Igor Druzhinin [Tue, 4 Feb 2020 21:49:36 +0000 (21:49 +0000)]
x86/shim: suspend and resume platform time correctly
Similarly to S3, platform time needs to be saved on guest suspend
and restored on resume respectively. This should account for expected
jumps in PV clock counter value after resume. time_suspend/resume()
are safe to use in PVH setting as is since any existing operations
with PIT/HPET that they do would simply be ignored if PIT/HPET is
not present.
Additionally, add resume callback for Xen PV clocksource to avoid
its breakage on migration.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Tue, 11 Feb 2020 15:02:31 +0000 (15:02 +0000)]
x86/amd: Avoid cpu_has_hypervisor evaluating true on native hardware
Currently when booting native on AMD hardware, cpuidmask_defaults._1cd gets
configured with the HYPERVISOR bit before native CPUID is scanned for feature
bits.
This results in cpu_has_hypervisor becoming set as part of identify_cpu(), and
ends up appearing in the raw and host CPU policies.
A combination of this bug, and c/s bb502a8ca59 "x86: check feature flags after
resume" which checks that feature bits don't go missing, results in broken S3
on AMD hardware.
Alter amd_init_levelling() to exclude the HYPERVISOR bit from
cpumask_defaults, and update domain_cpu_policy_changed() to allow it to be
explicitly forwarded.
This also fixes a bug on kexec, where the hypervisor bit is left enabled for
the new kernel to find.
These changes highlight a further but - dom0 construction is asymetric with
domU construction, by not having any calls to domain_cpu_policy_changed().
Extend arch_domain_create() to always call domain_cpu_policy_changed().
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Fri, 24 Jan 2020 14:49:35 +0000 (14:49 +0000)]
x86/vmx: use a MEMF_no_refcount domheap page for APIC_DEFAULT_PHYS_BASE
vmx_alloc_vlapic_mapping() currently contains some very odd looking code
that allocates a MEMF_no_owner domheap page and then shares with the guest
as if it were a xenheap page. This then requires vmx_free_vlapic_mapping()
to call a special function in the mm code: free_shared_domheap_page().
By using a MEMF_no_refcount domheap page instead, the odd looking code in
vmx_alloc_vlapic_mapping() can simply use get_page_and_type() to set up a
writable mapping before insertion in the P2M and vmx_free_vlapic_mapping()
can simply release the page using put_page_alloc_ref() followed by
put_page_and_type(). This then allows free_shared_domheap_page() to be
purged.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Paul Durrant [Thu, 30 Jan 2020 12:56:42 +0000 (12:56 +0000)]
mm: make pages allocated with MEMF_no_refcount safe to assign
Currently it is unsafe to assign a domheap page allocated with
MEMF_no_refcount to a domain because the domain't 'tot_pages' will not
be incremented, but will be decrement when the page is freed (since
free_domheap_pages() has no way of telling that the increment was skipped).
This patch allocates a new 'count_info' bit for a PGC_extra flag
which is then used to mark pages when alloc_domheap_pages() is called
with MEMF_no_refcount. assign_pages() because it still needs to call
domain_adjust_tot_pages() to make sure the domain is appropriately
referenced. Hence it is modified to do that for PGC_extra pages even if it
is passed MEMF_no_refount.
The number of PGC_extra pages assigned to a domain is tracked in a new
'extra_pages' counter, which is then subtracted from 'total_pages' in
the domain_tot_pages() helper. Thus 'normal' page assignments will still
be appropriately checked against 'max_pages'.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien@xen.org> Acked-by: George Dunlap <George.Dunlap@eu.citrix.com>
Paul Durrant [Thu, 30 Jan 2020 11:55:35 +0000 (11:55 +0000)]
add a domain_tot_pages() helper function
This patch adds a new domain_tot_pages() inline helper function into
sched.h, which will be needed by a subsequent patch.
No functional change.
NOTE: While modifying the comment for 'tot_pages' in sched.h this patch
makes some cosmetic fixes to surrounding comments.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: George Dunlap <George.Dunlap@eu.citrix.com> Acked-by: Julien Grall <julien@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Wei Liu [Thu, 13 Feb 2020 21:40:27 +0000 (21:40 +0000)]
libxl: mark parameters in stub functions as unused
Hopefully this can fix issues like:
In file included from ../../src/libxl/xen_xl.c:24:0:
/home/osstest/build.147035.build-amd64-libvirt/xendist/usr/local/include/libxl.h: In function 'libxl_cpuid_apply_policy':
/home/osstest/build.147035.build-amd64-libvirt/xendist/usr/local/include/libxl.h:2345:56: error: unused parameter 'ctx' [-Werror=unused-parameter]
static inline void libxl_cpuid_apply_policy(libxl_ctx *ctx, uint32_t domid) {}
Fixes: dacb80f9 ("tools/libxl: Remove libxl_cpuid_{set,apply_policy}() from the API") Signed-off-by: Wei Liu <wl@xen.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Thu, 13 Feb 2020 13:42:00 +0000 (13:42 +0000)]
automation: update debian:unstable-arm64v8 to have python3-config
The Arm container wasn't updated in the original patch.
Fixes: 1a3673da6482 ("automation: updating container to have python3-config binary") Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Wei Liu <wl@xen.org>
Juergen Gross [Wed, 12 Feb 2020 07:41:53 +0000 (08:41 +0100)]
xenstore: add console xenstore entries for xenstore stubdom
In order to be able to connect to the console of Xenstore stubdom we
need to create the appropriate entries in Xenstore.
For the moment we don't support xenconsoled living in another domain
than dom0, as this information isn't available other then via
Xenstore which we are just setting up.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org>
In order to be able to get access to the console of Xenstore stubdom
we need an appropriate granttab entry. So call xc_dom_gnttab_init()
when constructing the domain and preset some information needed
for that function in the dom structure.
We need to create the event channel for the console, too. Do that and
store all necessary data locally.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wl@xen.org>
Jan Beulich [Wed, 12 Feb 2020 09:52:20 +0000 (10:52 +0100)]
dom0-build: fix build with clang5
With non-empty CONFIG_DOM0_MEM clang5 produces
dom0_build.c:344:24: error: use of logical '&&' with constant operand [-Werror,-Wconstant-logical-operand]
if ( !dom0_mem_set && CONFIG_DOM0_MEM[0] )
^ ~~~~~~~~~~~~~~~~~~
dom0_build.c:344:24: note: use '&' for a bitwise operation
if ( !dom0_mem_set && CONFIG_DOM0_MEM[0] )
^~
&
dom0_build.c:344:24: note: remove constant to silence this warning
if ( !dom0_mem_set && CONFIG_DOM0_MEM[0] )
~^~~~~~~~~~~~~~~~~~~~~
1 error generated.
Obviously neither of the two suggestions are an option here. Oddly
enough swapping the operands of the && helps, while e.g. casting or
parenthesizing doesn't. Another workable variant looks to be the use of
!! on the constant.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien@xen.org> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Wed, 5 Feb 2020 15:25:21 +0000 (15:25 +0000)]
tools/libxl: Combine legacy CPUID handling logic
While we are in the process of overhauling boot time CPUID/MSR handling, the
existing logic is going to have to remain in roughly this form for backwards
compatibility.
Fold libxl__cpuid_apply_policy() and libxl__cpuid_set() together into a single
libxl__cpuid_legacy() to reduce the complexity for callers.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Thu, 5 Dec 2019 17:19:47 +0000 (17:19 +0000)]
MAINTAINERS: Add explicit check-in policy section
The "nesting" section in the MAINTAINERS file was not initially
intended to describe the check-in policy for patches, but only how
nesting worked; but since there was no check-in policy, it has been
acting as a de-facto policy.
One problem with this is that the policy is not complete: It doesn't
cover open objections, time to check-in, or so on. The other problem
with the policy is that, as written, it doesn't account for
maintainers submitting patches to files which they themselves
maintain. This is fine for situations where there are are multiple
maintainers, but not for situations where there is only one
maintainer.
Add an explicit "Check-in policy" section to the MAINTAINERS document
to serve as the canonical reference for the check-in policy. Move
paragraphs not explicitly related to nesting into it.
While here, "promote" the "The meaning of nesting" section title.
DISCUSSION
This seems to be a change from people's understanding of the current
policy. Most people's understanding of the current policy seems to be:
1. In order to get a change to a given file committed, it must have
an Ack or Review from at least one *maintainer* of that file other
than the submitter.
2. In the case where a file has only one maintainer, it must have an
Ack or Review from a "nested" maintainer.
I.e., if I submitted something to x86/mm, it would require an Ack from
Jan or Andy, or (in exceptional circumstances) The Rest; but an Ack from
(say) Roger or Juergen wouldn't suffice.
Let's call this the "maintainer-ack" approach (because it must have an
ack or r-b from a maintainer to be checked in), and the proposal in
this patch the "maintainer-approval" (since SoB from a maintainer
indicates approval).
The core issue I have with "maintainer-ack" is that it makes the
maintainer less privileged with regard to writing code than
non-maintainers. If component X has maintainers A and B, then a
non-maintainer can have code checked in if reviewed either by A or B.
If A or B wants code checked in, they have to wait for exactly one
person to review it.
In fact, if B is quite busy, the easiest way for A really to get their
code checked in might be to hand it to a non-maintainer N, and ask N
to submit it as their own. Then A can Ack the patches and check them
in.
The current system, therefore, either sets up a perverse incentive (if
you think the behavior described above is unacceptable) or unnecessary
bureaucracy (if you think it's acceptable). Either way I think we
should set up our system to avoid it.
Other variations on "maintainer-ack" have been proposed:
- Allow maintainer's patches to go in with an R-b from "designated
reviewers"
- Allow maintainer's patches to go in with an Ack from more general
maintainer
Both fundamentally make it harder for maintainers to get their code in
and/or reviewed effectively than non-maintainers, setting up the
perverse incentive / unnecessary bureaucracy.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 10 Feb 2020 18:33:26 +0000 (18:33 +0000)]
x86/pvh: Adjust dom0's starting state
Fixes: b25fb1a04e "xen/pvh: Fix segment selector ABI" Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Sun, 2 Feb 2020 18:23:47 +0000 (18:23 +0000)]
AMD/IOMMU: Treat head/tail pointers as byte offsets
The MMIO registers as already byte offsets. Using them in this form removes
the need to shift their values for use.
It is also inefficient to store both entries and alloc_size (which only differ
by entry_size). Rename alloc_size to size, and drop entries entirely, which
simplifies the allocation/deallocation helpers slightly.
Mark send_iommu_command() and invalidate_iommu_all() as static, as they have
no external declaration or callers.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Thu, 6 Feb 2020 16:48:10 +0000 (16:48 +0000)]
MAINTAINERS: cc community manager on patches to CHANGELOG.md
The purpose of the change-log is to be a human-readable list of notable
changes and, as such, additions to it are likely of interest to the
community manager in preparing blog entries, release summaries, etc.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org> Acked-by: Julien Grall <julien@xen.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Mon, 10 Feb 2020 11:27:32 +0000 (11:27 +0000)]
x86/svm: Reduce vmentry latency
Writing to the stack pointer in the middle of a line of pop operations is
specifically recommended against by the optimisation guide, and is a technique
used by Speculative Load Hardening to combat SpectreRSB.
In practice, it causes all further stack-relative accesses to block until the
write to the stack pointer retires, so the stack engine can get back in sync.
Pop into any dead register to discard %rax's value without clobbering the
stack engine. Smaller compiled code, and runs faster.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 3 Feb 2020 13:50:34 +0000 (13:50 +0000)]
AMD/IOMMU: Treat guest head/tail pointers as byte offsets
The MMIO registers as already formatted as byte offsets. Start by masking out
reserved bits, which fixes an implementation bug (reserved bits should be
read-only zero, rather than preserving their previously-written value). As a
consequence, we can use the values directly, instead of masking/shifting on
every use.
Store the buffer size, rather than the number of entries, to keep the same
units for comparison purposes.
This simplifies guest_iommu_get_table_mfn() by dropping the entry_size
parameter, and simplifies the map_domain_page() handling by being able to drop
the log_base variables.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Drop all status register MASK/SHIFT constants, and enumerate the bits
normally. Rename EVENT_OVERFLOW to EVENT_LOG_OVERFLOW for consistency. (The
field name in the spec is inconsistent, despite the description referring to
an overflow of the event log.)
The only semantic change is in iommu_reset_log() where 'run_bit' changes from
being a bit position to being a single-bit mask. Update some local variable
types to be more suitable.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 3 Feb 2020 13:09:17 +0000 (13:09 +0000)]
AMD/IOMMU: Move headers to be local
We currently have amd-iommu-defs.h, amd-iommu-proto.h and amd-iommu.h, and no
references outside of the AMD IOMMU driver.
Keep iommu-defs.h as is, but merge amd-iommu.h and amd-iommu-proto.h to just
iommu.h, and move them both into drivers/passthrough/amd/. (While merging,
drop the bogus #pragma pack around the *_entry structures.)
Take the opportunity to trim the include lists, including x86/mm/p2m.c
which (AFAICT) hasn't needed this include since c/s aef3f2275 "x86/mm/p2m:
break into common, pt-implementation and pod parts" in 2011.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Sat, 8 Feb 2020 14:47:48 +0000 (14:47 +0000)]
xen/pvh: Fix segment selector ABI
The written ABI states that %es will be set up, but libxc doesn't do so. In
practice, it breaks `rep movs` inside guests before they reload %es.
The written ABI doesn't mention %ss, but libxc does set it up. Having %ds
different to %ss is obnoxous to work with, as different registers have
different implicit segments.
Modify the spec to state that %ss is set up as a flat read/write segment.
This a) matches the Multiboot 1 spec, b) matches what is set up in practice,
and c) is the more sane behaviour for guests to use.
Fixes: 68e1183411b ('libxc: introduce a xc_dom_arch for hvm-3.0-x86_32 guests') Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jeff Kubascik [Tue, 4 Feb 2020 19:51:50 +0000 (14:51 -0500)]
xen/arm: Handle unimplemented VGICv3 registers as RAZ/WI
Per the ARM Generic Interrupt Controller Architecture Specification (ARM
IHI 0069E), reserved registers should generally be treated as RAZ/WI.
To simplify the VGICv3 design and improve guest compatibility, treat the
default case for GICD and GICR registers as read_as_zero/write_ignore.
Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com> Acked-by: Julien Grall <julien@xen.org>
Julien Grall [Thu, 6 Feb 2020 15:41:18 +0000 (15:41 +0000)]
xen/include: public: Document the padding in struct xen_hvm_param
There is an implicit padding of 2 bytes in struct xen_hvm_param between
the field domid and index. Make it explicit by introduce a padding
field. This can also serve as documentation.
Note that I don't think we can mandate it to be zero because a guest may
not have initialized the padding.
Signed-off-by: Julien Grall <jgrall@amazon.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org>
Jan Beulich [Thu, 6 Feb 2020 15:23:30 +0000 (16:23 +0100)]
x86/HVM: reduce scope of pfec in hvm_emulate_init_per_insn()
It needs calculating only in one out of three cases. Re-structure the
code a little such that the variable truly gets calculated only when we
don't get any insn bytes from elsewhere, and hence need to (try to)
fetch them. Also OR in PFEC_insn_fetch right in the initializer.
While in this mood, restrict addr's scope as well.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <pdurrant@amazon.com>
Wei Liu [Wed, 5 Feb 2020 18:02:24 +0000 (18:02 +0000)]
x86/guest/xen: only set HVM parameter on BSP
There is no need for every CPU to set a guest property.
Suggested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Wei Liu <wl@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Thu, 6 Feb 2020 08:55:18 +0000 (09:55 +0100)]
domctl/vNUMA: avoid arithmetic overflow
Checking the result of a multiplication against a certain limit has no
sufficient implication on the original value's range. In the case here
it is in particular problematic that while handling the domctl we do
if ( copy_from_guest(info->vdistance, uinfo->vdistance,
nr_vnodes * nr_vnodes) )
goto vnuma_fail;
which means copying sizeof(unsigned int) * (nr_vnodes * nr_vnodes)
bytes, and the handling of XENMEM_get_vnumainfo similarly has
Jan Beulich [Thu, 6 Feb 2020 08:53:12 +0000 (09:53 +0100)]
xmalloc: guard against integer overflow
There are hypercall handling paths (EFI ones are what this was found
with) needing to allocate buffers of a caller specified size. This is
generally fine, as our page allocator enforces an upper bound on all
allocations. However, certain extremely large sizes could, when adding
in allocator overhead, result in an apparently tiny allocation size,
which would typically result in either a successful allocation, but a
severe buffer overrun when using that memory block, or in a crash right
in the allocator code.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Thu, 6 Feb 2020 08:52:33 +0000 (09:52 +0100)]
EFI: don't leak heap contents through XEN_EFI_get_next_variable_name
Commit 1f4eb9d27d0e ("EFI: fix getting EFI variable list on some
systems") switched to using the caller provided size for the copy-out
without making sure the copied buffer is properly scrubbed.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Thu, 6 Feb 2020 08:51:17 +0000 (09:51 +0100)]
EFI: re-check {get,set}-variable name strings after copying in
A malicious guest given permission to invoke XENPF_efi_runtime_call may
play with the strings underneath Xen sizing them and copying them in.
Guard against this by re-checking the copyied in data for consistency
with the initial sizing. At the same time also check that the actual
copy-in is in fact successful, and switch to the lighter weight non-
checking flavor of the function.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Wei Liu [Wed, 15 Jan 2020 16:40:49 +0000 (16:40 +0000)]
x86/hyperv: setup hypercall page
Hyper-V uses a technique called overlay page for its hypercall page. It
will insert a backing page to the guest when the hypercall functionality
is enabled. That means we can use a page that is not backed by real
memory for hypercall page.
To avoid shattering L0 superpages and treading on any MMIO areas
residing in low addresses, use the top-most addressable page for that
purpose. Adjust e820 map accordingly.
We also need to register Xen's guest OS ID to Hyper-V. Use 0x3 as the
vendor ID. Fix the comment in hyperv-tlfs.h while at it.
Signed-off-by: Wei Liu <liuwe@microsoft.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Paul Durrant <pdurrant@amazon.com>
Wei Liu [Wed, 8 Jan 2020 21:35:23 +0000 (21:35 +0000)]
x86: provide executable fixmap facility
This allows us to set aside some address space for executable mapping.
This fixed map range starts from XEN_VIRT_END so that it is within reach
of the .text section.
Shift the percpu stub range and shrink livepatch range accordingly.
Signed-off-by: Wei Liu <liuwe@microsoft.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tamas K Lengyel [Wed, 5 Feb 2020 12:53:14 +0000 (13:53 +0100)]
x86/mem_sharing: use default_access in add_to_physmap
When plugging a hole in the target physmap don't use the access permission
returned by __get_gfn_type_access as it is non-sensical (p2m_access_n) in
the use-case add_to_physmap was intended to be used in. It leads to vm_events
being sent out for access violations at unexpected locations. Make use of
p2m->default_access instead and document the ambiguity surrounding "hole"
types and corner-cases with custom mem_access being set on holes.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tamas K Lengyel [Wed, 5 Feb 2020 12:52:29 +0000 (13:52 +0100)]
x86/hvm: introduce hvm_copy_context_and_params
Currently the hvm parameters are only accessible via the HVMOP hypercalls. In
this patch we introduce a new function that can copy both the hvm context and
parameters directly into a target domain.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Wed, 5 Feb 2020 12:50:46 +0000 (13:50 +0100)]
x86/vvmx: don't enable interrupt window when using virt intr delivery
If virtual interrupt delivery is used to inject the interrupt to the
guest the interrupt window shouldn't be enabled, as the interrupt is
already injected using the GUEST_INTR_STATUS vmcs field.
Reported-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Roger Pau Monné [Wed, 5 Feb 2020 12:50:09 +0000 (13:50 +0100)]
x86/vvmx: fix VM_EXIT_ACK_INTR_ON_EXIT handling
When VM_EXIT_ACK_INTR_ON_EXIT is clear in the vmexit control vmcs
register the bit 31 of VM_EXIT_INTR_INFO must be 0, in order to denote
that the field doesn't contain any interrupt information. This is not
currently acknowledged as the field always get filled with valid
interrupt information, regardless of whether VM_EXIT_ACK_INTR_ON_EXIT
is set.
Fix this and only fill VM_EXIT_INTR_INFO when VM_EXIT_ACK_INTR_ON_EXIT
is set. Note that this requires one minor change in
nvmx_update_apicv in order to obtain the interrupt information from
the internal state rather than the nested vmcs register.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Roger Pau Monné [Wed, 5 Feb 2020 12:49:09 +0000 (13:49 +0100)]
x86/vvmx: fix virtual interrupt injection when Ack on exit control is used
When doing a virtual vmexit (ie: a vmexit handled by the L1 VMM)
interrupts shouldn't be injected using the virtual interrupt delivery
mechanism unless the Ack on exit vmexit control bit isn't set in the
nested vmcs.
Gate the call to nvmx_update_apicv helper on whether the nested vmcs
has the Ack on exit bit set in the vmexit control field.
Note that this fixes the usage of x2APIC by the L1 VMM, at least when
the L1 VMM is Xen.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Tue, 4 Feb 2020 20:29:38 +0000 (20:29 +0000)]
libxc/restore: Fix REC_TYPE_X86_PV_VCPU_XSAVE data auditing (take 2)
It turns out that a bug (since forever) in Xen causes XSAVE records to have
non-architectural behaviour on xsave-capable hardware, when a PV guest has not
touched the state.
In such a case, the data record returned from Xen is 2*uint64_t, both claiming
the (illegitimate) state of %xcr0 and %xcr0_accum being 0.
Adjust the bound in handle_x86_pv_vcpu_blob() to cope with this.
Fixes: 2a62c22715b "libxc/restore: Fix data auditing in handle_x86_pv_vcpu_blob()" Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Wed, 5 Feb 2020 11:24:12 +0000 (11:24 +0000)]
libxl: fix assertion failure in stub domain creation
An assertion in libxl__domain_make():
'soft_reset || *domid == INVALID_DOMID'
does not hold true for stub domain creation, where soft_reset is false
but the passed in domid == 0. This is easily fixed by changing the
initializer in libxl__spawn_stub_dm().
NOTE: The comment for XEN_DOMCTL_createdomain in domctl.h is changed to
reflect reality.
Fixes: 75259239d85d ("libxl_create: make 'soft reset' explicit") Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Wei Liu <wl@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Stefan Bader [Tue, 4 Feb 2020 09:34:23 +0000 (09:34 +0000)]
tools/xenstore: Re-introduce (fake) xs_restrict call to preserve ABI
libxenstore3.0 in Xen 4.8 had this function. We don't really want to
bump the ABI version (soname) just for this, since we don't think
there are actual callers anywhere. But tools complain about the
symbol going away.
So, provide a function xs_restrict which conforms to the original
semantics, although it always fails.
Gbp-Pq: Topic xenstore
Gbp-Pq: Name tools-fake-xs-restrict.patch Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Mon, 3 Feb 2020 12:07:19 +0000 (13:07 +0100)]
x86/EPT: do away with hidden GUEST_TABLE_MAP_FAILED == 0 assumptions
The code is quite a bit easier to read and to reason about this way,
I think.
In ept_set_entry() additionally change the function's return value in
the MAP_FAILED case to -ENOMEM; -ENOENT would be applicable only when
ept_next_entry() was invoked with "read_only" set to true.
In two cases, where ept_next_level() follows an ept_split_superpage()
invocation, actually tighten the loop exit condition from
"== MAP_FAILED" to "!= NORMAL_PAGE". Continuing these loops for other
than NORMAL_PAGE is invalid, and there are ASSERT()s in place after
these loops.
Also reduce the scope of "ret" variables where possible, in particular
to better distinguish them from "rc" often used in the same function.
Finally drop pointless "else" in a few areas touched anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Juergen Gross [Mon, 3 Feb 2020 12:04:30 +0000 (13:04 +0100)]
xen: split parameter related definitions in own header file
Move the parameter related definitions from init.h into a new header
file param.h. This will avoid include hell when new dependencies are
added to parameter definitions.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Julien Grall <julien@xen.org> Acked-by: Dario Faggioli <dfaggioli@suse.com> Acked-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Mon, 27 Jan 2020 13:34:12 +0000 (13:34 +0000)]
xen/x86: domctl: Don't leak data via XEN_DOMCTL_gethvmcontext
The HVM context may not fill up the full buffer passed by the caller.
While we report corectly the size of the context, we will still be
copying back the full size of the buffer.
As the buffer is allocated through xmalloc(), we will be copying some
bits from the previous allocation.
Only copy back the part of the buffer used by the HVM context to prevent
any leak.
Note that per XSA-72, this is not a security issue.
Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Mon, 20 Jan 2020 14:10:57 +0000 (14:10 +0000)]
xen/x86: domain: Remove specific case when allocating struct domain
Commit 8916fcf4577 "x86/domain: compile with lock_profile=y enabled"
allowed the struct domain to use more than a PAGE_SIZE (i.e 4096).
However, the function free_domheap_struct() will only free the first
page.
We could modify the free part to free the correct number of pages, but
the structure has been fitting in a page (even with lock profile
enabled) since commit 428607a410 "x86: shrink 'struct domain', was
already PAGE_SIZE" (part of Xen 4.7).
Therefore, the specific case for lock profile is now removed.
This is not a security issue because struct domain can only be bigger
than a page size for lock profiling. The feature can only be selected
in DEBUG and EXPERT mode.
Fixes: 8916fcf4577 ("x86/domain: compile with lock_profile=y enabled") Reported-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Fri, 31 Jan 2020 14:25:57 +0000 (15:25 +0100)]
tools/xenstore: don't apply write limiting for privileged domain
Xenstore write limiting should not be applied to dom0. Unfortunately
write limiting is disabled only for connections via sockets. When
running in a stubdom Xenstore will apply write limiting to dom0, too.
Change that by testing for the domain to be privileged as well.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Fri, 31 Jan 2020 15:01:45 +0000 (15:01 +0000)]
libxl: generalise libxl__domain_userdata_lock()
This function implements a file-based lock with a file name generated
from a domid.
This patch splits it into two, generalising the core of the locking code
into a new libxl__lock_file() function which operates on a specified file,
leaving just the file name generation in libxl__domain_userdata_lock().
This patch also generalises libxl__unlock_domain_userdata() to
libxl__unlock_file() and modifies all call-sites.
Suggested-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Paul Durrant [Fri, 31 Jan 2020 15:01:44 +0000 (15:01 +0000)]
libxl_create: make 'soft reset' explicit
The 'soft reset' code path in libxl__domain_make() is currently taken if a
valid domid is passed into the function. A subsequent patch will enable
higher levels of the toolstack to determine the domid of newly created or
restored domains and therefore this criteria for choosing 'soft reset'
will no longer be usable.
This patch adds an extra boolean option to libxl__domain_make() to specify
whether it is being invoked in soft reset context and appropriately
modifies callers to choose the right value. To facilitate this, a new
'soft_reset' boolean field is added to struct libxl__domain_create_state
and the 'domid_soft_reset' field is renamed to 'domid' in anticipation of
its wider remit. For the moment do_domain_create() will always set
domid to INVALID_DOMID and hence we can add an assertion into
libxl__domain_create() that, if it is not called in soft reset context,
the passed in domid is exactly that value.
Whilst in the neighbourhood, some checks of 'restore_fd > -1' have been
replaced by 'restore_fd >= 0' to be more conventional and consistent with
checks of 'restore_fd < 0'.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Paul Durrant [Fri, 31 Jan 2020 15:01:43 +0000 (15:01 +0000)]
libxl: add definition of INVALID_DOMID to the API
Currently both xl and libxl have internal definitions of INVALID_DOMID
which happen to be identical. However, for the purposes of describing the
behaviour of libxl_domain_create_new/restore() it is useful to have a
specified invalid value for a domain id.
This patch therefore moves the libxl definition from libxl_internal.h to
libxl.h and removes the internal definition from xl_utils.h. The hardcoded
'-1' passed back via domcreate_complete() is then updated to INVALID_DOMID
and comment above libxl_domain_create_new/restore() is accordingly
modified.
NOTE: The value of INVALID_DOMID (~0) is distinct from the hypervisor's
DOMID_INVALID. This patch preserves that value.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Fri, 31 Jan 2020 15:47:29 +0000 (16:47 +0100)]
x86/HVM: relinquish resources also from hvm_domain_destroy()
Domain creation failure paths don't call domain_relinquish_resources(),
yet allocations and alike done from hvm_domain_initialize() need to be
undone nevertheless. Call the function also from hvm_domain_destroy(),
after making sure all descendants are idempotent.
Note that while viridian_{domain,vcpu}_deinit() were already used in
ways suggesting they're idempotent, viridian_time_vcpu_deinit() actually
wasn't: One can't kill a timer that was never initialized.
For hvm_destroy_all_ioreq_servers()'s purposes make
relocate_portio_handler() return whether the to be relocated port range
was actually found. This seems cheaper than introducing a flag into
struct hvm_domain's ioreq_server sub-structure.
In hvm_domain_initialise() additionally
- use XFREE() also to replace adjacent xfree(),
- use hvm_domain_relinquish_resources() as being idempotent now.
There as well as in hvm_domain_destroy() the explicit call to
rtc_deinit() isn't needed anymore.
In hvm_domain_relinquish_resources() additionally drop a no longer
relevant if().
Fixes: e7a9b5e72f26 ("viridian: separately allocate domain and vcpu structures") Fixes: 26fba3c85571 ("viridian: add implementation of synthetic timers") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <pdurrant@amazon.com>
Jan Beulich [Thu, 30 Jan 2020 16:19:46 +0000 (17:19 +0100)]
x86: fold linker script pre-processing rules
There's no need to have twice almost the same rule. Simply add the extra
-DEFI to AFLAGS for the EFI variant, and specify both targets for the
then single rule.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 30 Jan 2020 16:18:12 +0000 (17:18 +0100)]
x86: undo part of "refine link time stub area related assertion"
The original check was not too strict: While we don't use one page of
memory per CPU, we do use ons page of VA space per CPU. It is the
latter which matters here.
Undo that part of the change, but leave everything else in place.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Anthony PERARD [Wed, 11 Dec 2019 13:55:06 +0000 (13:55 +0000)]
xen: Move CONFIG_INDIRECT_THUNK to Kconfig
Now that Kconfig has the capability to run shell command when
generating CONFIG_* we can use it in some cases to test CFLAGS.
CONFIG_INDIRECT_THUNK is a good example that wants to exist both in
Makefile and as a C macro, which Kconfig do. So use Kconfig to
generate CONFIG_INDIRECT_THUNK and have the CFLAGS depends on that.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Anthony PERARD [Wed, 4 Dec 2019 17:13:51 +0000 (17:13 +0000)]
xen: Import cc-ifversion from Kbuild
This is in preparation of importing Kbuild to build Xen. We won't be
able to include Config.mk so we will need a replacement for the macro
`cc-ifversion'.
This patch imports parts of "scripts/Kbuild.include" from Linux v5.4,
the macro cc-ifversion. It makes use of CONFIG_GCC_VERSION that
Kconfig now provides.
Since they are no other use of Xen's `cc-ifversion' macro, we can
remove it.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Anthony PERARD [Wed, 4 Dec 2019 16:33:23 +0000 (16:33 +0000)]
xen: Have Kconfig check $(CC)'s version
This import several files from Linux v5.3
- scripts/Kconfig.include
- scripts/clang-version.sh
- scripts/gcc-version.sh
and several config values from from Linux's init/Kconfig file.
But gcc-version.sh have been modified to return "0" when $CC isn't
GCC, like clang-version.sh do.
Files are copied into scripts/ directory because that's were the files
are found in Linux tree, and also because we are going to import more
of Kbuild from Linux which is located in scripts/.
CONFIG_GCC_VERSION and CONFIG_CC_IS_CLANG are going to be use in
follow-up patches.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Anthony PERARD [Tue, 17 Sep 2019 13:13:50 +0000 (14:13 +0100)]
xen: Update Kconfig to Linux v5.4
This patch updates Kconfig to a more recent version of Kconfig, found
in Linux v5.4.0, 219d54332a09 ("Linux 5.4").
With the updated version of Kconfig, other changes are necessary to
avoid breaking the build.
Kconfig files:
- fix Kconfig files that where using option env=*:
Since Linux commit 104daea149c4 ("kconfig: reference environment
variables directly and remove 'option env='"), we can access the
environment directly via $() and "option env=" as been removed.
- CONFIG_EXPERT='y' will now appear in .config file if
XEN_CONFIG_EXPERT=y in the environment. The alternative is to change
"EXPERT" to "$(XEN_CONFIG_EXPERT)" in all Kconfig files.
Makefile:
- silentoldconfig target as been removed from Kconfig. To update
include/generated/autoconf.h, we need to use syncconfig target
instead.
Makefile.kconfig:
- Import newer needed code from Linux's Makefile.lib and
Kbuild.include and Makefile.build.
- Set Q to empty, Xen build system doesn't silence commands. Having Q
empty mean we can import stuff from Linux without having to remove the
leading $(Q) from build commands. And quiet='' means commands will be
echoed.
- Add $(PHONY) to .PHONY. Like it is intended by Kbuild.
Makefile.host is also updated and copied from Linux.
Dependency change:
- Now depends on flex/bison, maybe we could _shipped those files like
before. Linux doesn't do that anymore.
The .gitignore in kconfig/ has more entries, compared to upstream, for
file generated by Makefile.host.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>