x86/mapcache: initialise the mapcache even for the idle domain
In situations where PMAP cannot be used or the mapcache of a domain is
simply not ready, we need to have a mapcache in the idle domain to map
pages when there is no direct map.
Wei Liu [Fri, 8 Feb 2019 17:19:26 +0000 (17:19 +0000)]
x86/mm: drop _new suffix for page table APIs
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Hongyan Xia <hongyax@amazon.com>
---
Changed since v1:
- Fix rebase conflicts against new master and other changes since v1.
Changed since v2:
- Also drop _new for the fix of l2t leak.
Wei Liu [Tue, 29 Jan 2019 14:40:26 +0000 (14:40 +0000)]
x86_64/mm: switch to new APIs in paging_init
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Hongyan Xia <hongyax@amazon.com>
---
Changed since v1:
- Use a global mapping for compat_idle_pg_table_l2, otherwise
l2_ro_mpt will unmap it.
Wei Liu [Tue, 29 Jan 2019 12:54:48 +0000 (12:54 +0000)]
x86/mm: change pl*e to l*t in virt_to_xen_l*e
We will need to have a variable named pl*e when we rewrite
virt_to_xen_l*e. Change pl*e to l*t to reflect better its purpose.
This will make reviewing later patch easier.
No functional change.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Hongyan Xia <hongyax@amazon.com>
Wei Liu [Mon, 28 Jan 2019 18:10:10 +0000 (18:10 +0000)]
x86/mm: introduce l{1,2}t local variables to modify_xen_mappings
The pl2e and pl1e variables are heavily (ab)used in that function. It
is fine at the moment because all page tables are always mapped so
there is no need to track the life time of each variable.
We will soon have the requirement to map and unmap page tables. We
need to track the life time of each variable to avoid leakage.
Introduce some l{1,2}t variables with limited scope so that we can
track life time of pointers to xen page tables more easily.
Wei Liu [Mon, 28 Jan 2019 17:54:24 +0000 (17:54 +0000)]
x86/mm: introduce l{1,2}t local variables to map_pages_to_xen
The pl2e and pl1e variables are heavily (ab)used in that function. It
is fine at the moment because all page tables are always mapped so
there is no need to track the life time of each variable.
We will soon have the requirement to map and unmap page tables. We
need to track the life time of each variable to avoid leakage.
Introduce some l{1,2}t variables with limited scope so that we can
track life time of pointers to xen page tables more easily.
Wei Liu [Wed, 23 Jan 2019 15:33:07 +0000 (15:33 +0000)]
x86: introduce a new set of APIs to manage Xen page tables
We are going to switch to using domheap page for page tables.
A new set of APIs is introduced to allocate, map, unmap and free pages
for page tables.
The allocation and deallocation work on mfn_t but not page_info,
because they are required to work even before frame table is set up.
Implement the old functions with the new ones. We will rewrite, site
by site, other mm functions that manipulate page tables to use the new
APIs.
Note these new APIs still use xenheap page underneath and no actual
map and unmap is done so that we don't break xen half way. They will
be switched to use domheap and dynamic mappings when usage of old APIs
is eliminated.
Andrew Cooper [Thu, 21 Nov 2019 17:22:52 +0000 (17:22 +0000)]
x86/svm: Write the correct %eip into the outgoing task
The TASK_SWITCH vmexit has fault semantics, and doesn't provide any NRIPs
assistance with instruction length. As a result, any instruction-induced task
switch has the outgoing task's %eip pointing at the instruction switch caused
the switch, rather than after it.
This causes callers of task gates to livelock (repeatedly execute the call/jmp
to enter the task), and any restartable task to become a nop after its first
use (the (re)entry state points at the iret used to exit the task).
32bit Windows in particular is known to use task gates for NMI handling, and
to use NMI IPIs.
In the task switch handler, distinguish instruction-induced from
interrupt/exception-induced task switches, and decode the instruction under
%rip to calculate its length.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Mon, 25 Nov 2019 19:33:36 +0000 (19:33 +0000)]
x86/svm: Always intercept ICEBP
ICEBP isn't handled well by SVM.
The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
appropriate instruction boundary (fault or trap, as appropriate), except for
an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
rather than after it. As ICEBP isn't distinguished in the vectoring event
type, the state is ambiguous.
To add to the confusion, an ICEBP which occurs due to Introspection
intercepting the instruction, or from x86_emulate() will have %rip updated as
a consequence of partial emulation required to inject an ICEBP event in the
first place.
We could in principle spot the non-injected case in the TASK_SWITCH handler,
but this still results in complexity if the ICEBP instruction also has an
Instruction Breakpoint active on it (which genuinely has fault semantics).
Unconditionally intercept ICEBP. This does have NRIPs support as it is an
instruction intercept, which allows us to move %rip forwards appropriately
before the TASK_SWITCH intercept is hit. This makes #DB-vectored switches
have consistent behaviour however the ICEBP #DB came about, and avoids special
cases in the TASK_SWITCH intercept.
This in turn allows for the removal of the conditional
hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
will now always be submitted for monitoring checks.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Thu, 21 Nov 2019 17:22:52 +0000 (17:22 +0000)]
x86/vtx: Fix fault semantics for early task switch failures
The VT-x task switch handler adds inst_len to %rip before calling
hvm_task_switch(), which is problematic in two ways:
1) Early faults (i.e. ones delivered in the context of the old task) get
delivered with trap semantics, and break restartibility.
2) The addition isn't truncated to 32 bits. In the corner case of a task
switch instruction crossing the 4G->0 boundary taking an early fault (with
trap semantics), a VMEntry failure will occur due to %rip being out of
range.
Instead, pass the instruction length into hvm_task_switch() and write it into
the outgoing TSS only, leaving %rip in its original location.
For now, pass 0 on the SVM side. This highlights a separate preexisting bug
which will be addressed in the following patch.
While adjusting call sites, drop the unnecessary uint16_t cast.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Thu, 28 Nov 2019 16:47:25 +0000 (17:47 +0100)]
build: provide option to disambiguate symbol names
The .file assembler directives generated by the compiler do not include
any path components (gcc) or just the ones specified on the command line
(clang, at least version 5), and hence multiple identically named source
files (in different directories) may produce identically named static
symbols (in their kallsyms representation). The binary diffing algorithm
used by xen-livepatch, however, depends on having unique symbols.
Make the ENFORCE_UNIQUE_SYMBOLS Kconfig option control the (build)
behavior, and if enabled use objcopy to prepend the (relative to the
xen/ subdirectory) path to the compiler invoked STT_FILE symbols. Note
that this build option is made no longer depend on LIVEPATCH, but merely
defaults to its setting now.
Conditionalize explicit .file directive insertion in C files where it
exists just to disambiguate names in a less generic manner; note that
at the same time the redundant emission of STT_FILE symbols gets
suppressed for clang. Assembler files as well as multiply compiled C
ones using __OBJECT_FILE__ are left alone for the time being.
Since we now expect there not to be any duplicates anymore, also don't
force the selection of the option to 'n' anymore in allrandom.config.
Similarly COVERAGE no longer suppresses duplicate symbol warnings if
enforcement is in effect, which in turn allows
SUPPRESS_DUPLICATE_SYMBOL_WARNINGS to simply depend on
!ENFORCE_UNIQUE_SYMBOLS.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 28 Nov 2019 14:14:03 +0000 (15:14 +0100)]
x86/IRQ: make internally used IRQs also honor the pending EOI stack
At the time the pending EOI stack was introduced there were no
internally used IRQs which would have the LAPIC EOI issued from the
->end() hook. This had then changed with the introduction of IOMMUs,
but the interaction issue was presumably masked by
irq_guest_eoi_timer_fn() frequently EOI-ing interrupts way too early
(which got fixed by 359cf6f8a0ec ["x86/IRQ: don't keep EOI timer
running without need"]).
The problem is that with us re-enabling interrupts across handler
invocation, a higher priority (guest) interrupt may trigger while
handling a lower priority (internal) one. The EOI issued from
->end() (for ACKTYPE_EOI kind interrupts) would then mistakenly
EOI the higher priority (guest) interrupt, breaking (among other
things) pending EOI stack logic's assumptions.
Notes:
- In principle we could get away without the check_eoi_deferral flag.
I've introduced it just to make sure there's as little change as
possible to unaffected paths.
- Similarly the cpu_has_pending_apic_eoi() check in do_IRQ() isn't
strictly necessary.
- The new function's name isn't very helpful with its use in
end_level_ioapic_irq_new(). I did also consider eoi_APIC_irq() (to
parallel ack_APIC_irq()), but then liked this even less.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Diagnosed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Roger Pau Monné [Thu, 28 Nov 2019 10:58:25 +0000 (11:58 +0100)]
x86/vmx: always sync PIR to IRR before vmentry
When using posted interrupts on Intel hardware it's possible that the
vCPU resumes execution with a stale local APIC IRR register because
depending on the interrupts to be injected vlapic_has_pending_irq
might not be called, and thus PIR won't be synced into IRR.
Fix this by making sure PIR is always synced to IRR in
hvm_vcpu_has_pending_irq regardless of what interrupts are pending.
Reported-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Joe Jin <joe.jin@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Sergey Dyasli [Wed, 27 Nov 2019 10:04:30 +0000 (10:04 +0000)]
x86/microcode: refuse to load the same revision ucode
Currently if a user tries to live-load the same or older ucode revision
than CPU already has, he will get a single message in Xen log like:
(XEN) 128 cores are to update their microcode
No actual ucode loading will happen and this situation can be quite
confusing. Fix this by starting ucode update only when the provided
ucode revision is higher than the currently cached one (if any).
This is based on the property that if microcode_cache exists, all CPUs
in the system should have at least that ucode revision.
Additionally, print a user friendly message if no matching or newer
ucode can be found in the provided blob. This also requires ignoring
-ENODATA in AMD-side code, otherwise the message given to the user is:
(XEN) Parsing microcode blob error -61
Which actually means that a ucode blob was parsed fine, but no matching
ucode was found.
Igor Druzhinin [Tue, 26 Nov 2019 17:08:19 +0000 (17:08 +0000)]
AMD/IOMMU: honour IR setting while pre-filling DTEs
IV bit shouldn't be set in DTE if interrupt remapping is not
enabled. It's a regression in behavior of "iommu=no-intremap"
option which otherwise would keep interrupt requests untranslated
for all of the devices in the system regardless of wether it's
described as valid in IVRS or not.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
George Dunlap [Tue, 26 Nov 2019 15:49:20 +0000 (15:49 +0000)]
docs/xl: Document pci-assignable state
Changesets 319f9a0ba9 ("passthrough: quarantine PCI devices") and ba2ab00bbb ("IOMMU: default to always quarantining PCI devices")
introduced PCI device "quarantine" behavior, but did not document how
the pci-assignable-add and -remove functions act in regard to this.
Rectify this.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Paul Durrant <paul@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 26 Nov 2019 13:17:45 +0000 (14:17 +0100)]
EFI: fix "efi=attr=" handling
Commit 633a40947321 ("docs: Improve documentation and parsing for efi=")
failed to honor the strcmp()-like return value convention of
cmdline_strcmp().
Reported-by: Roman Shaposhnik <roman@zededa.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
There are two mappings active in the middle of do_recalc(), and hence
commit 0d0f4d78e5d1 ("p2m: change write_p2m_entry to return an error
code") should have added (or otherwise invoked) unmapping code just
like it did in p2m_next_level(), despite us not expecting any errors
here. Arrange for the existing unmap invocation to take effect in all
cases.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Tue, 26 Nov 2019 13:16:09 +0000 (14:16 +0100)]
x86/domctl: have XEN_DOMCTL_getpageframeinfo3 preemptible
This hypercall can take a long time to finish because it attempts to
grab the `hostp2m' lock up to 1024 times. The accumulated wait for the
lock can take several seconds.
This can easily happen with a guest with 32 vcpus and plenty of RAM,
during localhost migration.
While the patch doesn't fix the problem with the lock contention and
the fact that the `hostp2m' lock is currently global (and not on a
single page), it is still an improvement to the hypercall. It will in
particular, down the road, allow dropping the arbitrary limit of 1024
entries per request.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 26 Nov 2019 13:15:01 +0000 (14:15 +0100)]
IOMMU: default to always quarantining PCI devices
XSA-302 relies on the use of libxl's "assignable-add" feature to prepare
devices to be assigned to untrusted guests.
Unfortunately, this is not considered a strictly required step for
device assignment. The PCI passthrough documentation on the wiki
describes alternate ways of preparing devices for assignment, and
libvirt uses its own ways as well. Hosts where these alternate methods
are used will still leave the system in a vulnerable state after the
device comes back from a guest.
Default to always quarantining PCI devices, but provide a command line
option to revert back to prior behavior (such that people who both
sufficiently trust their guests and want to be able to use devices in
Dom0 again after they had been in use by a guest wouldn't need to
"manually" move such devices back from DomIO to Dom0).
This is XSA-306.
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
George Dunlap [Tue, 26 Nov 2019 10:32:42 +0000 (10:32 +0000)]
x86: Don't increase ApicIdCoreSize past 7
Changeset ca2eee92df44 ("x86, hvm: Expose host core/HT topology to HVM
guests") attempted to "fake up" a topology which would induce guest
operating systems to not treat vcpus as sibling hyperthreads. This
involved actually reporting hyperthreading as available, but giving
vcpus every other ApicId; which in turn led to doubling the ApicIds
per core by bumping the ApicIdCoreSize by one. In particular, Ryzen
3xxx series processors, and reportedly EPYC "Rome" cpus -- have an
ApicIdCoreSize of 7; the "fake" topology increases this to 8.
Unfortunately, Windows running on modern AMD hardware -- including
Ryzen 3xxx series processors, and reportedly EPYC "Rome" cpus --
doesn't seem to cope with this value being higher than 7. (Linux
guests have so far continued to cope.)
A "proper" fix is complicated and it's too late to fix it either for
4.13, or to backport to supported branches. As a short-term fix,
limit this value to 7.
This does mean that a Linux guest, booted on such a system without
this change, and then migrating to a system with this change, with
more than 64 vcpus, would see an apparent topology change. This is a
low enough risk in practice that enabling this limit unilaterally, to
allow other guests to boot without manual intervention, is worth it.
Reported-by: Steven Haigh <netwiz@crc.id.au> Reported-by: Andreas Kinzler <hfp@posteo.de> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
George Dunlap [Fri, 22 Nov 2019 18:52:02 +0000 (18:52 +0000)]
x86/mm: Adjust linear uses / entries when a page loses validation
"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).
XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.
Additionally, XSA-299 introduced a mode whereby if a page was known to
have been only partially validated, _put_page_type() would be called
with PTF_partial_set, indicating that if the page had been
de-validated by someone else, the type count should be left alone.
Unfortunately, this change did not account for the required accounting
for linear page table uses and entries; in the case that a previously
partially-devalidated pagetable was fully-devalidated by someone else,
the linear_pt_counts are not updated.
This could happen in one of two places:
1. In the case a partially-devalidated page was re-validated by
someone else
2. During domain tear-down, when pages are force-invalidated while
leaving the type count intact.
The second could be ignored, since at that point the pages can no
longer be abused; but the first requires handling. Note however that
this would not be a security issue: having the counts be too high is
overly strict (i.e., will prevent a page from being used in a way
which is perfectly safe), but shouldn't cause any other issues.
Fix this by adjusting the linear counts when a page loses validation,
regardless of whether the de-validation completed or was only partial.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
libxl: make default path to add/remove all PV devices
Adding/removing device is handled for specific devices only: VBD, VIF,
QDISK. This commit adds default case to handle adding/removing for all PV
devices by default, except QDISK device, which requires special handling.
If any other device is required a special handling it should be done by
implementing separate case (similar to QDISK device). The default
behaviour for adding device is to wait when the backend goes to
XenbusStateInitWait and the default behaviour on removing device is to
start generic device remove procedure.
Also this commit fixes removing guest function: before the guest was
removed when all VIF and VBD devices are removed. The fix removes
guest when all created devices are removed. This is done by checking the
guest device list instead of checking num_vifs and num_vbds. num_vifs and
num_vbds variables are removed as redundant in this case.
Signed-off-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wl@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
There are two kind of VKBD devices: with QEMU backend and user space PV
backend. In current implementation they can't be distinguished as both use
VKBD backend type. As result, user space PV KBD backend is started and
stopped as QEMU backend. This commit adds new device kind VINPUT to be
used as backend type for user space PV KBD backend.
Signed-off-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Thu, 21 Nov 2019 18:21:49 +0000 (18:21 +0000)]
x86/vvmx: Fix livelock with XSA-304 fix
It turns out that the XSA-304 / CVE-2018-12207 fix of disabling executable
superpages doesn't work well with the nested p2m code.
Nested virt is experimental and not security supported, but is useful for
development purposes. In order to not regress the status quo, disable the
XSA-304 workaround until the nested p2m code can be improved.
Introduce a per-domain exec_sp control and set it based on the current
opt_ept_exec_sp setting. Take the oppotunity to omit a PVH hardware domain
from the performance hit, because it is already permitted to DoS the system in
such ways as issuing a reboot.
When nested virt is enabled on a domain, force it to using executable
superpages and rebuild the p2m.
Having the setting per-domain involves rearranging the internals of
parse_ept_param_runtime() but it still retains the same overall semantics -
for each applicable domain whose setting needs to change, rebuild the p2m.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Tue, 5 Nov 2019 19:08:14 +0000 (19:08 +0000)]
x86/livepatch: Prevent patching with active waitqueues
The safety of livepatching depends on every stack having been unwound, but
there is one corner case where this is not true. The Sharing/Paging/Monitor
infrastructure may use waitqueues, which copy the stack frame sideways and
longjmp() to a different vcpu.
This case is rare, and can be worked around by pausing the offending
domain(s), waiting for their rings to drain, then performing a livepatch.
In the case that there is an active waitqueue, fail the livepatch attempt with
-EBUSY, which is preforable to the fireworks which occur from trying to unwind
the old stack frame at a later point.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Roger Pau Monné [Fri, 22 Nov 2019 16:52:59 +0000 (17:52 +0100)]
x86/vlapic: allow setting APIC_SPIV_FOCUS_DISABLED in x2APIC mode
Current code unconditionally prevents setting APIC_SPIV_FOCUS_DISABLED
regardless of the processor model, which is not correct according to
the specification.
This issue was discovered while trying to boot a pvshim with x2APIC
enabled.
Always allow setting APIC_SPIV_FOCUS_DISABLED: the local APIC
provided to guests is emulated by Xen, and as such doesn't depend on
the features found on the hardware processor. Note for example that
Xen offers x2APIC support to guests even when the underlying hardware
doesn't have such feature.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Julien Grall [Wed, 20 Nov 2019 13:37:51 +0000 (13:37 +0000)]
xen: Add missing va_end() in hypercall_create_continuation()
The documentation requires va_start() to always be matched with a
corresponding va_end(). However, this is not the case in the path used
for bad format.
This was introduced by XSA-296.
Coverity-ID: 1488727 Fixes: 0bf9f8d3e3 ("xen/hypercall: Don't use BUG() for parameter checking in hypercall_create_continuation()") Signed-off-by: Julien Grall <julien@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Sergey Dyasli [Wed, 30 Oct 2019 14:54:47 +0000 (14:54 +0000)]
x86/e820: fix 640k - 1M region reservation logic
Converting a guest from PV to PV-in-PVH makes the guest to have 384k
less memory, which may confuse guest's balloon driver. This happens
because Xen unconditionally reserves 640k - 1M region in E820 despite
the fact that it's really a usable RAM in PVH boot mode.
Fix this by skipping region type change in virtualised environments,
trusting whatever memory map our hypervisor has provided.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Mon, 9 Sep 2019 10:43:33 +0000 (11:43 +0100)]
x86/boot: Remove cached CPUID data from the trampoline
We have a cached cpuid_ext_features in the trampoline which is kept in sync by
various pieces of boot logic. This is complicated, and all it is actually
used for is to derive whether NX is safe to use.
Replace it with a canned value to load into EFER.
trampoline_setup() and efi_arch_cpu() now tweak trampoline_efer at the point
that they are stashing the main copy of CPUID data. Similarly,
early_init_intel() needs to tweak if it has re-enabled the use of NX.
This simplifies the AP boot and S3 resume paths by using trampoline_efer
directly, rather than locally turning FEATURE_NX into EFER_NX.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Wed, 20 Nov 2019 16:12:12 +0000 (17:12 +0100)]
x86/Makefile: remove $(guard) use from $(TARGET).efi target
Following the patch 65d104984c04 ("x86: fix race to build
arch/x86/efi/relocs-dummy.o"), the error message
nm: 'efi/relocs-dummy.o': No such file"
started to appear on system which can't build the .efi target. This is
because relocs-dummy.o isn't built anymore.
The error is printed by the evaluation of VIRT_BASE and ALT_BASE which
aren't use anyway.
But, we don't need that file as we don't want to build `$(TARGET).efi'
anyway. On such system, $(guard) evaluate to the shell builtin ':',
which prevent any of the shell commands in `$(TARGET).efi' from been
executed.
Even if $(guard) is evaluated opon use, it depends on $(XEN_BUILD_PE)
which is evaluated at the assignment. So, we can replace $(guard) in
$(TARGET).efi by having two different rules depending on
$(XEN_BUILD_PE) instead.
The change with this patch is that none of the dependency of
$(TARGET).efi will be built if the linker doesn't support PE
and VIRT_BASE and ALT_BASE don't get evaluated anymore, so nm will not
complain about the missing relocs-dummy.o file anymore.
Since prelink-efi.o isn't built on system that can't build
$(TARGET).efi anymore, we can remove the $(guard) variable everywhere.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
efi: do not use runtime services table with efi=no-rs
Before dfcccc6631 "efi: use directmap to access runtime services table"
all usages of efi_rs pointer were guarded by efi_rs_enter(), which
implicitly refused to operate with efi=no-rs (by checking if
efi_l4_pgtable is NULL - which is the case for efi=no-rs). The said
commit (re)moved that call as unneeded for just reading content of
efi_rs structure - to avoid unnecessary page tables switch. But it
neglected to check if efi_rs access is legal.
Fix this by adding explicit check for runtime service being enabled in
the cases that do not use efi_rs_enter().
Reported-by: Roman Shaposhnik <roman@zededa.com> Fixes: dfcccc6631 "efi: use directmap to access runtime services table" Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
c/s ff66ccefe5 "x86/CPUID: adjust SSEn dependencies" made SSE4A depend on
SSSE3, but these processors really do have have SSE4A without SSSE3.
This manifests as an upgrade regression, as the SSE4A feature disappears from
view.
Adjust the SSE4A feature to depend on SSE3 rather than SSSE3.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
> To embed Python into an application, a new --embed option must be
> passed to python3-config --libs --embed to get -lpython3.8 (link the
> application to libpython). To support both 3.8 and older, try
> python3-config --libs --embed first and fallback to python3-config
> --libs (without --embed) if the previous command fails.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Wei Liu <wl@xen.org>
[ wei: rerun autogen.sh ]
Ian Jackson [Tue, 29 Oct 2019 17:45:30 +0000 (17:45 +0000)]
tools/configure: Honour XEN_COMPILE_ARCH and _TARGET_ for shim
The pvshim can only be built 64-bit because the hypervisor is only
64-bit nowadays. The hypervisor build supports XEN_COMPILE_ARCH and
XEN_TARGET_ARCH which override the information from uname. The pvshim
build runs out of the tools/ directory but calls the hypervisor build
system.
If one runs in a Linux 32-bit userland with a 64-bit kernel, one used
to be able to set XEN_COMPILE_ARCH. But nowadays this does not work.
configure sees the target cpu as 64-bit and tries to build pvshim.
The build prints
echo "*** Xen x86/32 target no longer supported!"
and doesn't build anything. Then the subsequent Makefiles try to
install the non-built pieces.
Fix this anomaly by causing configure to honour the Xen hypervisor way
of setting the target architecture.
In principle this user behaviour is not handled quite right, because
configure will still see 64-bit and so all the autoconf-based
architecture testing will see 64-bit rather than 32-bit x86. But the
tools are in fact generally quite portable: this particular location
in configure{.ac,} is the only place in tools/ where 64-bit x86 is
treated differently from 32-bit x86, so the fix is sufficient and
correct for this use case.
It remains the case that XEN_COMPILE_ARCH or XEN_TARGET_ARCH to a
non-x86 architecture, when configure thinks things are x86, or vice
versa, will not work right.
(This is a bugfix to 8845155c831c
pvshim: make PV shim build selectable from configure
which inadvertantly deleted the logic to only build the shim for
XEN_TARGET_ARCH != x86_32.)
I have rerun autogen.sh, so this patch contains the fix to configure
as well as the source fix to configure.ac.
Fixes: 8845155c831c59e867ee3dd31ee63e0cc6c7dcf2 Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> CC: Olaf Hering <olaf@aepfle.de> CC: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Jürgen Groß <jgross@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
libxl: gentypes: initialise array elements in json
Currently, array elements are initialized with calloc. Which means
initialize all element fields with zero values. If an entry is not
present in the json (which is entirely permitted), the element will be
all-bits-zero instead of the default value (which is wrong).
The fix is to initalise each array element before parsing it, using
the new libxl_C_type_do_init function.
With existing types this results in a lot of new calls like this:
(indentation adjusted). This looks right. To check what happens with
types which have nontrivial defaults but don't have init functions (of
which we currently have none in arrays), I (Ian) experimentally added:
("pnode", uint32), # physical node of this node
("vcpus", libxl_bitmap), # vcpus in this node
+ ("sporks", Array(MemKB, "num_sporks")),
])
Ian Jackson [Tue, 29 Oct 2019 15:19:33 +0000 (15:19 +0000)]
libxl: gentypes.py: Break out libxl_C_type_do_init
This is going to be the common way to initialise things.
_libxl_C_type_init remains the thing for generating the body of the
init function, and for some special cases.
No functional change with existing types: C output is identical.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Ian Jackson [Tue, 29 Oct 2019 15:17:58 +0000 (15:17 +0000)]
libxl: gentypes.py: Break out field_pass in ..._copy_deprecated
We are going to want this in a moment.
No functional change with existing types: C output is identical.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Ian Jackson [Tue, 29 Oct 2019 15:00:35 +0000 (15:00 +0000)]
tools/libxl: gentypes.py: Prefer init_val to init_fn
When both are provided, init_val is likely to be more direct.
No functional change with existing types: C output is identical.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Anthony PERARD [Thu, 31 Oct 2019 12:17:27 +0000 (12:17 +0000)]
libxl_pci: Don't hold QMP connection while waiting
After sending the 'device_del' command for a PCI passthrough device,
we wait until QEMU has effectively deleted the device, this involves
executing more QMP commands. While waiting, libxl hold the connection.
It isn't necessary to hold the connection and it prevents others from
making progress, so this patch releases the QMP connection.
For background:
e.g., when a guest is created with several pci passthrough
attached, on `xl destroy` all the devices needs to be detach, and
this is usually what happens:
- 'device_del' called for the 1st pci device
- 'query-pci' checking if pci still there, it is
- wait 1s
- 'query-pci' checking again, and it's gone
-> now the same can be done for the second pci device, so
plenty of waiting on others when pci detach can be done in
parallel.
On shutdown, libxl usually keeps waiting because QEMU never
releases the device because the guest kernel never responds QEMU's
unplug queries. So detaching of the 1st device waits until a
timeout stops it, and since the same timeout is setup at the same
time for the other devices to detach, the 'device_del' command is
never sent for those.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Mon, 18 Nov 2019 17:13:08 +0000 (17:13 +0000)]
libxl_qmp: Have a lock for QMP socket access
This patch workaround the fact that it's not possible to connect
multiple time to a single QMP socket. QEMU listen on the socket with
a backlog value of 1, which mean that on Linux when concurrent thread
call connect() on the socket, they get EAGAIN.
Background:
This happens when attempting to create a guest with multiple
pci devices passthrough, libxl creates one connection per device to
attach and execute connect() on all at once before any single
connection has finished.
To work around this, we use a new lock.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Mon, 18 Nov 2019 18:10:14 +0000 (18:10 +0000)]
libxl: Introduce libxl__ev_immediate
This new ev allows to arrange a non-reentrant callback to be called.
This happen immediately after the current event is processed and after
other ev_immediates that would have already been registered.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Mon, 18 Nov 2019 17:13:06 +0000 (17:13 +0000)]
libxl: libxl__ev_qmp_send now takes an egc
No functionnal changes.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>