Jan Beulich [Mon, 2 Sep 2019 12:38:37 +0000 (14:38 +0200)]
timers: limit heap size
First and foremost make timer_softirq_action() avoid growing the heap
if its new size can't be stored without truncation. 64k entries is a
lot, and I don't think we're at risk of actually running into the issue,
but I also think it's better not to allow for hard to debug problems to
occur in the first place.
Furthermore also adjust the code such the size/limit fields becoming
unsigned int would at least work from a mere sizing point of view. For
this also switch various uses of plain int to unsigned int.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Igor Druzhinin [Fri, 30 Aug 2019 13:23:01 +0000 (15:23 +0200)]
x86/domain: don't destroy IOREQ servers on soft reset
Performing soft reset should not opportunistically kill IOREQ servers
for device emulators that might be currently running for a domain.
Every emulator is supposed to clean up IOREQ servers for itself on exit.
This allows a toolstack to elect whether or not a particular device
model should be restarted.
The original code was introduced in 3235cbfe ("arch-specific hooks for
domain_soft_reset()") likely due to the fact 'default' IOREQ server
existed in Xen at the time and used by QEMU didn't have an API call to
destroy. Since the removal of 'default' IOREQ server from Xen this
reason has gone away.
Since commit ba7fdd64b ("xen: cleanup IOREQ server on exit") QEMU now
destroys IOREQ server for itself as every other device emulator
is supposed to do. It's now safe to remove this code from soft reset
path - existing systems with old QEMU should be able to work as
even if there are IOREQ servers left behind, a new QEMU instance will
override its ranges anyway.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 30 Aug 2019 13:21:54 +0000 (15:21 +0200)]
x86: move INVPCID_TYPE_* to x86-defns.h
This way the insn emulator can then too use the #define-s. In place of
the TYPE infix add an X86 prefix.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 30 Aug 2019 08:24:13 +0000 (10:24 +0200)]
x86/ACPI: re-park previously parked CPUs upon resume from S3
Aiui when resuming from S3, CPUs come back out of RESET/INIT. Therefore
they need to undergo the same procedure as was added elsewhere by
commits d8f974f1a6 ("x86: command line option to avoid use of secondary
hyper-threads") and 8797d20a6e ("x86: possibly bring up all CPUs even
if not all are supposed to be used").
Just like done at boot time, avoid (at least pointlessly) using
stop-machine logic.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 29 Aug 2019 13:10:07 +0000 (15:10 +0200)]
x86: properly gate clearing of PKU feature
setup_clear_cpu_cap() is __init and hence may not be called post-boot.
Note that opt_pku nevertheless is not getting __initdata added - see
e.g. commit 43fa95ae6a ("mm: make opt_bootscrub non-init").
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Thu, 29 Aug 2019 13:08:46 +0000 (15:08 +0200)]
partially revert "x86/mm: Clean IOMMU flags from p2m-pt code"
This partially reverts commit 854a49a7486a02edae5b3e53617bace526e9c1b1 by re-adding the logic that
propagates changes to the domain physmap done by p2m_pt_set_entry into
the iommu page tables. Without this logic changes to the guest physmap
are not propagated to the iommu, leaving stale iommu entries that can
leak data, or failing to add new entries.
Note that this commit doesn't re-introduce iommu flags to the cpu page
table entries, since the logic to add/remove entries to the iommu page
tables is based on the p2m type and the mfn.
Fixes: 854a49a7486a02 ('x86/mm: Clean IOMMU flags from p2m-pt code') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Wed, 1 May 2019 17:14:03 +0000 (18:14 +0100)]
x86/boot: Drop all use of lmsw
lmsw is an obsolete relic of the 286 processor - so much so that it even lacks
intercept assistance on AMD processors.
Use a plain mov to %cr0 which is easier to follow, certainly faster to
virtualise on AMD hardware, and almost certainly a faster microcode path in
real hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 12 Aug 2019 17:40:04 +0000 (18:40 +0100)]
x86/suspend: Simplify system table handling on resume
load_TR() is used exclusively in the resume path, but jumps through a lot of
unnecessary hoops. As suspend/resume is strictly on CPU0 in idle context, the
correct GDT to use is boot_gdt, which means it doesn't need saving on suspend.
Although doing more than strictly necessary, reuse load_system_tables(), which
is already used by APs on the S3 resume path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 7 Aug 2019 11:53:51 +0000 (12:53 +0100)]
xen: Drop XEN_DOMCTL_{get,set}_machine_address_size
This functionality is obsolete. It was introduced by c/s 41296317a31 into
Xend, but was never exposed in libxl.
Nothing limits this to PV guests, but it makes no sense for HVM guests.
Looking through the XenServer templates, this was used to work around bugs in
the 32bit RHEL/CentOS 4.7 and 4.8 kernels (fixed in 4.9) and RHEL/CentOS/OEL
5.2 and 5.3 kernels (fixed in 5.4). RHEL 4 as a major version went out of
support in 2017, whereas the 5.2/5.3 kernels went out of support when 5.4 was
released in 2009.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 7 Aug 2019 11:49:37 +0000 (12:49 +0100)]
xen: Drop XEN_DOMCTL_suppress_spurious_page_faults
This functionality is obsolete. It was introduced by c/s 39407bed9c0 into
Xend, but never exposed in libxl.
While not explicitly limited to PV guests, this is PV-only by virtue of its
position in the pagefault handler.
Looking though the XenServer templates, this was used to work around bugs in
the 32bit RHEL/CentOS 4.{5..7} kernels (fixed in 4.8). RHEL 4 as a major
version when out if support in 2017.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Wed, 28 Aug 2019 14:58:45 +0000 (16:58 +0200)]
x86/hvm/domain: remove the 'hap_enabled' flag
The hap_enabled() macro can determine whether the feature is available
using the domain 'options'; there is no need for a separate flag.
NOTE: Furthermore, by extending sanitizing of the domain 'options', the
macro can be transformed into an inline function and re-located to
xen/sched.h. This also makes hap_enabled() common, thus allowing
removal of an ugly ifdef CONFIG_X86 from the common iommu code.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Wed, 28 Aug 2019 14:57:36 +0000 (16:57 +0200)]
p2m/ept: pass correct level to atomic_write_ept_entry in ept_invalidate_emt
The level passed to ept_invalidate_emt corresponds to the EPT entry
passed as the mfn parameter, which is a pointer to an EPT page table,
hence the entries in that page table will have one level less than the
parent.
Fix the call to atomic_write_ept_entry to pass the correct level, ie:
one level less than the parent.
Fixes: 50fe6e73705 ('pvh dom0: add and remove foreign pages') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>.
Julien Grall [Wed, 14 Aug 2019 09:36:07 +0000 (10:36 +0100)]
xen/arm: traps: Remove all zero padding before PRIregister format
Commit af156ff085 "xen/arm: types: Specify the zero padding in the
definition of PRIregister" moved the zero padding within the definition
of PRIregister.
However, some of the users still had zero padding before which result
to print tens of zero when dumping the CPU state.
To prevent this, remove the last users of zero padding before
PRIregister.
Igor Druzhinin [Tue, 27 Aug 2019 11:48:05 +0000 (12:48 +0100)]
x86/mm: correctly initialise M2P entries on boot
Since guest resource management work it's now possible to have a page
assigned to a domain without a valid M2P entry. Some paths in the code
rely on the fact a GFN returned from mfn_to_gfn() for such a page
is not valid as well, i.e. see arch_iommu_populate_page_table().
For systems without 512GB contiguous RAM M2P entries were already
correctly initialised on boot with INVALID_M2P_ENTRY (~0UL) but
on systems where M2P could be covered by a single 1GB page directory
0x77 poison was used instead. That eventually resulted in a crash
during IOMMU construction on systems without shared PTs enabled.
While here fix up compat M2P entries as well.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Igor Druzhinin [Mon, 19 Aug 2019 18:45:35 +0000 (19:45 +0100)]
tools/oxenstored: port XS_INTRODUCE evtchn rebind function from cxenstored
C version of xenstored had this ability since 61aaed0d5 ("Allow
XS_INTRODUCE to be used for rebinding the xenstore evtchn.") from 2007.
Copy it as is to Ocaml version.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Paul Durrant [Fri, 16 Aug 2019 17:19:55 +0000 (18:19 +0100)]
domain: remove the 'is_xenstore' flag
This patch introduces a convenience macro, is_xenstore_domain(), which
tests the domain 'options' directly and then uses that in place of
the 'is_xenstore' flag.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: "Roger Pau Monné" <roger.pau@citrix.com> Acked-by: George Dunlap <George.Dunlap@eu.citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Paul Durrant [Fri, 16 Aug 2019 17:19:52 +0000 (18:19 +0100)]
passthrough: make deassign_device() static
This function is only ever called from within the same source module and
really has no business being declared xen/iommu.h. This patch relocates
the function ahead of the first caller and makes it static.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
c/s 650c31d3af "x86/IRQ: fix locking around vector management" adjusted the
locking in adjust_irq_affinity().
The S3 path ends up here via iommu_resume() before interrupts are enabled, at
which point spin_lock_irq() fails ASSERT(local_irq_is_enabled()); but with no
working console.
Use spin_lock_irqsave() instead to cope with interrupts already being
disabled.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Julien Grall [Mon, 19 Aug 2019 17:13:05 +0000 (18:13 +0100)]
xen/console: Fix build when CONFIG_DEBUG_TRACE=y
Commit b5e6e1ee8da "xen/console: Don't treat NUL character as the end
of the buffer" extended sercon_puts to take the number of character
to print in argument.
Sadly, a couple of couple of the callers in debugtrace_dump_worker()
were not converted. This result to a build failure when enabling
CONFIG_DEBUG_TRACE.
Spotted by Travis using randconfig Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Fri, 19 Jul 2019 12:25:45 +0000 (13:25 +0100)]
include/public/memory.h: remove the XENMEM_rsrc_acq_caller_owned flag
When commit 3f8f1228 "x86/mm: add HYPERVISOR_memory_op to acquire guest
resources" introduced the concept of directly mapping some guest resources,
it was envisaged that the memory for some resources associated with a guest
may not actually be assigned to that guest, specifically the IOREQ server
resource introduces in commit 6e387461 "x86/hvm/ioreq: add a new mappable
resource type...". Such resources were dubbed "caller owned" and resulted
in the owned resources" and acquiring them resulted in the
XENMEM_rsrc_acq_caller_owned flag being passed back to the caller of the
memory op.
Unfortunately the implementation led to XSA-276, which was mitigated
by commit f6b6ae78 "x86/hvm/ioreq: fix page referencing" and then a related
memory accounting problem was worked around by commit e862e6ce
"x86/hvm/ioreq: use ref-counted target-assigned shared pages". This latter
commit removed the only instance of a "caller owned" resource, but the
flag was left in header and checked in one place in the core code.
This patch removes that now redundant check and removes the definition of
XENMEM_rsrc_acq_caller_owned from the public header. Also, since this was
the only flag defined for the XENMEM_acquire_resource memory op, it removes
the 'flags' field of struct xen_mem_acquire_resource and replaces it with
an equivalently sized 'pad' field.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
match_watch_by_token() when returns an error, sets also exception within
python. This is generally the right thing to do, but when
xspy_read_watch() handle EAGAIN error internally, the exception needs to
be cleared. Otherwise it will fail like this:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
(...)
result = self.handle.read_watch()
SystemError: <method 'read_watch' of 'xen.lowlevel.xs.xs' objects> returned a result with an error set
Fixes f6e1023412 "python: Extract registered watch search logic from xspy_read_watch()" Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Wed, 21 Aug 2019 08:22:58 +0000 (09:22 +0100)]
viridian: make viridian_time_domain_freeze() safe to call...
...on a partially destroyed domain.
viridian_time_domain_freeze() and viridian_time_vcpu_freeze() rely
(respectively) on the dynamically allocated per-domain and per-vcpu viridian
areas [1], which are freed during domain_relinquish_resources().
Because arch_domain_pause() can call viridian_domain_time_freeze() this
can lead to host crashes if e.g. a XEN_DOMCTL_pausedomain is issued after
domain_relinquish_resources() has run.
To prevent such crashes, this patch adds a check of is_dying into
viridian_time_domain_freeze(), and viridian_time_domain_thaw() which is
similarly vulnerable to indirection into freed memory.
NOTE: The patch also makes viridian_time_vcpu_freeze/thaw() static, since
they have no callers outside of the same source module.
[1] See commit e7a9b5e72f26 "viridian: separately allocate domain and vcpu
structures".
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
x86/p2m: fix non-translated handling of iommu mappings
The current usage of need_iommu_pt_sync in p2m for non-translated
guests is wrong because it doesn't correctly handle a relaxed PV
hardware domain, that has need_sync set to false, but still need
entries to be added from calls to {set/clear}_identity_p2m_entry.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Tested-by: Roman Shaposhnik <roman@zededa.com>
Extend the list of xc() object methods with additional one to display
Xen's buildid. The implementation follows the libxl implementation
(e.g. max buildid size assumption being XC_PAGE_SIZE minus
sizeof(buildid->len)).
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Martin Mazein <amazein@amazon.de> Reviewed-by: Andra-Irina Paraschiv <andraprs@amazon.com> Reviewed-by: Norbert Manthey <nmanthey@amazon.de> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
xen/arm: add reserved-memory regions to the dom0 memory node
Reserved memory regions are automatically remapped to dom0. Their device
tree nodes are also added to dom0 device tree. However, the dom0 memory
node is not currently extended to cover the reserved memory regions
ranges as required by the spec. This commit fixes it.
Change make_memory_node to take a struct meminfo * instead of a
kernel_info. Call it twice for dom0, once to create the first regular
memory node, and the second time to create a second memory node with the
ranges covering reserved-memory regions.
Also, make a small code style fix in make_memory_node.
xen/arm: don't iomem_permit_access for reserved-memory regions
Don't allow reserved-memory regions to be remapped into any unprivileged
guests, until reserved-memory regions are properly supported in Xen. For
now, do not call iomem_permit_access on them, because giving
iomem_permit_access to dom0 means that the toolstack will be able to
assign the region to a domU.
xen/arm: handle reserved-memory in consider_modules and dt_unreserved_regions
reserved-memory regions overlap with memory nodes. The overlapping
memory is reserved-memory and should be handled accordingly:
consider_modules and dt_unreserved_regions should skip these regions the
same way they are already skipping mem-reserve regions.
As we parse the device tree in Xen, keep track of the reserved-memory
regions as they need special treatment (follow-up patches will make use
of the stored information.)
Reuse process_memory_node to add reserved-memory regions to the
bootinfo.reserved_mem array.
Refuse to continue once we reach the max number of reserved memory
regions to avoid accidentally mapping any portions of them into a VM.
xen/arm: make process_memory_node a device_tree_node_func
Change the signature of process_memory_node to match
device_tree_node_func. Thanks to this change, the next patch will be
able to use device_tree_for_each_node to call process_memory_node on all
the children of a provided node.
Return error if there is no reg property or if nr_banks is reached. Let
the caller deal with the error.
Add a new parameter to device_tree_for_each_node: node, the node to
start the search from.
To avoid scanning device tree, and given that we only care about
relative increments of depth compared to the depth of the initial node,
we set the initial depth to 0. Then, we call func() for every node with
depth > 0.
Don't call func() on the parent node passed as an argument. Clarify the
change in the comment on top of the function. The current callers pass
the root node as argument: it is OK to skip the root node because no
relevant properties are in it, only subnodes.
A lot of legitimate error messages were hidden behind debug printk
only. Most of these messages can be triggered by loading a malformed
hotpatch payload and are priceless for understanding issues with such
payloads.
Thus, always display all relevant XENLOG_ERR messages.
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Amit Shah <aams@amazon.de> Reviewed-by: Martin Mazein <amazein@amazon.de> Reviewed-by: Bjoern Doebel <doebel@amazon.de> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
[Fix indentation and double LIVEPATCH prefixes, drop gratuitous punctuation] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Sat, 26 Jan 2019 15:58:48 +0000 (15:58 +0000)]
xen/x86: Use mfn_to_gfn rather than mfn_to_gmfn
mfn_to_gfn and mfn_to_gmfn are doing exactly the same except the former
is using mfn_t and gfn_t (return type).
Furthermore, the naming of the former is more consistent with the
current naming scheme (GFN/MFN). So replace mfn_to_gmfn with
mfn_to_gfn in x86 code.
Take the opportunity to convert some of the callers to use typesafe GFN and
format the message correctly.
No functional changes.
Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
--
Changes in v3:
- The hunk in x86/mm.c is not necessary anymore
- Update printk message to use GFN rather than frame when suitable
- Update commit message with some NITs
- Add Jan's reviewed-by
Changes in v2:
- mfn_to_gfn now returns a gfn_t
- Use %pd and PRI_gfn when possible in the message
- Don't split format string to help grep/ack.
Michał Kowalczyk [Mon, 19 Aug 2019 02:23:33 +0000 (04:23 +0200)]
x86: Restore IA32_MISC_ENABLE on wakeup
Code in intel.c:early_init_intel() modifies IA32_MISC_ENABLE MSR. Those
modifications must be restored after resuming from S3 (see e.g. Linux wakeup
code), otherwise bad things may happen (e.g. wakeup code may cause #GP when
trying to set IA32_EFER.NXE [1]).
This bug was noticed on a ThinkPad x230 with NX disabled in the BIOS:
Xen could correctly boot, but crashed when resuming from suspend.
Applying this patch fixed the problem.
[1] Intel SDM vol 3: "If the execute-disable capability is not
available, a write to set IA32_EFER.NXE produces a #GP exception."
Signed-off-by: Michał Kowalczyk <mkow@invisiblethingslab.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
At the moment, HYPERCALL_console_io is using signed int to describe the
command (@cmd) and the size of the buffer (@count).
* @cmd does not need to be signed this used as a set of named value.
None of them are negative. If new one are introduced they can be
positive.
* @count is used to know the size of the buffer. It makes little
sense to have a negative value here.
So both variables are now switched to use unsigned int.
Changing @count to unsigned type will result in a change of behavior for
the existing commands:
- write: Any buffer bigger than 2GB will now be printed rather than
been ignored (the command return 0).
- read: The return value is a signed 32-bit value for 32-bit Xen.
To keep compatibility between 32-bit and 64-bit ABI, it
effectively means the return value is 32-bit (despite been long
on 64-bit). Negative value are used for error and positive value
for the number of characters read. To avoid clash between the two
sets, the buffer is still limited to 2GB. The only difference is
an error is returned rather than claiming there are no characters.
The behavior is only affecting unlikely use of the current interface, so
this is not a big concern regarding backward compatibility.
Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Tue, 26 Feb 2019 21:39:58 +0000 (21:39 +0000)]
xen/console: Don't treat NUL character as the end of the buffer
After upgrading Debian to Buster, I have began to notice console
mangling when using zsh in Dom0. This is happenning because output sent by
zsh to the console may contain NULs in the middle of the buffer.
The actual implementation of CONSOLEIO_write considers that a buffer
always terminate with a NUL and therefore will ignore anything after it.
In general, NULs are perfectly legitimate in terminal streams. For
instance, this could be used for padding slow terminals. See terminfo(5)
section `Delays and Padding`, or search for the pcre '\bpad'.
Other use cases includes using the console for dumping non-human
readable information (e.g debugger, file if no network...). With the
current behavior, the resulting stream will end up to be corrupted.
The documentation for CONSOLEIO_write is pretty limited (to not say
inexistent). From the declaration, the hypercall takes a buffer and size.
So this could lead to think the NUL character is allowed in the middle of
the buffer.
This patch updates the console API to pass the size along the buffer
down so we can remove the reliance on buffer terminating by a NUL
character.
Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
At this moment IOMMU pt sharing is disabled by commit [1].
This patch aims to clear the IOMMU hap share support as it will not be
used in the future. By doing this the IOMMU bits used in pte[52:58] can
be used in other ways.
Suggested-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Brian Woods <brian.woods@amd.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Julien Grall [Mon, 12 Aug 2019 11:23:43 +0000 (12:23 +0100)]
xen/arm: setup: Add Xen as boot module before printing all boot modules
Since commit f60658c6ae "xen/arm: Stop relocating Xen", the position of
Xen in memory is not printed anymore. This can make difficult to debug
early code.
As Xen is not relocated anymore, we can add Xen as boot module before
calling boot_fdt_info(). With that, the function will print Xen module
information along with all the other modules.
Michael Young [Tue, 13 Aug 2019 20:15:02 +0000 (21:15 +0100)]
tools/pygrub: Failing to set value to 0 in Grub2ConfigFile
In Grub2ConfigFile the code to handle ${saved_entry} and ${next_entry}
sets arg = "0" but this now does nothing following c/s d1b93ea2615bd
"tools/pygrub: Make pygrub understand default entry in string format"
which replaced arg.strip() with arg_strip in the following line. This
patch restores the previous behaviour.
Signed-off-by: Michael Young <m.a.young@durham.ac.uk> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 13 Aug 2019 13:46:00 +0000 (14:46 +0100)]
tools/xenstat: Fix -Wformat-truncation= issue
Building with GCC 8.3 on Buster identifies:
src/xenstat_linux.c: In function 'xenstat_collect_networks':
src/xenstat_linux.c:307:32: warning: 'snprintf' output may be truncated before
the last format character [-Wformat-truncation=]
snprintf(devNoBridge, 16, "p%s", devBridge);
^
src/xenstat_linux.c:307:2: note: 'snprintf' output between 2 and 17 bytes into
a destination of size 16
snprintf(devNoBridge, 16, "p%s", devBridge);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
devNoBridge[] needs one charater more than devBridge[], so allocate one byte
more. Replace a raw 16 in the snprintf() call with a sizeof() expression
instead.
Finally, libxenstat, unlike most of the rest of the Xen, doesn't use -Werror
which is why this issue went unnoticed in CI. Fix this.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org>
Andrew Cooper [Thu, 8 Aug 2019 16:18:10 +0000 (17:18 +0100)]
x86/boot: Simplify %fs setup in trampoline_setup
mov/shr is easier to follow than shld, and doesn't have a merge dependency on
the previous value of %edx. Shorten the rest of the code by streamlining the
comments.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
Viktor Mitin [Wed, 7 Aug 2019 10:10:28 +0000 (13:10 +0300)]
xen/arm: domain_build: Consolidate make_timer_node() and make_timer_domU_node()
At the moment, the hwdom and domUs are creating the timer node
differently.
Technically the timer exposed the same way for any domain, the only
difference should be the interrupts used. The two current other
differences are:
- compatible: The hwdom DT will use the same as the one provided
by the host provided. The domUs DT will use "arm,armv7-timer" for
32-bit domain and "arm,armv8-timer" for 64-bit domain. The latter
matches the behavior of libxl when guests are created from
userspace.
- clock-frequency: The property is used on platform with
broken firmware to indicate the clock frequency. This should
be used by all the domains, however this is not yet the case
for domUs created by Xen.
To avoid more discrepancy the two functions are now consolidated into
one place make_timer_node().
For simplicity, the compatible will now be based on the bitness even
for the hwdom. This means the compatible exposed for the hwdom may
differ. This should only have an impact on 32-bit hwdom booting on
Armv8 hardware.
Viktor Mitin [Wed, 7 Aug 2019 10:10:27 +0000 (13:10 +0300)]
xen/arm: extend fdt_property_interrupts to support DomU
The domain and fdt can be found in the structure kinfo.
Rather than adding a an extra argument for the domain, pass directly
kinfo.
This also requires to adapt fdt_property_interrupts() prototype.
A follow-up patch will need to create the interrupts for either Dom0 or
DomU.
Andrew Cooper [Tue, 30 Jul 2019 14:19:04 +0000 (15:19 +0100)]
x86/vvmx: Fix nested virt on VMCS-Shadow capable hardware
c/s e9986b0dd "x86/vvmx: Simplify per-CPU memory allocations" had the wrong
indirection on its pointer check in nvmx_cpu_up_prepare(), causing the
VMCS-shadowing buffer never be allocated. Fix it.
This in turn results in a massive quantity of logspam, as every virtual
vmentry/exit hits both gdprintk()s in the *_bulk() functions.
Switch these to using printk_once(), but still only in debug builds. The size
of the buffer is chosen at compile time, so complaining about it repeatedly is
of no benefit.
Finally, drop the runtime NULL pointer checks. It is not terribly appropriate
to be repeatedly checking infrastructure which is set up from start-of-day,
and in this case, actually hid the above bug.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 21 Nov 2018 13:50:21 +0000 (13:50 +0000)]
x86/atomic: Improvements and simplifications to assembly constraints
* Constraints in the form "=r" (x) : "0" (x) can be folded to just "+r" (x)
* Switch to using named parameters (mostly for legibility) which in
particular helps with...
* __xchg(), __cmpxchg() and __cmpxchg_user() modify their memory operand, so
must list it as an output operand. This only works because they each have
a memory clobber to give the construct full compiler-barrier properties.
* Every memory operand has an explicit known size. Letting the compiler see
the real size rather than obscuring it with __xg() allows for the removal
of the instruction size suffixes without introducing ambiguity.
* Drop semicolons after lock prefixes.
* Other misc style changes.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 26 Jul 2019 18:48:48 +0000 (19:48 +0100)]
xen/percpu: Make DECLARE_PER_CPU() and __DEFINE_PER_CPU() common
These macros are identical across the architectures, and shouldn't be separate
from the DEFINE_PER_CPU*() infrastructure.
This converts the final asm/percpu.h includes, which were all using
DECLARE_PER_CPU(), to include xen/percpu.h instead.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Fri, 9 Aug 2019 13:16:06 +0000 (14:16 +0100)]
x86/xpti: Don't leak TSS-adjacent percpu data via Meltdown
The XPTI work restricted the visibility of most of memory, but missed a few
aspects when it came to the TSS.
Given that the TSS is just an object in percpu data, the 4k mapping for it
created in setup_cpu_root_pgt() maps adjacent percpu data, making it all
leakable via Meltdown, even when XPTI is in use.
Furthermore, no care is taken to check that the TSS doesn't cross a page
boundary. As it turns out, struct tss_struct is aligned on its size which
does prevent it straddling a page boundary.
Rework the TSS types while making this change. Rename tss_struct to tss64, to
mirror the existing tss32 structure we have in HVM's Tast Switch logic. Drop
tss64's alignment and __cacheline_filler[] field.
Introduce tss_page which contains a single tss64 and keeps the rest of the
page clear, so no adjacent data can be leaked. Move the definition from
setup.c to traps.c, which is a more appropriate place for it to live.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 12 Aug 2019 07:17:01 +0000 (09:17 +0200)]
x86/desc: Build boot_{,compat_}gdt[] in C
... where we can at least get the compiler to fill in the surrounding space
without having to do it manually. This also results in the symbols having
proper type/size information in the debug symbols.
Reorder 'raw' in the seg_desc_t union to allow for easier initialisation.
Leave a comment explaining the various restrictions we have on altering the
GDT layout.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Introduce SEL2GDT(). Correct GDT indices in public header comments.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Fri, 9 Aug 2019 12:14:40 +0000 (13:14 +0100)]
xen/page_alloc: Keep away MFN 0 from the buddy allocator
Combining of buddies happens only such that the resulting larger buddy
is still order-aligned. To cross a zone boundary while merging, the
implication is that both the buddy [0, 2^n-1] and the buddy
[2^n, 2^(n+1)-1] are free.
Ideally we want to fix the allocator, but for now we can just prevent
adding the MFN 0 in the allocator to avoid merging across zone
boundaries.
On x86, the MFN 0 is already kept away from the buddy allocator. So the
bug can only happen on Arm platform where the first memory bank is
starting at 0.
As this is a specific to the allocator, the MFN 0 is removed in the common code
to cater all the architectures (current and future).
Andrew Cooper [Fri, 9 Aug 2019 14:36:58 +0000 (16:36 +0200)]
xen/link: Introduce .bss.percpu.page_aligned
Future changes are going to need to page align some percpu data.
Shuffle the exact link order of items within the BSS to give
.bss.percpu.page_aligned appropriate alignment, even on CPU0, which uses
.bss.percpu itself.
Insert explicit alignment such that there won't be a gap between
__per_cpu_start and the first actual per-CPU object. The POINTER_ALIGN
for __bss_end is to cover the lack of SMP_CACHE_BYTES alignment, as the
loops which zero the BSS use pointer-sized stores on all architectures.
Rework __DEFINE_PER_CPU() so the caller passes in all attributes, and
adjust DEFINE_PER_CPU{,_READ_MOSTLY}() to match. This has the added bonus
that it is now possible to grep for .bss.percpu and find all the users.
Finally, introduce DEFINE_PER_CPU_PAGE_ALIGNED() which specifies the
section attribute and verifies the type's alignment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Make DEFINE_PER_CPU_PAGE_ALIGNED() verify the alignment rather than
specifying it. It is the underlying type which should be suitably aligned.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Specifically:
xen/lowlevel/xc/xc.c: In function ‘pyxc_domain_create’:
xen/lowlevel/xc/xc.c:147:24: error: comparison of integer expressions of different signedness: ‘int’ and ‘long unsigned int’ [-Werror=sign-compare]
147 | for ( i = 0; i < sizeof(xen_domain_handle_t); i++ )
| ^
xen/lowlevel/xc/xc.c: In function ‘pyxc_domain_sethandle’:
xen/lowlevel/xc/xc.c:312:20: error: comparison of integer expressions of different signedness: ‘int’ and ‘long unsigned int’ [-Werror=sign-compare]
312 | for ( i = 0; i < sizeof(xen_domain_handle_t); i++ )
| ^
xen/lowlevel/xc/xc.c: In function ‘pyxc_domain_getinfo’:
xen/lowlevel/xc/xc.c:391:24: error: comparison of integer expressions of different signedness: ‘int’ and ‘long unsigned int’ [-Werror=sign-compare]
391 | for ( j = 0; j < sizeof(xen_domain_handle_t); j++ )
| ^
xen/lowlevel/xc/xc.c: In function ‘pyxc_get_device_group’:
xen/lowlevel/xc/xc.c:677:20: error: comparison of integer expressions of different signedness: ‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare]
677 | for ( i = 0; i < num_sdevs; i++ )
| ^
xen/lowlevel/xc/xc.c: In function ‘pyxc_physinfo’:
xen/lowlevel/xc/xc.c:988:20: error: comparison of integer expressions of different signedness: ‘int’ and ‘long unsigned int’ [-Werror=sign-compare]
988 | for ( i = 0; i < sizeof(pinfo.hw_cap)/4; i++ )
| ^
xen/lowlevel/xc/xc.c:994:20: error: comparison of integer expressions of different signedness: ‘int’ and ‘long unsigned int’ [-Werror=sign-compare]
994 | for ( i = 0; i < ARRAY_SIZE(virtcaps_bits); i++ )
| ^
xen/lowlevel/xc/xc.c:998:24: error: comparison of integer expressions of different signedness: ‘int’ and ‘long unsigned int’ [-Werror=sign-compare]
998 | for ( i = 0; i < ARRAY_SIZE(virtcaps_bits); i++ )
| ^
xen/lowlevel/xs/xs.c: In function ‘xspy_ls’:
xen/lowlevel/xs/xs.c:191:23: error: comparison of integer expressions of different signedness: ‘int’ and ‘unsigned int’ [-Werror=sign-compare]
191 | for (i = 0; i < xsval_n; i++)
| ^
xen/lowlevel/xs/xs.c: In function ‘xspy_get_permissions’:
xen/lowlevel/xs/xs.c:297:23: error: comparison of integer expressions of different signedness: ‘int’ and ‘unsigned int’ [-Werror=sign-compare]
297 | for (i = 0; i < perms_n; i++) {
| ^
cc1: all warnings being treated as errors
Use size_t for loop iterators where it's compared with sizeof() or
similar construct.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Commit 4941bfb "xen/arm64: macros: Introduce an assembly macro to alias
x30" moved
lr .req x30
to macros.h. A later patch (1396dab "xen/arm64: head: Don't clobber
x30/lr in the macro PRINT") started to use "lr" in head.S, however, it
didn't add an #include macros.h to head.S. This commit fixes it.
The lack of alias breaks the build with
gcc-linaro-5.3.1-2016.05-x86_64_aarch64-linux-gnu. The alias was added
later to binutils 2.29 in 2017.
Andrew Cooper [Fri, 26 Jul 2019 18:48:48 +0000 (19:48 +0100)]
xen/percpu: Drop unused asm/percpu.h includes
These files either don't use any PER_CPU() infrastructure at all, or use
DEFINE_PER_CPU_*(). This is declared in xen/percpu.h, not asm/percpu.h, which
means that xen/percpu.h is included via a different path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 26 Jul 2019 19:41:03 +0000 (20:41 +0100)]
arm/percpu: Move {get,set}_processor_id() into current.h
For cleanup purposes, it is necessary for asm/percpu.h to not use
DECLARE_PER_CPU() itself. asm/current.h is arguably a better place for this
functionality to live anyway.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
The current names, boot_cpu_{,compat_}gdt_table, have a table suffix which is
redundant with the T of GDT, and the cpu infix doesn't provide any meaningful
context. Drop them both.
Likewise, shorten the {,compat_}gdt{,_l1e} variables.
Finally, rename gdt_descr to boot_gdtr to more clearly identify its purpose.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 7 Aug 2019 10:11:22 +0000 (12:11 +0200)]
AMD/IOMMU: miscellaneous DTE handling adjustments
First and foremost switch boolean fields to bool. Adjust a few related
function parameters as well. Then
- in amd_iommu_set_intremap_table() don't use literal numbers,
- in iommu_dte_add_device_entry() use a compound literal instead of many
assignments,
- in amd_iommu_setup_domain_device()
- eliminate a pointless local variable,
- use || instead of && when deciding whether to clear an entry,
- clear the I field without any checking of ATS / IOTLB state,
- leave reserved fields unnamed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
Roger Pau Monné [Wed, 7 Aug 2019 10:09:51 +0000 (12:09 +0200)]
x86/apic: enable x2APIC mode before doing any setup
Current code calls apic_x2apic_probe which does some initialization
and setup before having enabled x2APIC mode (if it's not already
enabled by the firmware).
This can lead to issues if the APIC ID doesn't match the x2APIC ID, as
apic_x2apic_probe calls init_apic_ldr_x2apic_cluster which depending
on the APIC mode might set cpu_2_logical_apicid using the APIC ID
instead of the x2APIC ID (because x2APIC might not be enabled yet).
Fix this by enabling x2APIC before calling apic_x2apic_probe.
As a remark, this was discovered while I was trying to figure out why
one of my test boxes didn't report any iommu faults. The root cause
was that the iommu MSI address field was set using the stale value in
cpu_2_logical_apicid, and thus the iommu fault interrupt would get
lost. Even if the MSI address field gets sets to a correct value
afterwards as soon as a single iommu fault is pending no further
interrupts would get injected, so losing a single iommu fault
interrupt is fatal.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Lukasz Hawrylko [Wed, 7 Aug 2019 10:09:31 +0000 (12:09 +0200)]
Intel TXT: add reviewer, move to Odd Fixes state
Support for Intel TXT has orphaned status right now because
no active maintainter is listed. Adding myself as reviewer
and moving it to Odd Fixes state.
Signed-off-by: Lukasz Hawrylko <lukasz.hawrylko@linux.intel.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 5 Aug 2019 16:40:36 +0000 (17:40 +0100)]
passthrough/amd: Drop "IOMMU not found" message
Since c/s 9fa94e10585 "x86/ACPI: also parse AMD IOMMU tables early", this
function is unconditionally called in all cases where a DMAR ACPI table
doesn't exist.
As a consequnce, "AMD-Vi: IOMMU not found!" is printed in all cases where an
IOMMU isn't present, even on non-AMD systems. Drop the message - it isn't
terribly interesting anyway, and is now misleading is a number of common
cases.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Brian Woods <brian.woods@amd.com>
George Dunlap [Tue, 6 Aug 2019 11:19:55 +0000 (12:19 +0100)]
mm: Safe to clear PGC_allocated on xenheap pages without an extra reference
Commits ec83f825627 "mm.h: add helper function to test-and-clear
_PGC_allocated" (and subsequent fix-up 44a887d021d "mm.h: fix BUG_ON()
condition in put_page_alloc_ref()") introduced a BUG_ON() to detect
unsafe behavior of callers.
Unfortunately this condition still turns out to be too strict.
xenheap pages are somewhat "magic": calling free_domheap_pages() on
them will not cause free_heap_pages() to be called: whichever part of
Xen allocated them specially must call free_xenheap_pages()
specifically. (They'll also be handled appropriately at domain
destruction time.)
Only crash Xen when put_page_alloc_ref() finds only a single refcount
if the page is not a xenheap page.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 5 Aug 2019 13:48:21 +0000 (14:48 +0100)]
x86/shim: Fix parallel build following c/s 32b1d62887d0
Unfortunately, a parallel build from clean can fail in the following manner:
xen.git$ make -j4 -C tools/firmware/xen-dir/
make: Entering directory '/local/xen.git/tools/firmware/xen-dir'
mkdir -p xen-root
make: *** No rule to make target 'xen-root/xen/arch/x86/configs/pvshim_defconfig', needed by 'xen-root/xen/.config'. Stop.
make: *** Waiting for unfinished jobs....
The rule for pvshim_defconfig needs to depend on the linkfarm, rather than
$(D)/xen/.config specifically.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Dario Faggioli [Mon, 5 Aug 2019 10:50:56 +0000 (11:50 +0100)]
xen: sched: reassign vCPUs to pCPUs, when they come back online
When a vcpu that was offline, comes back online, we do want it to either
be assigned to a pCPU, or go into the wait list.
Detecting that a vcpu is coming back online is a bit tricky. Basically,
if the vcpu is waking up, and is neither assigned to a pCPU, nor in the
wait list, it must be coming back from offline.
When this happens, we put it in the waitqueue, and we "tickle" an idle
pCPU (if any), to go pick it up.
Looking at the patch, it seems that the vcpu wakeup code is getting
complex, and hence that it could potentially introduce latencies.
However, all this new logic is triggered only by the case of a vcpu
coming online, so, basically, the overhead during normal operations is
just an additional 'if()'.
Signed-off-by: Dario Faggioli <dario.faggioli@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Message-Id: <156412236222.2385.236340632846050170.stgit@Palanthas>
Dario Faggioli [Mon, 5 Aug 2019 10:50:55 +0000 (11:50 +0100)]
xen: sched: deal with vCPUs being or becoming online or offline
If a vCPU is, or is going, offline we want it to be neither
assigned to a pCPU, nor in the wait list, so:
- if an offline vcpu is inserted (or migrated) it must not
go on a pCPU, nor in the wait list;
- if an offline vcpu is removed, we are sure that it is
neither on a pCPU nor in the wait list already, so we
should just bail, avoiding doing any further action;
- if a vCPU goes offline we need to remove it either from
its pCPU or from the wait list.
Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Message-Id: <156412235656.2385.13861979113936528474.stgit@Palanthas>
openSUSE comes in two flavours: Leap, which is non-rolling, and released
annualy, and Tumbleweed, which is rolling.
Reasons why it makes sense to have both (despite both being openSUSE,
package lists in dockerfiles being quite similar, etc) are:
- Leap share a lot with SUSE Linux Enterprise. So, regressions on Leap,
not only means regressions for all openSUSE Leap users, but also helps
prevent/catch regressions on SLE;
- Tumbleweed often has the most bleeding-edge software, so it will help
us prevent/catch regressions with newly released versions of
libraries, compilers, etc (e.g., at the time of writing this commit,
some build issues, with GCC9, where discovered while trying to build
in a Tumbleweed image).
Note that, considering the rolling nature of Tumbleweed, the container
would need to be rebuilt (e.g., periodically), even if the docker file
does not change.