Use explicit casts to uintptr_t when it is not possible to use the
provided static inline functions.
M3CM: Rule-18.2: Subtraction between pointers shall only be applied to
pointers that address elements of the same array
Since we are changing the body of is_kernel_text and friends, take the
opportunity to remove the leading underscores in the local variables
names, which are violationg namespace rules. Also make the local p__
variable const.
In the case of __initcall_start, __presmp_initcall_end, and
__initcall_end, turn the three variables into two proper ranges
introducing __presmp_initcall_start.
M3CM: Rule-18.2: Subtraction between pointers shall only be applied to
pointers that address elements of the same array
Since we are changing the body of is_kernel_text and friends, take the
opportunity to remove the leading underscores in the local variables
names, which are violationg namespace rules. Also make the local p__
variable const.
QAVerify: 2761 Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> CC: JBeulich@suse.com CC: andrew.cooper3@citrix.com
---
Changes in v11:
- split (__initcall_start,__initcall_end) and
(__initcall_start,__initcall_end)
- make use of elf_note_bytediff
- use DECLARE_BOUNDS
Changes in v10:
- use DEFINE_SYMBOL
- move changes for _start, _end, _stext, _etext, _srodata, _erodata,
_sinittext, _einittext to a different patch
Changes in v9:
- use SYMBOLS_SUBTRACT and SYMBOLS_COMPARE
possible to use the provided static inline functions.
M3CM: Rule-18.2: Subtraction between pointers shall only be applied to
pointers that address elements of the same array.
QAVerify: 2761 Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> CC: JBeulich@suse.com CC: andrew.cooper3@citrix.com
---
Changes in v11:
- change p type to struct abstract_per_cpu *
- move changes to alternative.c to a new patch
- use DECLARE_BOUNDS
Changes in v10:
- use DEFINE_SYMBOL
- move changes for _start, _end, _stext, _etext, _srodata, _erodata,
_sinittext, _einittext to a different patch
Changes in v9:
- use SYMBOLS_SUBTRACT and SYMBOLS_COMPARE
QAVerify: 2761 Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> CC: JBeulich@suse.com CC: andrew.cooper3@citrix.com
---
Changes in v11:
- change p type to struct abstract_per_cpu *
- move changes to alternative.c to a new patch
- use DECLARE_BOUNDS
Changes in v10:
- use DEFINE_SYMBOL
- move changes for _start, _end, _stext, _etext, _srodata, _erodata,
_sinittext, _einittext to a different patch
Changes in v9:
- use SYMBOLS_SUBTRACT and SYMBOLS_COMPARE
Introduce a MACRO to be used to declare array variables corresponding to
linker symbols, plus two static inline functions to be used for
comparing and subtracting pointers with the linker symbols.
Note that the start and end symbols are declared of different types to
help avoid errors and misusing those variables.
Use a build-time assertion to check the proper alignment of the structs
passed as arguments to the static inline functions. Use BUILD_BUG_ON for
the implementation.
Suggested-by: Jan Beulich <JBeulich@suse.com> Suggested-by: Ian Jackson <ian.jackson@citrix.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
---
Changes in v11:
- add ptrdiff_t casts in _diff macro
- improve comment
- add build-time assertion on struct alignment
- add _bytediff function
- move the macros to xen/lib.h
- rename DEFINE_SYMBOL to __DECLARE_BOUNDS
- add wrappers
Introduce the new type "ptrdiff_t" which is defined as the signed
integer type of the result of subtracting two pointers. Use
__PTRDIFF_TYPE__ to define it.
Sergey Dyasli [Wed, 9 Jan 2019 14:45:14 +0000 (15:45 +0100)]
mm/page_alloc: fix MEMF_no_dma allocations for single NUMA
Currently dma_bitsize is zero by default on single NUMA node machines.
This makes all alloc_domheap_pages() calls with MEMF_no_dma return NULL.
There is only 1 user of MEMF_no_dma: dom0_memflags, which are used
during memory allocation for Dom0. Failing allocation with default
dom0_memflags is especially severe for the PV Dom0 case: it makes
alloc_chunk() to use suboptimal 2MB allocation algorithm with a search
for higher memory addresses.
This can lead to the NMI watchdog timeout during PV Dom0 construction
on some machines, which can be worked around by specifying "dma_bits"
in Xen's cmdline manually.
Fix the issue by ignoring MEMF_no_dma in cases when dma_bitsize is zero,
which means there is no DMA zone. This shouldn't cause any issues for
Dom0 because alloc_heap_pages() will first use higher memory addresses
for satisfying memory allocation requests.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Wed, 9 Jan 2019 11:07:30 +0000 (11:07 +0000)]
docs: Fix output of man/xen-vbd-interface
In pandoc's markdown, a code block needs at least 4 spaces to be
recognize as such. This patch fix the rendering of description of the
encoding in the VBD interface so that [1] can be readable.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
[ wei: rebase on top of staging ] Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monné [Tue, 8 Jan 2019 09:03:45 +0000 (10:03 +0100)]
x86/shim: only mark special pages as RAM in pvshim mode
When running Xen as a guest it's not necessary to mark such pages as
RAM because they won't be assigned to the initial domain memory map.
While there move the functions to the PV shim specific file and rename
them accordingly.
No functional change expected.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Mon, 17 Dec 2018 09:22:59 +0000 (09:22 +0000)]
x86/mm/p2m: stop checking for IOMMU shared page tables in mmio_order()
Now that the iommu_map() and iommu_unmap() operations take an order
parameter and elide flushing there's no strong reason why modifying MMIO
ranges in the p2m should be restricted to a 4k granularity simply because
the IOMMU is enabled but shared page tables are not in operation.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Mon, 17 Dec 2018 09:22:58 +0000 (09:22 +0000)]
iommu: elide flushing for higher order map/unmap operations
This patch removes any implicit flushing that occurs in the implementation
of map and unmap operations and adds new iommu_map/unmap() wrapper
functions. To maintain semantics of the iommu_legacy_map/unmap() wrapper
functions, these are modified to call the new wrapper functions and then
perform an explicit flush operation.
Because VT-d currently performs two different types of flush dependent upon
whether a PTE is being modified versus merely added (i.e. replacing a non-
present PTE) 'iommu flush flags' are defined by this patch and the
iommu_ops map_page() and unmap_page() methods are modified to OR the type
of flush necessary for the PTE that has been populated or depopulated into
an accumulated flags value. The accumulated value can then be passed into
the explicit flush operation.
The ARM SMMU implementations of map_page() and unmap_page() currently
perform no implicit flushing and therefore the modified methods do not
adjust the flush flags.
NOTE: The per-cpu 'iommu_dont_flush_iotlb' is respected by the
iommu_legacy_map/unmap() wrapper functions and therefore this now
applies to all IOMMU implementations rather than just VT-d.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: Brian Woods <brian.woods@amd.com>
Paul Durrant [Mon, 17 Dec 2018 09:22:57 +0000 (09:22 +0000)]
iommu: rename wrapper functions
A subsequent patch will add semantically different versions of
iommu_map/unmap() so, in advance of that change, this patch renames the
existing functions to iommu_legacy_map/unmap() and modifies all call-sites.
It also adjusts a comment that refers to iommu_map_page(), which was re-
named by a previous patch.
This patch is purely cosmetic. No functional change.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Paul Durrant [Mon, 17 Dec 2018 09:22:56 +0000 (09:22 +0000)]
amd-iommu: add flush iommu_ops
The iommu_ops structure contains two methods for flushing: 'iotlb_flush' and
'iotlb_flush_all'. This patch adds implementations of these for AMD IOMMUs.
The iotlb_flush method takes a base DFN and a (4k) page count, but the
flush needs to be done by page order (i.e. 0, 9 or 18). Because a flush
operation is fairly expensive to perform, the code calculates the minimum
order single flush that will cover the specified page range rather than
performing multiple flushes.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Brian Woods <brian.woods@amd.com>
Andrew Cooper [Wed, 2 Jan 2019 10:26:49 +0000 (10:26 +0000)]
docs/man: Fix/simplify generation of manpages
The original intent of this patch was to rename xen-vbd-interface.markdown.7
to xen-vbd-interface.pandoc.7 to remove the final markdown file from the docs/
tree.
The DOC_MANx lists are broken. They contain MANxSRC-y twice, the first half
with a partial %.pod substituation, and the second half with a partial
%.markdown substitution. This is also the root cause behind the filtering
activity in the uninstall-man$(i)-pages rule.
Furthermore, the logic for generating the manpage targets is unnecesserily
repetative, owing to the layout of source files in the man/ directory.
Therefore, tackle the problem by renaming all of our manpage source files from
"$FORMAT.$SECTION" to "$SECTION.$FORMAT". For the two xl.cfg.5 and xl.1 which
are preprocessed by autoconf to contain path information, this requires
updating configure.ac and .gitignore. The markdown to pandoc conversion is
performed as well, as it is also a straight rename.
An ancillary benefit of this renaming is that text editors stand a chance of
being able to work out the correct mode to use.
As for the makefile:
1) Break the MAN_SECTIONS list out of the GENERATE_MANPAGE_RULES loop, as we
are going to use it a second time.
2) Do away with the individaul MANxSRC-y variables. Use a single list,
derived from all *.pod and *.pandoc files, with their format suffixes
removed.
3) Use a $(foreach ...) to generate the DOC_MANx lists, filling them with the
correct content.
4) The DOC_HTML and DOC_TXT can now include all manpages with a single
substitution, as they don't need to separate the manpages by
section-numbered-directory.
5) Fix up the filenames in the manpage metarule to match the renaming.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Wed, 2 Jan 2019 10:26:47 +0000 (10:26 +0000)]
docs/markdown: Switch to using pandoc, and fix underscore escaping
c/s a3a99df44 "docs/cmdline: Rewrite the cpuid_mask_* section" completely
forgot about how markdown gets rendered to HTML (as opposed to PDF), because
we use different translators depending on the destination format.
markdown and pandoc are very similar markup languages, but a couple of details
about pandoc cause it to have far more user-friendly inline markup.
Switch all markdown documents to be pandoc (so we are using a single
translator, and therefore a single flavour of markdown), which fixes the
rendered docs on xenbits.xen.org/docs.
While changing the format, fix the remainder of the escaped underscores in the
same mannor as the previous patch. The two problem cases here are __LINE__
and __FILE__ where the first underscore still needs escaping.
In addition, dmop.markdown and dom0less.markdown didn't used to get processed,
as only .markdown files in the misc/ directory got considered.
dom0less.pandoc gets picked up automatically now, due to being in the
features/ directory, but designs/ needs adding to the pandoc directory list
for dmop.pandoc to get processed.
While edting in appropriate areas, take the opportunity to fix some markup to
the surrounding style, and drop trailing whitespace.
No change in content - only formatting. This results in the text being easier
to read and grep.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Wed, 2 Jan 2019 10:26:45 +0000 (10:26 +0000)]
docs/pandoc: Don't escape underscores in the middle of text
Pandoc deliberately (and contrary to markdown) doesn't treat underscores in
the middle of normal text as emphasis markers, as this is almost always the
unhelpful interpretation.
For text which is emphasised using _, an underscore in the middle is
interpreted, but the emphasis marker can be switched to * instead.
One problem case is where we use {} globbing with identifier names, as it
counts as a word break. Therefore, we do need to retain the escaped
underscore immediately following a closing brace.
No change in content - only formatting. This results in the text being easier
to read and grep.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Roger Pau Monne [Fri, 21 Dec 2018 09:41:05 +0000 (10:41 +0100)]
x86/mm-locks: apply a bias to lock levels for control domain
paging_log_dirty_op function takes mm locks from a subject domain and
then attempts to perform copy to operations against the caller domain
in order to copy the result of the hypercall into the caller provided
buffer.
This works fine when the caller is a non-paging domain, but triggers a
lock order panic when the caller is a paging domain due to the fact
that at the point where the copy to operation is performed the subject
domain paging lock is locked, and the copy operation requires
locking the caller p2m lock which has a lower level.
Fix this limitation by adding a bias to the level of control domain mm
locks, so that the lower control domain mm lock always has a level
greater than the higher unprivileged domain lock level. This allows
locking the subject domain mm locks and then locking the control
domain mm locks, while keeping the same lock ordering and the changes
mostly confined to mm-locks.h.
Note that so far only this flow (locking a subject domain locks and
then the control domain ones) has been identified, but not all
possible code paths have been inspected. Hence this solution attempts
to be a non-intrusive fix for the problem at hand, without discarding
further changes in the future if other valid code paths are found that
require more complex lock level ordering.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Lars Kurth [Mon, 10 Dec 2018 19:33:09 +0000 (11:33 -0800)]
CONTRIBUTING: Clarifications on how to handle license deviations
This patch makes a few clarifications which were discussed on
IRC recently.
Specifically:
- Highlight the principle that license deviations
should be brought to the attention of maintainers
- Add a requirement for GPLv2 compatibility
- Restructure the document to highlight use-cases for
"New components" and "Importing code" clearer
- Add conventions and instructions for "New files"
Signed-off-by: Lars Kurth <lars.kurth@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
Anthony PERARD [Wed, 12 Dec 2018 14:53:46 +0000 (14:53 +0000)]
libxl_create: Re-order callbacks of initiate_domain_create
Callbacks should be in the order that there are going to be executed.
This patch fixes the initiate_domain_create callbacks, and also
reorders the callbacks prototypes. That way, it's easier to follow the
flow.
This patch:
- move libxl__colo_restore_setup_done after domcreate_bootloader_done.
- move domcreate_attach_devices after domcreate_devmodel_started.
No functional change.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Razvan Cojocaru [Tue, 18 Dec 2018 15:11:44 +0000 (17:11 +0200)]
x86/altp2m: add altp2m_vcpu_disable_notify
Allow altp2m users to disable #VE/VMFUNC alone. Currently it is
only possible to disable this functionality when we disable altp2m
completely; #VE/VMFUNC can only be enabled once per altp2m session.
In addition to making things complete, disabling #VE is also a
workaround for CFW116 ("When Virtualization Exceptions are Enabled,
EPT Violations May Generate Erroneous Virtualization Exceptions")
on Xeon E-2100 CPUs.
Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monne [Fri, 28 Dec 2018 11:18:56 +0000 (12:18 +0100)]
x86/dom0: take alignment into account when populating p2m in PVH mode
Current code that allocates memory and populates the p2m for PVH Dom0
doesn't take the address alignment into account, this can lead to high
order allocations that start on a non-aligned address to be broken
down into lower order entries on the p2m page tables.
Fix this by taking into account the p2m page sizes and alignment
requirements when allocating the memory and populating the p2m.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monne [Thu, 27 Dec 2018 15:26:35 +0000 (16:26 +0100)]
x86/dom0: allow stealing RAM from a region that starts in the low 1MB
As long as the memory stolen is always above 1MB. This allows the PVH
Dom0 builder to be used on a memory map that only has a single RAM
region starting at 0.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Thu, 20 Dec 2018 15:08:50 +0000 (15:08 +0000)]
x86/vtx: Improvements to ept= command line handling
Switch parse_ept_param() to use the parse_boolean() infrastructure for more
consistency with related command line parameters. Rename opt_pml_enabled to
opt_ept_pml for consistency with opt_ept_ad, and switch it to being bool
Drop the leading comment for parse_ept_param(). It is stale, and just repeats
the command line documentation.
For the command line documentation, rewrite it largely from scratch, updating
to the latest metadata style. Document A/D first, including a note about
AVR41, and modify PML to note its dependency on A/D.
No practical changes to behaviour.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Razvan Cojocaru [Sat, 22 Dec 2018 09:43:52 +0000 (09:43 +0000)]
p2m: change_type_range: Only invalidate mapped gfns
change_type_range() invalidates gfn ranges to lazily change the type
of a range of gfns, and also modifies the logdirty rangesets of that
p2m. At the moment, it clips both down by the hostp2m.
While this will result in correct behavior, it's not entirely efficient,
since invalidated entries outside that range will, on fault, simply be
modified back to "empty" before faulting normally again.
Separate out the calculation of the two ranges. Keep using the
hostp2m's max_mapped_pfn to clip the logdirty ranges, but use the
current p2m's max_mapped_pfn to further clip the invalidation range
for alternate p2ms.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Tested-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Razvan Cojocaru [Sat, 22 Dec 2018 09:43:52 +0000 (09:43 +0000)]
p2m: Always use hostp2m when clipping rangesets
The logdirty rangesets of the altp2ms need to be kept in sync with the
hostp2m. This means when iterating through the altp2ms, we need to
use the host p2m to clip the rangeset, not the indiviual altp2m's
value.
This change also:
- Documents that the end is non-inclusive
- Calculates an "inclusive" value for the end once, rather than
open-coding the modification, and (worse) back-modifying updates so
that the calculation ends up correct
- Clarifies the logic deciding whether to call
change_entry_type_global() or change_entry_type_range()
- Handles the case where start >= hostp2m->max_mapped_pfn
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Tested-by: Tamas K Lengyel <tamas@tklengyel.com>
Razvan Cojocaru [Sat, 22 Dec 2018 09:43:51 +0000 (09:43 +0000)]
x86/altp2m: fix display frozen when switching to a new view early
When an new altp2m view is created very early on guest boot, the
display will freeze (although the guest will run normally). This
may also happen on resizing the display. The reason is the way
Xen currently (mis)handles logdirty VGA: it intentionally
misconfigures VGA pages so that they will fault.
The problem is that it only does this in the host p2m. Once we
switch to a new altp2m, the misconfigured entries will no longer
fault, so the display will not be updated.
This patch:
* updates ept_handle_misconfig() to use the active altp2m instead
of the hostp2m;
* modifies p2m_change_entry_type_global(),
p2m_memory_type_changed(), p2m_change_type_range() and
p2m_finish_type_change() to propagate their changes to all
valid altp2ms.
With the introduction of altp2m fields in p2m_memory_type_changed()
the whole function has been put under CONFIG_HVM.
Suggested-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Tested-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Razvan Cojocaru [Sat, 22 Dec 2018 09:43:50 +0000 (09:43 +0000)]
x86/p2m: refactor p2m_reset_altp2m()
Refactor p2m_reset_altp2m() so that it can be used to remove
redundant codepaths, fixing the locking while we're at it.
The previous code now replaced by p2m_reset_altp2m(d, i,
ALTP2M_DEACTIVATE) calls did not set p2m->min_remapped_gfn
and p2m->max_remapped_gfn because in those cases the altp2m
idx was disabled; so before getting used again,
p2m_init_altp2m_ept() would get called, which resets them.
Always setting them in p2m_reset_altp2m(), while redundant,
is preferable to an extra conditional.
Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Tested-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Razvan Cojocaru [Sat, 22 Dec 2018 09:43:49 +0000 (09:43 +0000)]
x86/p2m: allocate logdirty_ranges for altp2ms
For now, only do allocation/deallocation; keeping them in sync
will be done in subsequent patches.
Logdirty synchronization will only be done for active altp2ms;
so allocate logdirty rangesets (copying the host logdirty
rangeset) when an altp2m is activated, and free it when
deactivated.
Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Tested-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
George Dunlap [Sat, 22 Dec 2018 08:59:48 +0000 (08:59 +0000)]
libxl/dm_depriv: Fix non-debug build
Apparently older versions of gcc, when building with debug=n, the
uninitialized variable logic gets confused. Distros on which a
non-debug build will fail include:
- Centos 7
- Debian Jessie
- Ubuntu Trusty
It seems to be one particular path confusing the logic; so just set it
on that path to keep the compiler happy, while still catching other
potential paths where it might be unset.
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:11 +0000 (15:41 +0000)]
libxl: Introduce specific username to be used as a reaper
Untrusted device models must be killed by uid rather than by pid for
safety. To do this reliably, we need another uid, not used for any
other purpose, from which to make the kill system call.
When using xen-qemuuser-range-base, we can repurpose
xen-qemuuser-range-base itself as a UID from which to kill other
devicemodel uids (since domain ID 0 should never have a device model
associated with it).
However, we'd like people to be able to use the device_model_user
feature without also defining xen-qemuuser-range-base (which requires
the ability to 'reserve' 32k+ user IDs).
To that end, introduce the xen-qemuuser-reaper id. When killing by
UID, first look for and use that ID if available; then fall back to
xen-qemuuser-range-base.
Document the new call in docs/features/qemu-deprivilege.pandoc.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:10 +0000 (15:41 +0000)]
libxl: Kill QEMU with "reaper" ruid
Using kill(-1) to killing an untrusted dm process with the real uid
equal to the dm_uid isn't guaranteed to succeed: the process in
question may be able to kill the reaper process after the setresuid()
and before the kill().
Instead, set the real uid to the QEMU user for domain 0
(QEMU_USER_RANGE_BASE + 0). The reaper process will still be able to
kill the dm process, but not vice versa.
This, in turn, requires locking to make sure that only one reaper
process is using that uid at a time; otherwise one reaper process may
kill the other reaper process.
Create a lockfile in RUNDIR/dm-reaper-lock, and grab the lock before
executing kill.
In the event that we can't get the lock for some reason, go ahead with
the kill using dm_uid for both real and effective UIDs. This isn't
guaranteed to work, but it's no worse than not trying to kill the
process at all.
NB that this effectively requires admins using device_model_user to
also define xen_qemuuser_range_base; this will be addressed in
subsequent patches.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:09 +0000 (15:41 +0000)]
libxl: Kill QEMU by uid when possible
The privcmd fd that a dm_restrict'ed QEMU has gives it permission to
one specific domain ID. This domain ID will probably eventually be
used again. It is therefore necessary to make absolutely sure that a
rogue QEMU process cannot hang around after its domain has exited.
Killing QEMU by pid is insufficient in this situation, because QEMU
may be able to fork() to escape killing. It is surprisingly tricky to
kill a process which can call fork() without races; the only reliable
way is to use kill(-1) to kill all processes with a given uid.
We can use this method only when we're sure that there's only one QEMU
instance per uid. Add a dm_uid into the domain_build_state struct,
and set it in libxl__domain_get_device_model_uid() when it's safe to
kill by UID. Store this in xenstore next to device-model-pid.
On domain destroy, check to see if device-model-uid is present in
xenstore. If so, fork off a reaper process, setuid to that uid, and
do kill(-9) to kill all uids of that type. Otherwise, carry on
destroying by pid.
While we're here, make libxl__destroy_device_model() consistently:
1. Return an error when anything fails
2. But continue to do as much clean-up as possible
NOTE that this is not yet completely safe: with ruid == dm_uid, the
device model may be able to kill(-9) the 'reaper' process before the
reaper process can kill it. Further patches will address this.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:08 +0000 (15:41 +0000)]
libxl: Make killing of device model asynchronous
Or at least, give it an asynchronous interface so that we can make it
actually asynchronous in subsequent patches.
Create state structures and callback function signatures. Add the
state structure to libxl__destroy_domid_state. Break
libxl__destroy_domid down into two functions.
No functional change intended.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:07 +0000 (15:41 +0000)]
libxl: Do root checks once in libxl__domain_get_device_model_uid
At the moment, we check for equivalence to literal "root" before
deciding whether to add the `runas` command-line option to QEMU. This
is unsatisfactory for several reasons.
First, just because the string doesn't match "root" doesn't mean the
final uid won't end up being zero; in particular, the range_base
calculations may end up producing "0:NNN", which would be root in any
case.
Secondly, it's almost certainly a configuration error if the resulting
uid ends up to be zero; rather than silently do what was specified but
probably not intended, throw an error.
To fix this, check for root once in
libxl__domain_get_device_model_uid. If the result is root, return an
error; if appropriate, set `runas`.
After that, assume that the presence of state->dm_runas implies that a
`runas` argument should be constructed.
One side effect of this is to check whether device_model_user exists
before passing it to qemu, resulting in better error reporting.
While we're here:
- Refactor the function to use the "goto out" idiom
- Use 'rc' rather than 'ret', in line with CODING_STYLE
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:06 +0000 (15:41 +0000)]
dm_depriv: Describe expected usage of device_model_user parameter
A number of subsequent patches rely on as-yet undefined behavior for
what the `device_model_user` parameter does. Rather than implement it
incorrectly (or randomly), or remove the feature, describe an expected
usage for the feature. Further patches will make decisions based on
this expected usage.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:06 +0000 (15:41 +0000)]
libxl: Clean up userlookup_helper_getpw* helper
Bring conventions more in line with libxl__xs_read_checked():
- If found, return 0 and set pointer to non-NULL
- If not found, return 0 and set pointer to NULL
- On error, return libxl-style error number.
Update documentation to match.
Use CODING_STYLE compliant `r` rather than `ret`.
On error, log the error code before returning instead of discarding
it.
Now that it only returns 0 or errno, update caller error checks to be
`if (ret)` rather than `if (ret < 0)`.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:05 +0000 (15:41 +0000)]
libxl: Get rid of support for QEMU_USER_BASE (xen-qemuuser-domidNN)
QEMU_USER_BASE allows a user to specify the UID to use when running
the devicemodel for a specific domain number. Unfortunately, this is
not really practical: It requires nearly 32,000 entries in
/etc/passwd. QEMU_USER_RANGE_BASE is much more practical.
Remove support for QEMU_USER_BASE.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:04 +0000 (15:41 +0000)]
libxl: Move dm user determination logic into a helper function
To reliably kill an untrusted devicemodel, we need to know not only
its pid, but its uid. In preparation for this, move the userid
determination logic into a helper function.
Create a new field, `dm_runas`, in libxl__domain_build_state to store
the value during domain creation.
This change also removes unnecessary duplication of the argument
construction code.
While here, clean up some minor CODING_STYLE infractions (space
between * and variable name).
No functional change intended.
While here, delete some trailing whitespace.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Benjamin Sanda [Fri, 21 Dec 2018 16:26:49 +0000 (16:26 +0000)]
xenalyze: Build for Both ARM and x86 Platforms
Modified to provide building of the xenalyze binary for both ARM and
x86 platforms. The xenalyze binary is now built as part of the BIN
list for both platforms.
Signed-off-by: Benjamin Sanda <ben.sanda@dornerworks.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Julien Grall [Fri, 21 Dec 2018 16:26:47 +0000 (16:26 +0000)]
xen/arm: Allow a privileged domain to map foreign page from DOMID_XEN
For auto-translated domain, the only way to map a page to itself is the
using the foreign map API. The current code does not allow mapping page from
special page (such as DOMID_XEN).
As xentrace buffers are shared using DOMID_XEN, it is not possible to use
tracing for Arm.
This could be solved by using the helper get_pg_owner(). This helper will
be able to get a reference on DOMID_XEN and therefore allow mapping for
privileged domain.
This patch replace the call to rcu_lock_domain_by_any_id() with
get_pg_owner(). For consistency, all the call to rcu_unlock_domain are
replaced by put_pg_owner().
Julien Grall [Fri, 21 Dec 2018 16:26:46 +0000 (16:26 +0000)]
xen/arm: Make get_page_from_gfn working with DOMID_XEN
DOMID_XEN is used to share pages beloging to the hypervisor
(e.g trace buffers). Unlike other domains, DOMID_XEN is a non-auto
translated domain and therefore does not have a P2M.
This patch adds a special case for DOMID_XEN in get_page_from_gfn. We
may want to provide "non-auto translated helpers" in the future if we
see more case.
Julien Grall [Fri, 21 Dec 2018 16:26:45 +0000 (16:26 +0000)]
xen/arm: Add support for read-only foreign mappings
Currently, foreign mappings can only be read-write. A follow-up patch will
extend foreign mapping for Xen backend memory (via XEN_DOMID), some of
that memory should only be read accessible for the mapping domain.
Introduce a new p2m_type to cater read-only foreign mappings. For now,
the decision between the two foreign mapping type is based on the type
of the guest page mapped.
Julien Grall [Fri, 21 Dec 2018 16:26:43 +0000 (16:26 +0000)]
xen/arm: p2m: Introduce p2m_get_page_from_gfn
In a follow-up patch, we will need to handle get_page_from_gfn
differently for DOMID_XEN. To keep the code simple move the current
content in a new separate helper p2m_get_page_from_gfn.
Note the new helper is not anymore a static inline function as the helper
is quite complex.
Finally, take the opportunity to use typesafe gfn as the change is
minor.
Benjamin Sanda [Tue, 23 Oct 2018 15:21:35 +0000 (16:21 +0100)]
xen/page_alloc: Move get_pg_owner()/put_pg_owner() from x86 to common code
get_pg_owner() and put_pg_owner() will be necessary in a follow-up
commit to support xentrace on Arm. So move the helper to common code.
Take the opportunity to clean-up a bit the code moved.
Signed-off-by: Benjamin Sanda <ben.sanda@dornerworks.com>
[julien: Rework commit title / turn put_pg_owner to a static inline] Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 21 Dec 2018 07:57:31 +0000 (08:57 +0100)]
x86emul: support AVX512{F,BW} shift/rotate insns
Note that simd_packed_fp for the opcode space 0f38 major opcodes 14 and
15 is not really correct, but sufficient for the purposes here. Further
adjustments may later be needed for the down conversion unsigned
saturating VPMOV* insns, first and foremost for the different Disp8
scaling those ones use.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 21 Dec 2018 07:56:35 +0000 (08:56 +0100)]
x86emul: rename evex.br to evex.brs
This is to better reflect that it's an abbreviation for "broadcast,
rounding, or SAE" rather than just "broadcast".
Take the opportunity and also add SDM naming comments to both union vex
and union evex.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
xen/arm: zynqmp: introduce zynqmp specific defines
Introduce zynqmp specific defines for the firmware calls.
See EEMI:
https://www.xilinx.com/support/documentation/user_guides/ug1200-eemi-api.pdf
The error codes are described, under XIlPM Error Codes:
https://www.xilinx.com/support/documentation/user_guides/ug1137-zynq-ultrascale-mpsoc-swdev.pdf
- pm_api_id
These are the EEMI function IDs. Unavoidable.
- pm_ret_status
These are the EEMI return statuses. Unavoidable.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
xen/arm: zynqmp: Forward plaform specific firmware calls
Introduce zynqmp_eemi: a function responsible for implementing access
controls over the firmware calls. Only calls that are allowed are
forwarded to the firmware.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> Acked-by: Julien Grall <julien.grall@arm.com>
Introduce platform_smc as a way to handle firmware calls that Xen does
not know about in a platform specific way. This is particularly useful
for implementing the SiP (SoC implementation specific) service calls.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> Acked-by: Julien Grall <julien.grall@arm.com>
Julien Grall [Tue, 18 Dec 2018 13:07:39 +0000 (13:07 +0000)]
xen/arm: Stop relocating Xen
At the moment, Xen is relocated towards the end of the memory. While
this has the advantage to free space in low memory, the code is not
compliant with the break-before-make because it requires to switch
between two sets of page-table. This is not entirely trivial to fix as
it would require us to go through an identity mapping and disabling MMU.
Furthermore, it looks like that some platform (such as the Hikey960)
may not be able to bring-up secondary CPUs if the entry is too high.
While Xen should be quite tiny (< 2MB), the current algorithm to
allocate Dom0 memory will allocate memory chunks of at least 128MB.
Those memory chunks will always be 128MB. This means that depending on
where the modules are loaded, an extra 128MB may disappear.
As there are up to 4 modules (initramfs, XSM, kernel, DTB) loaded in
low memory. The problem is not entirely new as you could already waste
512MB of low-memory. The right solution would be to fix the allocation
algorithm. But this is independent from this patch.
For user in control of the memory (such as in U-boot), all modules
should be loaded as much as possible together or outside low-memory (i.e
above 4GB). For other users (i.e Grub/UEFI), I believe the bootloader is
already keeping everything together.
Based on the above, it would be fine to stop relocating Xen. This has
the advantage to simplify the code and should speed-up the boot as
relocation is not necessary anymore.
Note that the break-before-make issue is not fixed by this patch.
Julien Grall [Tue, 18 Dec 2018 18:04:17 +0000 (18:04 +0000)]
xen/arm: Track page accessed between batch of Set/Way operations
At the moment, the implementation of Set/Way operations will go through
all the entries of the guest P2M and flush them. However, this is very
expensive and may render unusable a guest OS using them.
For instance, Linux 32-bit will use Set/Way operations during secondary
CPU bring-up. As the implementation is really expensive, it may be possible
to hit the CPU bring-up timeout.
To limit the Set/Way impact, we track what pages has been of the guest
has been accessed between batch of Set/Way operations. This is done
using bit[0] (aka valid bit) of the P2M entry.
This patch adds a new per-arch helper is introduced to perform actions just
before the guest is first unpaused. This will be used to invalidate the
P2M to track access from the start of the guest.
Julien Grall [Tue, 18 Dec 2018 18:04:16 +0000 (18:04 +0000)]
xen/arm: Implement Set/Way operations
Set/Way operations are used to perform maintenance on a given cache.
At the moment, Set/Way operations are not trapped and therefore a guest
OS will directly act on the local cache. However, a vCPU may migrate to
another pCPU in the middle of the processor. This will result to have
cache with stale data (Set/Way are not propagated) potentially causing
crash. This may be the cause of heisenbug noticed in Osstest [1].
Furthermore, Set/Way operations are not available on system cache. This
means that OS, such as Linux 32-bit, relying on those operations to
fully clean the cache before disabling MMU may break because data may
sits in system caches and not in RAM.
For more details about Set/Way, see the talk "The Art of Virtualizing
Cache Maintenance" given at Xen Summit 2018 [2].
In the context of Xen, we need to trap Set/Way operations and emulate
them. From the Arm Arm (B1.14.4 in DDI 046C.c), Set/Way operations are
difficult to virtualized. So we can assume that a guest OS using them will
suffer the consequence (i.e slowness) until developer removes all the usage
of Set/Way.
As the software is not allowed to infer the Set/Way to Physical Address
mapping, Xen will need to go through the guest P2M and clean &
invalidate all the entries mapped.
Because Set/Way happen in batch (a loop on all Set/Way of a cache), Xen
would need to go through the P2M for every instructions. This is quite
expensive and would severely impact the guest OS. The implementation is
re-using the KVM policy to limit the number of flush:
- If we trap a Set/Way operations, we enable VM trapping (i.e
HVC_EL2.TVM) to detect cache being turned on/off, and do a full
clean.
- We clean the caches when turning on and off
- Once the caches are enabled, we stop trapping VM instructions
Julien Grall [Tue, 18 Dec 2018 18:04:15 +0000 (18:04 +0000)]
xen/arm: vsysreg: Add wrapper to handle sysreg access trapped by HCR_EL2.TVM
A follow-up patch will require to emulate some accesses to system
registers trapped by HCR_EL2.TVM. When set, all NS EL1 writes to the
virtual memory control registers will be trapped to the hypervisor.
This patch adds the infrastructure to passthrough the access to the host
registers.
Note that HCR_EL2.TVM will be set in a follow-up patch dynamically.
Julien Grall [Tue, 18 Dec 2018 18:04:14 +0000 (18:04 +0000)]
xen/arm: vcpreg: Add wrappers to handle co-proc access trapped by HCR_EL2.TVM
A follow-up patch will require to emulate some accesses to some
co-processors registers trapped by HCR_EL2.TVM. When set, all NS EL1 writes
to the virtual memory control registers will be trapped to the hypervisor.
This patch adds the infrastructure to passthrough the access to host
registers. For convenience a bunch of helpers have been added to
generate the different helpers.
Note that HCR_EL2.TVM will be set in a follow-up patch dynamically.
Julien Grall [Mon, 26 Nov 2018 16:12:01 +0000 (16:12 +0000)]
xen/arm: p2m: Add support for preemption in p2m_cache_flush_range
p2m_cache_flush_range does not yet support preemption, this may be an
issue as cleaning the cache can take a long time. While the current
caller (XEN_DOMCTL_cacheflush) does not stricly require preemption, this
will be necessary for new caller in a follow-up patch.
The preemption implemented is quite simple, a counter is incremented by:
- 1 on region skipped
- 10 for each page requiring a flush
When the counter reach 512 or above, we will check if preemption is
needed. If not, the counter will be reset to 0. If yes, the function
will stop, update start (to allow resuming later on) and return
-ERESTART. This allows the caller to decide how the preemption will be
done.
For now, XEN_DOMCTL_cacheflush will continue to ignore the preemption.
Andrew Cooper [Mon, 19 Feb 2018 13:35:58 +0000 (13:35 +0000)]
x86/pv: Expose RDTSCP to PV guests
The final remnanat of PVRDTSCP is that we would emulate RDTSCP even on
hardware which lacked the instruction. RDTSCP is available on almost all
64-bit x86 hardware.
Remove this emulation, drop the TSC_MODE_PVRDTSCP constant, and allow RDTSCP
in a PV guest's CPUID policy.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 15 Nov 2018 21:04:37 +0000 (21:04 +0000)]
x86/msr: Handle MSR_TSC_AUX consistently for PV and HVM guests
With PVRDTSCP mode removed, handling of MSR_TSC_AUX can move into the common
code. Move its storage into struct vcpu_msrs (dropping the HVM-specific
msr_tsc_aux), and add an RDPID feature check as this bit also enumerates the
presence of the MSR.
Introduce cpu_has_rdpid along with the synthesized cpu_has_msr_tsc_aux to
correct the context switch paths, as MSR_TSC_AUX is enumerated by either
RDTSCP or RDPID.
Drop hvm_msr_tsc_aux() entirely, and use v->arch.msrs->tsc_aux directly.
Update hvm_load_cpu_ctxt() to check that the incoming ctxt.msr_tsc_aux isn't
out of range. In practice, no previous version of Xen ever wrote an
out-of-range value. Add MSR_TSC_AUX to the list of MSRs migrated for PV
guests, but leave the HVM path using the existing space in hvm_hw_cpu.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Mon, 19 Feb 2018 14:27:04 +0000 (14:27 +0000)]
x86/pv: Remove deferred RDTSC{,P} handling in pv_emulate_privileged_op()
As noted in c/s 4999bf3e8b "x86/PV: use generic emulator for privileged
instruction handling", these hoops are jumped through to retain the older
behaviour, along with a note suggesting that we should reconsider things.
Part of the reason for retention of the old behaviour was removed by c/s 5b04262079 "x86/time: Rework pv_soft_rdtsc() to aid further cleanup" which in
particular caused it to not write regs->rcx directly.
It does not matter exactly when pv_soft_rdtsc() is called, as Xen's behaviour
is an opaque atomic action from the guests point of view.
Drop all the deferral logic, and leave TSC_AUX uniformly at 0 as PVRDTSCP mode
is being removed. Later changes will make this behave architecturally.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
PVRDTSCP was an attempt to provide Xen-aware userspace with a stable monotonic
clock, and enough information for said userspace to cope with frequency
changes across migrate. However, the PVRDTSCP infrastructure has resulted in
very tangled code, and non-architectural behaviour even in non-PVRDTSCP cases.
Seeing as the functionality has been replaced entirely by improvements in PV
clocks (including being plumbed into the VDSO nowadays), or alternatively by
hardware TSC scaling features, and no-one is aware of any users of this mode,
take the opportunity to remove it.
For now, drop TSC_MODE_PVRDTSCP from tsc_{get,set}_info(). This will catch
and cleanly reject attempts to migrate in a VM configured to use PVRDTSCP,
rather than letting it run and have the wrong timing mode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 13 Dec 2018 23:51:41 +0000 (15:51 -0800)]
tools/docs: Remove PVRDTSCP support
PVRDTSCP is believed-unused, and its implementation has adverse consequences
on unrelated functionality in the hypervisor. As a result, support is being
removed.
Modify libxl to provide a slightly more helpful error message if it encounters
PVRDTSCP being selected. While adjusting TSC handling, make libxl check for
errors from the set_tsc hypercall.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Tue, 20 Nov 2018 11:37:19 +0000 (11:37 +0000)]
x86/time: Alter tsc_set_info() to return an error value
Currently, tsc_set_info() performs no parameter checking, and an invalid
tsc_mode goes largely unnoticed. Fix it to reject invalid tsc_modes with
-EINVAL, and update the callers to check the return value.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 10 Dec 2018 11:42:13 +0000 (11:42 +0000)]
x86/hvm: Disallow moving the APIC MMIO window
See the code comment for a full discussion, but in short: guests which
currently run under Xen don't move the window, because moving it has never
worked properly. Implementing support for moving the window is never going to
work architecturally unless we switch to per-vcpu P2Ms (which seems very
unlikely), and would still be a substantial quantity of work for a feature
which is unused in practice.
Take the opportunity to rename vlapic_msr_set() to be consistent with the
other MSR handling functions, and return X86EMUL_* constants. Add logic to
check for reserved bits, including refusing x2APIC mode if it has not been
offered to the guest. Move the guest_{rd,wr}msr_x2apic() declarations into
vlapic.h which is a more appropriate place for them to live.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 18 Dec 2018 14:21:17 +0000 (15:21 +0100)]
x86emul: avoid triggering assertions with VME/PVI early #GP check
In commit efe9cba66c ("x86emul: VME and PVI modes require a #GP(0) check
first thing") I neglected the fact that the retire flags get zapped only
in x86_decode(), which hasn't been invoked yet at the point of the #GP(0)
check added. Move output state initialization into a helper function,
and invoke it from the callers of x86_decode() instead of doing it
(possibly too late) in that function.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 18 Dec 2018 14:19:47 +0000 (15:19 +0100)]
x86emul: work around SandyBridge errata
There are a number of exception condition related errata on SandyBridge
CPUs, some of which are unexpected #UD (others, of no interest here, are
lack of mandated exceptions, or exceptions of unexpected type). Annotate
the one workaround we already have, and add two more.
Due to the exception recovery we have in place for stub invocations
these aren't security issues.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 18 Dec 2018 13:27:09 +0000 (14:27 +0100)]
x86emul: fix 3-operand IMUL
While commit 75066cd4ea ("x86emul: fix {,i}mul and {,i}div") indeed did
as its title says, it broke the 3-operand form by uniformly using AL/AX/
EAX/RAX as second source operand. Fix this and add tests covering both
cases.
Reported-by: Andrei Lutas <vlutas@bitdefender.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 18 Dec 2018 13:26:44 +0000 (14:26 +0100)]
x86emul/test: drop another instance of .byte
Now that we require use of the {evex} pseudo-prefix, we can also use
the q-suffixed encoding of VPCMPESTRI, which is available as of 2.29
just like {evex} is.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 30 Nov 2018 16:14:08 +0000 (16:14 +0000)]
x86/hvm: Corrections to RDTSCP intercept handling
For both VT-x and SVM, the RDTSCP intercept will trigger if the pipeline
supports the instruction, but the guest may not have RDTSCP in its featureset.
Bring the vmexit handlers in line with the main emulator behaviour by
optionally handing back #UD.
Next on the AMD side, if RDTSCP actually ends up being intercepted on a debug
build or first-gen SVM hardware which lacks NRIP, we first update regs->rcx,
then call __get_instruction_length() asking for RDTSC. As the two
instructions are different (and indeed, different lengths!),
__get_instruction_length_from_list() fails and hands back a #GP fault.
This can demonstrated by putting a guest into tsc_mode="always emulate" and
executing an RDTSCP instruction:
(d1) --- Xen Test Framework ---
(d1) Environment: HVM 64bit (Long mode 4 levels)
(d1) Test rdtscp
(d1) TSC mode 1
(XEN) emulate.c:147:d1v0 __get_instruction_length: Mismatch between expected and actual instruction:
(XEN) emulate.c:152:d1v0 insn_index 8, opcode 0xf0031 modrm 0
(XEN) emulate.c:154:d1v0 rip 0x10475f, nextrip 0x104762, len 3
(XEN) SVM insn len emulation failed (1): d1v0 64bit @ 0008:0010475f -> 0f 01 f9 0f 31 5b 31 ff 31 c0 e9 c2 db ff ff 00
(d1) ******************************
(d1) PANIC: Unhandled exception at 0008:000000000010475f
(d1) Vec 13 #GP[0000]
(d1) ******************************
First, teach __get_instruction_length() to cope with RDTSCP, and improve
svm_vmexit_do_rdtsc() to ask for the correct instruction. Move the regs->rcx
adjustment into this function to ensure it gets done after we are done
potentially raising faults.
Reported-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Juergen Gross [Mon, 10 Dec 2018 11:44:22 +0000 (12:44 +0100)]
xen: add CONFIG item for default dom0 memory size
With being able to specify a dom0_mem value depending on host memory
size on x86 make it easy for distros to specify a default dom0 size by
adding a CONFIG_DOM0_MEM item which presets the dom0_mem boot parameter
value.
It will be used only if no dom0_mem parameter was specified in the
boot parameters.
Andrii Anisov [Wed, 12 Dec 2018 18:20:55 +0000 (20:20 +0200)]
arm/irq: skip action availability check for non-debug build
Under desc->lock taken:
An IRQ with _IRQ_GUEST flag set always has an action.
An IRQ with _IRQ_DISABLED flag cleared always has an action.
Those flags checks cover all accesses to desc->action in do_IRQ,
so we can skip desc->action check in non-debug build.
Keep in place for debug build to help diagnostics potential
misconfiguration.
Andrii Anisov [Wed, 12 Dec 2018 18:20:54 +0000 (20:20 +0200)]
gic-vgic: Drop an excessive clear_lrs
This action is excessive because for an invalid LR there is no need
to write another invalid value to a register. So we can skip it here,
saving a peripheral register write.
Keep clearing the LR for the DEBUG build. This would make dumped
invalid LRs be zero. That is more obvious than picking state bits
from a non-zero value.
Paul Durrant [Thu, 13 Dec 2018 11:01:50 +0000 (12:01 +0100)]
amd-iommu: remove page merging code
The page merging logic makes use of bits 1-8 and bit 63 of a PTE, which
used to be specified as 'ignored'. However, bits 5 and 6 are now specified
as 'accessed' and 'dirty' bits and their use only remains safe as long as
the DTE 'Host Access Dirty' bits remain unused by Xen, or by hardware
before the domain starts running. (XSA-275 disabled the operation of the
code after domain creation completes).
With the page merging logic present in its current form there are no spare
ignored bits in the PTE at all, but PV-IOMMU support will require at least
one spare bit to track which PTEs are added by hypercall.
This patch removes the code, freeing up the remaining PTE ignored bits
for other use, including PV-IOMMU support, as well as significantly
simplifying and shortening the source by ~170 lines. There may be some
marginal performance cost (but none has been observed in manual testing
with a passed-through NVIDIA GPU) since higher order mappings will now be
ruled out until a mapping order parameter is passed to iommu_ops. That will
be dealt with by a subsequent patch though.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
Julien Grall [Thu, 29 Nov 2018 11:37:43 +0000 (11:37 +0000)]
xen/arm: mm: Set-up page permission for Xen mappings earlier on
Xen mapping is first create using a 2MB page and then shatterred in 4KB
page for fine-graine permission. However, it is not safe to break-down
superpage page without going to an intermediate step invalidating
the entry.
As we are changing Xen mappings, we cannot go through the intermediate
step. The only solution is to create Xen mapping using 4KB entries
directly. As the Xen should always access the mappings according with
the runtime permission, it is then possible to set-up the permissions
while create the mapping.
We are still playing with the fire as there are still some
break-before-make issue in setup_pagetables (i.e switch between 2 sets of
page-tables). But it should slightly be better than the current state.