Wei Liu [Tue, 5 Feb 2019 12:02:00 +0000 (12:02 +0000)]
x86: switch root_pgt to mfn_t and use new APIs
This then requires moving declaration of root page table mfn into mm.h
and modify setup_cpu_root_pgt to have a single exit path.
We also need to force map_domain_page to use direct map when switching
per-domain mappings. This is contrary to our end goal of removing
direct map, but this will be removed once we make map_domain_page
context-switch safe in another (large) patch series.
Wei Liu [Tue, 29 Jan 2019 14:40:26 +0000 (14:40 +0000)]
x86_64/mm: switch to new APIs in paging_init
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Hongyan Xia <hongyax@amazon.com>
---
Changed since v1:
* use a global mapping for compat_idle_pg_table_l2, otherwise
l2_ro_mpt will unmap it.
Wei Liu [Tue, 29 Jan 2019 12:59:55 +0000 (12:59 +0000)]
x86/mm: change pl3e to l3t in virt_to_xen_l3e
We will need to have a variable named pl3e when we rewrite
virt_to_xen_l3e. Change pl3e to l3t to reflect better its purpose.
This will make reviewing later patch easier.
Wei Liu [Tue, 29 Jan 2019 12:57:35 +0000 (12:57 +0000)]
x86/mm: change pl1e to l1t in virt_to_xen_l1e
We will need to have a variable named pl1e when we rewrite
virt_to_xen_l1e. Change pl1e to l1t to reflect better its purpose.
This will make reviewing later patch easier.
Wei Liu [Tue, 29 Jan 2019 12:54:48 +0000 (12:54 +0000)]
x86/mm: change pl2e to l2t in virt_to_xen_l2e
We will need to have a variable named pl2e when we rewrite
virt_to_xen_l2e. Change pl2e to l2t to reflect better its purpose.
This will make reviewing later patch easier.
Wei Liu [Mon, 28 Jan 2019 18:10:10 +0000 (18:10 +0000)]
x86/mm: introduce l{1,2}t local variables to modify_xen_mappings
The pl2e and pl1e variables are heavily (ab)used in that function. It
is fine at the moment because all page tables are always mapped so
there is no need to track the life time of each variable.
We will soon have the requirement to map and unmap page tables. We
need to track the life time of each variable to avoid leakage.
Introduce some l{1,2}t variables with limited scope so that we can
track life time of pointers to xen page tables more easily.
Wei Liu [Mon, 28 Jan 2019 17:54:24 +0000 (17:54 +0000)]
x86/mm: introduce l{1,2}t local variables to map_pages_to_xen
The pl2e and pl1e variables are heavily (ab)used in that function. It
is fine at the moment because all page tables are always mapped so
there is no need to track the life time of each variable.
We will soon have the requirement to map and unmap page tables. We
need to track the life time of each variable to avoid leakage.
Introduce some l{1,2}t variables with limited scope so that we can
track life time of pointers to xen page tables more easily.
Wei Liu [Wed, 23 Jan 2019 15:33:07 +0000 (15:33 +0000)]
x86: introduce a new set of APIs to manage Xen page tables
We are going to switch to using domheap page for page tables.
A new set of APIs is introduced to allocate, map, unmap and free pages
for page tables.
The allocation and deallocation work on mfn_t but not page_info,
because they are required to work even before frame table is set up.
Implement the old functions with the new ones. We will rewrite, site
by site, other mm functions that manipulate page tables to use the new
APIs.
Note these new APIs still use xenheap page underneath and no actual
map and unmap is done so that we don't break xen half way. They will
be switched to use domheap and dynamic mappings when usage of old APIs
is eliminated.
Commit f855dd9625 "sched: add minimalistic idle scheduler for free cpus"
introduce the use of ZERO_BLOCK_PTR in the scheduler code. However, the
define does not exist outside of xmalloc_tsf.c for non-x86 architecture.
This will result to a compilation error on Arm:
schedule.c: In function ‘sched_idle_alloc_vdata’:
schedule.c:100:12: error: ‘ZERO_BLOCK_PTR’ undeclared (first use in this function)
return ZERO_BLOCK_PTR;
^~~~~~~~~~~~~~
schedule.c:100:12: note: each undeclared identifier is reported only once for each function it appears in
schedule.c:101:1: error: control reaches end of non-void function [-Werror=return-type]
}
^
cc1: all warnings being treated as errors
To avoid the compilation error, the default definition for
ZERO_BLOCK_PTR is now moved in xen/config.h allowing all the code to use
the define.
Fixes: f855dd9625 ('sched: add minimalistic idle scheduler for free cpus') Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
sched: add minimalistic idle scheduler for free cpus
Instead of having a full blown scheduler running for the free cpus
add a very minimalistic scheduler for that purpose only ever scheduling
the related idle vcpu. This has the big advantage of not needing any
per-cpu, per-domain or per-scheduling unit data for free cpus and in
turn simplifying moving cpus to and from cpupools a lot.
Right now, CPUs that are not in any pool, still belong to Pool-0's
scheduler. This forces us to make, within the scheduler, extra effort
to avoid actually running vCPUs on those.
In the case of Credit1, this also cause issue to weights
(re)distribution, as the number of CPUs available to the scheduler is
wrong.
This is described in the changelog of commit e7191920261d ("xen:
credit2: never consider CPUs outside of our cpupool").
This new scheduler will just use a common lock for all free cpus.
As this new scheduler is not user selectable don't register it as an
official scheduler, but just include it in schedule.c.
Today a cpu which is removed from the system is taken directly from
Pool0 to the offline state. This will conflict with the new idle
scheduler, so remove it from Pool0 first. Additionally accept removing
a free cpu instead of requiring it to be in Pool0.
For the resume failed case we need to call the scheduler code for that
situation after the cpupool handling, so move the scheduler code into
a function and call it from cpupool_cpu_remove_forced() and remove the
CPU_RESUME_FAILED case from cpu_schedule_callback().
Note that we are calling now schedule_cpu_switch() in stop_machine
context so we need to switch from spinlock_irq to spinlock_irqsave.
Jan Beulich [Tue, 24 Sep 2019 08:50:33 +0000 (10:50 +0200)]
libxc/x86: avoid certain overflows in CPUID APIC ID adjustments
Recent AMD processors may report up to 128 logical processors in CPUID
leaf 1. Doubling this value produces 0 (which OSes sincerely dislike),
as the respective field is only 8 bits wide. Suppress doubling the value
(and its leaf 0x80000008 counterpart) in such a case.
Note that while there's a similar overflow in intel_xc_cpuid_policy(),
that one is being left alone for now.
Note further that while it was considered to suppress the multiplication
by 2 altogether if the host topology already provides at least one bit
of thread ID within APIC IDs, it was decided to avoid more change here
than really needed at this point.
Also zap leaf 4 (and at the same time leaf 2) EDX output for AMD, as it
should have been from the beginning.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Alexandru Isaila [Tue, 24 Sep 2019 08:49:36 +0000 (10:49 +0200)]
x86/emulate: send vm_event from emulate
A/D bit writes (on page walks) can be considered benign by an introspection
agent, so receiving vm_events for them is a pessimization. We try here to
optimize by filtering these events out.
Currently, we are fully emulating the instruction at RIP when the hardware sees
an EPT fault with npfec.kind != npfec_kind_with_gla. This is, however,
incorrect, because the instruction at RIP might legitimately cause an
EPT fault of its own while accessing a _different_ page from the original one,
where A/D were set.
The solution is to perform the whole emulation, while ignoring EPT restrictions
for the walk part, and taking them into account for the "actual" emulating of
the instruction at RIP. When we send out a vm_event, we don't want the emulation
to complete, since in that case we won't be able to veto whatever it is doing.
That would mean that we can't actually prevent any malicious activity, instead
we'd only be able to report on it.
When we see a "send-vm_event" case while emulating, we need to first send the
event out and then suspend the emulation (return X86EMUL_RETRY).
After the emulation stops we'll call hvm_vm_event_do_resume() again after the
introspection agent treats the event and resumes the guest. There, the
instruction at RIP will be fully emulated (with the EPT ignored) if the
introspection application allows it, and the guest will continue to run past
the instruction.
A common example is if the hardware exits because of an EPT fault caused by a
page walk, p2m_mem_access_check() decides if it is going to send a vm_event.
If the vm_event was sent and it would be treated so it runs the instruction
at RIP, that instruction might also hit a protected page and provoke a vm_event.
Now if npfec.kind == npfec_kind_in_gpt and d->arch.monitor.inguest_pagefault_disabled
is true then we are in the page walk case and we can do this emulation optimization
and emulate the page walk while ignoring the EPT, but don't ignore the EPT for the
emulation of the actual instruction.
In the first case we would have 2 EPT events, in the second case we would have
1 EPT event if the instruction at the RIP triggers an EPT event.
We use hvmemul_map_linear_addr() to intercept write access and
__hvm_copy() to intercept exec, read and write access.
A new return type was added, HVMTRANS_need_retry, in order to have all
the places that consume HVMTRANS* return X86EMUL_RETRY.
hvm_emulate_send_vm_event() can return false if there was no violation,
if there was an error from monitor_traps() or p2m_get_mem_access().
-ESRCH from p2m_get_mem_access() is treated as restricted access.
NOTE: hvm_emulate_send_vm_event() assumes the caller will enable/disable
arch.vm_event->send_event
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Paul Durrant <paul@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Jan Beulich [Tue, 24 Sep 2019 08:48:44 +0000 (10:48 +0200)]
x86/traps: widen condition for logging top-of-stack
Despite -fno-omit-frame-pointer the compiler may omit the frame pointer,
often for relatively simple leaf functions. (To give a specific example,
the case I've run into this with is _pci_hide_device() and gcc 8.
Interestingly the even more simple neighboring iommu_has_feature() does
get a frame pointer set up, around just a single instruction. But this
may be a result of the size-of-asm() effects discussed elsewhere.)
Log the top-of-stack value if it looks valid _or_ if RIP looks invalid.
Also annotate all stack trace entries with a marker, to indicate their
origin:
R: register state
F: frame pointer based
S: raw stack contents
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 24 Sep 2019 08:47:53 +0000 (10:47 +0200)]
x86/traps: guard top-of-stack reads
Nothing guarantees that the original frame's stack pointer points at
readable memory. Avoid a (likely nested) crash by attaching exception
recovery to the read (making it a single read at the same time). Don't
even invoke _show_trace() in case of a non-readable top slot.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Anthony PERARD [Mon, 23 Sep 2019 13:26:52 +0000 (14:26 +0100)]
libxl: Fix build when LIBXL_API_VERSION is set
The compatibility function mistakenly called itself.
Fixes: 95627b87c3159928458ee586e8c5c593bdd248d8 Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Wei Liu <wl@xen.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
We want to limit number of shared buffers that guest can register in
OP-TEE. Every such buffer consumes XEN resources and we don't want
guest to exhaust XEN. So we choose arbitrary limit for shared buffers.
xen/arm: optee: check for preemption while freeing shared buffers
We can check for hypercall_preempt_check() in the loop inside
optee_relinquish_resources() to increase hypervisor responsiveness in
case if preemption is required.
xen/arm: optee: impose limit on shared buffer size
We want to limit number of calls to lookup_and_pin_guest_ram_addr()
per one request. There are two ways to do this: either preempt
translate_noncontig() or limit size of one shared buffer size.
It is quite hard to preempt translate_noncontig(), because it is deep
nested. So we chose the second option. We will allow 129 pages per one
shared buffer. This corresponds to the GP standard, as it requires
that size limit for shared buffer should be at least 512kB. One extra
page (129th) is needed to cope with the fact that user's buffer is not
necessary aligned with page boundary.
Also, with this limitation OP-TEE still passes own "xtest" test suite,
so this is okay for now.
Anthony PERARD [Fri, 20 Sep 2019 16:19:02 +0000 (17:19 +0100)]
tools/ocaml: Build fix following libxl API changes
The following libxl API became asynchronous and gained an additional
`ao_how' parameter:
libxl_domain_pause()
libxl_domain_unpause()
libxl_send_trigger()
Adapt the ocaml binding.
Build tested only.
Fixes: edaa631ddcee665cdfae1cf6bc7492c791e01ef4 Fixes: 95627b87c3159928458ee586e8c5c593bdd248d8 Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
xen/arm: livepatch: Prevent CPUs to fetch stale instructions after livepatching
During livepatch, a single CPU will take care of applying the patch and
all the others will wait for the action to complete. They will then once
execute arch_livepatch_post_action() to flush the pipeline.
Per B2.2.5 "Concurrent modification and execution of instructions" in
DDI 0487E.a, flushing the instruction cache is not enough to ensure new
instructions are seen. All the PEs should also do an isb() to
synchronize the fetched instruction stream.
xen/arm32: setup: Give a xenheap page to the boot allocator
After commit 6e3e771203 "xen/arm: setup: Relocate the Device-Tree later on
in the boot", the boot allocator will not receive any xenheap page (i.e.
mapped page) on Arm32.
However, the boot allocator implicitly relies on having the first page
already mapped and therefore result to break boot on Arm32.
The easiest way for now is to give a xenheap page to the boot allocator.
We may want to rethink the interface in the future.
[stefano: fix grammar in commit message]
Fixes: 6e3e771203 ('xen/arm: setup: Relocate the Device-Tree later on in the boot') Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Anthony PERARD [Tue, 13 Aug 2019 14:48:27 +0000 (15:48 +0100)]
libxlu: Handle += in config files
Handle += of both strings and lists.
If += is used for config options expected to be numbers, then a
warning is printed and the config option ignored (because xl ignores
config options with errors).
This is to be used for development purposes, where modifying config
option can be done on the `xl create' command line.
One could have a cmdline= in the cfg file, and specify cmdline+= on
the `xl create` command line without the need to write the whole
cmdline in `xl' command line but simply append to the one in the cfg file.
Or add an extra vif or disk by simply having "vif += [ '', ];" in the
`xl' cmdline.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Thu, 19 Sep 2019 16:52:24 +0000 (17:52 +0100)]
libxl_pci: Extract common part of *qemu_trad_watch_state_cb
Functions pci_add_qemu_trad_watch_state_cb and
pci_remove_qemu_trad_watch_state_cb are similar so the common part is
extracted in a different function.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Thu, 30 May 2019 17:08:45 +0000 (18:08 +0100)]
libxl: Use ev_qmp in libxl_set_vcpuonline
Removed libxl__qmp_cpu_add since it's not used anymore.
`cpumap' arg of libxl__set_vcpuonline_xenstore is constified.
The QMP command "query-cpus" is going to be called from different
places, so the algorithm that parse the answer is in a separate
function, qmp_parse_query_cpus.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Tue, 30 Jul 2019 14:56:30 +0000 (15:56 +0100)]
libxl_pci: Only check if qemu-dm is running in qemu-trad case
QEMU upstream (or qemu-xen) may not have set "running" state in
xenstore. "running" with QEMU doesn't mean that the binary is
running, it means that the emulation have started. When adding a
pci-passthrough device to QEMU, we do so via QMP, we have a direct
answer to whether QEMU is running or not, no need to check ahead.
Moving the check to do it only with qemu-trad makes upcoming changes
simpler.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Thu, 9 May 2019 17:08:09 +0000 (18:08 +0100)]
libxl_pci: Coding style of do_pci_add
do_pci_add is going to be asynchronous, so we start by having a single
path out of the function. All `return`s instead set rc and goto out.
While here, some use of `rc' was used to store the return value of
libxc calls, change them to store into `r'. Also, add the value of `r'
in the error message of those calls.
There were an `out' label that was use it seems to skip setting up the
IRQ, the label has been renamed to `out_no_irq'.
No functional changes.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Wed, 8 May 2019 14:23:52 +0000 (15:23 +0100)]
libxl: Use aodev for libxl__device_usbdev_remove
This also mean libxl__initiate_device_usbctrl_remove, which uses
libxl__device_usbdev_remove synchronously, needs to be updated to use
it with multidev.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Tue, 7 May 2019 14:54:08 +0000 (15:54 +0100)]
libxl: Add device_{config,type} to libxl__ao_device
These two fields help to give more information about the device been
hotplug/hotunplug to callbacks.
There is already `dev' of type `libxl__device', but it is mostly
useful when the backend/frontend is xenstore. Some device (like
`usbdev') don't have devid, so `dev' can't be used.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Wed, 17 Apr 2019 16:16:07 +0000 (17:16 +0100)]
libxl: Add libxl__ev_qmp to libxl__ao_device
`aodev->qmp' is initialised in libxl__prepare_ao_device(), but since
there isn't a single exit path for a `libxl__ao_device', users of this
new `qmp' field will have to disposed of it.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Tue, 7 May 2019 16:18:56 +0000 (17:18 +0100)]
libxl: Inline do_usbdev_remove into libxl__device_usbdev_remove
Having the function do_usbdev_remove makes it harder to add asynchronous
calls into it. Move its body back into libxl__device_usbdev_remove and
adjust the latter as there are no reason to have a separated function.
No functional changes.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Thu, 18 Apr 2019 11:10:30 +0000 (12:10 +0100)]
libxl: Inline do_usbdev_add into libxl__device_usbdev_add
Having the function do_usbdev_add makes it harder to add asynchronous
calls into it. Move its body back into libxl__device_usbdev_add and
adjust the latter as there are no reason to have a separated function.
No functional changes.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Sun, 26 May 2019 12:37:44 +0000 (13:37 +0100)]
libxl: Re-introduce libxl__domain_resume
libxl__domain_resume is a rework libxl__domain_resume_deprecated. It
makes uses of ev_xswatch and ev_qmp, to replace synchronous QMP calls
and libxl__wait_for_device_model_deprecated call.
This patch also introduce libxl__dm_resume which is a sub-operation of
both libxl__domain_resume and libxl__domain_unpause and can be used
instead of libxl__domain_resume_device_model_deprecated.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Thu, 23 May 2019 14:07:52 +0000 (15:07 +0100)]
libxl: Deprecate libxl__domain_{unpause,resume}
These two functions are used from many places in libxl and need to
change to be able to accomodate libxl__ev_qmp calls and thus needs to
be asynchronous.
(There is also libxl__domain_resume_device_model in the mix.)
A later patch will introduce a new libxl__domain_resume and
libxl__domain_unpause which will make use of libxl__ev_qmp.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Fri, 17 May 2019 09:39:13 +0000 (10:39 +0100)]
libxl: Replace libxl__qmp_initializations by ev_qmp calls
Setup a timeout of 10s for all the commands. It used to be about 5s
per commands.
The order of command is changed, we call 'query-vnc' before
'change-vnc-password', but that should not matter. That makes it
easier to call 'change-vnc-password' conditionally.
Also 'change' command is replaced by 'change-vnc-password'
because 'change' is deprecated. The new command is available in all
QEMU versions that also have Xen support.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>