]> xenbits.xensource.com Git - people/pauldu/xen.git/log
people/pauldu/xen.git
5 years agox86/setup: do not map memory when passing to allocators for-hongyan
Hongyan Xia [Wed, 9 Oct 2019 13:54:23 +0000 (14:54 +0100)]
x86/setup: do not map memory when passing to allocators

Also destroy the existing mappings to the direct map region when passing
to allocators.

Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agox86: properly (un)map pages in restore_all_guests.
Hongyan Xia [Fri, 13 Sep 2019 16:59:30 +0000 (17:59 +0100)]
x86: properly (un)map pages in restore_all_guests.

Before, it assumed both cr3 could be accessed via a direct map. This is
no longer true.

Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agox86/mm: don't call unmap on a xenheap allocation
Hongyan Xia [Mon, 30 Sep 2019 18:23:25 +0000 (19:23 +0100)]
x86/mm: don't call unmap on a xenheap allocation

This used to work when unmap on the direct map was just a no-op, which
is no longer true.

Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agox86/domain_page: only enable the fast paths when there is direct map
Hongyan Xia [Tue, 1 Oct 2019 09:50:24 +0000 (10:50 +0100)]
x86/domain_page: only enable the fast paths when there is direct map

Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agoxen/page_alloc: add a path for xenheap when there is no direct map
Hongyan Xia [Tue, 1 Oct 2019 09:30:45 +0000 (10:30 +0100)]
xen/page_alloc: add a path for xenheap when there is no direct map

When there is not an always-mapped direct map, xenheap allocations need
to be mapped and unmapped on-demand.

Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agoacpi: add a path to work with no direct map
Hongyan Xia [Fri, 13 Sep 2019 08:21:11 +0000 (09:21 +0100)]
acpi: add a path to work with no direct map

Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agoxen/page_alloc: statically allocate bootmem_region_list
Hongyan Xia [Tue, 1 Oct 2019 09:13:20 +0000 (10:13 +0100)]
xen/page_alloc: statically allocate bootmem_region_list

This is to avoid all sorts of bootstrapping problems, especially when
we do not have a direct map.

Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agox86/pv: refactor how building dom0 in PV handles domheap mappings
Hongyan Xia [Fri, 13 Sep 2019 14:23:52 +0000 (15:23 +0100)]
x86/pv: refactor how building dom0 in PV handles domheap mappings

Building a PV dom0 is allocating from the domheap but uses it like the
xenheap. This is clearly wrong. Fix.

Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agox86/pv: domheap pages should be mapped while relocating initrd
Wei Liu [Tue, 19 Feb 2019 15:45:12 +0000 (15:45 +0000)]
x86/pv: domheap pages should be mapped while relocating initrd

Xen shouldn't use domheap page as if they were xenheap pages. Map and
unmap pages accordingly.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/domain_page: use PMAP when d/vcache is not ready
Hongyan Xia [Mon, 30 Sep 2019 17:06:51 +0000 (18:06 +0100)]
x86/domain_page: use PMAP when d/vcache is not ready

Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agox86/mapcache: initialise the mapcache even for the idle domain
Hongyan Xia [Wed, 11 Sep 2019 09:37:14 +0000 (10:37 +0100)]
x86/mapcache: initialise the mapcache even for the idle domain

In situations where PMAP cannot be used or the mapcache of a domain is
simply not ready, we need to have a mapcache in the idle domain to map
pages when there is no direct map.

Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agoKconfig: add an option to enable and disable the direct map
Hongyan Xia [Wed, 9 Oct 2019 13:41:40 +0000 (14:41 +0100)]
Kconfig: add an option to enable and disable the direct map

This only works for x86 for now but can include Arm at some point.

Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agox86: lift vcpu mapcache to arch_vcpu
Wei Liu [Mon, 17 Dec 2018 15:58:13 +0000 (15:58 +0000)]
x86: lift vcpu mapcache to arch_vcpu

It is going to be needed by HVM as well, because we want even HVM vcpu
to have a per-vcpu mapcache.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agox86: lift domain mapcache to arch_domain
Wei Liu [Mon, 17 Dec 2018 15:50:49 +0000 (15:50 +0000)]
x86: lift domain mapcache to arch_domain

It is going to be needed by HVM as well, because we want even HVM
domain to have a per-domain mapcache.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86: add Persistent Map (PMAP) infrastructure
Wei Liu [Fri, 11 Jan 2019 17:20:21 +0000 (17:20 +0000)]
x86: add Persistent Map (PMAP) infrastructure

The basic idea is like Persistent Kernel Map (PKMAP) in linux. We
pre-populate all the relevant page tables before system is fully set
up.

It is needed to bootstrap map domain page infrastructure -- we need
some way to map pages to set up the mapcache without a direct map.

In order to keep the number of entries minimal, this infrastructure
can only be used by one CPU at a time.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agox86/mm: drop _new suffix for page table APIs
Wei Liu [Fri, 8 Feb 2019 17:19:26 +0000 (17:19 +0000)]
x86/mm: drop _new suffix for page table APIs

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyax@amazon.com>
---
Changed since v1:
- Fix rebase conflicts against new master and other changes since v1.

Changed since v2:
- Also drop _new for the fix of l2t leak.

5 years agox86: switch to use domheap page for page tables
Wei Liu [Tue, 5 Feb 2019 17:20:11 +0000 (17:20 +0000)]
x86: switch to use domheap page for page tables

Modify all the _new APIs to handle domheap pages.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/mm: drop old page table APIs
Wei Liu [Tue, 5 Feb 2019 17:06:43 +0000 (17:06 +0000)]
x86/mm: drop old page table APIs

Now that we've switched all users to the new APIs, the old ones aren't
needed anymore.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86: remove lXe_to_lYe in __start_xen
Wei Liu [Tue, 5 Feb 2019 17:04:56 +0000 (17:04 +0000)]
x86: remove lXe_to_lYe in __start_xen

Properly map and unmap page tables where necessary.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/pv: properly map and unmap page table in dom0_construct_pv
Wei Liu [Tue, 5 Feb 2019 16:35:28 +0000 (16:35 +0000)]
x86/pv: properly map and unmap page table in dom0_construct_pv

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/pv: properly map and unmap page tables in mark_pv_pt_pages_rdonly
Wei Liu [Tue, 5 Feb 2019 16:32:54 +0000 (16:32 +0000)]
x86/pv: properly map and unmap page tables in mark_pv_pt_pages_rdonly

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/smpboot: remove lXe_to_lYe in cleanup_cpu_root_pgt
Wei Liu [Tue, 5 Feb 2019 13:51:12 +0000 (13:51 +0000)]
x86/smpboot: remove lXe_to_lYe in cleanup_cpu_root_pgt

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm: map and unmap page tables in subarch_memory_op
Wei Liu [Tue, 5 Feb 2019 13:47:07 +0000 (13:47 +0000)]
x86_64/mm: map and unmap page tables in subarch_memory_op

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm: map and unmap page tables in subarch_init_memory
Wei Liu [Tue, 5 Feb 2019 13:44:22 +0000 (13:44 +0000)]
x86_64/mm: map and unmap page tables in subarch_init_memory

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm: map and unmap page tables in cleanup_frame_table
Wei Liu [Tue, 5 Feb 2019 13:35:19 +0000 (13:35 +0000)]
x86_64/mm: map and unmap page tables in cleanup_frame_table

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm: map and unmap page tables in setup_compat_m2p_table
Wei Liu [Tue, 5 Feb 2019 13:25:05 +0000 (13:25 +0000)]
x86_64/mm: map and unmap page tables in setup_compat_m2p_table

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm: map and unmap page tables in destroy_m2p_mapping
Wei Liu [Tue, 5 Feb 2019 13:19:43 +0000 (13:19 +0000)]
x86_64/mm: map and unmap page tables in destroy_m2p_mapping

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm: map and unmap page tables in destroy_compat_m2p_mapping
Wei Liu [Tue, 5 Feb 2019 13:09:18 +0000 (13:09 +0000)]
x86_64/mm: map and unmap page tables in destroy_compat_m2p_mapping

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm: map and unmap page tables in share_hotadd_m2p_table
Wei Liu [Tue, 5 Feb 2019 13:06:08 +0000 (13:06 +0000)]
x86_64/mm: map and unmap page tables in share_hotadd_m2p_table

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm: map and unmap page tables in m2p_mapped
Wei Liu [Tue, 5 Feb 2019 12:56:41 +0000 (12:56 +0000)]
x86_64/mm: map and unmap page tables in m2p_mapped

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/shim: map and unmap page tables in replace_va_mapping
Wei Liu [Tue, 5 Feb 2019 12:48:03 +0000 (12:48 +0000)]
x86/shim: map and unmap page tables in replace_va_mapping

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/smpboot: drop lXe_to_lYe invocations from cleanup_cpu_root_pgt
Wei Liu [Mon, 4 Feb 2019 18:16:30 +0000 (18:16 +0000)]
x86/smpboot: drop lXe_to_lYe invocations from cleanup_cpu_root_pgt

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/smpboot: switch pl*e to use new APIs in clone_mapping
Wei Liu [Mon, 4 Feb 2019 17:57:33 +0000 (17:57 +0000)]
x86/smpboot: switch pl*e to use new APIs in clone_mapping

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agox86/smpboot: clone_mapping should have one exit path
Wei Liu [Mon, 4 Feb 2019 17:48:45 +0000 (17:48 +0000)]
x86/smpboot: clone_mapping should have one exit path

We will soon need to clean up page table mappings in the exit path.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/smpboot: add emacs block
Wei Liu [Mon, 4 Feb 2019 17:45:50 +0000 (17:45 +0000)]
x86/smpboot: add emacs block

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agoefi: switch EFI L4 table to use new APIs
Wei Liu [Mon, 4 Feb 2019 17:19:27 +0000 (17:19 +0000)]
efi: switch EFI L4 table to use new APIs

This requires storing the MFN instead of linear address of the L4
table. Adjust code accordingly.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agoefi: add emacs block to boot.c
Wei Liu [Mon, 4 Feb 2019 17:01:10 +0000 (17:01 +0000)]
efi: add emacs block to boot.c

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agoefi: use new page table APIs in efi_init_memory
Wei Liu [Mon, 4 Feb 2019 17:00:59 +0000 (17:00 +0000)]
efi: use new page table APIs in efi_init_memory

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agoefi: avoid using global variable in copy_mapping
Wei Liu [Mon, 4 Feb 2019 16:40:34 +0000 (16:40 +0000)]
efi: avoid using global variable in copy_mapping

We will soon switch efi_l4_table to use ephemeral mapping. Make
copy_mapping take a pointer to the mapping instead of using the global
variable.

No functional change intended.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agoefi: use new page table APIs in copy_mapping
Wei Liu [Mon, 4 Feb 2019 16:01:03 +0000 (16:01 +0000)]
efi: use new page table APIs in copy_mapping

After inspection ARM doesn't have alloc_xen_pagetable so this function
is x86 only, which means it is safe for us to change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
---
XXX test this in gitlab ci to be sure.

5 years agox86_64/mm: drop lXe_to_lYe invocations from setup_m2p_table
Wei Liu [Thu, 31 Jan 2019 19:04:23 +0000 (19:04 +0000)]
x86_64/mm: drop lXe_to_lYe invocations from setup_m2p_table

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm: switch to new APIs in setup_m2p_table
Wei Liu [Thu, 31 Jan 2019 19:01:11 +0000 (19:01 +0000)]
x86_64/mm: switch to new APIs in setup_m2p_table

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm: introduce pl2e in setup_m2p_table
Wei Liu [Thu, 31 Jan 2019 18:52:48 +0000 (18:52 +0000)]
x86_64/mm: introduce pl2e in setup_m2p_table

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm.c: remove code that serves no purpose in setup_m2p_table
Wei Liu [Thu, 31 Jan 2019 18:49:36 +0000 (18:49 +0000)]
x86_64/mm.c: remove code that serves no purpose in setup_m2p_table

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm: drop l4e_to_l3e invocation from paging_init
Wei Liu [Thu, 31 Jan 2019 18:31:04 +0000 (18:31 +0000)]
x86_64/mm: drop l4e_to_l3e invocation from paging_init

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86_64/mm: switch to new APIs in paging_init
Wei Liu [Tue, 29 Jan 2019 14:40:26 +0000 (14:40 +0000)]
x86_64/mm: switch to new APIs in paging_init

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyax@amazon.com>
---
Changed since v1:
- Use a global mapping for compat_idle_pg_table_l2, otherwise
  l2_ro_mpt will unmap it.

5 years agox86_64/mm: introduce pl2e in paging_init
Wei Liu [Thu, 31 Jan 2019 18:06:53 +0000 (18:06 +0000)]
x86_64/mm: introduce pl2e in paging_init

Introduce pl2e so that we can use l2_ro_mpt to point to the page table
itself.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/mm: switch to new APIs in arch_init_memory
Wei Liu [Tue, 29 Jan 2019 14:15:47 +0000 (14:15 +0000)]
x86/mm: switch to new APIs in arch_init_memory

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/mm: drop lXe_to_lYe invocations from modify_xen_mappings
Wei Liu [Fri, 1 Feb 2019 13:15:59 +0000 (13:15 +0000)]
x86/mm: drop lXe_to_lYe invocations from modify_xen_mappings

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/mm: switch to new APIs in modify_xen_mappings
Wei Liu [Tue, 29 Jan 2019 14:03:48 +0000 (14:03 +0000)]
x86/mm: switch to new APIs in modify_xen_mappings

Page tables allocated in that function should be mapped and unmapped
now.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyax@amazon.com>
---
Changed since v1:
- Remove redundant lines.

5 years agox86/mm: drop lXe_to_lYe invocations in map_pages_to_xen
Wei Liu [Fri, 1 Feb 2019 12:39:26 +0000 (12:39 +0000)]
x86/mm: drop lXe_to_lYe invocations in map_pages_to_xen

Map and unmap page tables where necessary.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyax@amazon.com>
---
Changed since v2:
- Fix a leak in mapping l2t.

5 years agox86/mm: switch to new APIs in map_pages_to_xen
Wei Liu [Tue, 29 Jan 2019 13:56:43 +0000 (13:56 +0000)]
x86/mm: switch to new APIs in map_pages_to_xen

Page tables allocated in that function should be mapped and unmapped
now.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyax@amazon.com>
---
Changed since v1:
- Remove redundant lines.

5 years agox86/mm: rewrite virt_to_xen_l*e
Wei Liu [Tue, 29 Jan 2019 12:42:23 +0000 (12:42 +0000)]
x86/mm: rewrite virt_to_xen_l*e

Rewrite that function to use the new APIs. Modify its callers to unmap
the pointer returned.

Note that the change of virt_to_xen_l1e also requires vmap_to_mfn to
unmap the page.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agox86/mm: change pl*e to l*t in virt_to_xen_l*e
Wei Liu [Tue, 29 Jan 2019 12:54:48 +0000 (12:54 +0000)]
x86/mm: change pl*e to l*t in virt_to_xen_l*e

We will need to have a variable named pl*e when we rewrite
virt_to_xen_l*e. Change pl*e to l*t to reflect better its purpose.
This will make reviewing later patch easier.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyax@amazon.com>
5 years agox86/mm: add an end_of_loop label in modify_xen_mappings
Wei Liu [Mon, 28 Jan 2019 18:45:06 +0000 (18:45 +0000)]
x86/mm: add an end_of_loop label in modify_xen_mappings

We will soon need to clean up mappings whenever the out most loop
is ended. Add a new label and turn relevant continue's into goto's.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/mm: make sure there is one exit path for modify_xen_mappings
Wei Liu [Mon, 28 Jan 2019 18:41:26 +0000 (18:41 +0000)]
x86/mm: make sure there is one exit path for modify_xen_mappings

We will soon need to handle dynamically mapping / unmapping page
tables in the said function.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/mm: add an end_of_loop label in map_pages_to_xen
Wei Liu [Mon, 28 Jan 2019 18:35:52 +0000 (18:35 +0000)]
x86/mm: add an end_of_loop label in map_pages_to_xen

We will soon need to clean up mappings whenever the out most loop is
ended. Add a new label and turn relevant continue's into goto's.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/mm: map_pages_to_xen should have one exit path
Wei Liu [Mon, 28 Jan 2019 18:30:47 +0000 (18:30 +0000)]
x86/mm: map_pages_to_xen should have one exit path

We will soon rewrite the function to handle dynamically mapping and
unmapping of page tables.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/mm: introduce l{1,2}t local variables to modify_xen_mappings
Wei Liu [Mon, 28 Jan 2019 18:10:10 +0000 (18:10 +0000)]
x86/mm: introduce l{1,2}t local variables to modify_xen_mappings

The pl2e and pl1e variables are heavily (ab)used in that function.  It
is fine at the moment because all page tables are always mapped so
there is no need to track the life time of each variable.

We will soon have the requirement to map and unmap page tables. We
need to track the life time of each variable to avoid leakage.

Introduce some l{1,2}t variables with limited scope so that we can
track life time of pointers to xen page tables more easily.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/mm: introduce l{1,2}t local variables to map_pages_to_xen
Wei Liu [Mon, 28 Jan 2019 17:54:24 +0000 (17:54 +0000)]
x86/mm: introduce l{1,2}t local variables to map_pages_to_xen

The pl2e and pl1e variables are heavily (ab)used in that function. It
is fine at the moment because all page tables are always mapped so
there is no need to track the life time of each variable.

We will soon have the requirement to map and unmap page tables. We
need to track the life time of each variable to avoid leakage.

Introduce some l{1,2}t variables with limited scope so that we can
track life time of pointers to xen page tables more easily.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86: introduce a new set of APIs to manage Xen page tables
Wei Liu [Wed, 23 Jan 2019 15:33:07 +0000 (15:33 +0000)]
x86: introduce a new set of APIs to manage Xen page tables

We are going to switch to using domheap page for page tables.
A new set of APIs is introduced to allocate, map, unmap and free pages
for page tables.

The allocation and deallocation work on mfn_t but not page_info,
because they are required to work even before frame table is set up.

Implement the old functions with the new ones. We will rewrite, site
by site, other mm functions that manipulate page tables to use the new
APIs.

Note these new APIs still use xenheap page underneath and no actual
map and unmap is done so that we don't break xen half way. They will
be switched to use domheap and dynamic mappings when usage of old APIs
is eliminated.

No functional change intended in this patch.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86: move some xen mm function declarations
Wei Liu [Wed, 23 Jan 2019 15:17:41 +0000 (15:17 +0000)]
x86: move some xen mm function declarations

They were put into page.h but mm.h is more appropriate.

The real reason is that I will be adding some new functions which
takes mfn_t. It turns out it is a bit difficult to do in page.h.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
5 years agox86/svm: Write the correct %eip into the outgoing task
Andrew Cooper [Thu, 21 Nov 2019 17:22:52 +0000 (17:22 +0000)]
x86/svm: Write the correct %eip into the outgoing task

The TASK_SWITCH vmexit has fault semantics, and doesn't provide any NRIPs
assistance with instruction length.  As a result, any instruction-induced task
switch has the outgoing task's %eip pointing at the instruction switch caused
the switch, rather than after it.

This causes callers of task gates to livelock (repeatedly execute the call/jmp
to enter the task), and any restartable task to become a nop after its first
use (the (re)entry state points at the iret used to exit the task).

32bit Windows in particular is known to use task gates for NMI handling, and
to use NMI IPIs.

In the task switch handler, distinguish instruction-induced from
interrupt/exception-induced task switches, and decode the instruction under
%rip to calculate its length.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/svm: Always intercept ICEBP
Andrew Cooper [Mon, 25 Nov 2019 19:33:36 +0000 (19:33 +0000)]
x86/svm: Always intercept ICEBP

ICEBP isn't handled well by SVM.

The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
appropriate instruction boundary (fault or trap, as appropriate), except for
an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
rather than after it.  As ICEBP isn't distinguished in the vectoring event
type, the state is ambiguous.

To add to the confusion, an ICEBP which occurs due to Introspection
intercepting the instruction, or from x86_emulate() will have %rip updated as
a consequence of partial emulation required to inject an ICEBP event in the
first place.

We could in principle spot the non-injected case in the TASK_SWITCH handler,
but this still results in complexity if the ICEBP instruction also has an
Instruction Breakpoint active on it (which genuinely has fault semantics).

Unconditionally intercept ICEBP.  This does have NRIPs support as it is an
instruction intercept, which allows us to move %rip forwards appropriately
before the TASK_SWITCH intercept is hit.  This makes #DB-vectored switches
have consistent behaviour however the ICEBP #DB came about, and avoids special
cases in the TASK_SWITCH intercept.

This in turn allows for the removal of the conditional
hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
will now always be submitted for monitoring checks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/vtx: Fix fault semantics for early task switch failures
Andrew Cooper [Thu, 21 Nov 2019 17:22:52 +0000 (17:22 +0000)]
x86/vtx: Fix fault semantics for early task switch failures

The VT-x task switch handler adds inst_len to %rip before calling
hvm_task_switch(), which is problematic in two ways:

 1) Early faults (i.e. ones delivered in the context of the old task) get
    delivered with trap semantics, and break restartibility.

 2) The addition isn't truncated to 32 bits.  In the corner case of a task
    switch instruction crossing the 4G->0 boundary taking an early fault (with
    trap semantics), a VMEntry failure will occur due to %rip being out of
    range.

Instead, pass the instruction length into hvm_task_switch() and write it into
the outgoing TSS only, leaving %rip in its original location.

For now, pass 0 on the SVM side.  This highlights a separate preexisting bug
which will be addressed in the following patch.

While adjusting call sites, drop the unnecessary uint16_t cast.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agobuild: provide option to disambiguate symbol names
Jan Beulich [Thu, 28 Nov 2019 16:47:25 +0000 (17:47 +0100)]
build: provide option to disambiguate symbol names

The .file assembler directives generated by the compiler do not include
any path components (gcc) or just the ones specified on the command line
(clang, at least version 5), and hence multiple identically named source
files (in different directories) may produce identically named static
symbols (in their kallsyms representation). The binary diffing algorithm
used by xen-livepatch, however, depends on having unique symbols.

Make the ENFORCE_UNIQUE_SYMBOLS Kconfig option control the (build)
behavior, and if enabled use objcopy to prepend the (relative to the
xen/ subdirectory) path to the compiler invoked STT_FILE symbols. Note
that this build option is made no longer depend on LIVEPATCH, but merely
defaults to its setting now.

Conditionalize explicit .file directive insertion in C files where it
exists just to disambiguate names in a less generic manner; note that
at the same time the redundant emission of STT_FILE symbols gets
suppressed for clang. Assembler files as well as multiply compiled C
ones using __OBJECT_FILE__ are left alone for the time being.

Since we now expect there not to be any duplicates anymore, also don't
force the selection of the option to 'n' anymore in allrandom.config.
Similarly COVERAGE no longer suppresses duplicate symbol warnings if
enforcement is in effect, which in turn allows
SUPPRESS_DUPLICATE_SYMBOL_WARNINGS to simply depend on
!ENFORCE_UNIQUE_SYMBOLS.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Wei Liu <wl@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: make internally used IRQs also honor the pending EOI stack
Jan Beulich [Thu, 28 Nov 2019 14:14:03 +0000 (15:14 +0100)]
x86/IRQ: make internally used IRQs also honor the pending EOI stack

At the time the pending EOI stack was introduced there were no
internally used IRQs which would have the LAPIC EOI issued from the
->end() hook. This had then changed with the introduction of IOMMUs,
but the interaction issue was presumably masked by
irq_guest_eoi_timer_fn() frequently EOI-ing interrupts way too early
(which got fixed by 359cf6f8a0ec ["x86/IRQ: don't keep EOI timer
running without need"]).

The problem is that with us re-enabling interrupts across handler
invocation, a higher priority (guest) interrupt may trigger while
handling a lower priority (internal) one. The EOI issued from
->end() (for ACKTYPE_EOI kind interrupts) would then mistakenly
EOI the higher priority (guest) interrupt, breaking (among other
things) pending EOI stack logic's assumptions.

Notes:

- In principle we could get away without the check_eoi_deferral flag.
  I've introduced it just to make sure there's as little change as
  possible to unaffected paths.
- Similarly the cpu_has_pending_apic_eoi() check in do_IRQ() isn't
  strictly necessary.
- The new function's name isn't very helpful with its use in
  end_level_ioapic_irq_new(). I did also consider eoi_APIC_irq() (to
  parallel ack_APIC_irq()), but then liked this even less.

Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Diagnosed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/vmx: always sync PIR to IRR before vmentry
Roger Pau Monné [Thu, 28 Nov 2019 10:58:25 +0000 (11:58 +0100)]
x86/vmx: always sync PIR to IRR before vmentry

When using posted interrupts on Intel hardware it's possible that the
vCPU resumes execution with a stale local APIC IRR register because
depending on the interrupts to be injected vlapic_has_pending_irq
might not be called, and thus PIR won't be synced into IRR.

Fix this by making sure PIR is always synced to IRR in
hvm_vcpu_has_pending_irq regardless of what interrupts are pending.

Reported-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Joe Jin <joe.jin@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoMAINTAINERS: Update path to the livepatch documentation
Julien Grall [Tue, 26 Nov 2019 13:30:23 +0000 (13:30 +0000)]
MAINTAINERS: Update path to the livepatch documentation

Commit d661611d08 "docs/markdown: Switch to using pandoc, and fix
underscore escaping" converted the livepatch documentation from markdown
to pandoc.

Update MAINTAINERS to reflect the change so the correct maintainers are
CCed to the patches.

Fixes: d661611d08 ("docs/markdown: Switch to using pandoc, and fix underscore escaping")
Signed-off-by: Julien Grall <julien@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/microcode: refuse to load the same revision ucode
Sergey Dyasli [Wed, 27 Nov 2019 10:04:30 +0000 (10:04 +0000)]
x86/microcode: refuse to load the same revision ucode

Currently if a user tries to live-load the same or older ucode revision
than CPU already has, he will get a single message in Xen log like:

    (XEN) 128 cores are to update their microcode

No actual ucode loading will happen and this situation can be quite
confusing. Fix this by starting ucode update only when the provided
ucode revision is higher than the currently cached one (if any).
This is based on the property that if microcode_cache exists, all CPUs
in the system should have at least that ucode revision.

Additionally, print a user friendly message if no matching or newer
ucode can be found in the provided blob. This also requires ignoring
-ENODATA in AMD-side code, otherwise the message given to the user is:

    (XEN) Parsing microcode blob error -61

Which actually means that a ucode blob was parsed fine, but no matching
ucode was found.

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoAMD/IOMMU: honour IR setting while pre-filling DTEs
Igor Druzhinin [Tue, 26 Nov 2019 17:08:19 +0000 (17:08 +0000)]
AMD/IOMMU: honour IR setting while pre-filling DTEs

IV bit shouldn't be set in DTE if interrupt remapping is not
enabled. It's a regression in behavior of "iommu=no-intremap"
option which otherwise would keep interrupt requests untranslated
for all of the devices in the system regardless of wether it's
described as valid in IVRS or not.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agodocs/xl: Document pci-assignable state
George Dunlap [Tue, 26 Nov 2019 15:49:20 +0000 (15:49 +0000)]
docs/xl: Document pci-assignable state

Changesets 319f9a0ba9 ("passthrough: quarantine PCI devices") and
ba2ab00bbb ("IOMMU: default to always quarantining PCI devices")
introduced PCI device "quarantine" behavior, but did not document how
the pci-assignable-add and -remove functions act in regard to this.
Rectify this.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Reviewed-by: Paul Durrant <paul@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoEFI: fix "efi=attr=" handling
Jan Beulich [Tue, 26 Nov 2019 13:17:45 +0000 (14:17 +0100)]
EFI: fix "efi=attr=" handling

Commit 633a40947321 ("docs: Improve documentation and parsing for efi=")
failed to honor the strcmp()-like return value convention of
cmdline_strcmp().

Reported-by: Roman Shaposhnik <roman@zededa.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/p2m-pt: fix (latent) page table mapping leak on do_recalc() error paths
Jan Beulich [Tue, 26 Nov 2019 13:17:11 +0000 (14:17 +0100)]
x86/p2m-pt: fix (latent) page table mapping leak on do_recalc() error paths

There are two mappings active in the middle of do_recalc(), and hence
commit 0d0f4d78e5d1 ("p2m: change write_p2m_entry to return an error
code") should have added (or otherwise invoked) unmapping code just
like it did in p2m_next_level(), despite us not expecting any errors
here. Arrange for the existing unmap invocation to take effect in all
cases.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/domctl: have XEN_DOMCTL_getpageframeinfo3 preemptible
Anthony PERARD [Tue, 26 Nov 2019 13:16:09 +0000 (14:16 +0100)]
x86/domctl: have XEN_DOMCTL_getpageframeinfo3 preemptible

This hypercall can take a long time to finish because it attempts to
grab the `hostp2m' lock up to 1024 times. The accumulated wait for the
lock can take several seconds.

This can easily happen with a guest with 32 vcpus and plenty of RAM,
during localhost migration.

While the patch doesn't fix the problem with the lock contention and
the fact that the `hostp2m' lock is currently global (and not on a
single page), it is still an improvement to the hypercall. It will in
particular, down the road, allow dropping the arbitrary limit of 1024
entries per request.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoIOMMU: default to always quarantining PCI devices
Jan Beulich [Tue, 26 Nov 2019 13:15:01 +0000 (14:15 +0100)]
IOMMU: default to always quarantining PCI devices

XSA-302 relies on the use of libxl's "assignable-add" feature to prepare
devices to be assigned to untrusted guests.

Unfortunately, this is not considered a strictly required step for
device assignment. The PCI passthrough documentation on the wiki
describes alternate ways of preparing devices for assignment, and
libvirt uses its own ways as well. Hosts where these alternate methods
are used will still leave the system in a vulnerable state after the
device comes back from a guest.

Default to always quarantining PCI devices, but provide a command line
option to revert back to prior behavior (such that people who both
sufficiently trust their guests and want to be able to use devices in
Dom0 again after they had been in use by a guest wouldn't need to
"manually" move such devices back from DomIO to Dom0).

This is XSA-306.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
5 years agox86: Don't increase ApicIdCoreSize past 7
George Dunlap [Tue, 26 Nov 2019 10:32:42 +0000 (10:32 +0000)]
x86: Don't increase ApicIdCoreSize past 7

Changeset ca2eee92df44 ("x86, hvm: Expose host core/HT topology to HVM
guests") attempted to "fake up" a topology which would induce guest
operating systems to not treat vcpus as sibling hyperthreads.  This
involved actually reporting hyperthreading as available, but giving
vcpus every other ApicId; which in turn led to doubling the ApicIds
per core by bumping the ApicIdCoreSize by one.  In particular, Ryzen
3xxx series processors, and reportedly EPYC "Rome" cpus -- have an
ApicIdCoreSize of 7; the "fake" topology increases this to 8.

Unfortunately, Windows running on modern AMD hardware -- including
Ryzen 3xxx series processors, and reportedly EPYC "Rome" cpus --
doesn't seem to cope with this value being higher than 7.  (Linux
guests have so far continued to cope.)

A "proper" fix is complicated and it's too late to fix it either for
4.13, or to backport to supported branches.  As a short-term fix,
limit this value to 7.

This does mean that a Linux guest, booted on such a system without
this change, and then migrating to a system with this change, with
more than 64 vcpus, would see an apparent topology change.  This is a
low enough risk in practice that enabling this limit unilaterally, to
allow other guests to boot without manual intervention, is worth it.

Reported-by: Steven Haigh <netwiz@crc.id.au>
Reported-by: Andreas Kinzler <hfp@posteo.de>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/mm: Adjust linear uses / entries when a page loses validation
George Dunlap [Fri, 22 Nov 2019 18:52:02 +0000 (18:52 +0000)]
x86/mm: Adjust linear uses / entries when a page loses validation

"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).

XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both.  To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.

Additionally, XSA-299 introduced a mode whereby if a page was known to
have been only partially validated, _put_page_type() would be called
with PTF_partial_set, indicating that if the page had been
de-validated by someone else, the type count should be left alone.

Unfortunately, this change did not account for the required accounting
for linear page table uses and entries; in the case that a previously
partially-devalidated pagetable was fully-devalidated by someone else,
the linear_pt_counts are not updated.

This could happen in one of two places:

1. In the case a partially-devalidated page was re-validated by
someone else

2. During domain tear-down, when pages are force-invalidated while
leaving the type count intact.

The second could be ignored, since at that point the pages can no
longer be abused; but the first requires handling.  Note however that
this would not be a security issue: having the counts be too high is
overly strict (i.e., will prevent a page from being used in a way
which is perfectly safe), but shouldn't cause any other issues.

Fix this by adjusting the linear counts when a page loses validation,
regardless of whether the de-validation completed or was only partial.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: make default path to add/remove all PV devices
Oleksandr Grytsov [Thu, 21 Nov 2019 18:13:00 +0000 (20:13 +0200)]
libxl: make default path to add/remove all PV devices

Adding/removing device is handled for specific devices only: VBD, VIF,
QDISK. This commit adds default case to handle adding/removing for all PV
devices by default, except QDISK device, which requires special handling.
If any other device is required a special handling it should be done by
implementing separate case (similar to QDISK device). The default
behaviour for adding device is to wait when the backend goes to
XenbusStateInitWait and the default behaviour on removing device is to
start generic device remove procedure.

Also this commit fixes removing guest function: before the guest was
removed when all VIF and VBD devices are removed. The fix removes
guest when all created devices are removed. This is done by checking the
guest device list instead of checking num_vifs and num_vbds. num_vifs and
num_vbds variables are removed as redundant in this case.

Signed-off-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: introduce new backend type VINPUT
Oleksandr Grytsov [Thu, 21 Nov 2019 18:12:58 +0000 (20:12 +0200)]
libxl: introduce new backend type VINPUT

There are two kind of VKBD devices: with QEMU backend and user space PV
backend. In current implementation they can't be distinguished as both use
VKBD backend type. As result, user space PV KBD backend is started and
stopped as QEMU backend. This commit adds new device kind VINPUT to be
used as backend type for user space PV KBD backend.

Signed-off-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/vvmx: Fix livelock with XSA-304 fix
Andrew Cooper [Thu, 21 Nov 2019 18:21:49 +0000 (18:21 +0000)]
x86/vvmx: Fix livelock with XSA-304 fix

It turns out that the XSA-304 / CVE-2018-12207 fix of disabling executable
superpages doesn't work well with the nested p2m code.

Nested virt is experimental and not security supported, but is useful for
development purposes.  In order to not regress the status quo, disable the
XSA-304 workaround until the nested p2m code can be improved.

Introduce a per-domain exec_sp control and set it based on the current
opt_ept_exec_sp setting.  Take the oppotunity to omit a PVH hardware domain
from the performance hit, because it is already permitted to DoS the system in
such ways as issuing a reboot.

When nested virt is enabled on a domain, force it to using executable
superpages and rebuild the p2m.

Having the setting per-domain involves rearranging the internals of
parse_ept_param_runtime() but it still retains the same overall semantics -
for each applicable domain whose setting needs to change, rebuild the p2m.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/livepatch: Prevent patching with active waitqueues
Andrew Cooper [Tue, 5 Nov 2019 19:08:14 +0000 (19:08 +0000)]
x86/livepatch: Prevent patching with active waitqueues

The safety of livepatching depends on every stack having been unwound, but
there is one corner case where this is not true.  The Sharing/Paging/Monitor
infrastructure may use waitqueues, which copy the stack frame sideways and
longjmp() to a different vcpu.

This case is rare, and can be worked around by pausing the offending
domain(s), waiting for their rings to drain, then performing a livepatch.

In the case that there is an active waitqueue, fail the livepatch attempt with
-EBUSY, which is preforable to the fireworks which occur from trying to unwind
the old stack frame at a later point.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/vlapic: allow setting APIC_SPIV_FOCUS_DISABLED in x2APIC mode
Roger Pau Monné [Fri, 22 Nov 2019 16:52:59 +0000 (17:52 +0100)]
x86/vlapic: allow setting APIC_SPIV_FOCUS_DISABLED in x2APIC mode

Current code unconditionally prevents setting APIC_SPIV_FOCUS_DISABLED
regardless of the processor model, which is not correct according to
the specification.

This issue was discovered while trying to boot a pvshim with x2APIC
enabled.

Always allow setting APIC_SPIV_FOCUS_DISABLED: the local APIC
provided to guests is emulated by Xen, and as such doesn't depend on
the features found on the hardware processor. Note for example that
Xen offers x2APIC support to guests even when the underlying hardware
doesn't have such feature.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoxen: Add missing va_end() in hypercall_create_continuation()
Julien Grall [Wed, 20 Nov 2019 13:37:51 +0000 (13:37 +0000)]
xen: Add missing va_end() in hypercall_create_continuation()

The documentation requires va_start() to always be matched with a
corresponding va_end(). However, this is not the case in the path used
for bad format.

This was introduced by XSA-296.

Coverity-ID: 1488727
Fixes: 0bf9f8d3e3 ("xen/hypercall: Don't use BUG() for parameter checking in hypercall_create_continuation()")
Signed-off-by: Julien Grall <julien@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/e820: fix 640k - 1M region reservation logic
Sergey Dyasli [Wed, 30 Oct 2019 14:54:47 +0000 (14:54 +0000)]
x86/e820: fix 640k - 1M region reservation logic

Converting a guest from PV to PV-in-PVH makes the guest to have 384k
less memory, which may confuse guest's balloon driver. This happens
because Xen unconditionally reserves 640k - 1M region in E820 despite
the fact that it's really a usable RAM in PVH boot mode.

Fix this by skipping region type change in virtualised environments,
trusting whatever memory map our hypervisor has provided.

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/boot: Cache cpu_has_hypervisor very early on boot
Andrew Cooper [Fri, 1 Nov 2019 20:07:31 +0000 (20:07 +0000)]
x86/boot: Cache cpu_has_hypervisor very early on boot

We cache Long Mode and No Execute early on boot, so take the opportunity to
cache HYPERVISOR early as well.

Replace opencoded early access to the feature bit.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/boot: Remove cached CPUID data from the trampoline
Andrew Cooper [Mon, 9 Sep 2019 10:43:33 +0000 (11:43 +0100)]
x86/boot: Remove cached CPUID data from the trampoline

We have a cached cpuid_ext_features in the trampoline which is kept in sync by
various pieces of boot logic.  This is complicated, and all it is actually
used for is to derive whether NX is safe to use.

Replace it with a canned value to load into EFER.

trampoline_setup() and efi_arch_cpu() now tweak trampoline_efer at the point
that they are stashing the main copy of CPUID data.  Similarly,
early_init_intel() needs to tweak if it has re-enabled the use of NX.

This simplifies the AP boot and S3 resume paths by using trampoline_efer
directly, rather than locally turning FEATURE_NX into EFER_NX.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/Makefile: remove $(guard) use from $(TARGET).efi target
Anthony PERARD [Wed, 20 Nov 2019 16:12:12 +0000 (17:12 +0100)]
x86/Makefile: remove $(guard) use from $(TARGET).efi target

Following the patch 65d104984c04 ("x86: fix race to build
arch/x86/efi/relocs-dummy.o"), the error message
  nm: 'efi/relocs-dummy.o': No such file"
started to appear on system which can't build the .efi target. This is
because relocs-dummy.o isn't built anymore.
The error is printed by the evaluation of VIRT_BASE and ALT_BASE which
aren't use anyway.

But, we don't need that file as we don't want to build `$(TARGET).efi'
anyway.  On such system, $(guard) evaluate to the shell builtin ':',
which prevent any of the shell commands in `$(TARGET).efi' from been
executed.

Even if $(guard) is evaluated opon use, it depends on $(XEN_BUILD_PE)
which is evaluated at the assignment. So, we can replace $(guard) in
$(TARGET).efi by having two different rules depending on
$(XEN_BUILD_PE) instead.

The change with this patch is that none of the dependency of
$(TARGET).efi will be built if the linker doesn't support PE
and VIRT_BASE and ALT_BASE don't get evaluated anymore, so nm will not
complain about the missing relocs-dummy.o file anymore.

Since prelink-efi.o isn't built on system that can't build
$(TARGET).efi anymore, we can remove the $(guard) variable everywhere.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoefi: do not use runtime services table with efi=no-rs
Marek Marczykowski-Górecki [Wed, 20 Nov 2019 16:10:59 +0000 (17:10 +0100)]
efi: do not use runtime services table with efi=no-rs

Before dfcccc6631 "efi: use directmap to access runtime services table"
all usages of efi_rs pointer were guarded by efi_rs_enter(), which
implicitly refused to operate with efi=no-rs (by checking if
efi_l4_pgtable is NULL - which is the case for efi=no-rs). The said
commit (re)moved that call as unneeded for just reading content of
efi_rs structure - to avoid unnecessary page tables switch. But it
neglected to check if efi_rs access is legal.

Fix this by adding explicit check for runtime service being enabled in
the cases that do not use efi_rs_enter().

Reported-by: Roman Shaposhnik <roman@zededa.com>
Fixes: dfcccc6631 "efi: use directmap to access runtime services table"
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/cpuid: Fix Lisbon/Magny-Cours Opterons WRT SSSE3/SSE4A
Andrew Cooper [Tue, 19 Nov 2019 16:40:26 +0000 (16:40 +0000)]
x86/cpuid: Fix Lisbon/Magny-Cours Opterons WRT SSSE3/SSE4A

c/s ff66ccefe5 "x86/CPUID: adjust SSEn dependencies" made SSE4A depend on
SSSE3, but these processors really do have have SSE4A without SSSE3.

This manifests as an upgrade regression, as the SSE4A feature disappears from
view.

Adjust the SSE4A feature to depend on SSE3 rather than SSSE3.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoconfigure: Fix test for python 3.8
Anthony PERARD [Fri, 15 Nov 2019 16:15:32 +0000 (16:15 +0000)]
configure: Fix test for python 3.8

https://docs.python.org/3.8/whatsnew/3.8.html#debug-build-uses-the-same-abi-as-release-build

> To embed Python into an application, a new --embed option must be
> passed to python3-config --libs --embed to get -lpython3.8 (link the
> application to libpython). To support both 3.8 and older, try
> python3-config --libs --embed first and fallback to python3-config
> --libs (without --embed) if the previous command fails.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
[ wei: rerun autogen.sh ]

5 years agotools/configure: Honour XEN_COMPILE_ARCH and _TARGET_ for shim
Ian Jackson [Tue, 29 Oct 2019 17:45:30 +0000 (17:45 +0000)]
tools/configure: Honour XEN_COMPILE_ARCH and _TARGET_ for shim

The pvshim can only be built 64-bit because the hypervisor is only
64-bit nowadays.  The hypervisor build supports XEN_COMPILE_ARCH and
XEN_TARGET_ARCH which override the information from uname.  The pvshim
build runs out of the tools/ directory but calls the hypervisor build
system.

If one runs in a Linux 32-bit userland with a 64-bit kernel, one used
to be able to set XEN_COMPILE_ARCH.  But nowadays this does not work.
configure sees the target cpu as 64-bit and tries to build pvshim.
The build prints
  echo "*** Xen x86/32 target no longer supported!"
and doesn't build anything.  Then the subsequent Makefiles try to
install the non-built pieces.

Fix this anomaly by causing configure to honour the Xen hypervisor way
of setting the target architecture.

In principle this user behaviour is not handled quite right, because
configure will still see 64-bit and so all the autoconf-based
architecture testing will see 64-bit rather than 32-bit x86.  But the
tools are in fact generally quite portable: this particular location
in configure{.ac,} is the only place in tools/ where 64-bit x86 is
treated differently from 32-bit x86, so the fix is sufficient and
correct for this use case.

It remains the case that XEN_COMPILE_ARCH or XEN_TARGET_ARCH to a
non-x86 architecture, when configure thinks things are x86, or vice
versa, will not work right.

(This is a bugfix to 8845155c831c
  pvshim: make PV shim build selectable from configure
which inadvertantly deleted the logic to only build the shim for
XEN_TARGET_ARCH != x86_32.)

I have rerun autogen.sh, so this patch contains the fix to configure
as well as the source fix to configure.ac.

Fixes: 8845155c831c59e867ee3dd31ee63e0cc6c7dcf2
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
CC: Olaf Hering <olaf@aepfle.de>
CC: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Jürgen Groß <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
5 years agolibxl: gentypes: initialise array elements in json
Oleksandr Grytsov [Mon, 28 Oct 2019 18:22:16 +0000 (18:22 +0000)]
libxl: gentypes: initialise array elements in json

Currently, array elements are initialized with calloc.  Which means
initialize all element fields with zero values.  If an entry is not
present in the json (which is entirely permitted), the element will be
all-bits-zero instead of the default value (which is wrong).

The fix is to initalise each array element before parsing it, using
the new libxl_C_type_do_init function.

With existing types this results in a lot of new calls like this:

      for (i=0; (t=libxl__json_array_get(x,i)); i++) {
 +            libxl_sched_params_init(&p->vcpus[i]);
              rc = libxl__sched_params_parse_json(gc, t, &p->vcpus[i]);

(indentation adjusted).  This looks right.  To check what happens with
types which have nontrivial defaults but don't have init functions (of
which we currently have none in arrays), I (Ian) experimentally added:

      ("pnode", uint32), # physical node of this node
      ("vcpus", libxl_bitmap), # vcpus in this node
 +    ("sporks", Array(MemKB, "num_sporks")),
      ])

The result was this:

          for (i=0; (t=libxl__json_array_get(x,i)); i++) {
 +                p->sporks[i] = LIBXL_MEMKB_DEFAULT;
                  rc = libxl__uint64_parse_json(gc, t, &p->sporks[i]);

where the context was added by adding "sporks" and "+" indicates a
line added by this patch, "initialise array elements in json".

Signed-off-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
---
v2 [iwj]: Use libxl_C_type_do_init.
          Reword commit message and discuss spork testing.

5 years agolibxl: gentypes.py: Break out libxl_C_type_do_init
Ian Jackson [Tue, 29 Oct 2019 15:19:33 +0000 (15:19 +0000)]
libxl: gentypes.py: Break out libxl_C_type_do_init

This is going to be the common way to initialise things.
_libxl_C_type_init remains the thing for generating the body of the
init function, and for some special cases.

No functional change with existing types: C output is identical.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
5 years agolibxl: gentypes.py: Break out field_pass in ..._copy_deprecated
Ian Jackson [Tue, 29 Oct 2019 15:17:58 +0000 (15:17 +0000)]
libxl: gentypes.py: Break out field_pass in ..._copy_deprecated

We are going to want this in a moment.

No functional change with existing types: C output is identical.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
5 years agotools/libxl: gentypes.py: Prefer init_val to init_fn
Ian Jackson [Tue, 29 Oct 2019 15:00:35 +0000 (15:00 +0000)]
tools/libxl: gentypes.py: Prefer init_val to init_fn

When both are provided, init_val is likely to be more direct.

No functional change with existing types: C output is identical.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
5 years agolibxl_pci: Don't hold QMP connection while waiting
Anthony PERARD [Thu, 31 Oct 2019 12:17:27 +0000 (12:17 +0000)]
libxl_pci: Don't hold QMP connection while waiting

After sending the 'device_del' command for a PCI passthrough device,
we wait until QEMU has effectively deleted the device, this involves
executing more QMP commands. While waiting, libxl hold the connection.

It isn't necessary to hold the connection and it prevents others from
making progress, so this patch releases the QMP connection.

For background:
    e.g., when a guest is created with several pci passthrough
    attached, on `xl destroy` all the devices needs to be detach, and
    this is usually what happens:
- 'device_del' called for the 1st pci device
- 'query-pci' checking if pci still there, it is
- wait 1s
- 'query-pci' checking again, and it's gone
-> now the same can be done for the second pci device, so
plenty of waiting on others when pci detach can be done in
parallel.

    On shutdown, libxl usually keeps waiting because QEMU never
    releases the device because the guest kernel never responds QEMU's
    unplug queries. So detaching of the 1st device waits until a
    timeout stops it, and since the same timeout is setup at the same
    time for the other devices to detach, the 'device_del' command is
    never sent for those.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl_qmp: Have a lock for QMP socket access
Anthony PERARD [Mon, 18 Nov 2019 17:13:08 +0000 (17:13 +0000)]
libxl_qmp: Have a lock for QMP socket access

This patch workaround the fact that it's not possible to connect
multiple time to a single QMP socket. QEMU listen on the socket with
a backlog value of 1, which mean that on Linux when concurrent thread
call connect() on the socket, they get EAGAIN.

Background:
    This happens when attempting to create a guest with multiple
    pci devices passthrough, libxl creates one connection per device to
    attach and execute connect() on all at once before any single
    connection has finished.

To work around this, we use a new lock.

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: Introduce libxl__ev_immediate
Anthony PERARD [Mon, 18 Nov 2019 18:10:14 +0000 (18:10 +0000)]
libxl: Introduce libxl__ev_immediate

This new ev allows to arrange a non-reentrant callback to be called.
This happen immediately after the current event is processed and after
other ev_immediates that would have already been registered.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: libxl__ev_qmp_send now takes an egc
Anthony PERARD [Mon, 18 Nov 2019 17:13:06 +0000 (17:13 +0000)]
libxl: libxl__ev_qmp_send now takes an egc

No functionnal changes.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>