x86/mapcache: initialise the mapcache even for the idle domain
In situations where PMAP cannot be used or the mapcache of a domain is
simply not ready, we need to have a mapcache in the idle domain to map
pages when there is no direct map.
Wei Liu [Fri, 8 Feb 2019 17:19:26 +0000 (17:19 +0000)]
x86/mm: drop _new suffix for page table APIs
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Hongyan Xia <hongyax@amazon.com>
---
Changed since v1:
- Fix rebase conflicts against new master and other changes since v1.
Changed since v2:
- Also drop _new for the fix of l2t leak.
Wei Liu [Tue, 29 Jan 2019 14:40:26 +0000 (14:40 +0000)]
x86_64/mm: switch to new APIs in paging_init
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Hongyan Xia <hongyax@amazon.com>
---
Changed since v1:
- Use a global mapping for compat_idle_pg_table_l2, otherwise
l2_ro_mpt will unmap it.
Wei Liu [Tue, 29 Jan 2019 12:54:48 +0000 (12:54 +0000)]
x86/mm: change pl*e to l*t in virt_to_xen_l*e
We will need to have a variable named pl*e when we rewrite
virt_to_xen_l*e. Change pl*e to l*t to reflect better its purpose.
This will make reviewing later patch easier.
No functional change.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Hongyan Xia <hongyax@amazon.com>
Wei Liu [Mon, 28 Jan 2019 18:10:10 +0000 (18:10 +0000)]
x86/mm: introduce l{1,2}t local variables to modify_xen_mappings
The pl2e and pl1e variables are heavily (ab)used in that function. It
is fine at the moment because all page tables are always mapped so
there is no need to track the life time of each variable.
We will soon have the requirement to map and unmap page tables. We
need to track the life time of each variable to avoid leakage.
Introduce some l{1,2}t variables with limited scope so that we can
track life time of pointers to xen page tables more easily.
Wei Liu [Mon, 28 Jan 2019 17:54:24 +0000 (17:54 +0000)]
x86/mm: introduce l{1,2}t local variables to map_pages_to_xen
The pl2e and pl1e variables are heavily (ab)used in that function. It
is fine at the moment because all page tables are always mapped so
there is no need to track the life time of each variable.
We will soon have the requirement to map and unmap page tables. We
need to track the life time of each variable to avoid leakage.
Introduce some l{1,2}t variables with limited scope so that we can
track life time of pointers to xen page tables more easily.
Wei Liu [Wed, 23 Jan 2019 15:33:07 +0000 (15:33 +0000)]
x86: introduce a new set of APIs to manage Xen page tables
We are going to switch to using domheap page for page tables.
A new set of APIs is introduced to allocate, map, unmap and free pages
for page tables.
The allocation and deallocation work on mfn_t but not page_info,
because they are required to work even before frame table is set up.
Implement the old functions with the new ones. We will rewrite, site
by site, other mm functions that manipulate page tables to use the new
APIs.
Note these new APIs still use xenheap page underneath and no actual
map and unmap is done so that we don't break xen half way. They will
be switched to use domheap and dynamic mappings when usage of old APIs
is eliminated.
Roger Pau Monne [Tue, 1 Oct 2019 15:22:33 +0000 (17:22 +0200)]
libxl: wait for the ack when issuing power control requests
Currently only suspend power control requests wait for an ack from the
domain, while power off or reboot requests simply write the command to
xenstore and exit.
Introduce a 1 minute wait for the domain to acknowledge the request, or
else return an error. The suspend code is slightly modified to use the
new infrastructure added, but shouldn't have any functional change.
Fix the ocaml bindings and also provide a backwards compatible
interface for the reboot and poweroff libxl API functions.
Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
[ wei: change ret to rc to fix build ] Signed-off-by: Wei Liu <wl@xen.org>
Jan Beulich [Wed, 2 Oct 2019 11:38:02 +0000 (13:38 +0200)]
tools/xen-cpuid: avoid producing bogus output
I was (mistakenly, as - looking at the code - it's clearly not intended
to work) passing the tool "Raw" and "Host" as command line arguments.
Avoid printing just "Raw " with not even a newline at the end in
such a case. Instead report what wasn't understood by the parsing logic.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Wed, 2 Oct 2019 11:37:43 +0000 (13:37 +0200)]
MAINTAINERS: add tools/misc/xen-cpuid to "X86 ARCHITECTURE"
Along the lines of other x86-specific pieces under tools/.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Sergey Dyasli [Wed, 2 Oct 2019 11:35:44 +0000 (13:35 +0200)]
microcode: rendezvous CPUs in NMI handler and load ucode
When one core is loading ucode, handling NMI on sibling threads or
on other cores in the system might be problematic. By rendezvousing
all CPUs in NMI handler, it prevents NMI acceptance during ucode
loading.
Basically, some work previously done in stop_machine context is
moved to NMI handler. Primary threads call in and load ucode in
NMI handler. Secondary threads wait for the completion of ucode
loading on all CPU cores. An option is introduced to disable this
behavior.
Control thread doesn't rendezvous in NMI handler by calling self_nmi()
(in case of unknown_nmi_error() being triggered). The side effect is
control thread might be handling an NMI while other threads are loading
ucode. If an ucode is to update something shared by a whole socket,
control thread may be accessing things that are being updating by the
ucode loading on other cores. It is not safe. Update ucode on the
control thread first to mitigate this issue.
Igor Druzhinin [Tue, 1 Oct 2019 19:15:57 +0000 (20:15 +0100)]
x86/crash: force unlock console before printing on kexec crash
There is a small window where shootdown NMI might come to a CPU
(e.g. in serial interrupt handler) where console lock is taken. In order
not to leave following console prints waiting infinitely for shot down
CPUs to free the lock - force unlock the console.
The race has been frequently observed while crashing nested Xen in
an HVM domain.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
xen/arm: Implement workaround for Cortex A-57 and Cortex A72 AT speculate
Both Cortex-A57 (erratum 1319537) and Cortex-A72 (erratum 1319367) can
end with corrupted TLBs if they speculate an AT instruction while S1/S2
system registers in inconsistent state.
The workaround is the same as for Cortex A-76 implemented by commit a18be06aca "xen/arm: Implement workaround for Cortex-A76 erratum 1165522",
so it is only necessary to plumb in the cpuerrata framework.
Julien Grall [Wed, 21 Aug 2019 21:42:31 +0000 (22:42 +0100)]
xen/arm: domain_build: Don't continue if unable to allocate all dom0 banks
Xen will only print a warning if there are memory unallocated when using
1:1 mapping (only used by dom0). This also includes the case where no
memory has been allocated.
It will bring to all sort of issues that can be hard to diagnostic for
users (the warning can be difficult to spot or disregard).
If the users request 1GB of memory, then most likely they want the exact
amount and not 512MB. So panic if all the memory has not been allocated.
After this change, the behavior is the same as for non-1:1 memory
allocation (used by domU).
At the same time, reflow the message to have the format on a single
line.
Julien Grall [Fri, 9 Aug 2019 12:59:15 +0000 (13:59 +0100)]
xen/arm: p2m: Free the p2m entry after flushing the IOMMU TLBs
When freeing a p2m entry, all the sub-tree behind it will also be freed.
This may include intermediate page-tables or any l3 entry requiring to
drop a reference (e.g for foreign pages). As soon as pages are freed,
they may be re-used by Xen or another domain. Therefore it is necessary
to flush *all* the TLBs beforehand.
While CPU TLBs will be flushed before freeing the pages, this is not
the case for IOMMU TLBs. This can be solved by moving the IOMMU TLBs
flush earlier in the code.
This wasn't considered as a security issue as device passthrough on Arm
is not security supported.
xen/arm: domain_build: Avoid implicit conversion from ULL to UL
Clang 8.0 will fail to build domain_build.c on Arm32 because of the
following error:
domain_build.c:448:21: error: implicit conversion from 'unsigned long long' to 'unsigned long' changes value from 1090921693184 to 0
[-Werror,-Wconstant-conversion]
bank_size = MIN(GUEST_RAM1_SIZE, kinfo->unassigned_mem);
Arm32 is able to support more than 4GB of physical memory, so it would
be theorically possible to create domain with more the 4GB of RAM.
Therefore, the size of a bank may not fit in 32-bit.
This can be resolved by switch the variable bank_size and the parameter
tot_size to "paddr_t".
GAS 2.25.0 throws multiple errors when building arm32/head.S:
arm32/head.S: Assembler messages:
arm32/head.S:452: Error: invalid constant (f7f) after fixup
arm32/head.S:453: Error: invalid constant (f7f) after fixup
arm32/head.S:495: Error: invalid constant (f7f) after fixup
arm32/head.S:510: Error: invalid constant (f7f) after fixup
arm32/head.S:514: Error: invalid constant (f7f) after fixup
arm32/head.S:516: Error: invalid constant (f7f) after fixup
arm32/head.S:633: Error: invalid constant (f7f) after fixup
This makes sense because the instruction mov is only able to deal with a
specific set of immediate (see "modified immediate constants in ARM
instructions"). For any 16-bit immediate, the instruction movw should be
used.
It looks like newer version of GAS will seemly switch to movw if the
immediate does not fit in the immediate encoding for mov. But we should
not rely on this. So switch to movw.
Fixes: 23dfe48d10 ("xen/arm32: head: Introduce macros to create table and mapping entry") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Jan Beulich [Mon, 30 Sep 2019 13:46:24 +0000 (15:46 +0200)]
x86: correct bogus error indicator of cpu_add()
Commit 54ce2db8b8 ("x86/numa: adjust datatypes for node and pxm")
changed this from the -1 (i.e. -EPERM, which was already bogus) that
comes back from setup_node() to NUMA_NO_NODE (0xff). Use a proper error
indicator instead.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Mon, 30 Sep 2019 13:45:16 +0000 (15:45 +0200)]
x86emul: move ARPL #UD check
The #UD for being outside of protected mode gets raised for ARPL only
after having read the memory operand - correct this by moving up the
respective construct.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 3 Sep 2019 13:58:08 +0000 (15:58 +0200)]
ns16550: make PCI device hiding uniform
The difference between pci_hide_device() and pci_ro_device() is that
the former only prevents a device from getting assigned to a guest,
while the latter additionally arranges for Dom0 write attempts to the
device's config space to be ignored/discarded. Whether we want one or
the other certainly doesn't depend on whether the device is in our set
of known devices. All that matters is whether we use a PCI device: Call
pci_ro_device() in any such case.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
xen/sched: move struct task_slice into struct sched_unit
In order to prepare for multiple vcpus per schedule unit move struct
task_slice in schedule() from the local stack into struct sched_unit
of the currently running unit. To make access easier for the single
schedulers add the pointer of the currently running unit as a parameter
of do_schedule().
While at it switch the tasklet_work_scheduled parameter of
do_schedule() from bool_t to bool.
As struct task_slice is only ever modified with the local schedule
lock held it is safe to directly set the different units in struct
sched_unit instead of using an on-stack copy for returning the data.
xen/sched: Change vcpu_migrate_*() to operate on schedule unit
vcpu_migrate_start() and vcpu_migrate_finish() are used only to ensure
a vcpu is running on a suitable processor, so they can be switched to
operate on schedule units instead of vcpus.
While doing that rename them accordingly.
Call vcpu_sync_execstate() for each vcpu of the unit when changing
processors in order to make that an explicit action (otherwise this
would happen later when either the vcpu is scheduled on the new
processor or another non-idle vcpu is scheduled on the old processor).
vcpu_move_locked() is switched to schedule unit, too.
xen/sched: add runstate counters to struct sched_unit
Add counters to struct sched_unit summing up runstates of associated
vcpus. This allows doing quick checks whether a unit has any vcpu
running or whether only a single vcpu of a unit is running.
xen/sched: switch schedule() from vcpus to sched_units
Use sched_units instead of vcpus in schedule(). This includes the
introduction of sched_unit_runstate_change() as a replacement of
vcpu_runstate_change() in schedule().
xen/sched: use sched_resource cpu instead smp_processor_id in schedulers
Especially in the do_schedule() functions of the different schedulers
using smp_processor_id() for the local cpu number is correct only if
the sched_unit is a single vcpu. As soon as larger sched_units are
used most uses should be replaced by the master_cpu number of the local
sched_resource instead.
Add a helper to get that sched_resource master_cpu and modify the
schedulers to use it in a correct way.
Today there are two distinct scenarios for vcpu_create(): either for
creation of idle-domain vcpus (vcpuid == processor) or for creation of
"normal" domain vcpus (including dom0), where the caller selects the
initial processor on a round-robin scheme of the allowed processors
(allowed being based on cpupool and affinities).
Instead of passing the initial processor to vcpu_create() and passing
on to sched_init_vcpu() let sched_init_vcpu() do the processor
selection. For supporting dom0 vcpu creation use the node_affinity of
the domain as a base for selecting the processors. User domains will
have initially all nodes set, so this is no different behavior compared
to today. In theory this is not guaranteed as vcpus are created only
with XEN_DOMCTL_max_vcpus being called, but this call is going to be
removed in future and the toolstack doesn't call
XEN_DOMCTL_setnodeaffinity before calling XEN_DOMCTL_max_vcpus.
To be able to use const struct domain * make cpupool_domain_cpumask()
take a const domain pointer, too.
A further simplification is possible by having a single function for
creating the dom0 vcpus with vcpu_id > 0 and doing the required pinning
for all vcpus after that. This allows to make sched_set_affinity()
private to schedule.c and switch it to sched_units easily. Note that
this functionality is x86 only.
xen: add sched_unit_pause_nosync() and sched_unit_unpause()
The credit scheduler calls vcpu_pause_nosync() and vcpu_unpause()
today. Add sched_unit_pause_nosync() and sched_unit_unpause() to
perform the same operations on scheduler units instead.
xen/sched: add is_running indicator to struct sched_unit
Add an is_running indicator to struct sched_unit which will be set
whenever the unit is being scheduled. Switch scheduler code to use
unit->is_running instead of vcpu->is_running for scheduling decisions.
At the same time introduce a state_entry_time field in struct
sched_unit being updated whenever the is_running indicator is changed.
Use that new field in the schedulers instead of the similar vcpu field.