Add a new memory policy option for the iomem parameter.
Possible values are:
- arm_dev_nGnRE, Device-nGnRE, the default on ARM
- arm_mem_WB, WB cachable memory
- x86_UC_minus: uncachable memory, the default on x86
Store the parameter in a new field in libxl_iomem_range.
Pass the memory policy option to xc_domain_mem_map_policy.
Do the libxl to libxc value conversion in per-arch functions so that we
can return error for x86 parameters on Arm architectures and vice versa.
Andrew suggested to remove MEMORY_POLICY_X86_UC_MINUS and x86_UC_minus
completely. If that's the consensus I am happy to respin the series
removing code.
Changes in v3:
- s/nGRE/nGnRE/g
- s/LIBXL_MEMORY_POLICY_ARM_DEV_NGRE/LIBXL_MEMORY_POLICY_ARM_DEV_NGNRE/g
- s/arm_devmem/arm_dev_nGnRE/g
- s/arm_memory/arm_mem_WB/g
- improve commit message
- improve man page
- s/MEMORY_POLICY_X86_UC/MEMORY_POLICY_X86_UC_MINUS/g
- s/x86_uc/x86_UC_minus/g
- move security support clarification to a separate patch
Changes in v2:
- add #define LIBXL_HAVE_MEMORY_POLICY
- ability to part the memory policy parameter even if gfn is not passed
- rename cache_policy to memory policy
- rename MEMORY_POLICY_DEVMEM to MEMORY_POLICY_ARM_DEV_nGRE
- rename MEMORY_POLICY_MEMORY to MEMORY_POLICY_ARM_MEM_WB
- rename memory to arm_memory and devmem to arm_devmem
- expand the non-security support status to non device passthrough iomem
configurations
- rename iomem options
- add x86 specific iomem option
Introduce a new libxc function that makes use of the new memory_policy
parameter added to the XEN_DOMCTL_memory_mapping hypercall.
The parameter values are the same for the XEN_DOMCTL_memory_mapping
hypercall (0 is MEMORY_POLICY_DEFAULT). Pass MEMORY_POLICY_DEFAULT by
default -- no changes in behavior.
We could extend xc_domain_memory_mapping, but QEMU makes use of it, so
it is easier and less disruptive to introduce a new libxc function and
change the implementation of xc_domain_memory_mapping to call into it.
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> CC: ian.jackson@eu.citrix.com CC: wei.liu2@citrix.com
---
Changes in v2:
- rename cache_policy to memory policy
- rename MEMORY_POLICY_DEVMEM to MEMORY_POLICY_ARM_DEV_nGRE
- rename MEMORY_POLICY_MEMORY to MEMORY_POLICY_ARM_MEM_WB
- introduce xc_domain_mem_map_policy
xen: extend XEN_DOMCTL_memory_mapping to handle memory policy
Reuse the existing padding field to pass memory policy information. On
Arm, the caller can specify whether the memory should be mapped as
Device-nGnRE (Device Memory on Armv7) at stage-2, which is the default
and the only possibility today, or cacheable memory write-back. The
resulting memory attributes will be a combination of stage-2 and stage-1
memory attributes: it will actually be the strongest between the 2
stages attributes.
On x86, the only option is uncachable. The current behavior becomes the
default (numerically '0'). Also explicitely set the memory_policy field
to 0 in libxc.
On ARM, map Device-nGnRE as p2m_mmio_direct_dev (as it is already done
today) and WB cacheable memory as p2m_mmio_direct_c.
On x86, return error if the memory policy requested is not
MEMORY_POLICY_X86_UC_MINUS.
Andrew suggested to remove MEMORY_POLICY_X86_UC_MINUS completely.
If that's the consensus I am happy to respin the series removing code.
Changes in v3:
- error handling in default label of the switch
- set memory_policy to 0 in libxc
- improve commit message
- improve comments
- s/Device-nGRE/Device-nGnRE/g
- add in-code comment
- s/MEMORY_POLICY_X86_UC/MEMORY_POLICY_X86_UC_MINUS/g
- #ifdef hypercall defines according to arch
Changes in v2:
- rebase
- use p2m_mmio_direct_c
- use EOPNOTSUPP
- rename cache_policy to memory policy
- rename MEMORY_POLICY_DEVMEM to MEMORY_POLICY_ARM_DEV_nGRE
- rename MEMORY_POLICY_MEMORY to MEMORY_POLICY_ARM_MEM_WB
- add MEMORY_POLICY_X86_UC
- add MEMORY_POLICY_DEFAULT and use it
Add a p2mt parameter to map_mmio_regions, pass p2m_mmio_direct_dev on
ARM and p2m_mmio_direct on x86 -- no changes in behavior. On x86,
introduce a macro to strip away the last parameter and rename the
existing implementation of map_mmio_regions to __map_mmio_regions.
Use __map_mmio_regions in vpci as it is x86-only today.
On ARM, given the similarity between map_mmio_regions after the change
and map_regions_p2mt, remove un/map_regions_p2mt. Also add an ASSERT to
check that only p2m_mmio_* types are passed to it.
Also fix the style of the comment on top of map_mmio_regions since we
are at it.
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> CC: JBeulich@suse.com CC: andrew.cooper3@citrix.com
---
Changes in v3:
- code style
- introduce __map_mmio_regions on x86
- fix comment style on top of map_mmio_regions
- add an assert on allowed p2mt types in map_mmio_regions
Jan Beulich [Mon, 17 Jun 2019 15:38:35 +0000 (17:38 +0200)]
x86/x2APIC: tighten check in cluster mode IPI sending
It is only of limited use to check the full accumulated 32-bit value,
because the high halves are the cluster ID. What needs to be non-zero is
the bit map at the bottom, or else APIC errors will result.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 17 Jun 2019 15:35:41 +0000 (17:35 +0200)]
x86/IO-APIC: dump full destination ID in x2APIC mode
In x2APIC mode it is 32 bits wide.
In __print_IO_APIC() drop logging of both physical and logical IDs:
The latter covers a superset of the bits of the former in the RTE, and
we write full 8-bit values anyway even in physical mode for all ordinary
interrupts, regardless of INT_DEST_MODE (see the users of SET_DEST()).
Adjust other column arrangement (and heading) a little as well.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Fri, 15 Mar 2019 21:19:43 +0000 (21:19 +0000)]
xen/arm: mm: Remove set_pte_flags_on_range()
set_pte_flags_on_range() is yet another function that will open-code
update to a specific range in the Xen page-tables. It can be completely
dropped by using either modify_xen_mappings() or destroy_xen_mappings().
Note that modify_xen_mappings() will keep the field 'pxn' cleared for
the all the cases. This is because the field is RES0 for the stage-1
hypervisor as only a single VA range is supported (see D5.4.5 in
DDI0487D.b).
Julien Grall [Sun, 2 Dec 2018 18:54:06 +0000 (18:54 +0000)]
xen/arm: mm: Don't open-code Xen PT update in {set, clear}_fixmap()
{set, clear}_fixmap() are currently open-coding update to the Xen
page-tables. This can be avoided by using the generic helpers
map_pages_to_xen() and destroy_xen_mappings().
Both function are not meant to fail for fixmap, hence the BUG_ON()
checking the return.
xen/arm: mm: Rework Xen page-tables walk during update
Currently, xen_pt_update_entry() is only able to update the region covered
by xen_second (i.e 0 to 0x7fffffff).
Because of the restriction we end to have multiple functions in mm.c
modifying the page-tables differently.
Furthermore, we never walked the page-tables fully. This means that any
change in the layout may requires major rewrite of the page-tables code.
Lastly, we have been quite lucky that no one ever tried to pass an address
outside this range because it would have blown-up.
xen_pt_update_entry() is reworked to walk over the page-tables every
time. The logic has been borrowed from arch/arm/p2m.c and contain some
limitations for the time being:
- Superpage cannot be shattered
- Only level 3 (i.e 4KB) can be done
Note that the parameter 'addr' has been renamed to 'virt' to make clear
we are dealing with a virtual address.
xen/arm: mm: Use {, un}map_domain_page() to map/unmap Xen page-tables
Currently, the virtual address of the 3rd level page-tables is obtained
using mfn_to_virt().
On Arm32, mfn_to_virt can only work on xenheap page. While in theory
all the page-tables updated will reside in xenheap, in practice the
page-tables covering Xen memory (e.g xen_mapping) is part of Xen binary.
Furthermore, a follow-up change will update xen_pt_update_entry() to
walk all the levels and therefore be more generic. Some of the
page-tables will also part of Xen memory and therefore will not be
reachable using mfn_to_virt().
The easiest way to reach those pages is to use {, un}map_domain_page().
While on arm32 this means an extra mapping in the normal cases, this is not
very important as xen page-tables are not updated often.
In order to allow future change in the way Xen page-tables are mapped,
two new helpers are introduced to map/unmap the page-tables.
Julien Grall [Sat, 23 Mar 2019 11:44:44 +0000 (11:44 +0000)]
xen/arm: mm: Rework xen_pt_update_entry to avoid use xenmap_operation
With the newly introduced flags, it is now possible to know how the page
will be updated through the flags.
All the use of xenmap_operation are now replaced with the flags. At the
same time, validity check are now removed as they are gathered in
xen_pt_check_entry().
Julien Grall [Mon, 18 Mar 2019 18:38:27 +0000 (18:38 +0000)]
xen/arm: mm: Sanity check any update of Xen page tables
The code handling Xen PT update has quite a few restrictions on what it
can do. This is not a bad thing as it keeps the code simple.
There are already a few checks scattered in current page table handling.
However they are not sufficient as they could still allow to
modify/remove entry with contiguous bit set.
The checks are divided in two sets:
- per entry check: They are gathered in a new function that will
check whether an update is valid based on the flags passed and the
current value of an entry.
- global check: They are sanity check on xen_pt_update() parameters.
Additionally to contiguous check, we also now check that the caller is
not trying to modify the memory attributes of an entry.
Lastly, it was probably a bit over the top to forbid removing an
invalid mapping. This could just be ignored. The new behavior will be
helpful in future changes.
Julien Grall [Mon, 18 Mar 2019 16:17:01 +0000 (16:17 +0000)]
xen/arm: mm: Introduce _PAGE_PRESENT and _PAGE_POPULATE
At the moment, the flags are not enough to describe what kind of update
will done on the VA range. They need to be used in conjunction with the
enum xenmap_operation.
It would be more convenient to have all the information for the update
in a single place.
Two new flags are added to remove the relience on xenmap_operation:
- _PAGE_PRESENT: Indicate whether we are adding/removing the mapping
- _PAGE_POPULATE: Indicate whether we only populate page-tables
xen/arm: grant-table: Protect gnttab_clear_flag against guest misbehavior
The function gnttab_clear_flag is used to clear the access flags. On
Arm, it is implemented using a loop and guest_cmpxchg.
It is possible that guest_cmpxchg will always return a different value
than old. This can happen if the guest updated the memory before Xen has
time to do the exchange. Because of that, there are no way for to
promise the loop will end.
It is possible to make the current code safe by re-using the same
principle as applied on the guest atomic helper. However this patch
takes a different approach that should lead to more efficient code in
the default case.
A new helper is introduced to clear a set of bits on a 16-bits word.
This should avoid a an extra loop to check cmpxchg succeeded.
Note that a mask is used instead of a bit, so the helper can be re-used
later on for clearing multiple flags at the same time.
This is part of XSA-295.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
xen: Use guest atomics helpers when modifying atomically guest memory
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
This patch replaces all the atomics operations on shared memory with
a guest by the new guest atomics helpers. The x86 code was not audited
to know where guest atomics helpers could be used. I will leave that
to the x86 folks.
Note that some rework was required in order to plumb use the new guest
atomics in event channel and grant-table.
Because guest_test_bit is ignoring the parameter "d" for now, it
means there a lot of places do not need to drop the const. We may want
to revisit this in the future if the parameter "d" becomes necessary.
xen/cmpxchg: Provide helper to safely modify guest memory atomically
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
This patch adds a new helper that will update the guest memory safely.
For x86, it is already possible to use the current helper safely. So
just wrap it.
For Arm, we will first attempt to update the guest memory with the
loop bounded by a maximum number of iterations. If it fails, we will
pause the domain and try again.
Note that this heuristics assumes that a page can only
be shared between Xen and one domain. Not Xen and multiple domain.
The maximum number of iterations is based on how many times atomic_inc()
can be executed in 1uS. The maximum value is per-CPU to cater big.LITTLE
and calculated when the CPU is booting.
The maximum number of iterations is based on how many times a simple
load-store atomic operation can be executed in 1uS. The maximum
value is per-CPU to cater big.LITTLE and calculated when the CPU is
booting. The heuristic was randomly chosen and can be modified if
impact too much good-behaving guest.
xen/bitops: Provide helpers to safely modify guest memory atomically
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
This patch adds a new set of helper that will update the guest memory
safely. For x86, it is already possible to use the current helpers
safely. So just wrap them.
For Arm, we will first attempt to update the guest memory with the loop
bounded by a maximum number of iterations. If it fails, we will pause the
domain and try again.
Note that this heuristics assumes that a page can only be shared between
Xen and one domain. Not Xen and multiple domain.
The maximum number of iterations is based on how many times a simple
load-store atomic operation can be executed in 1uS. The maximum value is
per-CPU to cater big.LITTLE and calculated when the CPU is booting. The
heuristic was randomly chosen and can be modified if impact too much
good-behaving guest.
Note, while test_bit does not requires to use atomic operation, a
wrapper for test_bit was added for completeness. In this case, the
domain stays constified to avoid major rework in the caller for the
time-being.
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
Recent patches introduced new helpers to update shared memory with guest
atomically. Those helpers relies on a memory region to be be shared with
Xen and a single guest.
At the moment, nothing prevent a guest sharing a page with Xen and as
well with another guest (e.g via grant table).
For the scope of the XSA, the quickest way is to deny communications
between unprivileged guest. So this patch is enabling and using SILO
mode by default on Arm.
Users wanted finer graine policy could wrote their own Flask policy.
This is part of XSA-295.
Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Wed, 22 May 2019 20:39:17 +0000 (13:39 -0700)]
xen/arm: cmpxchg: Provide a new helper that can timeout
Exclusive load-store atomics should only be used between trusted
threads. As not all the guests are trusted, it may be possible to DoS
Xen when updating shared memory with guest atomically.
To prevent the infinite loop, we introduce a new helper that can timeout.
The timeout is based on the maximum number of iterations.
It will be used in follow-up patch to make atomic operations on shared
memory safe.
xen/arm: bitops: Implement a new set of helpers that can timeout
Exclusive load-store atomics should only be used between trusted
threads. As not all the guests are trusted, it may be possible to DoS
Xen when updating shared memory with guest atomically.
To prevent the infinite loop, we introduce a new set of helpers that can
timeout. The timeout is based on the maximum number of iterations.
They will be used in follow-up patch to make atomic operations
on shared memory safe.
xen/arm32: cmpxchg: Simplify the cmpxchg implementation
The only difference between each case of the cmpxchg is the size of
used. Rather than duplicating the code, provide a macro to generate each
cases.
This makes the code easier to read and modify.
While doing the rework, the case for 64-bit cmpxchg is removed. This is
unused today (already commented) and it would not be possible to use
it directly.
xen/grant_table: Rework the prototype of _set_status* for lisibility
It is not clear from the parameters name whether domid and gt_version
correspond to the local or remote domain. A follow-up patch will make
them more confusing.
So rename domid (resp. gt_version) to ldomid (resp. rgt_version). At
the same time re-order the parameters to hopefully make it more
readable.
This is part of XSA-295.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
xen/arm: Add an isb() before reading CNTPCT_EL0 to prevent re-ordering
Per D8.2.1 in ARM DDI 0487C.a, "a read to CNTPCT_EL0 can occur
speculatively and out of order relative to other instructions executed
on the same PE."
Add an instruction barrier to get accurate number of cycles when
requested in get_cycles(). For the other users of CNPCT_EL0, replace by
a call to get_cycles().
Julien Grall [Mon, 18 Mar 2019 18:06:55 +0000 (18:06 +0000)]
xen/arm: mm: Protect Xen page-table update with a spinlock
The function create_xen_entries() may be called concurrently. For
instance, while the vmap allocation is protected by a spinlock, the
mapping is not.
The implementation create_xen_entries() contains quite a few TOCTOU
races such as when allocating the 3rd-level page-tables.
Thankfully, they are pretty hard to reach as page-tables are allocated
once and never released. Yet it is possible, so we need to protect with
a spinlock to avoid corrupting the page-tables.
xen/arm32: mm: Avoid cleaning the cache for secondary CPUs page-tables
The page-table walker is configured by TCR_EL2 to use the same
shareability and cacheability as the access performed when updating the
page-tables. This means cleaning the cache for secondary CPUs runtime
page-tables is unnecessary.
When a message is requeue'd in Xen's internal queue, the queue
entry contains the length of the message so that Xen knows to
send a VIRQ to the respective domain when enough space frees up
in the ring. Due to a small bug, however, Xen doesn't populate
the length of the msg if a given write fails, so this length is
always reported as zero. This causes Xen to spuriously wake up
a domain even when the ring doesn't have enough space.
This patch makes sure that the msg len is properly reported by
populating it in the event of a write failure.
Signed-off-by: Nicholas Tsirakis <tsirakisn@ainfosec.com> Reviewed-by: Christopher Clark <christopher.w.clark@gmail.com>
Julien Grall [Mon, 18 Mar 2019 18:01:31 +0000 (18:01 +0000)]
xen/arm: mm: Flush the TLBs even if a mapping failed in create_xen_entries
At the moment, create_xen_entries will only flush the TLBs if the full
range has successfully been updated. This may lead to leave unwanted
entries in the TLBs if we fail to update some entries.
All the TLBs helpers invalidate all the TLB entries are using the same
pattern:
DSB SY
TLBI ...
DSB SY
ISB
This pattern is following pattern recommended by the Arm Arm to ensure
visibility of updates to translation tables (see K11.5.2 in ARM DDI
0487D.b).
We have been a bit too eager in Xen and use system-wide DSBs when this
can be limited to the inner-shareable domain.
Furthermore, the first DSB can be restrict further to only store in the
inner-shareable domain. This is because the DSB is here to ensure
visibility of the update to translation table walks.
Lastly, there are a lack of documentation in most of the TLBs helper.
Rather than trying to update the helpers one by one, this patch
introduce a per-arch macro to generate the TLB helpers. This will be
easier to update the TLBs helper in the future and the documentation.
Now that we dropped flush_xen_text_tlb_local(), we have only one set of
helpers acting on Xen TLBs. There naming are quite confusing because the
TLB instructions used will act on both Data and Instruction TLBs.
Take the opportunity to rework the documentation which can be confusing
to read as they don't match the implementation. Note the mention about
the instruction cache maintenance has been removed because modifying
mapping does not require instruction cache maintenance.
Lastly, switch from unsigned long to vaddr_t as the function technically
deal with virtual address.
Julien Grall [Mon, 13 May 2019 15:02:18 +0000 (16:02 +0100)]
xen/arm: Don't boot Xen on platform using AIVIVT instruction caches
The AIVIVT is a type of instruction cache available on Armv7. This is
the only cache not implementing the IVIPT extension and therefore
requiring specific care.
To simplify maintenance requirements, Xen will not boot on platform
using AIVIVT cache.
This should not be an issue because Xen Arm32 can only boot on a small
number of processors (see arch/arm/arm32/proc-v7.S). All of them are
not using AIVIVT cache.
Andrew Cooper [Wed, 12 Jun 2019 10:28:05 +0000 (11:28 +0100)]
x86/boot: Drop vestigial support for pre-SIPI APICs
The current code in do_boot_cpu() makes a CMOS write (even in the case of an
FADT reduced hardware configuration) and two writes into the BDA for the
start_eip segment and offset.
BDA 0x67 and 0x69 hail from the days of the DOS and the 286, when IBM put
together the fast way to return from Protected Mode back to Real Mode (via a
deliberate triple fault). This vector, when set, redirects the early boot
logic back into OS control.
It is also used by early MP systems, before the Startup IPI message became
standard, which in practice was before Local APICs became integrated into CPU
cores.
Support for non-integrated APICs was dropped in c/s 7b0007af "xen/x86: Remove
APIC_INTEGRATED() checks" because there are no 64-bit capable systems without
them. Therefore, drop smpboot_{setup,restore}_warm_reset_vector().
Dropping smpboot_setup_warm_reset_vector() also lets us drop
TRAMPOLINE_{HIGH,LOW}, which lets us drop mach_wakecpu.h entirely. The final
function in smpboot_hooks.h is smpboot_setup_io_apic() and has a single
caller, so expand it inline and delete smpboot_hooks.h as well.
This removes all reliance on CMOS and the BDA from the AP boot path, which is
especially of interest on reduced_hardware boots and EFI systems.
This was discovered while investigating Xen's use of the BDA during kexec.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Pu Wen [Wed, 12 Jun 2019 12:54:25 +0000 (20:54 +0800)]
x86/pv: Add Hygon Dhyana support to emulate MSRs access
The Hygon Dhyana CPU supports lots of MSRs(such as perf event select and
counter MSRs, hardware configuration MSR, MMIO configuration base address
MSR, MPERF/APERF MSRs) as AMD CPU does, so add Hygon Dhyana support to the
PV emulation infrastructure by using the code path of AMD.
Signed-off-by: Pu Wen <puwen@hygon.cn> Acked-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Tue, 28 May 2019 10:32:16 +0000 (12:32 +0200)]
xen/sched: let sched_switch_sched() return new lock address
Instead of setting the scheduler percpu lock address in each of the
switch_sched instances of the different schedulers do that in
schedule_cpu_switch() which is the single caller of that function.
For that purpose let sched_switch_sched() just return the new lock
address.
This allows to set the new struct scheduler and struct schedule_data
values in the percpu area in schedule_cpu_switch() instead of the
schedulers, too.
It should be noted that in credit2 the lock used to be set while still
holding the global scheduler write lock, which will no longer be true
with the new scheme applied. This is actually no problem as the write
lock is meant to guard the call of init_pdata() which still is true.
While there, turn the full barrier, which was overkill, into an
smp_wmb(), matching with the one implicit in managing to take the
lock.
Andrii Anisov [Wed, 12 Jun 2019 09:35:50 +0000 (12:35 +0300)]
schedule: move credit scheduler specific member to its privates
The vcpu structure member last_run_time is used by credit scheduler only.
In order to get better encapsulation, it is moved from a generic
structure to the credit scheduler private vcpu definition. Also, rename
the member to last_sched_time in order to reflect that it is the time
when the vcpu went through the scheduling path.
With this move we have slight changes in functionality:
- last_sched_time is not updated for an idle vcpu. But the idle vcpu is,
in fact, a per-pcpu stub and never migrates so last_sched_time is
meaningless for it.
- The value of last_sched_time is updated on every schedule, even if the
vcpu is not being changed. It is still ok, because last_sched_time is
only used for runnable vcpu migration decision, and we have it correct
at that moment. Scheduling parameters and statistics are tracked by
other entities.
Reducing code and data usage when not running credit scheduler is another
nice side effect.
While here, also:
- turn last_sched_time into s_time_t, which is more appropriate.
- properly const-ify related argument of __csched_vcpu_is_cache_hot().
In its current state, if the destination ring is full, sendv()
will requeue the message and return the rc of pending_requeue(),
which will return 0 on success. This prevents the caller from
distinguishing the difference between a successful write and a
message that needs to be resent at a later time.
Instead, capture the -EAGAIN value returned from ringbuf_insert()
and *only* overwrite it if the rc of pending_requeue() is non-zero.
This allows the caller to make intelligent decisions on -EAGAIN and
still be alerted if the pending message fails to requeue.
Signed-off-by: Nicholas Tsirakis <tsirakisn@ainfosec.com> Reviewed-by: Christopher Clark <christopher.w.clark@gmail.com>
Jan Beulich [Tue, 11 Jun 2019 15:21:34 +0000 (17:21 +0200)]
x86/AMD: make use of CPUID leaf 0xb when available
Initially I did simply stumble across a backport of Linux commit e0ceeae708 ("x86/CPU/hygon: Fix phys_proc_id calculation logic for
multi-die processors") to our kernels. There I got puzzled by the claim
that a similar change isn't needed on the AMD side. As per the web page
cited [1], there aren't supposed to be affected AMD processors, but
according to my reading there are: The EPYC 7000 series comes with 8,
16, 24, or 32 cores, which I imply to be 1, 2, 3, or 4 die processors.
And many of them have "1P/2P" in the "socket count" column. Therefore
our calculation, being based on CPUID.80000008.EBX[15:12], would be
similarly wrong on such 2-socket 1- or 2-die systems.
Checking Linux code I then found that they don't even rely on the
calculation we currently use anymore, at least not in the case when
leaf 0xb is available (which is the case on Fam17). Let's follow
Suravee's Linux commit 3986a0a805 ("x86/CPU/AMD: Derive CPU topology
from CPUID function 0xB when available") in this regard to address this.
To avoid logging duplicate information, make the function return bool.
Move its and detect_ht()'s declaration to a private header at the same
time.
Roger Pau Monne [Mon, 10 Jun 2019 16:32:46 +0000 (18:32 +0200)]
automation: add clang and lld 8 tests to gitlab
Using clang and lld 8 requires installing the packages from the
official llvm apt repositories, so modify the Debian Docker files for
stretch and unstable to add the llvm repo and install clang and lld
from it.
Also add some jobs to test building Xen with clang 8 and lld.
Andrii Anisov [Mon, 27 May 2019 09:29:30 +0000 (12:29 +0300)]
xen/arm: gic: Defer the decision to unmask interrupts to do_{LPI, IRQ}()
At the moment, interrupts are unmasked by gic_interrupt() before
calling do_{IRQ, LPI}(). In the case of handling an interrupt routed
to guests, its priority will be dropped, via desc->handler->end()
called from do_irq(), with interrupt unmasked.
In other words:
- Until the priority is dropped, only higher priority interrupt
can be received. Today, only Xen interrupts have higher priority.
- As soon as priority is dropped, any interrupt can be received.
This means the purpose of the loop in gic_interrupt() is defeated as
all interrupts may get trapped earlier. To reinstate the purpose of
the loop (and prevent the trap), interrupts should be masked when
dropping the priority.
For interrupts routed to Xen, priority will always be dropped with
interrupts masked. So the issue is not present. However, it means
that we are pointless try to mask the interrupts.
To avoid conflicting behavior between interrupt handling,
gic_interrupt() is now keeping interrupts masked and defer the decision
to do_{LPI, IRQ}.
xen/device-tree: Add ability to handle nodes with interrupts-extended prop
The "interrupts-extended" property is a special form for use when
a node needs to reference multiple interrupt parents.
According to:
Linux/Documentation/devicetree/bindings/interrupt-controller/interrupts.txt
But, there are cases when "interrupts-extended" property is used for
"outside /soc node" with a single interrupt parent as an equivalent of
pairs ("interrupt-parent" + "interrupts").
A good example here is ARCH timer node for R-Car Gen3/Gen2 family,
which is mandatory device for Xen usage on ARM. And without ability
to handle such nodes, Xen fails to operate.
So, this patch adds required support for Xen to be able to handle
nodes with that property.
UBSAN (which I happened to have active in my build at the time) identifies the
problem explicitly:
(XEN) Using APIC driver default
(XEN) ================================================================================
(XEN) UBSAN: Undefined behaviour in /local/xen.git/xen/include/xsm/xsm.h:309:19
(XEN) member access within null pointer of type 'struct xsm_operations'
(XEN) ----[ Xen-4.13-unstable x86_64 debug=y Not tainted ]----
"adjust system domain creation (and call it earlier on x86)" didn't account
for the fact that domain_create() depends on XSM already being set up.
Therefore, domain_create() follows xsm_ops->alloc_security_domain() which is
offset 0 from a NULL pointer, meaning that we execute the 16bit IVT until
happening to explode in __x86_indirect_thunk_rax().
There is nothing very interesting that xsm_multiboot_init() does more than
allocating memory, which means that it is safe to move earlier during setup.
xen/arm: mm: Check start is always before end in {destroy, modify}_xen_mappings
The two helpers {destroy, modify}_xen_mappings don't check that the
start is always before the end. This should never happen but if it
happens, it will result to unexpected behavior.
Catch such issues earlier on by adding an ASSERT in destroy_xen_mappings
and modify_xen_mappings.
Since commit f60658c6ae "xen/arm: Stop relocating Xen", the function
setup_page_tables() does not require any information from the FDT.
So the initialization of the page-tables can be done much earlier in the
boot process. The earliest setup_page_tables() can be called is after
traps have been initialized, so we can get backtrace if an error
occurred.
Moving the initialization of the page-tables also avoid the dance to map
the FDT again in the new set of page-tables.
xen/arm: mm: Introduce DEFINE_PAGE_TABLE{,S} and use it
We have multiple static page-tables defined in arch/arm/mm.c. The
current way to define them is difficult to read and does not help when
making modification.
Two new helpers DEFINE_PAGE_TABLES (to define multiple page-tables) and
DEFINE_PAGE_TABLE (alias of DEFINE_PAGE_TABLES(..., 1)) are introduced
and now used to define static page-tables.
Note that DEFINE_PAGE_TABLES() alignment differs from what is currently
used for allocating page-tables. This is fine because page-tables are
only required to be aligned to a page-size.
xen/arm32: mm: Avoid to zero and clean cache for CPU0 domheap
The page-table walker is configured to use the same shareability and
cacheability as the access performed when updating the page-tables. This
means cleaning the cache for CPU0 domheap is unnecessary.
Furthermore, CPU0 page-tables are part of Xen binary and will already be
zeroed before been used. So it is pointless to zero the domheap again.
xen/arm32: head: Always zero r3 before update a page-table entry
The boot code is using r2 and r3 to hold the page-table entry value.
While r2 is always updated before storing the value, this is not always
the case for r3.
Thankfully today, r3 will always be zero when we care. But this is
difficult to track and error-prone.
So always zero r3 within the few instructions before the write the
page-table entry.
There are no reason to consider the HW CPU ID will be 0 when the
processor is part of a uniprocessor system. At best, this will result to
conflicting output as the rest of Xen use the value directly read from
MPIDR.
So remove the zeroing and logic to check if the CPU is part of a
uniprocessor system.
xen/arm: p2m: configure stage-2 page table to support upto 42-bit PA systems
At the moment, on platform supporting 42-bit PA, Xen will only expose
40-bit worth of IPA to all domains.
The limitation was to prevent allocating too much memory for the root
page tables as those platforms only support 3-levels page-tables. At the
time, this was deemed acceptable because none of the platforms had
address wired above 40-bits.
However, newer platforms take advantage of the full address space. This
will result to break Dom0 boot as it can't access anything above 40-bit.
The only way to support 42-bit IPA is to allocate 8 pages for the root
page-tables. This is a bit a waste of memory as Xen does not offer
per-guest stage-2 configuration. But it is considered acceptable as
current platforms support 42-bit PA have a lot of memory.
In the future, we may want to consider per-guest stage-2 configuration
to reduce the waste.
Jan Beulich [Fri, 31 May 2019 09:53:39 +0000 (03:53 -0600)]
Arm64: further speed-up to hweight{32,64}()
According to Linux commit e75bef2a4f ("arm64: Select
ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
variant using only bitwise operations on at least some hardware, and no
worse on other.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com>
xen: actually skip the first MAX_ORDER bits in pfn_pdx_hole_setup
pfn_pdx_hole_setup is meant to skip the first MAX_ORDER bits, but
actually it only skips the first MAX_ORDER-1 bits. The issue was
probably introduced by bdb5439c3f ("x86_64: Ensure frame-table
compression leaves MAX_ORDER aligned"), when changing to loop to start
from MAX_ORDER-1 an adjustment by 1 was needed in the call to
find_next_bit() but not done.
Fix the issue by passing j+1 and i+1 to find_next_zero_bit and
find_next_bit. Also add a check for i >= BITS_PER_LONG because
find_{,next_}zero_bit() are free to assume that their last argument is
less than their middle one.
pfn_to_pdx expects an address, not a size, as a parameter. Specifically,
it expects the end address, then the masks calculations compensate for
any holes between start and end. Thus, we should pass the end address to
pfn_to_pdx.
The initial pdx is stored in frametable_base_pdx, so we can subtract the
result of pfn_to_pdx(start_address) from nr_pdxs; we know that we don't
need to cover any memory in the range 0-start in the frametable.
Remove the variable `nr_pages' because it is unused.
Pu Wen [Thu, 4 Apr 2019 13:48:13 +0000 (21:48 +0800)]
tools/libxc: Add Hygon Dhyana support
Add Hygon Dhyana support to caculate the cpuid policies for creating PV
or HVM guest by using the code path of AMD.
Signed-off-by: Pu Wen <puwen@hygon.cn> Acked-by: Wei Liu <wei.liu2@citrix.com>
[Rebase over 0cd074144cb "x86/cpu: Renumber X86_VENDOR_* to form a bitmap"] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Pu Wen [Thu, 4 Apr 2019 13:48:04 +0000 (21:48 +0800)]
x86/cpuid: Add Hygon Dhyana support
The Hygon Dhyana family 18h processor shares the same cpuid leaves as
the AMD family 17h one. So add Hygon Dhyana support to caculate the
cpuid policies as the AMD CPU does.
Signed-off-by: Pu Wen <puwen@hygon.cn> Acked-by: Jan Beulich <jbeulich@suse.com>
[Rebase over 0cd074144cb "x86/cpu: Renumber X86_VENDOR_* to form a bitmap"] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Pu Wen [Thu, 4 Apr 2019 13:47:40 +0000 (21:47 +0800)]
x86/domctl: Add Hygon Dhyana support
Add Hygon Dhyana support to update cpuid info for creating PV guest.
Signed-off-by: Pu Wen <puwen@hygon.cn> Acked-by: Jan Beulich <jbeulich@suse.com>
[Rebase over 0cd074144cb "x86/cpu: Renumber X86_VENDOR_* to form a bitmap"] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Pu Wen [Thu, 4 Apr 2019 13:47:29 +0000 (21:47 +0800)]
x86/domain: Add Hygon Dhyana support
Add Hygon Dhyana support to handle HyperTransport range.
Also loading a nul selector does not clear bases and limits on Hygon
CPUs, so add Hygon Dhyana support to the function preload_segment.
Signed-off-by: Pu Wen <puwen@hygon.cn> Acked-by: Jan Beulich <jbeulich@suse.com>
[Rebase over 0cd074144cb "x86/cpu: Renumber X86_VENDOR_* to form a bitmap"] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Pu Wen [Thu, 4 Apr 2019 13:46:33 +0000 (21:46 +0800)]
x86/spec_ctrl: Add Hygon Dhyana to the respective mitigation machinery
The Hygon Dhyana CPU has the same speculative execution as AMD family
17h, so share AMD Retpoline and PTI mitigation code with Hygon Dhyana.
Signed-off-by: Pu Wen <puwen@hygon.cn> Acked-by: Jan Beulich <jbeulich@suse.com>
[Rebase over 0cd074144cb "x86/cpu: Renumber X86_VENDOR_* to form a bitmap"] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Pu Wen [Thu, 4 Apr 2019 13:46:23 +0000 (21:46 +0800)]
x86/cpu/mce: Add Hygon Dhyana support to the MCA infrastructure
The machine check architecture for Hygon Dhyana CPU is similar to the
AMD family 17h one. Add vendor checking for Hygon Dhyana to share the
code path of AMD family 17h.
Signed-off-by: Pu Wen <puwen@hygon.cn> Acked-by: Jan Beulich <jbeulich@suse.com>
[Rebase over 0cd074144cb "x86/cpu: Renumber X86_VENDOR_* to form a bitmap"] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Pu Wen [Thu, 4 Apr 2019 13:46:11 +0000 (21:46 +0800)]
x86/cpu/vpmu: Add Hygon Dhyana and AMD Zen support for vPMU
As Hygon Dhyana CPU share similar PMU architecture with AMD family
17h one, so add Hygon Dhyana support in vpmu_arch_initialise() and
vpmu_init() by sharing AMD code path.
Split the common part in amd_vpmu_init() to a static function
_vpmu_init(), making AMD and Hygon to call the shared function to
initialize vPMU.
As current vPMU still not support AMD Zen(family 17h), add 0x17 support
to amd_vpmu_init().
Also create a function hygon_vpmu_init() for Hygon vPMU initialization.
Both of AMD 17h and Hygon 18h have the same performance event select
and counter MSRs as AMD 15h has, so reuse the 15h definitions for them.
Signed-off-by: Pu Wen <puwen@hygon.cn> Acked-by: Jan Beulich <jbeulich@suse.com>
Pu Wen [Thu, 4 Apr 2019 13:45:42 +0000 (21:45 +0800)]
x86/cpu: Fix common cpuid faulting probing for AMD and Hygon
There is no MSR_INTEL_PLATFORM_INFO for AMD and Hygon families. Read
this MSR will stop the Xen initialization process in some Hygon
systems or produce GPF(0). So directly return false in the function
probe_cpuid_faulting() if !cpu_has_hypervisor.
Signed-off-by: Pu Wen <puwen@hygon.cn> Acked-by: Jan Beulich <jbeulich@suse.com>
[Rebase over 0cd074144cb "x86/cpu: Renumber X86_VENDOR_* to form a bitmap"] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>