Ian Jackson [Fri, 13 Dec 2019 17:01:44 +0000 (17:01 +0000)]
docs/process/branching-checklist: Fix a broken rune
cr-daily-branch ought to be called via cr-for-branches so that we take
the lock. Otherwise strange things can occur if cron runs
cr-daily-branch in the same directory - in particular, it will be
likely to update the osstest revision, breaking everything.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Fri, 26 Apr 2019 15:53:27 +0000 (16:53 +0100)]
xen/tasklet: Switch data parameter from unsigned long to void *.
Most users pass a vcpu pointer, and only stopmachine_action() takes an integer
parameter. Switch to using void * to substantially reduce the number of
explicit casts.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 11 Apr 2019 12:54:36 +0000 (13:54 +0100)]
xen/tasklet: Fix return value truncation on arm64
The use of return_reg() assumes ARM's 32bit ABI. Therefore, a failure such as
-EINVAL will appear as a large positive number near 4 billion to a 64bit ARM
guest which happens to use continue_hypercall_on_cpu().
Introduce a new arch_hypercall_tasklet_result() hook which is implemented by
both architectures, and drop the return_reg() macros. This logic will be
extended in a later change to make continuations out of the tasklet work.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <julien@xen.org>
Andrew Cooper [Thu, 31 May 2018 17:50:50 +0000 (18:50 +0100)]
x86/debug: Plumb pending_dbg through the monitor and devicemodel interfaces
Like %cr2 for pagefaults, %dr6 contains ancillary information for debug
exceptions, and needs similar handling.
For xendevicemodel_inject_event(), no ABI change is needed (although an API
one would be ideal). Switch from 'cr2' to 'extra' in variable names which
don't constitute an API change, and update the documentation to match.
For the monitor interface, vm_event_debug needs extending with a pending_dbg
field. This shall behave like the VT-x PENDING_DBG control. Extend
hvm_monitor_debug() and for now, always pass in 0 - this will be fixed
eventually, when other hypervisor bugfixes are complete.
While modifying hvm_monitor_debug(), take the opportunity to correct trap type
and instruction length from unsigned long to unsigned int, as they are both
tiny values.
Finally, adjust xen-access.c to the new expectations. Introspection tools
intercepting debug exceptions should mirror the new pending_dbg field into
xendevicemodel_inject_event() for %dr6 to be processed correctly for the
guest.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Andrew Cooper [Fri, 20 Dec 2019 15:26:00 +0000 (15:26 +0000)]
tools/libxc: Fix HVM_PARAM_PAE_ENABLED handling in xc_cpuid_apply_policy()
Despite as suggested in c/s 685e922d6f3, not all HVM_PARAMs are handled
in the same way. HVM_PARAM_PAE_ENABLED is a toolstack-only value, and
the xc_cpuid_apply_policy() used to be the only consumer.
Reinstate the old behaviour (mad as it is) to avoid regressions.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 19 Aug 2019 13:16:53 +0000 (14:16 +0100)]
x86/boot: Reposition trampoline data
... to separate code from data. In particular, trampoline_realmode_entry's
write to trampoline_cpu_started clobbers the I-cache line containing
trampoline_protmode_entry, which won't be great for AP startup performance.
Reformat the comments for trampoline_gdt to reduce their volume.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 27 Dec 2019 09:01:43 +0000 (10:01 +0100)]
x86/mm: avoid IOMMU operations in more cases in _get_page_type()
All that really matters is whether writability of a page changes; in
particular e.g. page table -> page table (but different levels)
transitions do not require unmapping the page from the IOMMU again.
Note that the XSA-288 fix did arrange for PGT_none pages not needing
special consideration here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 27 Dec 2019 08:57:05 +0000 (09:57 +0100)]
x86: move vgc_flags to struct pv_vcpu
There's been effectively no use of the field for HVM.
Also shrink the field to unsigned int, even if this doesn't immediately
yield any space benefit for the structure itself. The resulting 32-bit
padding slot can eventually be used for some other field. The change in
size makes accesses slightly more efficient though, as no REX.W prefix
is going to be needed anymore on the respective insns.
Mirror the HVM side change here (dropping of setting the field to
VGCF_online) also to Arm, on the assumption that it was cloned like
this originally. VGCF_online really should simply and consistently be
the guest view of the inverse of VPF_down, and hence needs representing
only in the get/set vCPU context interfaces.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 27 Dec 2019 08:56:04 +0000 (09:56 +0100)]
x86: move and rename NR_VECTORS
This is an architectural definition, so move it to x86-defns.h and add
an X86_ prefix. This in particular allows removing the inclusion of
irq_vectors.h by virtually every source file, due to irq.h and
hvm/vmx/vmcs.h having needed to include it: Changes to IRQ vector usage
shouldn't really trigger full rebuilds.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 27 Dec 2019 08:54:59 +0000 (09:54 +0100)]
x86/IRQ: re-use legacy vector ranges on APs
The legacy vectors have been actively used on CPU 0 only. CPUs not
sharing vector space with CPU 0 can easily re-use them, slightly
increasing the relatively scarce resource of total vectors available in
the system. As a result the legacy vector range simply becomes a
sub-range of the dynamic one, with an extra check performed in
_assign_irq_vector() (we can't rely on the
"per_cpu(vector_irq, new_cpu)[vector] >= 0" check in the subsequent
loop, as we need to also exclude vectors of disabled legacy IRQs).
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 27 Dec 2019 08:54:19 +0000 (09:54 +0100)]
x86/IRQ: flip legacy and dynamic vector ranges
There's no reason to have the PIC vectors (which are typically entirely
unused on 64-bit systems anyway) right below the high priority ones. Put
them in the lowest possible range, and shift the dynamic vector range up
accordingly. This is to reduce the priority of PIC vectors in the LAPIC
vs all other ones.
Note that irq_move_cleanup_interrupt(), despite using
FIRST_DYNAMIC_VECTOR, does not get touched, as PIC interrupts aren't
movable.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 27 Dec 2019 08:53:35 +0000 (09:53 +0100)]
x86/IRQ: simplify pending EOI stack logic for internally used IRQs
In 5655ce8b1ec2 ("x86/IRQ: make internally used IRQs also honor the
pending EOI stack") it was mentioned that both the check_eoi_deferral
per-CPU variable and the cpu_has_pending_apic_eoi() were added just to
have as little impact on existing behavior as possible, to reduce the
risk of a last minute regression in 4.13.
Upon closer inspection, dropping the variable is an option only if all
callers of ->end() would assume the responsibility of also calling
flush_ready_eoi(). Therefore only drop the cpu_has_pending_apic_eoi()
guard now.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 27 Dec 2019 08:52:41 +0000 (09:52 +0100)]
x86/IRQ: move and rename __do_IRQ_guest()
This is for it to be next to do_IRQ(). Beyond the actual code movement
this
- drops the leading underscores,
- passes in desc and vector, rather than irq,
- flips the order of two ASSERT()s,
- changes i and sp to unsigned int,
- restricts the scope of d and sp,
- corrects style.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 27 Dec 2019 08:51:52 +0000 (09:51 +0100)]
x86/IRQ: move do_IRQ()
This is to avoid forward declarations of static functions. Beyond the
actual code movement this does
- u8 -> uint8_t,
- convert to Xen style,
- drop unnecessary parentheses and alike,
- strip trailing white space.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Fri, 27 Dec 2019 08:50:31 +0000 (09:50 +0100)]
x86/hvm/rtc: preserved guest RTC offset during suspend/resume/migrate
The emulated RTC is synchronized with the PV wallclock; any write to the
RTC will update struct domain's 'time_offset_seconds' field and call
update_domain_wallclock().
However, the value of 'time_offset_seconds' is not preserved in any save
record and indeed, when the RTC save record is loaded, the CMOS values
will be updated based on an offset value which may or may not have been
set by the toolstack [1]. This may result in making bogus values available
to the guest and messing up any calculations done in the call to
alarm_timer_update() at the end of rtc_load().
This patch extends the RTC save record to contain an offset value, which
will be zero filled on load of an older record. The 'time_offset_secoonds'
field in struct domain is also modified into a 'time_offset' struct,
containing a 'seconds' field and a boolean 'set' field.
The code in rtc_load() then uses the new value in the save record to
update the value of struct domain's 'time_offset.seconds' unless
'time_offset.set' is true, which will only be the case if the toolstack has
already performed a XEN_DOMCTL_settimeoffset.
[1] There is currently no way for a toolstack to read the value of
'time_offset_seconds' from struct domain. In the past, any hope of
preservation of the value across a guest life-cycle operation was based
on relying on qemu-dm to write a value into xenstore whenever the RTC
was updated, in response to an IOREQ with type IOREQ_TYPE_TIMEOFFSET
being sent by Xen; see:
but this behaviour was never forward-ported into upstream QEMU, which
completely ignores that IOREQ type.
In either case, nothing in xl or libxl ever samples the value of
RTC offset from xenstore so any offset adjustment to a non-zero value
performed by the guest (which in the case of Windows is highly likely
as it normally writes RTC in local time, whereas Xen maintains time in
UTC) is completely lost with the de-facto toolstack, and always has
been. Instead, PV drivers are relied upon to paper over this gaping
hole.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien@xen.org>
Roger Pau Monne [Tue, 24 Dec 2019 15:32:47 +0000 (16:32 +0100)]
x86/vvmx: virtualize x2APIC mode and APIC accesses can't both be enabled
According to the Intel SDM, "virtualize x2APIC mode" and "virtualize
APIC accesses" can't be enabled at the same time, or else a
vm{launch/entry} failure will happen. This was seen when running Xen
nested and with x2APIC enabled:
(XEN) d3v0 VMLAUNCH error: 0x7
[...]
(XEN) *** Control State ***
(XEN) PinBased=0000003f CPUBased=b6a075fe SecondaryExec=000014fb
[...]
Fix this by making sure nvmx_update_secondary_exec_control clears the
incompatible bits from the host vmcs before merging it with the nested
vmcs.
This fixes a regression reported by osstest in the
test-amd64-amd64-qemuu-nested-intel job.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 17 Dec 2019 17:49:47 +0000 (17:49 +0000)]
libxc/migration: Drop unimplemented domain types
x86 PVH is completely obsolete - it was intended for legacy PVH before that
idea was abandoned. There was an RFC series for ARM in 2015, but there is
plenty of outstanding work which hasn't been done yet.
No functional change. New types can be (re)introduced with the code which
actually implements them.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien@xen.org> Acked-by: Wei Liu <wl@xen.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Wed, 18 Dec 2019 19:43:18 +0000 (19:43 +0000)]
libxc/restore: Don't duplicate state in process_vcpu_basic()
vcpu_guest_context_any_t is currently allocated on the stack, and copied from
a mutable buffer which is freed immediately after its use here. Mutate the
buffer in place instead of duplicating it.
The code is as it is due to how it was developed. Originally,
process_vcpu_basic() operated on a const pointer from the X86_VCPU_BASIC
record, but during upstreaming, the addition of Remus support required
buffering of X86_VCPU_BASIC records each checkpoint.
By the time process_vcpu_basic() runs, we are commited to completing state
restoration and unpausing the guest.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Nick Rosbrook [Mon, 23 Dec 2019 15:17:06 +0000 (10:17 -0500)]
golang/xenlight: revise use of Context type
Remove the exported global context variable, 'Ctx.' Generally, it is
better to not export global variables for use through a Go package.
However, there are some exceptions that can be found in the standard
library.
Add a NewContext function instead, and remove the Open, IsOpen, and
CheckOpen functions as a result.
Also, comment-out an ineffectual assignment to 'err' inside the function
Context.CpupoolInfo so that compilation does not fail.
Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Mon, 23 Dec 2019 13:16:11 +0000 (14:16 +0100)]
x86emul: introduce CASE_SIMD_..._FP_VEX()
Since there are many AVX{,2} insns having legacy SIMD counterparts, have
macros covering both in one go. This (imo) improves readability and helps
prepare for optionally disabling SIMD support in the emulator.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 23 Dec 2019 13:13:37 +0000 (14:13 +0100)]
x86emul: introduce CASE_SIMD_PACKED_INT_VEX()
Since there are many AVX{,2} insns having legacy MMX and SIMD
counterparts, have a macro covering all three in one go. This (imo)
improves readability (simply by the shrunk number of lines) and helps
prepare for optionally disabling MMX and SIMD support in the emulator.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Wei Liu [Mon, 23 Dec 2019 11:03:30 +0000 (11:03 +0000)]
x86/hyperv: change hv_tlb_flush_ex to fix clang build
Clang complains:
In file included from synic.c:15:
/builds/xen-project/xen/xen/include/asm/guest/hyperv-tlfs.h:900:18: error: field 'hv_vp_set' with variable sized type 'struct hv_vpset' not at the end of a struct or class is a GNU extension [-Werror,-Wgnu-variable-sized-type-not-at-end]
struct hv_vpset hv_vp_set;
^
1 error generated.
/builds/xen-project/xen/xen/Rules.mk:198: recipe for target 'synic.o' failed
make[6]: *** [synic.o] Error 1
Comment out the last variable size array from hv_tlb_flush_ex to fix
clang builds.
Fixes: bbba482664 ("x86: import hyperv-tlfs.h from Linux") Signed-off-by: Wei Liu <liuwe@microsoft.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Fri, 20 Dec 2019 19:47:49 +0000 (19:47 +0000)]
x86: Hyper-V clock source's offset should be signed
Also drop the useless inline keyword.
Fixes: 685d16bd5 (x86: implement Hyper-V clock source) Signed-off-by: Wei Liu <liuwe@microsoft.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
livepatch: Fix typos and other errors in tests Makefile
There was a bunch of typos (s/actions/action/) as well as one missing
config.h target dependency. Also, xen_expectation target has
unnecessary cycle dependency.
Jan Beulich [Fri, 20 Dec 2019 15:46:20 +0000 (16:46 +0100)]
x86emul: use CASE_SIMD_PACKED_INT() where possible
This (imo) improves readability (simply by the shrunk number of lines)
and helps prepare for optionally disabling MMX and SIMD support in the
emulator.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Sergey Kovalev [Fri, 20 Dec 2019 15:45:32 +0000 (16:45 +0100)]
x86/vm_event: add short-circuit for breakpoints (aka "fast single step")
When using DRAKVUF (or another system using altp2m with shadow pages similar
to what is described in
https://xenproject.org/2016/04/13/stealthy-monitoring-with-xen-altp2m),
after a breakpoint is hit the system switches to the default
unrestricted altp2m view with singlestep enabled. When the singlestep
traps to Xen another vm_event is sent to the monitor agent, which then
normally disables singlestepping and switches the altp2m view back to
the restricted view.
This patch short-circuiting that last part so that it doesn't need to send the
vm_event out for the singlestep event and should switch back to the restricted
view in Xen automatically.
This optimization gains about 35% speed-up.
Was tested on Debian branch of Xen 4.12. See at:
https://github.com/skvl/xen/tree/debian/knorrie/4.12/fast-singlestep
Rebased on master:
https://github.com/skvl/xen/tree/fast-singlestep
Signed-off-by: Sergey Kovalev <valor@list.ru> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Igor Druzhinin [Fri, 20 Dec 2019 15:44:38 +0000 (16:44 +0100)]
x86/time: update vtsc_last with cmpxchg and drop vtsc_lock
Now that vtsc_last is the only entity protected by vtsc_lock we can
simply update it using a single atomic operation and drop the spinlock
entirely. This is extremely important for the case of running nested
(e.g. shim instance with lots of vCPUs assigned) since if preemption
happens somewhere inside the critical section that would immediately
mean that other vCPU stop progressing (and probably being preempted
as well) waiting for the spinlock to be freed.
This fixes constant shim guest boot lockups with ~32 vCPUs if there is
vCPU overcommit present (which increases the likelihood of preemption).
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 20 Dec 2019 15:30:13 +0000 (16:30 +0100)]
x86: explicitly disallow guest access to PPIN
To fulfill the "protected" in its name, don't let the real hardware
values leak. While we could report a control register value expressing
this (which I would have preferred), unconditionally raise #GP for all
accesses (in the interest of getting this done).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Fri, 20 Dec 2019 15:29:22 +0000 (16:29 +0100)]
x86/apic: allow enabling x2APIC mode regardless of interrupt remapping
x2APIC mode doesn't mandate interrupt remapping, and hence can be
enabled independently. This patch enables x2APIC when available,
regardless of whether there's interrupt remapping support.
This is beneficial specially when running on virtualized environments,
since it reduces the amount of vmexits. For example when sending an
IPI in xAPIC mode Xen performs at least 3 different accesses to the
APIC MMIO region, while when using x2APIC mode a single wrmsr is used.
The following numbers are from a lock profiling of a Xen PV shim
running a Linux PV kernel with 32 vCPUs and xAPIC mode:
Roger Pau Monné [Fri, 20 Dec 2019 15:27:48 +0000 (16:27 +0100)]
x86/apic: force phys mode if interrupt remapping is disabled
Cluster mode can only be used with interrupt remapping support, since
the top 16bits of the APIC ID are filled with the cluster ID, and
hence on systems where the physical ID is still smaller than 255 the
cluster ID is not. Force x2APIC to use physical mode if there's no
interrupt remapping support.
Note that this requires a further patch in order to enable x2APIC
without interrupt remapping support.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Fri, 20 Dec 2019 15:26:09 +0000 (16:26 +0100)]
x86/ioapic: only use dest32 with x2apic and interrupt remapping enabled
The IO-APIC code assumes that x2apic being enabled also implies
interrupt remapping being enabled, and hence will use the 32bit
destination field in the IO-APIC entry.
This is safe now, but there's no reason to not enable x2APIC even
without interrupt remapping, and hence the IO-APIC code needs to use
the 32 bit destination field only when both interrupt remapping and
x2APIC are enabled.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 18 Dec 2019 20:17:42 +0000 (20:17 +0000)]
libxc/restore: Fix data auditing in handle_x86_pv_info()
handle_x86_pv_info() has a subtle bug. It uses an 'else if' chain with a
clause in the middle which doesn't exit unconditionally. In practice, this
means that when restoring a 32bit PV guest, later sanity checks are skipped.
Rework the logic a little to be simpler. There are exactly two valid
combinations of fields in X86_PV_INFO, so factor this out and check them all
in one go, before making adjustments to the current domain.
Once adjustments have been completed successfully, sanity check the result
against the X86_PV_INFO settings in one go, rather than piece-wise.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Wed, 18 Dec 2019 14:00:16 +0000 (14:00 +0000)]
tools/python: Python 3 compatibility
convert-legacy-stream is only used for incomming migration from pre Xen 4.7,
and verify-stream-v2 appears to only be used by me during migration
development - it is little surprise that they missed the main converstion
effort in Xen 4.13.
Fix it all up.
Move open_file_or_fd() into a new util.py to avoid duplication, making it a
more generic wrapper around open() or fdopen().
In libxc.py, drop all long() conversion. Python 2 will DTRT with int => long
promotion, even on 32bit builds.
In convert-legacy-stream, don't pass empty strings to write_record(). Join on
the empty argl will do the right thing.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Andrew Cooper [Wed, 18 Dec 2019 12:43:48 +0000 (12:43 +0000)]
tools/python: Drop test.py
This file hasn't been touched since it was introduced in 2005 (c/s 0c6f36628)
and has a wildly obsolete shebang for Python 2.3. Most importantly for us is
that it isn't Python 3 compatible.
Drop the file entirely. Since the 2.3 days, automatic discovery of tests has
been included in standard functionality. Rewrite the test rule to use
"$(PYTHON) -m unittest discover" which is equivelent.
Dropping test.py drops the only piece of ZPL-2.0 code in the tree. Drop the
ancillary files, and adjust COPYING to match.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Lars Kurth <lars.kurth@citrix.com>
Tamas K Lengyel [Wed, 18 Dec 2019 19:40:41 +0000 (11:40 -0800)]
x86/mem_sharing: cleanup code and comments in various locations
No functional changes.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
[Further cleanup] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tamas K Lengyel [Wed, 18 Dec 2019 19:40:40 +0000 (11:40 -0800)]
tools/libxc: clean up memory sharing files
No functional changes.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com> Acked-by: Wei Liu <wl@xen.org>
[Further cleanup] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 18 Dec 2019 13:49:59 +0000 (14:49 +0100)]
x86: provide Dom0 access to PPIN via XENPF_resource_op
It was requested that we provide a way independent of the MCE reporting
interface that Dom0 software could use to get hold of the values for
particular CPUs.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 18 Dec 2019 13:49:10 +0000 (14:49 +0100)]
x86: include the PPIN in MCE records when available
Quoting the respective Linux commit:
Intel Xeons from Ivy Bridge onwards support a processor identification
number set in the factory. To the user this is a handy unique number to
identify a particular CPU. Intel can decode this to the fab/production
run to track errors. On systems that have it, include it in the machine
check record. I'm told that this would be helpful for users that run
large data centers with multi-socket servers to keep track of which CPUs
are seeing errors.
Newer AMD CPUs support this too, at different MSR numbers.
Take the opportunity and hide __MC_NMSRS from the public interface going
forward.
Andrew Cooper [Fri, 13 Dec 2019 17:56:40 +0000 (17:56 +0000)]
x86/S3: Restore cr4 later during resume
Just like the BSP/AP paths, %cr4 is loaded with only PAE. Defer restoring all
of %cr4 (MCE in particular) until all the system structures (IDT/TSS in
particular) have been loaded.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 13 Dec 2019 17:52:21 +0000 (17:52 +0000)]
x86/S3: Don't save unnecessary GPRs
Only the callee-preserved registers need saving/restoring. Spill them to the
stack like regular functions do. %rsp is now the only GPR which gets stashed
in .data
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 13 Dec 2019 17:36:09 +0000 (17:36 +0000)]
x86/S3: Clarify and improve the behaviour of do_suspend_lowlevel()
do_suspend_lowlevel() behaves as a function call, even when the trampoline
jumps back into the middle of it. Discuss this property, while renaming the
far-too-generic __ret_point to s3_resume.
Optimise the calling logic for acpi_enter_sleep_state(). $3 doesn't require a
64bit write, and the function isn't variadic so doesn't need to specify zero
FPU registers in use.
In the case of an acpi_enter_sleep_state() error, we didn't actually lose
state so don't need to restore it. Jump straight to the end.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
The existing code assumes that the first mfn passed to the boot
allocator is mapped, which creates problems when, e.g., we do not have
a direct map, and may create other bootstrapping problems in the
future. Make it static. The size is kept the same as before (1 page).
Nick Rosbrook [Mon, 16 Dec 2019 18:08:10 +0000 (18:08 +0000)]
golang/xenlight: implement keyed union C to Go marshaling
Switch over union key to determine how to populate 'union' in Go struct.
Since the unions of C types cannot be directly accessed in cgo, use a
typeof trick to typedef a struct in the cgo preamble that is analagous
to each inner struct of a keyed union. For example, to define a struct
for the hvm inner struct of libxl_domain_build_info, do:
Nick Rosbrook [Mon, 16 Dec 2019 18:08:09 +0000 (18:08 +0000)]
golang/xenlight: begin C to Go type marshaling
Begin implementation of fromC marshaling functions for generated struct
types. This includes support for converting fields that are basic
primitive types such as string and integer types, nested anonymous
structs, nested libxl structs, and libxl built-in types.
This patch does not implement conversion of arrays or keyed unions.
Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Nick Rosbrook [Mon, 16 Dec 2019 18:08:08 +0000 (18:08 +0000)]
golang/xenlight: generate structs from the IDL
Add struct and keyed union generation to gengotypes.py. For keyed unions,
use a method similar to gRPC's oneof to interpret C unions as Go types.
Meaning, for a given struct with a union field, generate a struct for
each sub-struct defined in the union. Then, define an interface of one
method which is implemented by each of the defined sub-structs. For
example:
type domainBuildInfoTypeUnion interface {
isdomainBuildInfoTypeUnion()
}
type DomainBuildInfoTypeUnionHvm struct {
// HVM-specific fields...
}
Then, remove existing struct definitions in xenlight.go that conflict
with the generated types, and modify existing marshaling functions to
align with the new type definitions. Notably, drop "time" package since
fields of type time.Duration are now of type uint64.
Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Re-name and modify signature of toGo function to fromC. The reason for
using 'fromC' rather than 'toGo' is that it is not a good idea to define
methods on the C types. Also, add error return type to Bitmap's toC function.
Finally, as code-cleanup, re-organize the Bitmap type's comments as per
Go conventions.
Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
--
Changes in v2:
- Use consistent variable naming for slice created from
libxl_bitmap.
Nick Rosbrook [Mon, 16 Dec 2019 18:07:59 +0000 (18:07 +0000)]
golang/xenlight: define Defbool builtin type
Define Defbool as struct analagous to the C type, and define the type
'defboolVal' that represent true, false, and default defbool values.
Implement Set, Unset, SetIfDefault, IsDefault, Val, and String functions
on Defbool so that the type can be used in Go analagously to how its
used in C.
Finally, implement fromC and toC functions.
Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Nick Rosbrook [Mon, 16 Dec 2019 18:07:59 +0000 (18:07 +0000)]
golang/xenlight: generate enum types from IDL
Introduce gengotypes.py to generate Go code the from IDL. As a first step,
implement 'enum' type generation.
As a result of the newly-generated code, remove the existing, and now
conflicting definitions in xenlight.go. In the case of the Error type,
rename the slice 'errors' to 'libxlErrors' so that it does not conflict
with the standard library package 'errors.' And, negate the values used
in 'libxlErrors' since the generated error values are negative.
Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Mon, 16 Dec 2019 16:37:09 +0000 (17:37 +0100)]
x86emul: correct far branch handling for 64-bit mode
AMD and friends explicitly specify that 64-bit operands aren't possible
for these insns. Nevertheless REX.W isn't fully ignored: It still
cancels a possible operand size override (0x66). Intel otoh explicitly
provides for 64-bit operands on the respective insn page of the SDM.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
The version of this header present in the Linux source tree has contained
such macros for some time. These macros, as the names imply, allow front
or back rings to be set up for existent (rather than freshly created and
zeroed) shared rings.
This patch is to update this, the canonical version of the header, to
match the latest definition of these macros in the Linux source.
NOTE: The way the new macros are defined allows the FRONT/BACK_RING_INIT
macros to be re-defined in terms of them, thereby reducing
duplication.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Mon, 16 Dec 2019 16:35:50 +0000 (17:35 +0100)]
x86emul: correct LFS et al handling for 64-bit mode
AMD and friends explicitly specify that 64-bit operands aren't possible
for these insns. Nevertheless REX.W isn't fully ignored: It still
cancels a possible operand size override (0x66). Intel otoh explicitly
provides for 64-bit operands on the respective insn page of the SDM.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 16 Dec 2019 16:34:46 +0000 (17:34 +0100)]
x86emul: correct segment override decode for 64-bit mode
The legacy / compatibility mode ES, CS, SS, and DS overrides are fully
ignored prefixes in 64-bit mode, i.e. they in particular don't cancel an
earlier FS or GS one. (They don't violate the REX-prefix-must-be-last
rule though.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Igor Druzhinin [Fri, 13 Dec 2019 22:48:01 +0000 (22:48 +0000)]
x86/time: drop vtsc_{kern, user}count debug counters
They either need to be transformed to atomics to work correctly
(currently they left unprotected for HVM domains) or dropped entirely
as taking a per-domain spinlock is too expensive for high-vCPU count
domains even for debug build given this lock is taken too often.
Choose the latter as they are not extremely important anyway.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Mon, 16 Dec 2019 13:58:45 +0000 (13:58 +0000)]
x86/pv: Fix `global-pages` to match the documentation
c/s 5de961d9c09 "x86: do not enable global pages when virtualized on AMD or
Hygon hardware" in fact does. Fix the calculation in pge_init().
While fixing this, adjust the command line documenation, first to use the
newer style, and to expand the description to discuss cases where the option
might be useful to use, but Xen can't account for by default.
Fixes: 5de961d9c09 ('x86: do not enable global pages when virtualized on AMD or Hygon hardware') Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Thu, 12 Dec 2019 15:57:51 +0000 (15:57 +0000)]
x86/mm: More discriptive names for page de/validation functions
The functions alloc_page_type(), alloc_lN_table(), free_page_type()
and free_lN_table() are confusingly named: nothing is being allocated
or freed. Rather, the page being passed in is being either validated
or devalidated for use as the specific type; in the specific case of
pagetables, these may be promoted or demoted (i.e., grab appropriate
references for PTEs).
Rename alloc_page_type() and free_page_type() to validate_page() and
devalidate_page(). Also rename alloc_segdesc_page() to
validate_segdesc_page(), since this is what it's doing.
Rename alloc_lN_table() and free_lN_table() to promote_lN_table() and
demote_lN_table(), respectively.
After this change:
- get / put type consistenly refer to increasing or decreasing the count
- validate / devalidate consistently refers to actions done when a
type count goes 0 -> 1 or 1 -> 0
- promote / demote consistenly refers to acquiring or freeing
resources (in the form of type refs and general references) in order
to allow a page to be used as a pagetable.
No functional change.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
George Dunlap [Fri, 13 Dec 2019 14:09:46 +0000 (14:09 +0000)]
x86/mm: Use mfn_t in type get / put call tree
Replace `unsigned long` with `mfn_t` as appropriate throughout
alloc/free_lN_table, get/put_page_from_lNe, and
get_lN_linear_pagetable. This obviates the need for a load of
`mfn_x()` and `_mfn()` casts.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
George Dunlap [Fri, 13 Dec 2019 12:53:04 +0000 (12:53 +0000)]
x86/mm: Use a more descriptive name for pagetable mfns
In many places, a PTE being modified is accompanied by the pagetable
mfn which contains the PTE (primarily in order to be able to maintain
linear mapping counts). In many cases, this mfn is stored in the
non-descript variable (or argement) "pfn".
Replace these names with lNmfn, to indicate that 1) this is a
pagetable mfn, and 2) that it is the same level as the PTE in
question. This should be enough to remind readers that it's the mfn
containing the PTE.
No functional change.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
George Dunlap [Fri, 13 Dec 2019 12:53:04 +0000 (12:53 +0000)]
x86/mm: Implement common put_data_pages for put_page_from_l[23]e
Both put_page_from_l2e and put_page_from_l3e handle having superpage
entries by looping over each page and "put"-ing each one individually.
As with putting page table entries, this code is functionally
identical, but for some reason different. Moreover, there is already
a common function, put_data_page(), to handle automatically swapping
between put_page() (for read-only pages) or put_page_and_type() (for
read-write pages).
Replace this with put_data_pages() (plural), which does the entire
loop, as well as the put_page / put_page_and_type switch.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Fri, 13 Dec 2019 12:53:04 +0000 (12:53 +0000)]
x86/mm: Refactor put_page_from_l*e to reduce code duplication
put_page_from_l[234]e have identical functionality for devalidating an
entry pointing to a pagetable. But mystifyingly, they duplicate the
code in slightly different arrangements that make it hard to tell that
it's the same.
Create a new function, put_pt_page(), which handles the common
functionality; and refactor all the functions to be symmetric,
differing only in the level of pagetable expected (and in whether they
handle superpages).
Other than put_page_from_l2e() gaining an ASSERT it probably should
have had already, no functional changes.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>