xen/arm: use p2m_mmio_direct_c to map reserved-memory
Don't use p2m_ram_rw for memory mapped into the guest with iomem, and
for reserved-memory regions. Instead, use p2m_mmio_direct_c which has
very similar pagetable properties but not the same security implications
(p2m_is_ram checks and memory allocations.)
xen/arm: add reserved-memory regions to the dom0 memory node
Reserved memory regions are automatically remapped to dom0. Their device
tree nodes are also added to dom0 device tree. However, the dom0 memory
node is not currently extended to cover the reserved memory regions
ranges as required by the spec. This commit fixes it.
xen/arm: map reserved-memory regions as normal memory in dom0
reserved-memory regions should be mapped as normal memory. At the
moment, they get remapped as device memory in dom0 because Xen doesn't
know any better. Add an explicit check for it.
However, reserved-memory regions are allowed to overlap partially or
completely with memory nodes. In these cases, the overlapping memory is
reserved-memory and should be handled accordingly.
As we parse the device tree in Xen, keep track of the reserved-memory
regions as they need special treatment (follow-up patches will make use
of the stored information.)
Parse a new cacheability option for the iomem parameter, it can be
"devmem" for device memory mappings, which is the default, or "memory"
for normal memory mappings.
Store the parameter in a new field in libxl_iomem_range.
Pass the cacheability option to xc_domain_memory_mapping.
Add an additional parameter to xc_domain_memory_mapping to pass
cacheability information. The same parameter values are the same for the
XEN_DOMCTL_memory_mapping hypercall (0 is device memory, 1 is normal
memory). Pass CACHEABILITY_DEVMEM by default -- no changes in behavior.
xen: extend XEN_DOMCTL_memory_mapping to handle cacheability
Reuse the existing padding field to pass cacheability information about
the memory mapping, specifically, whether the memory should be mapped as
normal memory or as device memory (this is what we have today).
Add a cacheability parameter to map_mmio_regions. 0 means device
memory, which is what we have today.
On ARM, map device memory as p2m_mmio_direct_dev (as it is already done
today) and normal memory as p2m_ram_rw.
On x86, return error if the cacheability requested is not device memory.
Julien Grall [Tue, 23 Oct 2018 18:17:07 +0000 (19:17 +0100)]
xen/arm: gic: Ensure ordering between read of INTACK and shared data
When an IPI is generated by a CPU, the pattern looks roughly like:
<write shared data>
dsb(sy);
<write to GIC to signal SGI>
On the receiving CPU we rely on the fact that, once we've taken the
interrupt, then the freshly written shared data must be visible to us.
Put another way, the CPU isn't going to speculate taking an interrupt.
Unfortunately, this assumption turns out to be broken.
Consider that CPUx wants to send an IPI to CPUy, which will cause CPUy
to read some shared_data. Before CPUx has done anything, a random
peripheral raises an IRQ to the GIC and the IRQ line on CPUy is raised.
CPUy then takes the IRQ and starts executing the entry code, heading
towards gic_handle_irq. Furthermore, let's assume that a bunch of the
previous interrupts handled by CPUy were SGIs, so the branch predictor
kicks in and speculates that irqnr will be <16 and we're likely to
head into handle_IPI. The prefetcher then grabs a speculative copy of
shared_data which contains a stale value.
Meanwhile, CPUx gets round to updating shared_data and asking the GIC
to send an SGI to CPUy. Internally, the GIC decides that the SGI is
more important than the peripheral interrupt (which hasn't yet been
ACKed) but doesn't need to do anything to CPUy, because the IRQ line
is already raised.
CPUy then reads the ACK register on the GIC, sees the SGI value which
confirms the branch prediction and we end up with a stale shared_data
value.
This patch fixes the problem by adding an smp_rmb() to the IPI entry
code in do_SGI.
Julien Grall [Tue, 23 Oct 2018 18:17:06 +0000 (19:17 +0100)]
xen/arm: gic: Ensure we have an ISB between ack and do_IRQ()
Devices that expose their interrupt status registers via system
registers (e.g. Statistical profiling, CPU PMU, DynamIQ PMU, arch timer,
vgic (although unused by Linux), ...) rely on a context synchronising
operation on the CPU to ensure that the updated status register is
visible to the CPU when handling the interrupt. This usually happens as
a result of taking the IRQ exception in the first place, but there are
two race scenarios where this isn't the case.
For example, let's say we have two peripherals (X and Y), where Y uses a
system register for its interrupt status.
Case 1:
1. CPU takes an IRQ exception as a result of X raising an interrupt
2. Y then raises its interrupt line, but the update to its system
register is not yet visible to the CPU
3. The GIC decides to expose Y's interrupt number first in the Ack
register
4. The CPU runs the IRQ handler for Y, but the status register is stale
Case 2:
1. CPU takes an IRQ exception as a result of X raising an interrupt
2. CPU reads the interrupt number for X from the Ack register and runs
its IRQ handler
3. Y raises its interrupt line and the Ack register is updated, but
again, the update to its system register is not yet visible to the
CPU.
4. Since the GIC drivers poll the Ack register, we read Y's interrupt
number and run its handler without a context synchronisation
operation, therefore seeing the stale register value.
In either case, we run the risk of missing an IRQ. This patch solves the
problem by ensuring that we execute an ISB in the GIC drivers prior
to invoking the interrupt handler.
Julien Grall [Wed, 31 Oct 2018 18:13:13 +0000 (18:13 +0000)]
xen/arm: Move vgic_* helpers from gic.h to vgic.h
Keep vgic_* helpers in a single place. At the same time remove gic.h
from event.h since the helpers has now been moved to vgic.h (included by
domain.h).
Julien Grall [Wed, 31 Oct 2018 18:13:02 +0000 (18:13 +0000)]
xen/arm: Move SYSREG accessors in sysregs.h
System registers accessors are self-contained and should not be included
everywhere in Xen. Move the accessors in sysregs.h and include the file
when necessary.
With that change, it is not necessary to include processor.h in time.h.
Julien Grall [Wed, 31 Oct 2018 18:12:57 +0000 (18:12 +0000)]
xen/arm: Consolidate CPU identification in cpufeature.{c,h}
At the moment, CPU Identification is spread accross cpu.c, cpufeature.c,
processor.h, cpufeature.h. It would be better to keep everything
together in a single place.
Julien Grall [Wed, 31 Oct 2018 18:12:55 +0000 (18:12 +0000)]
xen/arm: Remove __init from prototype
In Xen, it is common to add __init to the declaration and not the
prototype. Remove the few __init on some prototypes which allows to
avoid the inclusion of init.h in headers.
With these changes, init.h is now required to be included on some c
files. Also, add __init where it was missing in declaration.
x86/hvm: clean up the rest of bool_t from vm_event
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Fri, 9 Nov 2018 12:05:28 +0000 (13:05 +0100)]
pass-through: adjust pIRQ migration
For one it is quite pointless to iterate over all pIRQ-s the domain has
when just one is being adjusted. Introduce hvm_migrate_pirq() as an
externally accessible function.
Additionally it is bogus to migrate the pIRQ to a vCPU different from
the one the event is supposed to be posted to - if anything, it might be
worth considering not to migrate the pIRQ at all in the posting case.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Thu, 1 Nov 2018 10:16:58 +0000 (10:16 +0000)]
xen/grant_table: Remove stale comment on top of map_grant_ref
Remove the 2 part comment on top of map_grant_ref:
- The first part mention the return value which has been void since
2006!
- The second part mention a local variable 'addr' which does not
exist anymore.
Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 9 Nov 2018 10:42:10 +0000 (11:42 +0100)]
cpufreq: convert to a single post-init driver (hooks) instance
This reduces the post-init memory footprint, eliminates a pointless
level of indirection at the use sites, and allows for subsequent
alternatives call patching.
Take the opportunity and also add a name to the PowerNow! instance.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Xin Li [Fri, 9 Nov 2018 10:41:30 +0000 (11:41 +0100)]
xsm: remove printing from set_to_dummy_if_null()
Filling dummy module's hook to null value of xsm_operations structure
will generate debug message. This becomes boot time spew for module
like silo, which only sets a few hooks of itself. So remove the printing
to avoid boot time spew.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Xin Li <xin.li@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Paul Durrant [Fri, 9 Nov 2018 10:40:12 +0000 (11:40 +0100)]
viridian: introduce struct viridian_page
The 'vp_assist' page is currently an example of a guest page which needs to
be kept mapped throughout the life-time of a guest, but there are other
such examples in the specifiction [1]. This patch therefore introduces a
generic 'viridian_page' type and converts the current vp_assist/apic_assist
related code to use it. Subsequent patches implementing other enlightments
can then also make use of it.
This patch also renames the 'vp_assist_pending' field in struct
hvm_viridian_vcpu_context to 'apic_assist_pending' to more accurately
reflect its meaning. The term 'vp_assist' applies to the whole page rather
than just the EOI-avoidance enlightenment. New versons of the specification
have defined data structures for other enlightenments within the same page.
Paul Durrant [Fri, 9 Nov 2018 10:39:27 +0000 (11:39 +0100)]
viridian: define type for the 'virtual VP assist page'
The specification [1] defines a type so we should use it, rather than just
OR-ing and AND-ing magic bits.
No functional change.
NOTE: The type defined in the specification does include an anonymous
sub-struct in the page type but, as we currently use only the first
element, the struct declaration has been omitted.
Paul Durrant [Fri, 9 Nov 2018 10:38:03 +0000 (11:38 +0100)]
viridian: separate time related enlightenment implementations...
...into new 'time' module.
This patch reduces the size of the main viridian source module by
moving time related enlightenments into their own source module. This is
done in anticipation of implementation of more such enightenments and
a desire to not further lengthen the main source module when this work
is done.
While moving the code:
- Move the declaration of HV_REFERENCE_TSC_PAGE from the header file into
the new source module, since it is only used there.
- Clean up a bool_t.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Paul Durrant [Fri, 9 Nov 2018 10:36:52 +0000 (11:36 +0100)]
viridian: separate interrupt related enlightenment implementations...
...into new 'synic' module.
The SynIC (synthetic interrupt controller) is specified [1] to be a super-
set of a virtualized LAPIC, and its definition encompasses all
enlightenments related to virtual interrupt control.
This patch reduces the size of the main viridian source module by giving
these enlightenments their own module. This is done in anticipation of
implementation of more such enlightenments and a desire not to further
lengthen then main source module when this work is done.
Whilst moving the code:
- Fix various style issues.
- Move the MSR definitions into the header (since they are now needed in
more than one source module).
Roger Pau Monne [Thu, 8 Nov 2018 14:23:58 +0000 (15:23 +0100)]
amd/pvh: enable ACPI C1E disable quirk on PVH Dom0
PV Dom0 has a quirk for some AMD processors, where enabling ACPI can
also enable C1E mode. Apply the same workaround as done on PV for a
PVH Dom0, which consist on trapping accesses to the SMI command IO
port and disabling C1E if ACPI is enabled.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
This patch adds a couple of regs to the vm_event that are used by
the introspection. The base, limit and ar
bits are compressed into a uint64_t union so as not to enlarge the
vm_event.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Jan Beulich [Thu, 8 Nov 2018 14:59:14 +0000 (15:59 +0100)]
x86/genapic: remove indirection from genapic hook accesses
Instead of loading a pointer at each use site, have a single runtime
instance of struct genapic, copying into it from the individual
instances. The individual instances can this way also be moved to .init
(also adjust apic_probe[] at this occasion).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 7 Nov 2018 08:35:14 +0000 (09:35 +0100)]
p2m: move p2m-common.h inclusion point
The header is (hence its name) supposed to be a helper for the per-arch
p2m.h files. It was never supposed to be included directly, and for the
purpose of putting common function declarations into the common header
it is more helpful if things like p2m_t are already available at the
inclusion point.
This also undoes parts of 02ede7dc03 ("memory: add
check_get_page_from_gfn() as a wrapper..."), which had been there just
because of the unhelpful original way of including p2m-common.h.
Take the opportunity and also ditch a duplicate public/memory.h from the
ARM header.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Sergey Dyasli [Wed, 7 Nov 2018 08:34:17 +0000 (09:34 +0100)]
mm/page_alloc: make bootscrub happen in idle-loop
Scrubbing RAM during boot may take a long time on machines with lots
of RAM. Add 'idle' option to bootscrub which marks all pages dirty
initially so they will eventually be scrubbed in idle-loop on every
online CPU.
It's guaranteed that the allocator will return scrubbed pages by doing
eager scrubbing during allocation (unless MEMF_no_scrub was provided).
Use the new 'idle' option as the default one.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 7 Nov 2018 08:33:24 +0000 (09:33 +0100)]
x86: work around HLE host lockup erratum
XACQUIRE prefixed accesses to the 4Mb range of memory starting at 1Gb
are liable to lock up the processor. Disallow use of this memory range.
Unfortunately the available Core Gen7 and Gen8 spec updates are pretty
old, so I can only guess that they're similarly affected when Core Gen6
is and the Xeon counterparts are, too.
This is part of XSA-282.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Sergey Dyasli [Thu, 21 Jun 2018 14:35:50 +0000 (16:35 +0200)]
x86/domctl: Implement XEN_DOMCTL_get_cpu_policy
This finally (after literally years of work!) marks the point where the
toolstack can ask the hypervisor for the current CPUID configuration of a
specific domain.
Introduce a new flask access vector and update the default policies.
Also extend xen-cpuid's --policy mode to be able to take a domid and dump a
specific domains CPUID and MSR policy.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Mon, 2 Jul 2018 16:05:33 +0000 (16:05 +0000)]
x86: Introduce struct cpu_policy to refer to a group of individual policies
This is prep work for the following patch - please refer to it as well.
When auditing and manipulating policies, it is necessary to do so with a
complete set of policies, due to the interdependences of the contents. A
containing structure like this will allow for clearer APIs and code.
As a first user, this structure is convenient for the mapping used by
XEN_SYSCTL_get_cpu_policy (implemented in the next patch), and for auditing
(later when XEN_DOMCTL_set_cpu_policy is implemented).
At this point, the distinction between *_max and *_default is introduced into
the ABI. For now, *_default is mapped to *_max, but future development work
will result in *_default being a logical subset of *_max.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Thu, 21 Jun 2018 14:35:50 +0000 (16:35 +0200)]
libx86: Introduce a helper to serialise msr_policy objects
As with CPUID, an architectural form is used for representing the MSR data.
It is expected not to change moving forwards, but does have a 32 bit field
(currently reserved) which can be used compatibly if needs be.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 21 Jun 2018 14:35:49 +0000 (16:35 +0200)]
libx86: Introduce a helper to serialise cpuid_policy objects
The serialised form is made up of the leaf, subleaf and data tuple. As this
is the architectural form, it is expected not to change going forwards.
The serialisation of the Xen/Viridian leaves isn't fully implemented yet. It
is just enough to be bug-compatible with the current DOMCTL_set_cpuid
behaviour, but needs further hypervisor work before the toolstack can sensibly
control these values.
x86_cpuid_copy_to_buffer() is implemented using Xen's regular copy_to_guest
primitives, with an API-compatible memcpy() is used for the libxc half of the
build.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Tue, 6 Nov 2018 15:41:25 +0000 (15:41 +0000)]
tools/dm_depriv: Add first cut RLIMITs
Limit the ability of a potentially compromised QEMU to consume system
resources. Key limits:
- RLIMIT_FSIZE (file size): 256KiB
- RLIMIT_NPROC (after uid changes to a unique uid)
NB that we do not yet set RLIMIT_AS (total virtual memory) or
RLIMIT_NOFILES (number of open files), since these require more care
and/or more coordination with QEMU to implement.
Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
Changes since v4:
- Put global headers before local headers (sugg by Paul)
- Move #undif inside the braces (sugg by Paul)
Changes since v3:
- Align RLIMIT_ENTRY list for easier reading
- Fix wrong format string specifier
- Get rid of some trailing whitespace
Changes since v2:
- Use a macro to define rlimit entries
- Use RLIMIT_NLIMITS as an end-of-list marker, rather than -1
- Various style clean-ups
CC: Ian Jackson <ian.jackson@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Anthony Perard <anthony.perard@citrix.com>
George Dunlap [Tue, 6 Nov 2018 15:41:24 +0000 (15:41 +0000)]
tools/dm_restrict: Unshare mount and IPC namespaces on Linux
QEMU running under Xen doesn't need mount or IPC functionality.
Create and enter separate namespaces for each of these before
executing QEMU, so that in the event that other restrictions fail, the
process won't be able to even name system mount points or exsting
non-file-based IPC descriptors to attempt to attack them.
Unsharing is something a process can only do to itself (it would
seem); so add an os-specific "dm_preexec_restrict()" hook just before
we exec() the device model.
Also add checks to depriv-process-checker.sh to verify that dm is
running in a new namespace (or at least, a different one than the
caller).
Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
Changes since v4:
- Fix function prototype for netbsd code
Changes since v3:
- Fix some more style issues
Changes since v2:
- Return an error rather than calling exit()
- Use LOGE() and print to the current stderr fd, rather than
printing to the new stderr fd via write()
- Use r for external return values rather than rc.
CC: Ian Jackson <ian.jackson@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Anthony Perard <anthony.perard@citrix.com>
George Dunlap [Tue, 6 Nov 2018 15:41:23 +0000 (15:41 +0000)]
tools/dm_restrict: Ask QEMU to chroot
When dm_restrict is enabled, ask QEMU to chroot into an empty directory.
* Create $XEN_RUN_DIR/qemu-root-<domid> (deleting the old one if it's there)
* Pass the -chroot option to QEMU
Rather than running `rm -rf` on the directory before creating it
(since there is no library function to do this), simply rmdir the
directory, relying on the fact that the previous QEMU instance, if
properly restricted, shouldn't have been able to write anything
anyway.
Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
Changes since v4:
- Minor change to comment
- Update stale directory name in commit message
Changes since v2:
- Style fixes
- Testing moved to a different patch
CC: Ian Jackson <ian.jackson@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Anthony Perard <anthony.perard@citrix.com>
George Dunlap [Tue, 6 Nov 2018 15:41:22 +0000 (15:41 +0000)]
SUPPORT.md: Add qemu-depriv section
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
Changes since v4:
- Fix some grammar (s/attack/attacking/;)
Changes since v3:
- Moved from the qemu-depriv doc patches.
- Reword to include the possibility of having a non-dom0 "devicemodel"
domain which may want to be protected
- Specify `Linux dom0` as the currently-tech-supported window
CC: Ian Jackson <ian.jackson@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Andrew Cooper <andrew.cooper3@citrix.com> CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Konrad Wilk <konrad.wilk@oracle.com> CC: Stefano Stabellini <sstabellini@kernel.org> CC: Julien Grall <julien.grall@arm.com> CC: Anthony Perard <anthony.perard@citrix.com> CC: Ross Lagerwall <ross.lagerwall@citrix.com>
George Dunlap [Tue, 6 Nov 2018 15:41:22 +0000 (15:41 +0000)]
docs/qemu-deprivilege: Revise and update with status and future plans
docs/qemu-deprivilege.txt had some basic instructions for using
dm_restrict, but it was incomplete, misleading, and stale.
Update the docs in a number of ways.
First, separate user-facing documentation and technical description
into docs/features and docs/design, respectively.
In the feature doc:
* Introduce a section mentioning minimim versions of Linux, Xen, and
qemu required (TBD)
* Fix the discussion of qemu userid. Mention xen-qemuuser-range-base,
and provide example shell code that actually has some hope of working
(instead of failing out after creating 900 userids).
* Describe how to enable restrictions, as well as features which
probably don't or definitely don't work.
In the design doc, introduce a "Technical Details" section which
describes specifically what restrictions are currently done, and also
what restrictions we are looking at doing in the future.
The idea here is that as we implement the various items for the
future, we move them from "Restrictions still to do" to "Restrictions
done". This can also act as a design document -- a place for public
discussion of what can or should be done and how.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
Changes since v4:
- Remove unnecessary FIXME
- Remove stale "Add SUPPORT.md"
Changes since v3:
- Fix typo (32->16)
- Use an example value not close to the `nobody` uids, but still a
multiple of 2^16.
- Mention that using a multiple of 2^16 may have advantages.
- Have the example create a group as well
- Reorganize two comments on the "range-base" method for clarity
Changes since v2:
- Extraneous privcmd / evtchn instances aren't closed
- Expand description of how to test fd deprivileging
- Rework and clarify two namespace sections, give reference for QEMU NAK
- Add more information about migration technical challenges
- In UID section, mention possibility of container ID collisions.
- Fix name of design document.
- Add SUPPORT.md statement. Specify Linux, to make sure that FreeBSD is
evaluated separately.
- Mention that `-sandbox` is a blacklist and why
Changes since v1:
- Break into two, and move into appropriate directories (rather than 'misc')
- Updated version requirements
- Distinguish between features which "don't yet work" and features which we never expect to work
- Update description of xen-restrict functionality
- Reorder and expand further restrictions
- Make it more clear which restrictions are available on Linux only
- Include detailed description of how to kill a process
- Add RLIMIT_NPROC as something we can do without further changes to qemu
- Document the need to check for the sandbox feature before using it
Thank you to Ross Lagerwall, whose description of what XenServer is
doing formed much of the basis for the text here.
CC: Ian Jackson <ian.jackson@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Andrew Cooper <andrew.cooper3@citrix.com> CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Konrad Wilk <konrad.wilk@oracle.com> CC: Stefano Stabellini <sstabellini@kernel.org> CC: Julien Grall <julien.grall@arm.com> CC: Anthony Perard <anthony.perard@citrix.com> CC: Ross Lagerwall <ross.lagerwall@citrix.com>
Ian Jackson [Mon, 5 Nov 2018 18:40:49 +0000 (18:40 +0000)]
tools: ipxe: Correct download error handling
This shell fragment lacked set -e. So, eg if the download failed a
broken ipxe.tar.gz would be left behind.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Tested-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Mon, 5 Nov 2018 18:37:05 +0000 (18:37 +0000)]
tools: Once again honour, but no longer advertise GIT_HTTP env var
In "build: add autoconf to replace custom checks in tools/check"
--enable-githttp was introduced. But we missed this comment where it
was advertised.
Also, that commit had the effect of uncondtionally setting GIT_HTTP
from the configure variable. But the env var has been advertised in
some places as the way to specify this behaviour, and overriding it is
just unfriendly.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> CC: Paul Durrant <paul.durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Dario Faggioli [Fri, 19 Oct 2018 15:54:41 +0000 (17:54 +0200)]
tools: libxl/xl: run NUMA placement even when an hard-affinity is set
Right now, if either an hard or soft-affinity are explicitly specified
in a domain's config file, automatic NUMA placement is skipped. However,
automatic NUMA placement affects only the soft-affinity of the domain
which is being created.
Therefore, it is ok to let it run if an hard-affinity is specified. The
semantics will be that the best placement candidate would be found,
respecting the specified hard-affinity, i.e., using only the nodes that
contain the pcpus in the hard-affinity mask.
This is particularly helpful if global xl pinning masks are defined, as
made possible by commit aa67b97ed34279c43 ("xl.conf: Add global affinity
masks"). In fact, without this commit, defining a global affinity mask
would also mean disabling automatic placement, but that does not
necessarily have to be the case (especially in large systems).
Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Fri, 26 Oct 2018 13:13:44 +0000 (15:13 +0200)]
Release: add release note link to SUPPORT.md
In order to have a link to the release notes in the feature list
generated from SUPPORT.md add that link in the "Release Support"
section of that file.
The real link needs to be adapted when the version is being released.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Fri, 2 Nov 2018 17:46:38 +0000 (17:46 +0000)]
x86/vcpu: Remove struct vcpu allocation restriction when possible
There is no need for struct vcpu to live below the 4G boundary for PV guests,
or for HVM vcpus using HAP.
Plumb struct domain into alloc_vcpu_struct() so the x86 version can query the
domain's type and paging settings.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
Wei Liu [Fri, 2 Nov 2018 15:55:42 +0000 (15:55 +0000)]
x86: rearrange x86_64/entry.S
Split the file into two halves. The first half pertains to PV guest
code while the second half is mostly used by the hypervisor itself to
handle interrupts and exceptions.
No functional change intended.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Wei Liu [Fri, 2 Nov 2018 15:55:39 +0000 (15:55 +0000)]
x86: make traps.c build with !CONFIG_PV
Provide a stub for pv_inject_event. Put code that accesses PV fields
and GDT / LDT fault handling code under CONFIG_PV. Move set_debugreg
to pv/misc-hypercalls.c.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 5 Nov 2018 10:13:59 +0000 (11:13 +0100)]
x86emul: VME and PVI modes require a #GP(0) check first thing
As explicitly spelled out by the SDM, EFLAGS.VIF and EFLAGS.VIP both set
at the start of an instruction trigger #GP(0) independent of actual
instruction.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 5 Nov 2018 10:13:09 +0000 (11:13 +0100)]
x86: deal with firmware setting bogus TSC_ADJUST values
The system Intel have handed me for AVX512 emulator work ("Gigabyte
Technology Co., Ltd. X299 AORUS Gaming 3 Pro/X299 AORUS Gaming 3
Pro-CF, BIOS F3 12/28/2017") would not come up under Xen - it hung in
the middle of Dom0 PCI initialization. As it turned out, Xen's time
management did not work because of the firmware setting (only) the boot
CPU's TSC_ADJUST MSR to a large negative value (on the order of -2^50).
Follow Linux (also shamelessly stealing their comments) in
- clearing the register for the boot CPU (we don't have a need for
exceptions here yet, as the only exception in Linux is a class of
systems Xen doesn't work on anyway as far as I'm aware),
- forcing non-negative values uniformly (commit 855615eee9 ["x86/tsc:
Remove the TSC_ADJUST clamp"] dropped this, but without this my
Haswell box won't boot anymore),
- syncing the registers within sockets.
Linux, prior to aforementioned commit, capped at 0x7fffffff as well, but as the
description there says this issue has been addressed with a microcode
update. Hence until someone runs into such a system without being able
to update its microcode, I think we should leave out that specific part.
In order to avoid making init_percpu_time() depend on running _before_
set_cpu_sibling_map() (and hence the booting CPU _not_ being accounted
in socket_cpumask[] yet), move that call slightly earlier in
start_secondary().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 5 Nov 2018 10:12:39 +0000 (11:12 +0100)]
x86/TSC: don't allow deadline timer to be used with unfixed errata
In preparation of writes to the TSC_ADJUST MSR, avoid the bad
interaction of writes to it and the TSC_DEADLINE one. Presumably the
original Linux commit bd9240a18e ("x86/apic: Add TSC_DEADLINE quirk due
to errata") refers to e.g. KBW092. (Of course this is an issue also
without us writing the TSC_ADJUST MSR, if instead firmware did already.
The errata checking can't be put in init_apic_mappings() as Linux does,
as that runs before we update microcode on the boot CPU. It needs to
happen before consumers of tdt_enabled, i.e.
- __setup_APIC_LVTT() <- setup_APIC_timer() <- setup_boot_APIC_clock()
- <- calibrate_APIC_clock() <- setup_boot_APIC_clock()
- setup_boot_APIC_clock()
setup_boot_APIC_clock() gets called from smp_prepare_cpus(), which sits
after microcode loading (note that calibrate_APIC_clock() gets called
before setting tdt_enabled).
Also add an MFENCE as per Linux commit 5d7c631d92 ("x86/apic: Serialize
LVTT and TSC_DEADLINE writes"), but I see no reason to put a conditional
around it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Mon, 5 Nov 2018 10:11:39 +0000 (11:11 +0100)]
viridian: remove duplicate union types
The 'viridian_vp_assist', 'viridian_hypercall_gpa' and
'viridian_reference_tsc' union types are identical in layout. The layout
is also common throughout the specification [1].
This patch declares a common 'viridian_page_msr' type and converts the rest
of the code to use that type for both the hypercall and VP assist pages.
Also, rename 'viridian_guest_os_id' to 'viridian_guest_os_id_msr' since it
also is a union representing an MSR value.
Paul Durrant [Mon, 5 Nov 2018 10:10:55 +0000 (11:10 +0100)]
viridian: remove comments referencing section number in the spec
Microsoft has a habit of re-numbering sections in the spec. so avoid
referring to section numbers in comments. Also remove the URL for the
spec. from the boilerplate... Again, Microsoft has a habit of changing
these too.
This patch also cleans up some > 80 character lines.
Purely cosmetic. No functional change.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Roger Pau Monne <roger.pau@citrix.com>
Wei Liu [Fri, 2 Nov 2018 12:34:12 +0000 (12:34 +0000)]
libxl/arm: fix guest type conversion
Commit 359970fd8b ("tools/libxl: Switch Arm guest type to PVH") missed
changing the type field in c_info. This issue didn't surface until ef72c93df9 which made creating PV guest on Arm unusable.
Create libxl__arch_domain_create_info_setdefault and switch the type
there.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
The may_defer var was left with the older bool_t type. This patch
changes the type to bool.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Paul Durrant <paul.durrant@citrix.com>
Jan Beulich [Fri, 2 Nov 2018 11:15:33 +0000 (12:15 +0100)]
VMX: fix vmx_handle_eoi()
In commit 303066fdb1e ("VMX: fix interaction of APIC-V and Viridian
emulation") I screwed up: Instead of clearing SVI, other ISR bits
should be taken into account.
Introduce a new helper set_svi(), split out of vmx_process_isr(), and
use it also from vmx_handle_eoi().
Following the problems in vmx_intr_assist() (see the still present big
block of debugging code there) also warn (once) if EOI'd vector and
original SVI don't match.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Commit 81946a73dc975a7dafe9017a8e61d1e64fdbedbf removed
Xenctrl.with_intf based on its undesirable behaviour of opening and
closing a Xenctrl connection with every invocation. This commit
re-introduces with_intf but with an updated behaviour: it maintains a
global Xenctrl connection which is opened upon first usage and kept
open. This handle can be obtained by clients using new functions
get_handle() and close_handle().
The main motivation of re-introducing with_intf is that otherwise
clients will have to implement this functionality individually.
Signed-off-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
tools/misc/xenpm: fix getting info when some CPUs are offline
Use physinfo.max_cpu_id instead of physinfo.nr_cpus to get max CPU id.
This fixes for example 'xenpm get-cpufreq-para' with smt=off, which
otherwise would miss half of the cores.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com>