amd_k8.c did a lot of common work and very little K8
specific work. So merge init functions of amd_f10.c and
amd_k8.c and move it into the common amd_mcheck_init
handler. With that done, there is not much left in either
files, so fold all code into just one file - mce_amd.c
While at it, update the comments regarding documentation
with correct URL's and revision numbers.
Also, update copyright info.
Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Acked-by: Christoph Egger <chegger@amazon.de>
Paul Durrant [Mon, 2 Jun 2014 08:02:25 +0000 (10:02 +0200)]
ioreq-server: make buffered ioreq handling optional
Some emulators will only register regions that require non-buffered
access. (In practice the only region that a guest uses buffered access
for today is the VGA aperture from 0xa0000-0xbffff). This patch therefore
makes allocation of the buffered ioreq page and event channel optional for
secondary ioreq servers.
If a guest attempts buffered access to an ioreq server that does not
support it, the access will be handled via the normal synchronous path.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Mon, 2 Jun 2014 08:01:27 +0000 (10:01 +0200)]
ioreq-server: remove p2m entries when server is enabled
For secondary servers, add a hvm op to enable/disable the server. The
server will not accept IO until it is enabled and the act of enabling
the server removes its pages from the guest p2m, thus preventing the guest
from directly mapping the pages and synthesizing ioreqs.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Mon, 2 Jun 2014 07:40:43 +0000 (09:40 +0200)]
ioreq-server: add support for multiple servers
The previous single ioreq server that was created on demand now
becomes the default server and an API is created to allow secondary
servers, which handle specific IO ranges or PCI devices, to be added.
When the guest issues an IO the list of secondary servers is checked
for a matching IO range or PCI device. If none is found then the IO
is passed to the default server.
Secondary servers use guest pages to communicate with emulators, in
the same way as the default server. These pages need to be in the
guest physmap otherwise there is no suitable reference that can be
queried by an emulator in order to map them. Therefore a pool of
pages in the current E820 reserved region, just below the special
pages is used. Secondary servers allocate from and free to this pool
as they are created and destroyed.
The size of the pool is currently hardcoded in the domain build at a
value of 8. This should be sufficient for now and both the location and
size of the pool can be modified in future without any need to change the
API.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Fix build errors in xen/xsm/dummy.c and xen/xsm/flask/hooks.c with XSM
enabled.
Jan Beulich [Wed, 28 May 2014 08:57:18 +0000 (10:57 +0200)]
hvmloader: don't use AML operations on 64-bit fields
WinXP and Win2K3, while having no problem with the QWordMemory resource
(there was another one there before), don't like operations on 64-bit
fields. Split the fields d0688669 ("hvmloader: also cover PCI MMIO
ranges above 4G with UC MTRR ranges") added to 32-bit ones, handling
carry over explicitly.
Sadly the constructs needed to create the sub-fields - nominally
- can't be used: The former gets warned upon by newer iasl, i.e. would
need to be replaced by the latter just with the addend changed to 0,
and the latter doesn't translate properly with recent iasl). Hence,
short of having an ASL/iasl expert at hand, we need to work around the
shortcomings of various iasl versions. See the code comment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ross Lagerwall [Wed, 28 May 2014 08:07:50 +0000 (10:07 +0200)]
timers: set the deadline more accurately
Program the timer to the deadline of the closest timer if it is further
than 50us ahead, otherwise set it 50us ahead. This way a single event
fires on time rather than 50us late (as it would have previously) while
still preventing too many timer wakeups in the case of having many
timers scheduled close together.
(where 50us is the timer_slop)
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Jan Beulich [Wed, 28 May 2014 07:51:07 +0000 (09:51 +0200)]
x86: don't use VA for cache flush when also flushing TLB
Doing both flushes at once is a strong indication for the address
mapping to either having got dropped (in which case the cache flush,
when done via INVLPG, would fault) or its physical address having
changed (in which case the cache flush would end up being done on the
wrong address range). There is no adverse effect (other than the
obvious performance one) using WBINVD in this case regardless of the
range's size; only map_pages_to_xen() uses combined flushes at present.
This problem was observed with the 2nd try backport of d6cb14b3 ("VT-d:
suppress UR signaling for desktop chipsets") to 4.2 (where ioremap()
needs to be replaced with set_fixmap_nocache(); the now commented out
__set_fixmap(, 0, 0) there to undo the mapping resulted in the first of
the above two scenarios).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 28 May 2014 07:50:33 +0000 (09:50 +0200)]
AMD IOMMU: don't free page table prematurely
iommu_merge_pages() still wants to look at the next level page table,
the TLB flush necessary before freeing too happens in that function,
and if it fails no free should happen at all. Hence the freeing must
be done after that function returned successfully, not before it's
being called.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Roger Pau Monné [Wed, 28 May 2014 07:48:56 +0000 (09:48 +0200)]
x86: fix setup of PVH Dom0 memory map
This patch adds the holes removed by MMIO regions to the end of the
memory map for PVH Dom0, so the guest OS doesn't have to manually
populate this memory.
Also, provide a suitable e820 memory map for PVH Dom0, that matches
the underlying p2m map. This means that PVH guests should always use
XENMEM_memory_map in order to obtain the e820, even when running as
Dom0.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 26 May 2014 10:28:46 +0000 (12:28 +0200)]
VT-d: fix mask applied to DMIBAR in desktop chipset XSA-59 workaround
In commit ("VT-d: suppress UR signaling for desktop chipsets")
the mask applied to the value read from DMIBAR is to narrow, only the
comment accompanying it was correct. Fix that and tag the literal
number as "long" at once to avoid eventual compiler warnings.
The widest possible value so far is 39 bits; all chipsets covered here
but having less than this number of bits have the remaining bits marked
reserved (zero), and hence there's no need for making the mask chipset
specific.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com>
There are two problems with initializetion of the ioreq_t in hvmemul_do_io():
- vp_eport is uninitialized (because it doesn't need to be) but because the
struct is the subject of a copy in hvm_send_assist_req(), this is flagged
as a problem.
- dir, addr, data_is_ptr, and data may be uninitialized when the struct is
passed to hvmtrace_io_assist(). This is clearly a bug, so the initializ-
ation of at least those fields needs to be moved earlier.
This patch fixes both these problems.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 26 May 2014 10:25:01 +0000 (12:25 +0200)]
ACPI/ERST: fix table mapping
acpi_get_table(), when executed before reaching SYS_STATE_active, will
return a mapping valid only until the next invocation of that funciton.
Consequently storing the returned pointer for later use is incorrect.
Copy the logic used in VT-d's DMAR handling.
Jason Andryuk [Mon, 19 May 2014 18:36:37 +0000 (14:36 -0400)]
libxl: Reset toolstack_save file position in libxl
toolstack_save data is written to a temporary file in libxl and read
back in libxl-save-helper. The file position must be reset prior to
reading the file, which is done in libxl-save-helper with lseek.
lseek is unsupported for pipes and sockets, so a wrapper passing such an
fd to libxl-save-helper fails the lseek. Moving the lseek to libxl
avoids the error, allowing the save to continue.
Signed-off-by: Jason Andryuk <andryuk@aero.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Thu, 22 May 2014 12:20:19 +0000 (14:20 +0200)]
hvmloader: fix build with certain iasl versions
While most of them support what we have now, Wheezy's dislikes the
empty range. Put a fake one in place - it's getting overwritten upon
evaluation of _CRS anyway.
The range could be grown (downwards) if necessary; the way it is now
it is
- the highest possible one below the 36-bit boundary (with 36 bits
being the lowest common denominator for all supported systems),
- the smallest possible one that said iasl accepts.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Jan Beulich [Wed, 21 May 2014 16:14:04 +0000 (18:14 +0200)]
hvmloader: PA range 0xfc000000-0xffffffff should be UC
Rather than leaving the range from PCI_MEM_END (0xfc000000) to 4G
uncovered, we should include this in the UC range created for the (low)
PCI range. Besides being more correct, this also has the advantage that
with the way pci_setup() currently works the range will always be
mappable with a single variable range MTRR (rather than from 2 to 5
depending on how much the lower boundary gets shifted down to
accommodate all devices).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Jan Beulich [Wed, 21 May 2014 16:13:36 +0000 (18:13 +0200)]
hvmloader: also cover PCI MMIO ranges above 4G with UC MTRR ranges
When adding support for BAR assignments to addresses above 4G, the MTRR
side of things was left out.
Additionally the MMIO ranges in the DSDT's \_SB.PCI0._CRS were having
memory types not matching the ones put into MTRRs: The legacy VGA range
is supposed to be WC, and the other ones should be UC.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Sylvain Munaut [Tue, 20 May 2014 14:56:43 +0000 (16:56 +0200)]
hotplug/linux: Fix the vif script to handle_iptable for tap interfaces
The TAP interfaces need the same iptables rules as the VIF, without it,
traffic will not be forwarded to/from them is the default FORWARD policy
is DROP/REJECT
Signed-off-by: Sylvain Munaut <s.munaut@whatever-company.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Mon, 19 May 2014 16:23:57 +0000 (17:23 +0100)]
xen/arm: p2m: Clean cache PT when the IOMMU doesn't support coherent walk
Some IOMMU don't suppport coherent PT walk. When the p2m is shared with
the CPU, Xen has to make sure the PT changes have reached the memory.
Introduce new IOMMU function that will check if the IOMMU feature is enabled
for a specified domain.
On ARM, the platform can contain multiple IOMMUs. Each of them may not
have the same set of feature. The domain parameter will be used to get the
set of features for IOMMUs used by this domain.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
libxc: check return values on mmap() and madvise() on xc_alloc_hypercall_buffer()
On a Thinkpad T4440p with OpenSUSE tumbleweed with v3.15-rc4
and today's latest xen tip from the git tree strace -f reveals
we end up on a never ending wait shortly after
This is right before we just wait on the qemu process which we
had mmap'd for. Without this you'll end up getting stuck on a
loop if mmap() worked but madvise() did not. While at it I noticed
even the mmap() error fail was not being checked, fix that too.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Jason Andryuk [Tue, 20 May 2014 13:37:08 +0000 (09:37 -0400)]
libxc: Protect xc_domain_resume from clobbering domain registers
xc_domain_resume() expects the guest to be in state SHUTDOWN_suspend.
However, nothing verifies the state before modify_returncode() modifies
the domain's registers. This will crash guest processes or the kernel
itself.
This can be demonstrated with `LIBXL_SAVE_HELPER=/bin/false xl migrate`.
Signed-off-by: Jason Andryuk <andryuk@aero.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Fri, 16 May 2014 14:40:27 +0000 (15:40 +0100)]
xen/arm: IRQ: Store IRQ type in arch_irq_desc
For now, ARM uses different IRQ functions to setup an interrupt handler. This
is a bit annoying for common driver because we have to add idefery when
an IRQ is setup (see ns16550_init_postirq for an example).
To avoid to completely fork the IRQ management code, we can introduce a field
to store the IRQ type (e.g level/edge ...).
This patch also adds platform_get_irq which will retrieve the IRQ from the
device tree and setup correctly the IRQ type.
In order to use this solution, we have to move init_IRQ earlier for the boot
CPU. It's fine because the code only depends on percpu.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Wei Liu [Tue, 13 May 2014 21:53:59 +0000 (22:53 +0100)]
libxl_json: allow basic JSON type objects generation
The original logic is that basic JSON types (number, string and null)
must be an element of JSON map or array. This assumption doesn't hold
true anymore when we need to return basic JSON types.
Returning basic JSON types is required for parsing number, string and
null objects back into libxl__json_object.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Wei Liu [Tue, 13 May 2014 21:53:55 +0000 (22:53 +0100)]
libxl_internal: make JSON_* types a bit-field
Libxl can generate number as type JSON_INTEGER, JSON_DOUBLE or
JSON_NUMBER, string as type JSON_STRING or JSON_NULL (if string is
null).
So make JSON_* type a bit-field and use it in libxl__json_map_get. This is
useful when parsing a libxl__json_object to libxl_FOO struct. We can
enforce type checking on libxl__json_object in an easy way.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Wei Liu [Tue, 13 May 2014 21:53:48 +0000 (22:53 +0100)]
libxl: fix memory leak in libxl_cpuid_dispose
libxl_cpuid_policy_list is not allocated with GC-aware allocation so it
needs to be freed manually, just like what libxl_string_list_dispose and
libxl_key_value_list_dispose do.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Jan Beulich [Tue, 20 May 2014 13:56:48 +0000 (15:56 +0200)]
x86/HVM: don't use confusing/non-suitable XSM checks
XSM_TARGET checks following rcu_lock_{,live_}remote_domain_by_id() are
rather pointless and potentially confusing. Use XSM_DM_PRIV there
instead.
Note that setting flask_ops.hvm_control to flask_hvm_param() (instead
of introducing flask_hvm_control() is intentional - that function is
already separating the contol and non-control sub-operations.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Juergen Gross [Tue, 20 May 2014 13:55:42 +0000 (15:55 +0200)]
move domain to cpupool0 before destroying it
Currently when a domain is destroyed it is removed from the domain_list
before all of it's resources, including the cpupool membership, are freed.
This can lead to a situation where the domain is still member of a cpupool
without for_each_domain_in_cpupool() (or even for_each_domain()) being
able to find it any more. This in turn can result in rejection of removing
the last cpu from a cpupool, because there seems to be still a domain in
the cpupool, even if it can't be found by scanning through all domains.
This situation can be avoided by moving the domain to be destroyed to
cpupool0 first and then remove it from this cpupool BEFORE deleting it from
the domain_list. As cpupool0 is always active and a domain without any cpupool
membership is implicitly regarded as belonging to cpupool0, this poses no
problem.
Signed-off-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Jan Beulich [Tue, 20 May 2014 13:54:01 +0000 (15:54 +0200)]
VT-d: extend error report masking workaround to newer chipsets
Add two more PCI IDs to the set that has been taken care of with a
different workaround long before XSA-59, and (for constency with the
newer workarounds) log a message here too.
Also move the function wide comment to the cases it applies to; this
should really have been done by d061d200 ("VT-d: suppress UR signaling
for server chipsets").
This is CVE-2013-3495 / XSA-59.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Xiantao Zhang <xiantao.zhang@intel.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com>
Jan Beulich [Tue, 20 May 2014 13:53:20 +0000 (15:53 +0200)]
VT-d: apply quirks at device setup time rather than only at boot
Accessing extended config space may not be possible at boot time, e.g.
when the memory space used by MMCFG is reserved only via ACPI tables,
but not in the E820/UEFI memory maps (which we need Dom0 to tell us
about). Consequently the change here still leaves the issue unaddressed
for systems where the extended config space remains inaccessible (due
to firmware bugs, i.e. not properly reserving the address space of
those regions).
With the respective messages now potentially getting logged more than
once, we ought to consider whether we should issue them only if we in
fact were required to do any masking (i.e. if the relevant mask bits
weren't already set).
This is CVE-2013-3495 / XSA-59.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Xiantao Zhang <xiantao.zhang@intel.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com>
Xiantao Zhang [Mon, 19 May 2014 14:10:56 +0000 (16:10 +0200)]
add Yang and Kevin as the new maintainer of VT-d stuff
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Restricted the change's effect to what its subject says: Replace the
VT-d maintainers, i.e. drop the new additions for the generic IOMMU
code for the time being.
Andrew Cooper [Mon, 19 May 2014 12:24:45 +0000 (14:24 +0200)]
x86/misc: post cleanup
* panic() now works on early boot. Replace EARLY_FAIL()
* Cleanup __set_intr_gate() & friends. The master IDT is fully constructed on
early boot, and only subsequently altered on the crash path. Make them
private to traps.c, move them into .init, and remove the loop over all idts,
as __set_intr_gate() will never find an AP to patch. (For some reason,
leaving out the noinline causes ~1.5k of code bloat from GCC inlining
everything)
* No need to clear X86_EFLAGS_NT in cpu_init(). This is covered by the eflags
reset in __high_start().
* Missing '\n' from unexpected MCE printk.
* load_system_tables() is x86 specific. Move its declaration into an x86 header.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 19 May 2014 12:24:04 +0000 (14:24 +0200)]
x86/irqs: move interrupt-stub generation out of C
In addition, generate stubs for reserved exceptions. These go through the
standard handle_exception mechanism, although the C handler do_reserved_trap()
is a terminal error path.
* Move all automatic stub generation out of i8259.c and into entry.S.
* Move patching of the master IDT into trap_init(). Provide ASSERT()s to
ensure we have fully populated the IDT and don't accidentally clobbered any
preexisting traps.
* Demote TRAP_copro_seg and TRAP_spurious_int to being reserved exceptions
and remove their custom entry points.
* Point double_fault's exception_table entry at do_reserved_trap. We do not
ever expect to enter a real double fault this way.
* Acquaint Xen with #VE but leave it reserved.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Olaf Hering [Mon, 19 May 2014 09:50:19 +0000 (11:50 +0200)]
libxl: add option for discard support to xl disk configuration
Handle new boolean option discard/no-discard for disk configuration. It
is supposed to disable discard support if file based backing storage was
intentionally created non-sparse to avoid fragmentation of the file.
The option intended for the backend driver. A new boolean property
"discard-enable" is written to the backend node. An upcoming patch for
qemu will make use of this property. The kernel blkback driver may be
updated as well to disable discard for phy based backing storage.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com>
Andrew Cooper [Fri, 16 May 2014 15:39:07 +0000 (17:39 +0200)]
x86/boot: install trap handlers much earlier on boot
Patch the trap handlers into the master idt very early on boot, and setup &
load the GDT, IDT, TR and LDT. Load the IDT before the TR so we stand a chance
of catching an invalid TSS exception rather than triple faulting.
This provides full exception support far earlier on boot than previously.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 16 May 2014 15:37:46 +0000 (17:37 +0200)]
x86/misc: early cleanup
Various bits of cleanup without functional impact as far as the series goes,
but make subsequent patches cleaner.
* WARN_ON(1) is just WARN().
* Replace hand-crafted rolled stack printing with fatal_trap().
* 16 BSS bytes is overkill for an empty idtr to triple fault with. Construct
it on the stack using an appropriate struct, and correct the asm memory
constraint.
* Fix watchdog asymmetry in panic(). machine_halt() needs just as much
watchdog care as machine_restart(), but it should be up to the arch
implementation of machine_{halt,restart}() to play with the watchdog.
* unsigned and const correctness for trapstr(), along with whitespace cleanup.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 16 May 2014 15:37:18 +0000 (17:37 +0200)]
x86/traps: make the main trap handlers safe for use early during Xen boot
Most of this patch is an analysis of the safety of the trap handlers.
Traps 0, 4, 5, 9-12, 16, 17 and 19 all end up in do_trap(). do_trap() is
mostly safe, performing an exception table search and possibly panic()s.
There is one complication with traps 16 and 19 which will see about calling
the fpu_exception_callback. This involves following current which is not
valid early on boot. The has_hvm_container_vcpu(curr) check is preceded with
a system_state check, so in the exceedingly unlikely case that Xen takes an
x87/SIMD trap while booting, it will panic() instead of following a bogus
current vcpu.
Traps 1, 3, 6-8, 13 and 15 are completely safe with respect to running during
early boot. They all have well formed and obvious differences between faults
in Xen and faults in guests, with the Xen faults doing little more than
exception table walks or panic()s.
Trap 2 is a complicated codepath, but appears safe. For the possible
injection of NMIs into dom0 there is a NULL domain pointer check. The
possible softirq raised for PCI SERR will be delivered until we start the idle
vcpu, but is safe.
Trap 14 is very complicated. The code is certainly unsafe for boot as
fixup_page_fault() will dereference current to find the running domain. There
exists an explicit do_early_page_fault() handler which shall continue to be
used.
Trap 18 has a default handler before the MCE infrastructure is set up, which
has always been unsafe and liable to deadlock itself with the console lock.
As it is expected never to trigger, and if it did we would be in serious
problems, the simple printk() is replaced with a fatal error path.
Trap 20 (Virtualisation Exception) is currently not implemented. It is fatal
one way or another, and will become more explicitly so with later changes.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 16 May 2014 15:36:40 +0000 (17:36 +0200)]
x86/traps: make panic and reboot paths safe during early boot
Reverse two conditions in show_registers(). For an early crash, it is not
safe to dereference 'current' for its HVM status before knowing that it is a
guest vcpu.
Introduce SYS_STATE_smp_boot to distinguish the point at which APs need
considering before boot is complete. There is one code change required as a
result; .init.text symbols are still in use before Xen is active, so alter its
predicate in is_active_kernel_text().
Make use of SYS_STATE_smp_boot in machine_{halt,restart}(). Before Xen starts
booting the APs, any execution here is certainly the BSP.
When halting or rebooting particularly early, this avoids the risks of a #PF
or #GP when accessing the LAPIC before generic_apic_probe(), as well as trying
to enable interrupts before init_IRQ() is complete.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Wed, 14 May 2014 13:14:54 +0000 (14:14 +0100)]
xen/arm: Drop event_mask in arch_vcpu
This field has not been used since a while, last use was before the
commit 4df76b3 "xen/arm: disable the event optimization in the gic" back
in July 2012.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Jan Beulich [Thu, 15 May 2014 13:26:12 +0000 (15:26 +0200)]
switch internal hypercall restart indication from -EAGAIN to -ERESTART
-EAGAIN being a return value we want to return to the actual caller in
a couple of cases makes this unsuitable for restart indication, and x86
already developed two cases where -EAGAIN could not be returned as
intended due to this (which is being fixed here at once).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com Acked-by: Aravind Gopalakrishnan<Aravind.Gopalakrishnan@amd.com> Reviewed-by: Tim Deegan <tim@xen.org>
Ian Campbell [Wed, 14 May 2014 14:12:01 +0000 (15:12 +0100)]
tools: arm: remove code to check for a DTB appended to the kernel
The code to check for an appended DTB was confusing and unnecessary. Since we
know the size of the kernel binary passed to us we should just load the entire
thing into guest RAM (subject to the limits checks). Removing this code avoids
a whole raft of overflow and alignment issues.
We also need to validate the limits of the segment where we intend to load the
kernel to avoid overflow issues.
For ARM32 we control the load address, but we need to validate the size. The
entry point is only relevant within the guest so we don't need to worry about
that.
For ARM64 we need to validate both the load address (which is the same as the
entry point) and the size.
This is XSA-95.
Reported-by: Thomas Leonard <talex5@gmail.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Julien Grall [Tue, 13 May 2014 15:50:26 +0000 (16:50 +0100)]
MAINTAINERS: Add drivers/passthrough/arm
Add the ARM IOMMU directory to "ARM ARCHITECTURE" part
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com> Cc: Keir Fraser <keir@xen.org> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com>
Julien Grall [Tue, 13 May 2014 15:50:25 +0000 (16:50 +0100)]
xen/passthrough: Introduce IOMMU ARM architecture
This patch contains the architecture to use IOMMUs on ARM. There is no
IOMMU drivers on this patch.
In this implementation, IOMMU page table will be shared with the P2M.
The code will run through the device tree and will initialize every IOMMU.
It's possible to have multiple IOMMUs on the same platform, but they must
be handled with the same driver. For now, there is no support for using
multiple iommu drivers at runtime.
Each new IOMMU drivers should contain:
static const char * const myiommu_dt_compat[] __initconst =
{
/* list of device compatible with the drivers. Will be matched with
* the "compatible" property on the device tree
*/
NULL,
};
Julien Grall [Tue, 13 May 2014 15:50:24 +0000 (16:50 +0100)]
xen/passthrough: iommu: Basic support of device tree assignment
Add IOMMU helpers to support device tree assignment/deassignment. This patch
introduces 2 new fields in the dt_device_node:
- is_protected: Does the device is protected by an IOMMU
- domain_list: Pointer to the next device assigned to the same
domain
This commit contains only support to protected a device with DOM0.
Device passthrough to another guest won't work out-of-box.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Cc: Xiantao Zhang <xiantao.zhang@intel.com>
Julien Grall [Tue, 13 May 2014 15:50:17 +0000 (16:50 +0100)]
xen/arm: Introduce flush_tlb_domain
The pattern p2m_load_VTTBR(d) -> flush_tlb -> p2m_load_VTTBR(current->domain)
is used in few places.
Replace this usage by flush_tlb_domain which will take care of this pattern.
This will help to the readability of apply_p2m_changes which begin to be big.
Kai Huang [Wed, 14 May 2014 08:54:39 +0000 (10:54 +0200)]
x86/MCE: bypass uninitialized vcpu in vMCE injection
Dom0 may bring up less number of vCPUs than xen hypervisor actually created for
it, and in this case, on Intel platform, vMCE injection to dom0 will fail due to
injecting vMCE to uninitialized vcpu, and cause dom0 crash.
Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Acked-by: Christoph Egger <chegger@amazon.de>
Julien Grall [Wed, 14 May 2014 08:51:37 +0000 (10:51 +0200)]
iommu: introduce arch specific code
Currently the structure hvm_iommu (xen/include/xen/hvm/iommu.h) contains
x86 specific fields.
This patch creates:
- arch_hvm_iommu structure which will contain architecture depend
fields
- arch_iommu_domain_{init,destroy} function to execute arch
specific during domain creation/destruction
Also move iommu_use_hap_pt and domain_hvm_iommu in asm-x86/iommu.h.
Julien Grall [Wed, 14 May 2014 08:50:22 +0000 (10:50 +0200)]
iommu: split generic code
The generic IOMMU framework code (xen/drivers/passthrough/iommu.c) contains
functions specific to x86 and PCI.
Split the framework in 3 distincts files:
- iommu.c: contains generic functions shared between x86 and ARM
(when it will be supported)
- pci.c: contains specific functions for PCI passthrough
- x86/iommu.c: contains specific functions for x86
io.c contains x86 HVM specific code. Only compile for x86.
This patch is mostly code movement in new files.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Wed, 14 May 2014 08:49:17 +0000 (10:49 +0200)]
passthrough: rework hwdom_pvh_reqs to use it also on ARM
Hardware domain on ARM will have the same requirements as hwdom PVH when iommu
is enabled. Both PVH and ARM guest has paging mode translate enabled, so Xen
can use it to know if it needs to check the requirements.
Rename the function and remove "pvh" word in the panic message.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Feng Wu [Mon, 12 May 2014 15:04:50 +0000 (17:04 +0200)]
x86/hvm: add SMAP support to HVM guest
Intel new CPU supports SMAP (Supervisor Mode Access Prevention).
SMAP prevents supervisor-mode accesses to any linear address with
a valid translation for which the U/S flag (bit 2) is 1 in every
paging-structure entry controlling the translation for the linear
address.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Tim Deegan <tim@xen.org>
If CR4.SMAP = 1, supervisor-mode data accesses are not allowed
to linear addresses that are accessible in user mode. If CPL < 3,
SMAP protections are disabled if EFLAGS.AC = 1. If CPL = 3, SMAP
applies to all supervisor-mode data accesses (these are implicit
supervisor accesses) regardless of the value of EFLAGS.AC.
This patch enables SMAP in Xen to prevent Xen hypervisor from
accessing pv guest data, whose translation paging-structure
entries' U/S flags are all set.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Feng Wu [Mon, 12 May 2014 15:03:09 +0000 (17:03 +0200)]
VMX: disable SMAP feature when guest is in non-paging mode
SMAP is disabled if CPU is in non-paging mode in hardware.
However Xen always uses paging mode to emulate guest non-paging
mode with HAP. To emulate this behavior, SMAP needs to be manually
disabled when guest switches to non-paging mode.
This logic is similiar with SMEP.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Feng Wu [Mon, 12 May 2014 15:02:25 +0000 (17:02 +0200)]
x86: temporarily disable SMAP to legally access user pages in kernel mode
Use STAC/CLAC to temporarily disable SMAP to allow legal accesses to
user pages in kernel mode
STAC/CLAC is not needed for compat_create_bounce_frame, since in this
chunk of code, it only accesses the pv guest's kernel stack, which is
in ring 1 for 32-bit pv guests.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Feng Wu [Mon, 12 May 2014 15:01:47 +0000 (17:01 +0200)]
x86: clear AC bit in RFLAGS to protect Xen itself by SMAP
Clear AC bit in RFLAGS at the beginning of exception, interrupt, hypercall,
so Xen itself can be protected by SMAP mechanism. This patch also sets AC
bit at the beginning of double_fault and fatal_trap() to reduce the likelihood
of taking a further fault while trying to dump state.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Feng Wu [Mon, 12 May 2014 15:00:39 +0000 (17:00 +0200)]
x86: add support for STAC/CLAC instructions
The STAC/CLAC instructions are only available when SMAP feature is
available, but on the other hand they aren't needed if SMAP is not
enabled, or before we start to run userspace, in that case, the
functions and macros do nothing.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Sat, 10 May 2014 01:18:33 +0000 (02:18 +0100)]
tools/pygrub: Fix error handling if no valid partitions are found
If no partitions at all are found, pygrub never creates the name 'fs',
resulting in a NameError indicating the lack of fs, rather than a
RuntimeError explaining that no partitions were found.
Set fs to None right at the start, and use the pythonic idiom "if fs is None:"
to protect against otherwise valid values for fs which compare equal to
0/False.
Reported-by: Sven Köhler <sven.koehler@gmail.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Fri, 9 May 2014 09:59:58 +0000 (10:59 +0100)]
tools/libxc: Issue individual DPRINTF()s rather than multiline ones.
For libxc users who log to syslog, this results in legible logging, rather
than long lines with #012's replacing newlines.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Thu, 8 May 2014 15:13:55 +0000 (16:13 +0100)]
xen: arm: bitops take unsigned int
Xen bitmaps can be 4 rather than 8 byte aligned, so use the appropriate type.
Otherwise the compiler can generate unaligned 8 byte accesses and cause traps.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Paul Durrant [Mon, 12 May 2014 10:04:45 +0000 (12:04 +0200)]
add the facility to limit ranges per rangeset
A subsequent patch exposes rangesets to secondary emulators, so to allow a
limit to be placed on the amount of xenheap that an emulator can cause to be
consumed, the function rangeset_limit() has been created to set the allowed
number of ranges in a rangeset. By default, there is no limit.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Mon, 12 May 2014 10:03:19 +0000 (12:03 +0200)]
ioreq-server: on-demand creation of ioreq server
This patch only creates the ioreq server when the legacy HVM parameters
are read (by an emulator).
A lock is introduced to protect access to the ioreq server by multiple
emulator/tool invocations should such an eventuality arise. The guest is
protected by creation of the ioreq server only being done whilst the
domain is paused.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Mon, 12 May 2014 10:02:20 +0000 (12:02 +0200)]
ioreq-server: create basic ioreq server abstraction
Collect together data structures concerning device emulation together into
a new struct hvm_ioreq_server.
Code that deals with the shared and buffered ioreq pages is extracted from
functions such as hvm_domain_initialise, hvm_vcpu_initialise and do_hvm_op
and consolidated into a set of hvm_ioreq_server manipulation functions. The
lock in the hvm_ioreq_page served two different purposes and has been
replaced by separate locks in the hvm_ioreq_server structure.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Mon, 12 May 2014 10:01:43 +0000 (12:01 +0200)]
ioreq-server: centralize access to ioreq structures
To simplify creation of the ioreq server abstraction in a subsequent patch,
this patch centralizes all use of the shared ioreq structure and the
buffered ioreq ring to the source module xen/arch/x86/hvm/hvm.c.
The patch moves an rmb() from inside hvm_io_assist() to hvm_do_resume()
because the former may now be passed a data structure on stack, in which
case the barrier is unnecessary.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Paul Durrant [Mon, 12 May 2014 10:00:30 +0000 (12:00 +0200)]
ioreq-server: pre-series tidy up
This patch tidies up various parts of the code that following patches move
around. If these modifications were combined with the code motion it would
be easy to miss them.
There's also some function renaming to reflect purpose and a single
whitespace fix.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Edmund H White [Mon, 12 May 2014 09:59:19 +0000 (11:59 +0200)]
Nested VMX: load current_vmcs only when it exists
There may not have valid vmcs on current CPU. So only load it when it exists.
This original fixing is from Edmud <edmund.h.white@intel.com>.
Signed-off-by: Edmund H White <edmund.h.white@intel.com> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Mukesh Rathor [Thu, 8 May 2014 12:18:27 +0000 (14:18 +0200)]
pvh dom0: construct_dom0 changes
This patch changes construct_dom0() to boot in pvh mode:
- Make sure dom0 elf supports pvh mode.
- Call guest_physmap_add_page for pvh rather than simple p2m setting
- Map all non-RAM regions 1:1 upto the end region in e820 or 4GB which
ever is higher.
- Allocate p2m, copying calculation from toolstack.
- Allocate shared info page from the virtual space so that dom0 PT
can be updated. Then update p2m for it with the actual mfn.
- Since we build the page tables for pvh same as for pv, in
pvh_fixup_page_tables_for_hap we replace the mfns with pfns.
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>