Roger Pau Monne [Fri, 11 Aug 2017 15:32:34 +0000 (16:32 +0100)]
vpci/msix: add MSI-X handlers
Add handlers for accesses to the MSI-X message control field on the
PCI configuration space, and traps for accesses to the memory region
that contains the MSI-X table and PBA. This traps detect attempts from
the guest to configure MSI-X interrupts and properly sets them up.
Note that accesses to the Table Offset, Table BIR, PBA Offset and PBA
BIR are not trapped by Xen at the moment.
Finally, turn the panic in the Dom0 PVH builder into a warning.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
- Remove parentheses around offsetof.
- Add "being" to MSI-X enabling comment.
- Use INVALID_PIRQ.
- Add a simple sanity check to vpci_msix_arch_enable in order to
detect wrong MSI-X entries more quickly.
- Constify vpci_msix_arch_print entry argument.
- s/cpu/fixed/ in vpci_msix_arch_print.
- Dump the MSI-X info together with the MSI info.
- Fix vpci_msix_control_write to take into account changes to the
address and data fields when switching the function mask bit.
- Only disable/enable the entries if the address or data fields have
been updated.
- Usew the BAR enable field to check if a BAR is mapped or not
(instead of reading the command register for each device).
- Fix error path in vpci_msix_read to set the return data to ~0.
- Simplify mask usage in vpci_msix_write.
- Cast data to uint64_t when shifting it 32 bits.
- Fix writes to the table entry control register to take into account
if the mask-all bit is set.
- Add some comments to clarify the intended behavior of the code.
- Align the PBA size to 64-bits.
- Remove the error label in vpci_init_msix.
- Try to compact the layout of the vpci_msix structure.
- Remove the local table_bar and pba_bar variables from
vpci_init_msix, they are used only once.
Changes since v3:
- Propagate changes from previous versions: remove xen_ prefix, use
the new fields in vpci_val and remove the return value from
handlers.
- Remove the usage of GENMASK.
- Mave the arch-specific parts of the dump routine to the
x86/hvm/vmsi.c dump handler.
- Chain the MSI-X dump handler to the 'M' debug key.
- Fix the header BAR mappings so that the MSI-X regions inside of
BARs are unmapped from the domain p2m in order for the handlers to
work properly.
- Unconditionally trap and forward accesses to the PBA MSI-X area.
- Simplify the conditionals in vpci_msix_control_write.
- Fix vpci_msix_accept to use a bool type.
- Allow all supported accesses as described in the spec to the MSI-X
table.
- Truncate the returned address when the access is a 32b read.
- Always return X86EMUL_OKAY from the handlers, returning ~0 in the
read case if the access is not supported, or ignoring writes.
- Do not check that max_entries is != 0 in the init handler.
- Use trylock in the dump handler.
Changes since v2:
- Split out arch-specific code.
This patch has been tested with devices using both a single MSI-X
entry and multiple ones.
Roger Pau Monne [Fri, 11 Aug 2017 15:32:34 +0000 (16:32 +0100)]
vpci: add a priority parameter to the vPCI register initializer
This is needed for MSI-X, since MSI-X will need to be initialized
before parsing the BARs, so that the header BAR handlers are aware of
the MSI-X related holes and make sure they are not mapped in order for
the trap handlers to work properly.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
- Add a middle priority and add the PCI header to it.
Changes since v3:
- Add a numerial suffix to the section used to store the pointer to
each initializer function, and sort them at link time.
Roger Pau Monne [Fri, 11 Aug 2017 15:32:33 +0000 (16:32 +0100)]
vpci/msi: add MSI handlers
Add handlers for the MSI control, address, data and mask fields in
order to detect accesses to them and setup the interrupts as requested
by the guest.
Note that the pending register is not trapped, and the guest can
freely read/write to it.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v4:
- Fix commit message.
- Change the ASSERTs in vpci_msi_arch_mask into ifs.
- Introduce INVALID_PIRQ.
- Destroy the partially created bindings in case of failure in
vpci_msi_arch_enable.
- Just take the pcidevs lock once in vpci_msi_arch_disable.
- Print an error message in case of failure of pt_irq_destroy_bind.
- Make vpci_msi_arch_init return void.
- Constify the arch parameter of vpci_msi_arch_print.
- Use fixed instead of cpu for msi redirection.
- Separate the header includes in vpci/msi.c between xen and asm.
- Store the number of configured vectors even if MSI is not enabled
and always return it in vpci_msi_control_read.
- Fix/add comments in vpci_msi_control_write to clarify intended
behavior.
- Simplify usage of masks in vpci_msi_address_{upper_}write.
- Add comment to vpci_msi_mask_{read/write}.
- Don't use MASK_EXTR in vpci_msi_mask_write.
- s/msi_offset/pos/ in vpci_init_msi.
- Move control variable setup closer to it's usage.
- Use d%d in vpci_dump_msi.
- Fix printing of bitfield mask in vpci_dump_msi.
- Fix definition of MSI_ADDR_REDIRECTION_MASK.
- Shuffle the layout of vpci_msi to minimize gaps.
- Remove the error label in vpci_init_msi.
Changes since v3:
- Propagate changes from previous versions: drop xen_ prefix, drop
return value from handlers, use the new vpci_val fields.
- Use MASK_EXTR.
- Remove the usage of GENMASK.
- Add GFLAGS_SHIFT_DEST_ID and use it in msi_flags.
- Add "arch" to the MSI arch specific functions.
- Move the dumping of vPCI MSI information to dump_msi (key 'M').
- Remove the guest_vectors field.
- Allow the guest to change the number of active vectors without
having to disable and enable MSI.
- Check the number of active vectors when parsing the disable
mask.
- Remove the debug messages from vpci_init_msi.
- Move the arch-specific part of the dump handler to x86/hvm/vmsi.c.
- Use trylock in the dump handler to get the vpci lock.
Changes since v2:
- Add an arch-specific abstraction layer. Note that this is only implemented
for x86 currently.
- Add a wrapper to detect MSI enabling for vPCI.
NB: I've only been able to test this with devices using a single MSI interrupt
and no mask register. I will try to find hardware that supports the mask
register and more than one vector, but I cannot make any promises.
If there are doubts about the untested parts we could always force Xen to
report no per-vector masking support and only 1 available vector, but I would
rather avoid doing it.
Roger Pau Monne [Fri, 11 Aug 2017 15:32:33 +0000 (16:32 +0100)]
vpci/bars: add handlers to map the BARs
Introduce a set of handlers that trap accesses to the PCI BARs and the command
register, in order to snoop BAR sizing and BAR relocation.
The command handler is used to detect changes to bit 2 (response to
memory space accesses), and maps/unmaps the BARs of the device into
the guest p2m. A rangeset is used in order to figure out which memory
to map/unmap. This makes it easier to keep track of the possible
overlaps with other BARs, and will also simplify MSI-X support, where
certain regions of a BAR might be used for the MSI-X table or PBA.
The BAR register handlers are used to detect attempts by the guest to size or
relocate the BARs.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: George Dunlap <George.Dunlap@eu.citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tim Deegan <tim@xen.org> Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes since v4:
- Expand commit message to mention the reason behind the usage of
rangesets.
- Fix comment related to the inclusiveness of rangesets.
- Fix off-by-one error in the calculation of the end of memory
regions.
- Store the state of the BAR (mapped/unmapped) in the vpci_bar
enabled field, previously was only used by ROMs.
- Fix double negation of return code.
- Modify vpci_cmd_write so it has a single call to pci_conf_write16.
- Print a warning when trying to write to the BAR with memory
decoding enabled (and ignore the write).
- Remove header_type local variable, it's used only once.
- Move the read of the command register.
- Restore previous command register value in the exit paths.
- Only set address to INVALID_PADDR if the initial BAR value matches
~0 & PCI_BASE_ADDRESS_MEM_MASK.
- Don't disable the enabled bit in the expansion ROM register, memory
decoding is already disabled and takes precedence.
- Don't use INVALID_PADDR, just set the initial BAR address to the
value found in the hardware.
- Introduce rom_enabled to store the status of the
PCI_ROM_ADDRESS_ENABLE bit.
- Reorder fields of the structure to prevent holes.
Changes since v3:
- Propagate previous changes: drop xen_ prefix and use u8/u16/u32
instead of the previous half_word/word/double_word.
- Constify some of the paramerters.
- s/VPCI_BAR_MEM/VPCI_BAR_MEM32/.
- Simplify the number of fields stored for each BAR, a single address
field is stored and contains the address of the BAR both on Xen and
in the guest.
- Allow the guest to move the BARs around in the physical memory map.
- Add support for expansion ROM BARs.
- Do not cache the value of the command register.
- Remove a label used in vpci_cmd_write.
- Fix the calculation of the sizing mask in vpci_bar_write.
- Check the memory decode bit in order to decide if a BAR is
positioned or not.
- Disable memory decoding before sizing the BARs in Xen.
- When mapping/unmapping BARs check if there's overlap between BARs,
in order to avoid unmapping memory required by another BAR.
- Introduce a macro to check whether a BAR is mappable or not.
- Add a comment regarding the lack of support for SR-IOV.
- Remove the usage of the GENMASK macro.
Changes since v2:
- Detect unset BARs and allow the hardware domain to position them.
Roger Pau Monne [Fri, 11 Aug 2017 15:32:32 +0000 (16:32 +0100)]
pci: split code to size BARs from pci_add_device
So that it can be called from outside in order to get the size of regular PCI
BARs. This will be required in order to map the BARs from PCI devices into PVH
Dom0 p2m.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com>
---
Changes since v4:
- Restore printing whether the BAR is from a vf.
- Make the psize pointer parameter not optional.
- s/u64/uint64_t.
- Remove some unneeded parentheses.
- Assert the return value is never 0.
Changes since v3:
- Rename function to size BARs to pci_size_mem_bar.
- Change the parameters passed to the function. Pass the position and
whether the BAR is the last one, instead of the (base, max_bars,
*index) tuple.
- Make the function return the number of BARs consumed (1 for 32b, 2
for 64b BARs).
- Change the dprintk back to printk.
- Do not log another error message in pci_add_device in case
pci_size_mem_bar fails.
Roger Pau Monne [Fri, 11 Aug 2017 15:32:32 +0000 (16:32 +0100)]
mm: move modify_identity_mmio to global file and drop __init
And also allow it to do non-identity mappings by adding a new
parameter.
This function will be needed in order to map the BARs from PCI devices
into the Dom0 p2m (and is also used by the x86 Dom0 builder). While
there fix the function to use gfn_t and mfn_t instead of unsigned long
for memory addresses.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
- Guard the function with CONFIG_HAS_PCI only.
- s/non-trival/non-negligible in the comment.
- Change XENLOG_G_WARNING to XENLOG_WARNING like the original
function.
Changes since v3:
- Remove the dummy modify_identity_mmio helper in dom0_build.c
- Try to make the comment in modify MMIO less scary.
- Clarify commit message.
- Only build the function for x86 or if there's PCI support.
Changes since v2:
- Use mfn_t and gfn_t.
- Remove stray newline.
Roger Pau Monne [Fri, 11 Aug 2017 15:32:31 +0000 (16:32 +0100)]
x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0
So that MMCFG regions not present in the MCFG ACPI table can be added
at run time by the hardware domain.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
- Change the hardware_domain check in hvm_physdev_op to a vpci check.
- Only register the MMCFG area, but don't scan it.
Roger Pau Monne [Fri, 11 Aug 2017 15:32:31 +0000 (16:32 +0100)]
x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas
Introduce a set of handlers for the accesses to the MMCFG areas. Those
areas are setup based on the contents of the hardware MMCFG tables,
and the list of handled MMCFG areas is stored inside of the hvm_domain
struct.
The read/writes are forwarded to the generic vpci handlers once the
address is decoded in order to obtain the device and register the
guest is trying to access.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v4:
- Change the attribute of pvh_setup_mmcfg to __hwdom_init.
- Try to add as many MMCFG regions as possible, even if one fails to
add.
- Change some fields of the hvm_mmcfg struct: turn size into a
unsigned int, segment into uint16_t and bus into uint8_t.
- Convert some address parameters from unsigned long to paddr_t for
consistency.
- Make vpci_mmcfg_decode_addr return the decoded register in the
return of the function.
- Introduce a new macro to convert a MMCFG address into a BDF, and
use it in vpci_mmcfg_decode_addr to clarify the logic.
- In vpci_mmcfg_{read/write} unify the logic for 8B accesses and
smaller ones.
- Add the __hwdom_init attribute to register_vpci_mmcfg_handler.
- Test that reg + size doesn't cross a device boundary.
Changes since v3:
- Propagate changes from previous patches: drop xen_ prefix for vpci
functions, pass slot and func instead of devfn and fix the error
paths of the MMCFG handlers.
- s/ecam/mmcfg/.
- Move the destroy code to a separate function, so the hvm_mmcfg
struct can be private to hvm/io.c.
- Constify the return of vpci_mmcfg_find.
- Use d instead of v->domain in vpci_mmcfg_accept.
- Allow 8byte accesses to the mmcfg.
Roger Pau Monne [Fri, 11 Aug 2017 15:32:31 +0000 (16:32 +0100)]
vpci: introduce basic handlers to trap accesses to the PCI config space
This functionality is going to reside in vpci.c (and the corresponding
vpci.h header), and should be arch-agnostic. The handlers introduced
in this patch setup the basic functionality required in order to trap
accesses to the PCI config space, and allow decoding the address and
finding the corresponding handler that should handle the access
(although no handlers are implemented).
Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are
setup inside of a x86 HVM file, since that's not shared with other
arches.
A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen
whether a domain should use the newly introduced vPCI handlers, this
is only enabled for PVH Dom0 at the moment.
A very simple user-space test is also provided, so that the basic
functionality of the vPCI traps can be asserted. This has been proven
quite helpful during development, since the logic to handle partial
accesses or accesses that expand across multiple registers is not
trivial.
The handlers for the registers are added to a linked list that's keep
sorted at all times. Both the read and write handlers support accesses
that expand across multiple emulated registers and contain gaps not
emulated.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v4:
* User-space test harness:
- Do not redirect the output of the test.
- Add main.c and emul.h as dependencies of the Makefile target.
- Use the same rule to modify the vpci and list headers.
- Remove underscores from local macro variables.
- Add _check suffix to the test harness multiread function.
- Change the value written by every different size in the multiwrite
test.
- Use { } to initialize the r16 and r20 arrays (instead of { 0 }).
- Perform some of the read checks with the local variable directly.
- Expand some comments.
- Implement a dummy rwlock.
* Hypervisor code:
- Guard the linker script changes with CONFIG_HAS_PCI.
- Rename vpci_access_check to vpci_access_allowed and make it return
bool.
- Make hvm_pci_decode_addr return the register as return value.
- Use ~3 instead of 0xfffc to remove the register offset when
checking accesses to IO ports.
- s/head/prev in vpci_add_register.
- Add parentheses around & in vpci_add_register.
- Fix register removal.
- Change the BUGs in vpci_{read/write}_hw helpers to
ASSERT_UNREACHABLE.
- Make merge_result static and change the computation of the mask to
avoid using a uint64_t.
- Modify vpci_read to only read from hardware the not-emulated gaps.
- Remove the vpci_val union and use a uint32_t instead.
- Change handler read type to return a uint32_t instead of modifying
a variable passed by reference.
- Constify the data opaque parameter of read handlers.
- Change the size parameter of the vpci_{read/write} functions to
unsigned int.
- Place the array of initialization handlers in init.rodata or
.rodata depending on whether late-hwdom is enabled.
- Remove the pci_devs lock, assume the Dom0 is well behaved and won't
remove the device while trying to access it.
- Change the recursive spinlock into a rw lock for performance
reasons.
Changes since v3:
* User-space test harness:
- Fix spaces in container_of macro.
- Implement a dummy locking functions.
- Remove 'current' macro make current a pointer to the statically
allocated vpcu.
- Remove unneeded parentheses in the pci_conf_readX macros.
- Fix the name of the write test macro.
- Remove the dummy EXPORT_SYMBOL macro (this was needed by the RB
code only).
- Import the max macro.
- Test all possible read/write size combinations with all possible
emulated register sizes.
- Introduce a test for register removal.
* Hypervisor code:
- Use a sorted list in order to store the config space handlers.
- Remove some unneeded 'else' branches.
- Make the IO port handlers always return X86EMUL_OKAY, and set the
data to all 1's in case of read failure (write are simply ignored).
- In hvm_select_ioreq_server reuse local variables when calling
XEN_DMOP_PCI_SBDF.
- Store the pointers to the initialization functions in the .rodata
section.
- Do not ignore the return value of xen_vpci_add_handlers in
setup_one_hwdom_device.
- Remove the vpci_init macro.
- Do not hide the pointers inside of the vpci_{read/write}_t
typedefs.
- Rename priv_data to private in vpci_register.
- Simplify checking for register overlap in vpci_register_cmp.
- Check that the offset and the length match before removing a
register in xen_vpci_remove_register.
- Make vpci_read_hw return a value rather than storing it in a
pointer passed by parameter.
- Handler dispatcher functions vpci_{read/write} no longer return an
error code, errors on reads/writes should be treated like hardware
(writes ignored, reads return all 1's or garbage).
- Make sure pcidevs is locked before calling pci_get_pdev_by_domain.
- Use a recursive spinlock for the vpci lock, so that spin_is_locked
checks that the current CPU is holding the lock.
- Make the code less error-chatty by removing some of the printk's.
- Pass the slot and the function as separate parameters to the
handler dispatchers (instead of passing devfn).
- Allow handlers to be registered with either a read or write
function only, the missing handler will be replaced by a dummy
handler (writes ignored, reads return 1's).
- Introduce PCI_CFG_SPACE_* defines from Linux.
- Simplify the handler dispatchers by removing the recursion, now the
dispatchers iterate over the list of sorted handlers and call them
in order.
- Remove the GENMASK_BYTES, SHIFT_RIGHT_BYTES and ADD_RESULT macros,
and instead provide a merge_result function in order to merge a
register output into a partial result.
- Rename the fields of the vpci_val union to u8/u16/u32.
- Remove the return values from the read/write handlers, errors
should be handled internally and signaled as would be done on
native hardware.
- Remove the usage of the GENMASK macro.
Changes since v2:
- Generalize the PCI address decoding and use it for IOREQ code also.
Changes since v1:
- Allow access to cross a word-boundary.
- Add locking.
- Add cleanup to xen_vpci_add_handlers in case of failure.
Roger Pau Monne [Fri, 11 Aug 2017 15:32:30 +0000 (16:32 +0100)]
x86/pci: introduce hvm_pci_decode_addr
And use it in the ioreq code to decode accesses to the PCI IO ports
into bus, slot, function and register values.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Paul Durrant <paul.durrant@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
- New in this version.
Roger Pau Monne [Fri, 11 Aug 2017 11:03:12 +0000 (12:03 +0100)]
x86/dom0: re-order DMA remapping enabling for PVH Dom0
Make sure the reserved regions are setup before enabling the DMA
remapping in the IOMMU, by calling dom0_setup_permissions before
iommu_hwdom_init. Also, in order to workaround IOMMU issues seen on
pre-Haswell Intel hardware, as described in patch "introduce a PVH
implementation of iommu_inclusive_mapping" make sure the DMA remapping
is enabled after populating Dom0 p2m.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since RFC:
- Expand commit message to reference patch #3.
Roger Pau Monne [Fri, 11 Aug 2017 11:03:12 +0000 (12:03 +0100)]
x86/vtd: introduce a PVH implementation of iommu_inclusive_mapping
On certain Intel systems, as far as I can tell almost all pre-Haswell ones,
trying to boot a PVH Dom0 will freeze the box completely, up to the point that
not even the watchdog works. The freeze happens exactly when enabling the DMA
remapping in the IOMMU, the last line seen is:
In order to workaround this (which seems to be a lack of proper RMRR entries,
plus the IOMMU being unable to generate faults and freezing the entire system)
add a PVH specific implementation of iommu_inclusive_mapping, that maps
non-RAM, non-unusable regions into Dom0 p2m. Note that care is taken to not map
device MMIO regions that Xen is emulating, like the local APIC or the IO APIC.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Kevin Tian <kevin.tian@intel.com>
Roger Pau Monne [Fri, 11 Aug 2017 11:03:11 +0000 (12:03 +0100)]
x86/dom0: prevent access to MMCFG areas for PVH Dom0
They are emulated by Xen, so they must not be mapped into Dom0 p2m.
Introduce a helper function to add the MMCFG areas to the list of
denied iomem regions for PVH Dom0.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since RFC:
- Introduce as helper instead of exposing the internal mmcfg
variables to the Dom0 builder.
Andrew Cooper [Wed, 26 Jul 2017 09:18:02 +0000 (10:18 +0100)]
common/domain_page: Drop domain_mmap_cache infrastructure
This infrastructure is used exclusively by the x86 do_mmu_update() hypercall.
Mapping and unmapping domain pages is probably not the slow part of that
function, but even with an opencoded caching implementation, Bloat-o-meter
reports:
function old new delta
do_mmu_update 6815 6573 -242
The !CONFIG_DOMAIN_PAGE stub code has a mismatch between mapping and
unmapping, which is a latent bug.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 10 Aug 2017 10:37:24 +0000 (12:37 +0200)]
x86/HVM: fix boundary check in hvmemul_insn_fetch() (again)
Commit 5a992b670b ("x86/hvm: Fix boundary check in
hvmemul_insn_fetch()") went a little too far in its correction to
commit 0943a03037 ("x86/hvm: Fixes to hvmemul_insn_fetch()"): Keep the
start offset check, but restore the original end offset one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
cpufreq: only stop ondemand governor if already started
On CPUFREQ_GOV_STOP in cpufreq_governor_dbs, shortcut to
return success if the governor is already stopped.
Avoid executing dbs_timer_exit, to prevent tripping an assertion
within a call to kill_timer on a timer that has not been prepared
with init_timer, if the CPUFREQ_GOV_START case has not
run beforehand.
kill_timer validates timer state:
* itself, via BUG_ON(this_cpu(timers).running == timer);
* within active_timer, ASSERTing timer->status is within bounds;
* within list_del, which ASSERTs timer inactive list membership.
Patch is synonymous to an OpenXT patch produced at Citrix prior to
June 2014.
Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/tboot: disable interrupts after map_pages_to_xen() in tboot_shutdown()
Move the point where interrupts are disabled in tboot_shutdown
to slightly later, to after the call to map_pages_to_xen.
This patch originated in OpenXT with the following report:
"Disabling interrupts early causes debug assertions.
This is only seen with debug builds but since it causes assertions it is
probably a bigger problem. It clearly says in map_pages_to_xen that it
should not be called with interrupts disabled. Moved disabling to just
after that call."
The Xen code comment ahead of map_pages_to_xen notes that the CPU cache
flushing in map_pages_to_xen differs depending on whether interrupts are
enabled or not. The flush logic with interrupts enabled is more
conservative, flushing all CPUs' TLBs/caches, rather than just local.
This is just before the tboot memory integrity MAC calculation is performed
in the case of entering S3.
Original patch author credit: Ross Philipson.
Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 10 Aug 2017 10:34:21 +0000 (12:34 +0200)]
AMD IOMMU: drop amd_iommu_setup_hwdom_device()
By moving its bridge special casing to amd_iommu_add_device(), we can
pass the latter to setup_hwdom_pci_devices() and at once consistently
handle bridges discovered at boot time as well as such reported by Dom0
later on.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
User-Mode Instruction Prevention (UMIP) is a security feature present in
new Intel Processors. With this feature, when the UMIP bit in CR4 set,
the following instructions cannot be executed if CPL > 0: SGDT, SIDT,
SLDT, SMSW, and STR. An attempt at such execution causes a general-
protection exception (#GP).
This patch simply adds necessary definitions to expose this feature to
hvm guests.
Signed-off-by: Boqun Feng (Intel) <boqun.feng@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Chao Gao [Thu, 10 Aug 2017 10:32:16 +0000 (12:32 +0200)]
VT-d PI: disable VT-d PI when CPU-side PI isn't enabled
From the context calling pi_desc_init(), we can conclude the current
implementation of VT-d PI depends on CPU-side PI. If we enable VT-d PI
and disable CPU-side PI by disabling APICv explicitly in xen boot
command line, we would get an assertion failure.
This patch clears iommu_intpost once finding CPU-side PI won't be enabled.
It is safe for this is done before this flag starts taking effect. Also
take this chance to remove the useless check of "acknowledge interrupt on
exit", which is a minimal requirement which has been checked earlier.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Olaf Hering [Fri, 23 Jun 2017 17:35:04 +0000 (19:35 +0200)]
vtpmmgr: make inline functions static
gcc7 is more strict with functions marked as inline. They are not
automatically inlined. Instead a function call is generated, but the
actual code is not visible by the linker.
Do a mechanical change and mark every 'inline' as 'static inline'. For
simpler review the static goes into an extra line.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Tested-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Yi Sun [Mon, 7 Aug 2017 01:50:49 +0000 (09:50 +0800)]
x86: adjust place of an ASSERT to avoid crash when destroy a domain.
In 'psr_free_cos', we should not use 'ASSERT(socket_info)' at the beginning
because the 'socket_info' is allocated only if 'psr' boot parameter is set.
So adjust its place to avoid crash.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
libxl: do not start dom0 qemu for stubdomain when not needed
Do not setup vfb+vkb when no access method was configured. Then check if
qemu is really needed.
The only not configurable thing forcing qemu running in dom0 after this
change are consoles used to save/restore. But even in that case, there
is much smaller part of qemu exposed.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Rusty Bird [Thu, 3 Aug 2017 10:40:25 +0000 (12:40 +0200)]
VT-d: don't panic/warn on iommu=no-igfx
When operating on an Intel graphics device, iommu_enable_translation()
panicked (force_iommu==1) or warned (force_iommu==0) about the BIOS if
is_igd_vt_enabled_quirk() returned 0. That's good if the actual BIOS
problem has been detected. But since commit 1463411, returning 0 could
also happen if the user simply passed "iommu=no-igfx", in which case
bailing out with an info message (instead of a panic/warning) would be
more appropriate.
The panic broke the combination "iommu=force,no-igfx", and also the case
where "iommu=no-igfx" is passed but force_iommu=1 is set automatically
by x2apic_bsp_setup().
Move the iommu_igfx check from is_igd_vt_enabled_quirk() into its only
caller iommu_enable_translation(), and tweak the logic.
Signed-off-by: Rusty Bird <rustybird@openmailbox.org> Acked-by: Kevin Tian <kevin.tian@intel.com>
Yi Sun [Tue, 1 Aug 2017 09:05:00 +0000 (11:05 +0200)]
tools: L2 CAT: support set cbm for L2 CAT.
This patch implements the xl/xc changes to support set CBM
for L2 CAT.
The new level option is introduced to original CAT setting
command in order to set CBM for specified level CAT.
- 'xl psr-cat-set' is updated to set cache capacity bitmasks(CBM)
for a domain according to input cache level.
root@:~$ xl psr-cat-set -l2 1 0x7f
root@:~$ xl psr-cat-show -l2 1
Socket ID : 0
Default CBM : 0xff
ID NAME CBM
1 ubuntu14 0x7f
Signed-off-by: He Chen <he.chen@linux.intel.com> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Yi Sun [Tue, 1 Aug 2017 09:05:00 +0000 (11:05 +0200)]
tools: L2 CAT: support show cbm for L2 CAT.
This patch implements changes in xl/xc changes to support
showing CBM of L2 CAT.
The new level option is introduced to original CAT showing
command in order to show CBM for specified level CAT.
- 'xl psr-cat-show' is updated to show CBM of a domain
according to input cache level.
Examples:
root@:~$ xl psr-cat-show -l2 1
Socket ID : 0
Default CBM : 0xff
ID NAME CBM
1 ubuntu14 0x7f
Signed-off-by: He Chen <he.chen@linux.intel.com> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Yi Sun [Tue, 1 Aug 2017 09:05:00 +0000 (11:05 +0200)]
tools: L2 CAT: support get HW info for L2 CAT.
This patch implements xl/xc changes to support get HW info
for L2 CAT.
'xl psr-hwinfo' is updated to show both L3 CAT and L2 CAT
info.
Example(on machine which only supports L2 CAT):
Cache Monitoring Technology (CMT):
Enabled : 0
Cache Allocation Technology (CAT): L2
Socket ID : 0
Maximum COS : 3
CBM length : 8
Default CBM : 0xff
Signed-off-by: He Chen <he.chen@linux.intel.com> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Yi Sun [Tue, 1 Aug 2017 09:04:00 +0000 (11:04 +0200)]
x86: refactor psr: L3 CAT: set value: implement cos id picking flow.
Continue from previous patch:
'x86: refactor psr: L3 CAT: set value: implement cos finding flow.'
If fail to find a COS ID, we need pick a new COS ID for domain. Only COS ID
that ref[COS_ID] is 1 or 0 can be picked to input a new set feature values.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Yi Sun [Tue, 1 Aug 2017 09:04:00 +0000 (11:04 +0200)]
x86: refactor psr: L3 CAT: set value: assemble features value array.
Only can one COS ID be used by one domain at one time. That means all enabled
features' COS registers at this COS ID are valid for this domain at that time.
When user updates a feature's value, we need make sure all other features'
values are not affected. So, we firstly need gather an array which contains
all features current values and replace the setting feature's value in array
to new value.
Then, we can try to find if there is a COS ID on which all features' COS
registers values are same as the array. If we can find, we just use this COS
ID. If fail to find, we need pick a new COS ID.
This patch implements value array assembling flow.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Yi Sun [Tue, 1 Aug 2017 09:04:00 +0000 (11:04 +0200)]
x86: refactor psr: L3 CAT: set value: implement framework.
As set value flow is the most complicated one in psr, it will be
divided to some patches to make things clearer. This patch
implements the set value framework to show a whole picture firstly.
It also changes domctl interface to make it more general.
To make the set value flow be general and can support multiple features
at same time, it includes below steps:
1. Test and set dom_ids bit corresponding to the domain. If the old bit is 0
which means the domain's COS ID is invalid, restore COS ID to 0. If the
COS ID is valid, get the COS ID that current domain is using.
2. Gather a value array to store all features current value
into it and replace the current value of the feature which is
being set to the new input value.
3. Find if there is already a COS ID on which all features'
values are same as the array. Then, we can reuse this COS
ID.
4. If fail to find, we need pick an available COS ID. Only COS ID which ref
is 0 or 1 can be picked.
5. Write the feature's MSRs according to the COS ID.
6. Update ref according to COS ID.
7. Save the COS ID into current domain's psr_cos_ids[socket] so that we
can know which COS the domain is using on the socket.
So, some functions are abstracted and the callback functions will be
implemented in next patches.
Here is an example to understand the process. The CPU supports
two featuers, e.g. L3 CAT and L2 CAT. User wants to set L3 CAT
of Dom1 to 0x1ff.
1. At the initial time, the old_cos of Dom1 is 0. The COS registers values
are below at this time.
-------------------------------
| COS 0 | COS 1 | COS 2 | ... |
-------------------------------
L3 CAT | 0x7ff | 0x7ff | 0x7ff | ... |
-------------------------------
L2 CAT | 0xff | 0xff | 0xff | ... |
-------------------------------
2. Gather the value array and insert new value into it:
val[0]: 0x1ff
val[1]: 0xff
3. It cannot find a matching COS.
4. Pick COS 1 to store the value set.
5. Write the L3 CAT COS 1 registers. The COS registers values are
changed to below now.
-------------------------------
| COS 0 | COS 1 | COS 2 | ... |
-------------------------------
L3 CAT | 0x7ff | 0x1ff | ... | ... |
-------------------------------
L2 CAT | 0xff | 0xff | ... | ... |
-------------------------------
6. The ref[1] is increased to 1 because Dom1 is using it now.
7. Save 1 to Dom1's psr_cos_ids[socket].
Then, user wants to set L3 CAT of Dom2 to 0x1ff too. The old_cos
of Dom2 is 0 too. Repeat above flow.
The val array assembled is:
val[0]: 0x1ff
val[1]: 0xff
So, it can find a matching COS, COS 1. Then, it can reuse COS 1
for Dom2.
The ref[1] is increased to 2 now because both Dom1 and Dom2 are
using this COS ID. Set 1 to Dom2's psr_cos_ids[socket].
There is one thing need to emphasize that we need restore domain's COS ID to
0 when socket is offline. Otherwise, a wrong COS ID will be used when the
socket is online again. That may cause user see the wrong CBM shown. But it
takes much time to iterate all domains to restore COS ID to 0. So, we define
a 'dom_ids[]' to represents all domains, one bit corresponds to one domain.
If the bit is 0 when entering 'psr_ctxt_switch_to', that means this is the
first time the domain is switched to this socket or domain's COS ID has not
been set since the socket is online. So, the COS ID set to ASSOC register on
this socket should be default value, 0. If not, that means the domain's COS
ID has been set when the socket was online. So, this COS ID is valid and we
can directly use it. We restore the domain's COS ID to 0 if the bit
corresponding to the domain is 0 but the domain's COS ID is not 0 when
'psr_get_val' and 'psr_set_val' is called. This can avoid CPU serialization
if restoring action is exectued in 'psr_ctxt_switch_to'.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
This patch implements the Domain init/free and schedule flows.
- When domain init, its psr resource should be allocated.
- When domain free, its psr resource should be freed too.
- When domain is scheduled, its COS ID on the socket should be
set into ASSOC register to make corresponding COS MSR value
work.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Yi Sun [Tue, 1 Aug 2017 09:04:00 +0000 (11:04 +0200)]
x86: refactor psr: L3 CAT: implement main data structures, CPU init and free flows.
To construct an extendible framework, we need analyze PSR features
and abstract the common things and feature specific things. Then,
encapsulate them into different data structures.
By analyzing PSR features, we can get below map.
+------+------+------+
--------->| Dom0 | Dom1 | ... |
| +------+------+------+
| |
|Dom ID | cos_id of domain
| V
| +-----------------------------------------------------------------------------+
User --------->| PSR |
Socket ID | +--------------+---------------+---------------+ |
| | Socket0 Info | Socket 1 Info | ... | |
| +--------------+---------------+---------------+ |
| | cos_id=0 cos_id=1 ... |
| | +-----------------------+-----------------------+-----------+ |
| |->Ref : | ref 0 | ref 1 | ... | |
| | +-----------------------+-----------------------+-----------+ |
| | +-----------------------+-----------------------+-----------+ |
| |->L3 CAT: | cos 0 | cos 1 | ... | |
| | +-----------------------+-----------------------+-----------+ |
| | +-----------------------+-----------------------+-----------+ |
| |->L2 CAT: | cos 0 | cos 1 | ... | |
| | +-----------------------+-----------------------+-----------+ |
| | +-----------+-----------+-----------+-----------+-----------+ |
| |->CDP : | cos0 code | cos0 data | cos1 code | cos1 data | ... | |
| +-----------+-----------+-----------+-----------+-----------+ |
+-----------------------------------------------------------------------------+
So, we need define a socket info data structure, 'struct
psr_socket_info' to manage information per socket. It contains a
reference count array according to COS ID and a feature array to
manage all features enabled. Every entry of the reference count
array is used to record how many domains are using the COS registers
according to the COS ID. For example, L3 CAT and L2 CAT are enabled,
Dom1 uses COS_ID=1 registers of both features to save CBM values, like
below.
+-------+-------+-------+-----+
| COS 0 | COS 1 | COS 2 | ... |
+-------+-------+-------+-----+
L3 CAT | 0x7ff | 0x1ff | ... | ... |
+-------+-------+-------+-----+
L2 CAT | 0xff | 0xff | ... | ... |
+-------+-------+-------+-----+
If Dom2 has same CBM values, it can reuse these registers which COS_ID=1.
That means, both Dom1 and Dom2 use same COS registers(ID=1) to keep same
L3/L2 values. So, the value of ref[1] is 2 which means 2 domains are using
COS_ID 1.
To manage a feature, we need define a feature node data structure,
'struct feat_node', to manage feature's specific HW info, and an array of all
COS registers values of this feature.
To manage feature properties, we need define a feature property data structure,
'struct feat_props', to manage common properties (callback functions - all
feature's specific behaviors are encapsulated into these callback functions,
and generic values - e.g. the cos_max), the feature independent values.
CDP is a special feature which uses two entries of the array
for one COS ID. So, the number of CDP COS registers is the half of L3
CAT. E.g. L3 CAT has 16 COS registers, then CDP has 8 COS registers if
it is enabled. CDP uses the COS registers array as below.
For more details, please refer SDM and patches to implement 'get value' and
'set value'.
This patch also implements the CPU init and free flow including L3 CAT
initialization and some resources free. It includes below flows:
1. presmp init:
- parse command line parameter.
- allocate socket info for every socket.
- allocate feature resource.
- initialize socket info, get feature info and add feature into feature
array per cpuid result.
- free resources allocated if error happens.
- register cpu notifier to handle cpu events.
2. cpu notifier:
- handle cpu online events, if initialization work has been done before,
do nothing.
- handle cpu offline events, if it is the last cpu offline, free some
socket resources.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Yi Sun [Tue, 1 Aug 2017 09:04:00 +0000 (11:04 +0200)]
x86: refactor psr: remove L3 CAT/CDP codes.
The current cache allocation codes in psr.c do not consider
future features addition and are not friendly to extend.
To make psr.c be more flexible to add new features and fulfill
the program principle, open for extension but closed for
modification, we have to refactor the psr.c:
1. Analyze cache allocation features and abstract general data
structures.
2. Analyze the init and all other functions flow, abstract all
steps that different features may have different implementations.
Make these steps be callback functions and register feature
specific fuctions. Then, the main processes will not be changed
when introducing a new feature.
Because the quantity of refactor codes is big and the logics are
changed a lot, it will cause reviewers confused if just change
old codes. Reviewers have to understand both old codes and new
implementations. After review iterations from V1 to V3, Jan has
proposed to remove all old cache allocation codes firstly, then
implement new codes step by step. This will help to make codes
be more easily reviewable.
There is no construction without destruction. So, this patch
removes all current L3 CAT/CDP codes in psr.c. The following
patches will introduce the new mechanism.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Yi Sun [Tue, 1 Aug 2017 09:04:00 +0000 (11:04 +0200)]
docs: create Cache Allocation Technology (CAT) and Code and Data Prioritization (CDP) feature document
This patch creates CAT and CDP feature document in doc/features/. It describes
key points to implement L3 CAT/CDP and L2 CAT which is described in details in
Intel SDM "INTEL® RESOURCE DIRECTOR TECHNOLOGY (INTEL® RDT) ALLOCATION FEATURES".
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Move pre-existing PAGE_(SHIFT|SIZE|MASK|ALIGN)_(4K|64K) and
introduce corresponding defines for 16K page granularity to/in a
common place in xen/page-defs.h to allow later commits to use the
consolidated defines.
Signed-off-by: Sergej Proskurin <proskurin@sec.in.tum.de> Acked-by: Jan Beulich <jbeulich@suse.com>
Praveen Kumar [Thu, 3 Aug 2017 10:24:25 +0000 (12:24 +0200)]
rbtree: changes to align the code with Linux tree
The patch aligns the code of rbtree related files with Linux tree.
This will minimize the conflicts during any future porting from Linux tree.
Linux commit till f4b477c47332367d35686bd2b808c2156b96d7c7 for rbtree.h
This includes addition of commented inline functions in rbtree.h, to have
complete replica from Linux tree.
Olaf Hering [Wed, 26 Jul 2017 14:39:50 +0000 (16:39 +0200)]
docs: add pod variant of xl-numa-placement
Convert source for xl-numa-placement.7 from markdown to pod.
This removes the buildtime requirement for pandoc, and subsequently the
need for ghc, in the chain for BuildRequires of xen.rpm.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Olaf Hering [Wed, 26 Jul 2017 14:39:49 +0000 (16:39 +0200)]
docs: add pod variant of xl-network-configuration.5
Convert source for xl-network-configuration.5 from markdown to pod.
This removes the buildtime requirement for pandoc, and subsequently the
need for ghc, in the chain for BuildRequires of xen.rpm.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Wei Liu <wei.liu2@citrix.com>
Olaf Hering [Wed, 26 Jul 2017 14:39:48 +0000 (16:39 +0200)]
docs: add pod variant of xen-pv-channel.7
Convert source for xen-pv-channel.7 from markdown to pod.
This removes the buildtime requirement for pandoc, and subsequently the
need for ghc, in the chain for BuildRequires of xen.rpm.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Wei Liu <wei.liu2@citrix.com>
Running "make uninstall" does not remove all installed files, a
situation which might cause link related issues if xen is re-installed
in a different location.
In order to make uninstall correctly remove the files it is best
the process should be done recursively by mirroring each "install"
target with an "uninstall" who removes the installed files.
An exception to this rule is uninstalling the files produced by
"qemu-xen-dir-remote" and "qemu-xen-traditional-dir", which are external
to the project. These projects do not implement an "uninstall" target so
the files have to be removed manually.
Signed-off-by: Petre Pircalabu <ppircalabu@bitdefender.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
If xc_gntshr_open failed the only thing to cleanup is free allocated
memory. So instead of calling libxenvchan_close (which assume
valid calculated buffers being mmaped already) free memory and return.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Fri, 30 Jun 2017 12:24:19 +0000 (12:24 +0000)]
x86/svm: Alias the VMCB segment registers as an array
This allows svm_{get,set}_segment_register() to access the user segments by
array index, as the x86_seg_* constants match the hardware encoding.
While making this alteration, add some newlines for clarity, switch an int for
a bool, and make the functions fail safe in a release build, rather than
crashing Xen.
Bloat-o-meter reports some modest improvements:
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-130 (-130)
function old new delta
svm_set_segment_register 662 653 -9
svm_get_segment_register 409 288 -121
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Wed, 19 Jul 2017 09:28:03 +0000 (10:28 +0100)]
x86/vvmx: Fix handing of the MSR_BITMAP field with VMCS shadowing
Currently, the following sequence of actions:
* VMPTRLD (creates a mapping, likely pointing at gfn 0 for an empty vmcs)
* VMWRITE CPU_BASED_VM_EXEC_CONTROL (completed by hardware)
* VMWRITE MSR_BITMAP (completed by hardware)
* VMLAUNCH
results in an L2 guest running with ACTIVATE_MSR_BITMAP set, but Xen using a
stale mapping (likely gfn 0) when reading the interception bitmap. The
MSR_BITMAP field needs unconditionally intercepting even with VMCS shadowing,
so Xen's mapping of the bitmap can be updated.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Tue, 18 Jul 2017 14:55:03 +0000 (14:55 +0000)]
x86/vvmx: Switch nested MSR intercept handling to use struct vmx_msr_bitmap
Rename vmx_check_msr_bitmap() to vmx_msr_is_intercepted() in order to more
clearly identify what the boolean return value means. Change the int
access_type to bool is_write.
The NULL pointer check is moved out, as it doesn't pertain to whether the MSR
is intercepted or not. The check is moved into nvmx_n2_vmexit_handler(),
where it becomes a hard error in the case that ACTIVATE_MSR_BITMAP is set.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Tue, 18 Jul 2017 14:44:05 +0000 (14:44 +0000)]
x86/vmx: Introduce and use struct vmx_msr_bitmap
This avoids opencoding the bitmap bases in accessor functions. Introduce a
build_assertions() function to check the structure layout against the manual
definiton. In addition, drop some stale comments and ASSERT() that callers
pass an in-range MSR.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Tue, 18 Jul 2017 14:33:13 +0000 (14:33 +0000)]
x86/vpmu: Use vmx_{clear,set}_msr_intercept() rather than opencoding them
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Tue, 18 Jul 2017 14:14:32 +0000 (14:14 +0000)]
x86/vmx: Improvements to vmx_{dis,en}able_intercept_for_msr()
* Shorten the names to vmx_{clear,set}_msr_intercept()
* Use an enumeration for MSR_TYPE rather than a plain integer
* Introduce VMX_MSR_RW, as most callers alter both the read and write
intercept at the same time.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Tue, 25 Jul 2017 18:48:43 +0000 (19:48 +0100)]
x86/hvm: Fix boundary check in hvmemul_insn_fetch()
c/s 0943a03037 added some extra protection for overflowing the emulation
instruction cache, but Coverity points out that boundary condition is off by
one when memcpy()'ing out of the buffer.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Wei Liu [Wed, 26 Jul 2017 07:44:56 +0000 (08:44 +0100)]
libxc: bail immediately when PV superpage is discovered
The original code was added with the hope that PV superpage migration
might work. But it was never proven that the code actually worked.
Now that PV superpage is gone, simplify the code by returning error
immediately.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Wei Liu [Wed, 26 Jul 2017 07:44:55 +0000 (08:44 +0100)]
tools: nuke superpage parameters in code
Also fix manpage because there is no superpages options in xl.cfg.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Wei Liu [Wed, 26 Jul 2017 07:44:54 +0000 (08:44 +0100)]
x86: nuke PV superpage option and code
Delete the user visible option and code for PV superpage support. The
mm code is modified as if the option is set to false (the default
value).
Return the address space occupied by spage_info back to the reserved
address space.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
GRUB_MODULES="boot chain configfile echo efinet eval ext2 fat font gettext gfxterm gzio help linux loadenv lsefi normal part_gpt par
t_msdos read regexp search search_fs_file search_fs_uuid search_label terminal terminfo test tftp time xen_boot"
The null scheduler does not really use hard-affinity for
scheduling, it uses it for 'placement', i.e., for deciding
to what pCPU to statically assign a vCPU.
Let's use soft-affinity in the same way, of course with the
difference that, if there's no free pCPU within the vCPU's
soft-affinity, we go checking the hard-affinity, instead of
putting the vCPU in the waitqueue.
This does has no impact on the scheduling overhead, because
soft-affinity is only considered in cold-path (like when a
vCPU joins the scheduler for the first time, or is manually
moved between pCPUs by the user).
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Mon, 5 Jun 2017 16:19:27 +0000 (17:19 +0100)]
x86/emul: Drop segment_attributes_t
The amount of namespace resolution is unnecessarily large, as all code deals
in terms of struct segment_register. This removes the attr.fields part of all
references, and alters attr.bytes to just attr.
Three areas of code using initialisers for segment_register are tweaked to
compile with older versions of GCC. arch_set_info_hvm_guest() has its SEG()
macros altered to use plain comma-based initialisation, while
{rm,vm86}_{cs,ds}_attr are simplified to plain numbers which matches their
description in the manuals.
No functional change. (For some reason, the old {rm,vm86}_{cs,ds}_attr causes
GCC to create variable in .rodata, whereas the new code uses immediate
operands. As a result, vmx_{get,set}_segment_register() are slightly
shorter.)
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 30 Jun 2017 12:12:00 +0000 (12:12 +0000)]
x86/svm: Drop svm_segment_register_t
Most SVM code already uses struct segment_register. Drop the typedef and
adjust the definitions in struct vmcb_struct, and svm_dump_sel(). Introduce
some build-time assertions that struct segment_register from the common
emulation code is usable in struct vmcb_struct.
While making these adjustments, fix some comments to not mix decimal and
hexidecimal offsets, and drop all trailing whitespace in vmcb.h
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Mon, 24 Jul 2017 16:28:25 +0000 (17:28 +0100)]
x86/pagewalk: Remove opt_allow_superpage check from guest_can_use_l2_superpages()
The purpose of guest_walk_tables() is to match the behaviour of real hardware.
A PV guest can have 2M superpages in its pagetables, via the M2P (and for dom0
via the initial P2M), even if the guest isn't permitted to create arbitrary 2M
superpage mappings.
guest_can_use_l2_superpages() checking opt_allow_superpage is a piece of PV
guest policy enforcement, rather than its intended purpose of meaning "would
hardware tolerate finding an L2 superpage with these control settings?"
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Wei Liu <wei.liu2@citrix.com>