Len Brown [Mon, 22 Apr 2013 12:00:16 +0000 (14:00 +0200)]
x86/mwait_idle: stop using driver_data for static flags
The (Linux) commit 4202735e8ab6ecfb0381631a0d0b58fefe0bd4e2
(cpuidle: Split cpuidle_state structure and move per-cpu statistics fields)
observed that the MWAIT flags for Cn on every processor to date were the
same, and created get_driver_data() to supply them.
Unfortunately, that assumption is false, going forward.
So here we restore the MWAIT flags to the cpuidle_state table.
However, instead restoring the old "driver_data" field,
we put the flags into the existing "flags" field,
where they probalby should have lived all along.
This patch does not change any operation.
Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Jan Beulich [Mon, 22 Apr 2013 11:58:01 +0000 (13:58 +0200)]
x86/EFI: pass boot services variable info to runtime code
EFI variables can be flagged as being accessible only within boot services.
This makes it awkward for us to figure out how much space they use at
runtime. In theory we could figure this out by simply comparing the results
from QueryVariableInfo() to the space used by all of our variables, but
that fails if the platform doesn't garbage collect on every boot. Thankfully,
calling QueryVariableInfo() while still inside boot services gives a more
reliable answer. This patch passes that information from the EFI boot stub
up to the efi platform code.
Based on a similarly named Linux patch by Matthew Garrett <matthew.garrett@nebula.com>.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Ben Guthro [Fri, 19 Apr 2013 10:29:01 +0000 (12:29 +0200)]
x86/S3: Fix cpu pool scheduling after suspend/resume
This review is another S3 scheduler problem with the system_state
variable introduced with the following changeset:
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=269f543ea750ed567d18f2e819e5d5ce58eda5c5
Specifically, the cpu_callback function that takes the CPU down during
suspend, and back up during resume. We were seeing situations where,
after S3, only CPU0 was in cpupool0. Guest performance suffered
greatly, since all vcpus were only on a single pcpu. Guests under high
CPU load showed the problem much more quickly than an idle guest.
Removing this if condition forces the CPUs to go through the expected
online/offline state, and be properly scheduled after S3.
This also includes a necessary partial change proposed earlier by
Tomasz Wroblewski here:
http://lists.xen.org/archives/html/xen-devel/2013-01/msg02206.html
It should also resolve the issues discussed in this thread:
http://lists.xen.org/archives/html/xen-devel/2012-11/msg01801.html
Signed-off-by: Ben Guthro <benjamin.guthro@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Daniel De Graaf [Fri, 19 Apr 2013 08:50:08 +0000 (10:50 +0200)]
x86: remove IS_PRIV bypass on IRQ check
This prevents a process in dom0 from granting a domU access to an IRQ without
adding the IRQ to the domU's list of permitted IRQs. This operation currently
succeeds in dom0 but would fail if the device model were running in a stubdom,
so making the failure consistent should ease debugging of the device-model
stubdoms.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Daniel De Graaf [Thu, 18 Apr 2013 15:01:45 +0000 (17:01 +0200)]
x86: remove IS_PRIV access check bypasses
Several domctl functions dealing with rangesets contain a short-circuit
bypass if the domain is privileged. Since the construction of domain 0
permits access to all I/O ranges, the call to irq_access_permitted will
normally return true even without the IS_PRIV check, and the presence of
the IS_PRIV check prevents the creation of a privileged domain without
access to specific devices or IO memory ranges.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Jan Beulich [Thu, 18 Apr 2013 14:11:23 +0000 (16:11 +0200)]
x86: fix various issues with handling guest IRQs
- properly revoke IRQ access in map_domain_pirq() error path
- don't permit replacing an in use IRQ
- don't accept inputs in the GSI range for MAP_PIRQ_TYPE_MSI
- track IRQ access permission in host IRQ terms, not guest IRQ ones
(and with that, also disallow Dom0 access to IRQ0)
This is CVE-2013-1919 / XSA-46.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Jan Beulich [Thu, 18 Apr 2013 14:00:35 +0000 (16:00 +0200)]
x86: clear EFLAGS.NT in SYSENTER entry path
... as it causes problems if we happen to exit back via IRET: In the
course of trying to handle the fault, the hypervisor creates a stack
frame by hand, and uses PUSHFQ to set the respective EFLAGS field, but
expects to be able to IRET through that stack frame to the second
portion of the fixup code (which causes a #GP due to the stored EFLAGS
having NT set).
And even if this worked (e.g if we cleared NT in that path), it would
then (through the fail safe callback) cause a #GP in the guest with the
SYSENTER handler's first instruction as the source, which in turn would
allow guest user mode code to crash the guest kernel.
Inject a #GP on the fake (NULL) address of the SYSENTER instruction
instead, just like in the case where the guest kernel didn't register
a corresponding entry point.
This is CVE-2013-1917 / XSA-44.
Reported-by: Andrew Cooper <andrew.cooper3@citirx.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Ian Campbell [Wed, 17 Apr 2013 12:52:34 +0000 (13:52 +0100)]
arm: vgic: fix race in vgic_vcpu_inject_irq
The initial check for a still pending interrupt (!list_empty(&n->inflight))
needs to be covered by the vgic lock to avoid trying to insert the IRQ into the
inflight list simultaneously on 2 pCPUS. Expand the area covered by the lock
appropriately.
Also consolidate the unlocks on the exit path into one location.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Yang Zhang [Thu, 18 Apr 2013 09:36:28 +0000 (11:36 +0200)]
VMX: Use posted interrupt to deliver virutal interrupt
Deliver virtual interrupt through posted way if posted interrupt
is enabled.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Jun Nakajima <jun.nakajima@intel.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)
Yang Zhang [Thu, 18 Apr 2013 09:34:49 +0000 (11:34 +0200)]
VMX: Add posted interrupt supporting
Add the supporting of using posted interrupt to deliver interrupt.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Jun Nakajima <jun.nakajima@intel.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)
Yang Zhang [Thu, 18 Apr 2013 09:34:04 +0000 (11:34 +0200)]
VMX: Turn on posted interrupt bit in vmcs
Turn on posted interrupt for vcpu if posted interrupt is avaliable.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Jun Nakajima <jun.nakajima@intel.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)
Yang Zhang [Thu, 18 Apr 2013 09:32:02 +0000 (11:32 +0200)]
VMX: Detect posted interrupt capability
Check whether the Hardware supports posted interrupt capability.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Jun Nakajima <jun.nakajima@intel.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)
Daniel De Graaf [Fri, 12 Apr 2013 15:22:26 +0000 (11:22 -0400)]
libxl: properly initialize device structures
This avoids returning unallocated memory in the libxl_device_vtpm
structure in libxl_device_vtpm_list, and uses libxl_device_nic_init
instead of memset when initializing libxl_device_nics.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Daniel De Graaf [Mon, 15 Apr 2013 14:33:25 +0000 (10:33 -0400)]
libxl: postpone backend name resolution
This adds a backend_domname field in libxl devices that contain a
backend_domid field, allowing either a domid or a domain name to be
specified in the configuration structures. The domain name is resolved
into a domain ID in the _setdefault function when adding the device.
This change allows the backend of the block devices to be specified
(which previously required passing the libxl_ctx down into the block
device parser), and will simplify specification of backend domains in
other users of libxl.
The check on run_hotplug_scripts in parse_config_data is removed because
it is a duplicate of the one in libxl__device_nic_setdefault, and is
removed here because it no longer has the resolved domain ID to check.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- reran flex ]
Node-affinity is now something that is under (some) control of the
user, so show it upon request as part of the output of `xl list'
by the `-n' option.
Re the patch, the print_bitmap() related hunk is _mostly_ code motion,
although there is a very minor change in the code, basically to allow
using the function for printing both cpu and node bitmaps (as, in case
all bits are sets, it used to print "any cpu", which doesn't fit the
nodemap case).
libxl: automatic placement deals with node-affinity
Which basically means the following two things:
1) during domain creation, it is the node-affinity of
the domain --rather than the vcpu-affinities of its
VCPUs-- that is affected by automatic placement;
2) during automatic placement, when counting how many
VCPUs are already "bound" to a placement candidate
(as part of the process of choosing the best
candidate), both vcpu-affinity and node-affinity
are considered.
libxl: optimize the calculation of how many VCPUs can run on a candidate
For choosing the best NUMA placement candidate, we need to figure out
how many VCPUs are runnable on each of them. That requires going through
all the VCPUs of all the domains and check their affinities.
With this change, instead of doing the above for each candidate, we
do it once for all, populating an array while counting. This way, when
we later are evaluating candidates, all we need is summing up the right
elements of the array itself.
This reduces the complexity of the overall algorithm, as it moves a
potentially expensive operation (for_each_vcpu_of_each_domain {})
outside from the core placement loop, so that it is performed only
once instead of (potentially) tens or hundreds of times. More
specifically, we go from a worst case computation time complaxity of:
xen: allow for explicitly specifying node-affinity
Make it possible to pass the node-affinity of a domain to the hypervisor
from the upper layers, instead of always being computed automatically.
Note that this also required generalizing the Flask hooks for setting
and getting the affinity, so that they now deal with both vcpu and
node affinity.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: Keir Fraser <keir@xen.org>
xen: sched_credit: let the scheduler know about node-affinity
As vcpu-affinity tells where VCPUs must run, node-affinity tells
where they prefer to. While respecting vcpu-affinity remains mandatory,
node-affinity is not that strict, it only expresses a preference,
although honouring it will bring significant performance benefits
(especially as compared to not having any affinity at all).
This change modifies the VCPUs load balancing algorithm (for the
credit scheduler only), introducing a two steps logic. During the
first step, we use both the vcpu-affinity and the node-affinity
masks (by looking at their intersection). The aim is giving precedence
to the PCPUs where the domain prefers to run, as expressed by its
node-affinity (with the intersection with the vcpu-afinity being
necessary in order to avoid running a VCPU where it never should).
If that fails in finding a valid PCPU, the node-affinity is just
ignored and, in the second step, we fall back to using cpu-affinity
only.
xen: sched_credit: when picking, make sure we get an idle one, if any
The pcpu picking algorithm treats two threads of a SMT core the same.
More specifically, if one is idle and the other one is busy, they both
will be assigned a weight of 1. Therefore, when picking begins, if the
first target pcpu is the busy thread (and if there are no other idle
pcpu than its sibling), that will never change.
This change fixes this by ensuring that, before entering the core of
the picking algorithm, the target pcpu is an idle one (if there is an
idle pcpu at all, of course).
More specifically:
1. replaces xenctl_cpumap with xenctl_bitmap
2. provides bitmap_to_xenctl_bitmap and the reverse;
3. re-implement cpumask_to_xenctl_bitmap with
bitmap_to_xenctl_bitmap and the reverse;
Other than #3, no functional changes. Interface only slightly
afected.
This is in preparation of introducing NUMA node-affinity maps.
The current code allows the PVHVM guest to make this hypercall.
But for PVHVM guest it always returns -EINVAL (-22) for Xen 4.2
and above. Xen 4.1 and earlier worked.
The reason is that the check in map_vcpu_info would fail
at:
if ( v->arch.vcpu_info_mfn != INVALID_MFN )
The reason is that the vcpu_info_mfn for PVHVM guests ends up by
defualt with the value of zero (introduced by c/s 23143).
The code in vcpu_initialise which initialized vcpu_info_mfn to a
valid value (INVALID_MFN), would never be called for PVHVM:
xl: Fix 'free_memory' to include outstanding_claims value.
Updating to make it clear that free_memory reported by 'xl info'
is influenced by the outstanding claim value. That is the free
memory that will be available to the host once all outstanding
claims have been completed. This modifies the behavior that the
patch titled "xl: 'xl info' print outstanding claims if enabled
(claim_mode=1 in xl.conf)" had - which reported the
outstanding claims and nothing else.
The free_pages as reported by the hypervisor is the currently
available count of pages on the heap. The outstanding pages is
the total amount of pages reserved for guests (so not taken from
the heap yet). As guests are being populated the memory from the
heap shrinks and the outstanding count of pages decreases.
The total memory used for guests increases.
As the available count of pages on the heap and outstanding
claims are intertwined, report the amount of free memory available
to be a combination of that. That is free heap memory minus the
outstanding pages.
We also make some odd choices in reporting. By default we will
only display 'outstanding_claims' if the claim_mode is enabled
in the global configuration file. However, if there are outstanding
claims, we will ignore the claim_mode and report these values.
Suggested-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xl: 'xl claims' print outstanding per domain claims
This is similar to "xl: 'xl info' print outstanding claims if enabled
(claim_mode=1 in xl.conf)" which exposes the global claim value.
This patch provides the value of the currently outstanding pages
claimed for each domains. This is per domain value which is added
to the global claim value which influences the hypervisors' MM system.
When a claim call is done, a reservation for a specific amount of pages
is set (and this patch lists said number) and also a global value is
incremented. This global value is then reduced as the domain's memory
is populated and eventually reaches zero.
The toolstack (libxc) also sets the domain's claim to zero when the population
of memory has completed as an extra step. Any call to destroy the domain
will also set the domain's claim to zero.
If the reservation cannot be meet the guest creation fails immediately
instead of taking seconds or minutes (depending on the size of the guest)
while the toolstack populates memory.
See patch: "xl: Implement XENMEM_claim_pages support via 'claim_mode'
global config" for details on how it is implemented.
The value fluctuates quite often so the value is stale once it is provided
to the user-space. However it is useful for diagnostic purposes.
It is printed irregardless of global "claim_mode" option in xl.conf(5).
That is b/c the user might have enabled, launched a guest, and then
disabled the option - and we should still report the correct outstanding
claim value. The 'man xl' shows the details of this argument.
The output is close to what 'xl list' looks like:
Name ID Mem VCPUs State Time(s) Claimed
Domain-0 0 2047 4 r----- 19.7 0
OL5 2 2048 1 --p--- 0.0 847
OL6 3 1024 4 r----- 5.9 0
Windows_XP 4 2047 1 --p--- 0.0 1989
[In which it can be seen that the OL5 guest still has 847MB of claimed
memory (out of the total 2048MB where 1191MB has been allocated to
the guest).]
Please note that the 'Mem' column has the cumulative value of outstanding
claims and the total amount of memory that has been allocated to the guest.
[v1: claims, not claim-list]
[v2: Add outstanding and current memkb in the output list]
[v3: Clairy docs and relax some checks]
[v4: Removed comments about guest config memory being the same as 'Mem'] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This patch provides the value of the currently outstanding pages
claimed for a specific domain. This is a value that influences
the global outstanding claims value (See patch: "xl: 'xl info'
print outstanding claims if enabled") returned via
xc_domain_get_outstanding_pages hypercall. This domain value
decrements as the memory is populated for the guest and
eventually reaches zero.
With this patch it is possible to utilize this field.
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[v2: s/unclaimed/outstanding/ per Tim's suggestion]
[v3: Don't use SXP printout file per Ian's suggestion] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Dan Magenheimer [Mon, 25 Feb 2013 20:10:08 +0000 (15:10 -0500)]
xc: export outstanding_pages value in xc_dominfo structure.
This patch provides the value of the currently outstanding pages
claimed for a specific domain. This is a value that influences
the global outstanding claims value (See patch: "xl: 'xl info'
print outstanding claims if enabled") returned via
xc_domain_get_outstanding_pages hypercall. This domain value
decrements as the memory is populated for the guest and
eventually reaches zero.
This patch is neccessary for "xl: export 'outstanding_pages' value
from xcinfo" patch.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v2: s/unclaimed_pages/outstanding_pages/ per Tim's suggestion] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
xl: 'xl info' print outstanding claims if enabled (claim_mode=1 in xl.conf)
This patch provides the value of the currently outstanding pages
claimed for all domains. This is a total global value that influences
the hypervisors' MM system.
When a claim call is done, a reservation for a specific amount of pages
is set and also a global value is incremented. This global value is then
reduced as the domain's memory is populated and eventually reaches zero.
The toolstack (libxc) also sets the domain's claim to zero when the population
of memory has completed as an extra step. Any call to destroy the domain
will also set the domain's claim to zero.
If the reservation cannot be meet the guest creation fails immediately
instead of taking seconds or minutes (depending on the size of the guest)
while the toolstack populates memory.
See patch: "xl: Implement XENMEM_claim_pages support via 'claim_mode'
global config" for details on how it is implemented.
The value fluctuates quite often so the value is stale once it is provided
to the user-space. However it is useful for diagnostic purposes.
It is only printed when the global "claim_mode" option in xl.conf(5)
is set to enabled (1). The 'man xl' shows the details of this item.
[v1: s/unclaimed/outstanding/]
[v2: Made libxl_get_claiminfo return just MemKB suggested by Ian Campbell]
[v3: Made libxl_get_claininfo return MemMB to conform to the other values printed]
[v4: Improvements suggested by Ian Jackson, also added docs to xl.pod.1]
[v5: Clarify how claims are cancelled, split >72 characters - Ian Jackson] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
xl: Implement XENMEM_claim_pages support via 'claim_mode' global config
The XENMEM_claim_pages hypercall operates per domain and it should be
used system wide. As such this patch introduces a global configuration
option 'claim_mode' that by default is disabled.
If this option is enabled then when a guest is created there will be an
guarantee that there is memory available for the guest. This is an
particularly acute problem on hosts with memory over-provisioned guests
that use tmem and have self-balloon enabled (which is the default option
for them). The self-balloon mechanism can deflate/inflate the balloon
quickly and the amount of free memory (which 'xl info' can show) is stale
the moment it is printed. When claim is enabled a reservation for the
amount of memory ('memory' in guest config) is set, which is then reduced
as the domain's memory is populated and eventually reaches zero.
If the reservation cannot be meet the guest creation fails immediately
instead of taking seconds/minutes (depending on the size of the guest)
while the guest is populated.
Note that to enable tmem type guests, one needs to provide 'tmem' on the
Xen hypervisor argument and as well on the Linux kernel command line.
There are two boolean options:
(0) No claim is made. Memory population during guest creation will be
attempted as normal and may fail due to memory exhaustion.
(1) Normal memory and freeable pool of ephemeral pages (tmem) is used when
calculating whether there is enough memory free to launch a guest.
This guarantees immediate feedback whether the guest can be launched due
to memory exhaustion (which can take a long time to find out if launching
massively huge guests) and in parallel.
[v1: Removed own claim_mode type, using just bool, improved docs, all per
Ian's suggestion]
[v2: Updated the comments]
[v3: Rebase on top 733b9c524dbc2bec318bfc3588ed1652455d30ec (xl: add vif.default.script)]
[v4: Fixed up comments]
[v5: s/global_claim_mode/claim_mode/]
[v6: Ian Jackson's feedback: use libxl_defbool, better comments, etc] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Dan Magenheimer [Mon, 25 Feb 2013 20:19:14 +0000 (15:19 -0500)]
xc: use XENMEM_claim_pages hypercall during guest creation.
We add an extra parameter to the structures passed to the
PV routine (arch_setup_meminit) and HVM routine (setup_guest)
that determines whether the claim hypercall is to be done.
The contents of the 'claim_enabled' is defined as an 'int'
in case the hypercall expands in the future with extra
flags (for example for per-NUMA allocation). For right now
the proper values are: 0 to disable it or 1 to enable
it.
If the hypervisor does not support this function, the
xc_domain_claim_pages and xc_domain_get_outstanding_pages
will silently return 0 (and set errno to zero).
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v2: Updated per Ian's recommendations]
[v3: Added support for out-of-sync hypervisor] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Yang Zhang [Tue, 16 Apr 2013 08:36:05 +0000 (10:36 +0200)]
VTD: Remove the check for reserved device scope type
Though we only have four valid types now, the new type may be added in future.
It's better to remove the check and only deal with the type that we can
recognize.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Xiantao Zhang <xiantao.zhang@Intel.com> Acked-by: Keir Fraser <keir@xen.org>
Add log message for this case.
On the crash path in nmi_shootdown_cpus(), we shut down the IOMMU, then
disable the IOAPIC.
On systems which support interrupt remapping, the variable iommu_intremap
remains set, meaning that disable_IO_APIC() issues interrupt remapping
invalidate requests.
IOAPIC interrupt remapping used to be conditional on iommu_enabled, but is now
conditional on iommu_intremap, following the above changeset.
This behaviour can be fixed by also indicating that interrupt remapping is not
enabled after shutting down the IOMMU.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Boris Ostrovsky [Mon, 15 Apr 2013 09:27:32 +0000 (11:27 +0200)]
x86/VPMU: Save/restore VPMU only when necessary
VPMU doesn't need to always be saved during context switch. If we are
comming back to the same processor and no other VPCU has run here we can
simply continue running. This is especailly useful on Intel processors
where Global Control MSR is stored in VMCS, thus not requiring us to stop
the counters during save operation. On AMD we need to explicitly stop the
counters but we don't need to save them.
Boris Ostrovsky [Mon, 15 Apr 2013 09:25:18 +0000 (11:25 +0200)]
x86/AMD: Stop counters on VPMU save
Stop the counters during VPMU save operation since they shouldn't be
running when VPCU that controls them is not. This also makes it
unnecessary to check for overflow in context_restore()
Set LVTPC vector before loading the context during vpmu_restore().
Otherwise it is possible to trigger an interrupt without proper vector.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
Boris Ostrovsky [Mon, 15 Apr 2013 09:24:52 +0000 (11:24 +0200)]
x86/AMD: Load context when attempting to read VPMU MSRs
Load context (and mark it as LOADED) on any MSR access. This will allow
us to always read the most up-to-date value of an MSR: guest may write
into an MSR without enabling it (thus not marking the context as RUNNING)
and then be migrated. Without first loading the context reading this MSR
from HW will not match the pervious write since registers will not be
loaded into HW in amd_vpmu_load().
In addition, we should be saving the context when it is LOADED, not
RUNNING --- otherwise we need to save it any time it becomes non-RUNNING,
which may be a frequent occurrence.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
Boris Ostrovsky [Mon, 15 Apr 2013 09:23:25 +0000 (11:23 +0200)]
x86/AMD: Allow more fine-grained control of VMCB MSR Permission Map
Currently VMCB's MSRPM can be updated to either intercept both reads and
writes to an MSR or not intercept neither. In some cases we may want to
be more selective and intercept one but not the other.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
Jan Beulich [Mon, 15 Apr 2013 08:33:48 +0000 (10:33 +0200)]
IOMMU: allow MSI message to IRTE propagation to fail
With the need to allocate multiple contiguous IRTEs for multi-vector
MSI, the chance of failure here increases. While on the AMD side
there's no allocation of IRTEs at present at all (and hence no way for
this allocation to fail, which is going to change with a later patch in
this series), VT-d already ignores an eventual error here, which this
patch fixes.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
Daniel De Graaf [Thu, 21 Mar 2013 20:11:28 +0000 (16:11 -0400)]
stubdom/grub: send kernel measurements to vTPM
This allows a domU with an arbitrary kernel and initrd to take advantage
of the static root of trust provided by a vTPM.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Acked-by: Matthew Fioravante <matthew.fioravante@jhuapl.edu>
Daniel De Graaf [Thu, 21 Mar 2013 20:11:25 +0000 (16:11 -0400)]
stubdom/vtpm: make state save operation atomic
This changes the save format of the vtpm stubdom to include two copies
of the saved data: one active, and one inactive. When saving the state,
data is written to the inactive slot before updating the key and hash
saved with the TPM Manager, which determines the active slot when the
vTPM starts up.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Daniel De Graaf [Thu, 21 Mar 2013 20:11:24 +0000 (16:11 -0400)]
stubdom/vtpm: Support locality field
The vTPM protocol now contains a field allowing the locality of a
command to be specified; pass this to the TPM when processing a packet.
While the locality is not currently checked for validity, a binding
between locality and some distinguishing feature of the client domain
(such as the XSM label) will need to be defined in order to properly
support a multi-client vTPM.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Matthew Fioravante <matthew.fioravante@jhuapl.edu>
Daniel De Graaf [Thu, 21 Mar 2013 20:11:23 +0000 (16:11 -0400)]
stubdom/vtpm: correct the buffer size returned by TPM_CAP_PROP_INPUT_BUFFER
The vtpm2 ABI supports packets of up to 4088 bytes by default; expose
this property though the TPM's interface so clients do not attempt to
send larger packets.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Daniel De Graaf [Thu, 21 Mar 2013 20:11:21 +0000 (16:11 -0400)]
mini-os/tpmback: Replace UUID field with opaque pointer
Instead of only recording the UUID field, which may not be of interest
to all tpmback implementations, provide a user-settable opaque pointer
associated with the tpmback instance.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Daniel De Graaf [Thu, 21 Mar 2013 20:11:20 +0000 (16:11 -0400)]
mini-os/tpmback: set up callbacks before enumeration
The open/close callbacks in tpmback cannot be properly initalized in
order to catch the initial enumeration events because init_tpmback
clears the callbacks and then asynchronously starts the enumeration of
existing tpmback devices. Fix this by passing the callbacks to
init_tpmback so they can be installed before enumeration.
This also removes the unused callbacks for suspend and resume.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Daniel De Graaf [Thu, 21 Mar 2013 20:11:19 +0000 (16:11 -0400)]
mini-os/tpm{back, front}: Allow device repoens
Allow the vtpm device to be disconnected and reconnected so that a
bootloader (like pv-grub) can submit measurements and return the vtpm
device to its initial state before booting the target kernel.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Daniel De Graaf [Thu, 11 Apr 2013 16:20:25 +0000 (12:20 -0400)]
mini-os/tpm{back, front}: Change shared page ABI
This changes the vTPM shared page ABI from a copy of the Xen network
interface to a single-page interface that better reflects the expected
behavior of a TPM: only a single request packet can be sent at any given
time, and every packet sent generates a single response packet. This
protocol change should also increase efficiency as it avoids mapping and
unmapping grants when possible. The vtpm xenbus device now requires a
feature-protocol-v2 node in xenstore to avoid conflicts with existing
(xen-patched) kernels supporting the old interface.
While the contents of the shared page have been defined to allow packets
larger than a single page (actually 4088 bytes) by allowing the client
to add extra grant references, the mapping of these extra references has
not been implemented; a feature node in xenstore may be used in the
future to indicate full support for the multi-page protocol. Most uses
of the TPM should not require this feature.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Cc: Jan Beulich <JBeulich@suse.com>
Ian Campbell [Tue, 5 Feb 2013 16:19:53 +0000 (16:19 +0000)]
tools+stubdom: install under /usr/local by default.
Now that the hotplug scripts have been fixed to remove hardcoded paths lets
try this again. From 26470:acaf29203cf9:
This is the defacto (or FHS mandated?) standard location for software
built from source, in order to avoid clashing with packaged software
which is installed under /usr/bin etc.
I think there is benefit in having Xen's install behave more like the
majority of other OSS software out there.
The major downside here is in the transition from 4.2 to 4.3 where
people who have built from source will innevitably discover breakage
because 4.3 no longer overwrites stuff in /usr like it used to so they
pickup old stale bits from /usr instead of new stuff from /usr/local.
Packages will use ./configure --prefix=/usr or whatever helper macro
their package manager gives them. I have confirmed that doing this
results in the same list of installed files as before this patch was
applied.
The hypervisor remains in /boot/ and there is no intention to move it.
Wei Liu [Mon, 25 Mar 2013 11:17:31 +0000 (11:17 +0000)]
Switch to poll() in cxenstored's IO loop
Poll() can support more file descriptors than select(). We've done this for
xenconsoled, now do this for cxenstored as well.
The code is taken from xenconsoled and modified to adapt to cxenstored.
Note that poll() semantic is a bit different from select(). In Linux, if a fd
is set in IN/OUT fd_set and error occurs inside select(), this fd is still
considered readable / writable, and it is set in the returned IN/OUT fd_set.
So in later handle_input / handle_output, the connection will eventually be
talloc_free'ed(). After switching to poll(), we should take care of any error
right away, making the code clearer.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Daniel De Graaf [Thu, 21 Mar 2013 20:11:29 +0000 (16:11 -0400)]
stubdom/Makefile: Fix gmp extract rule
When NEWLIB_STAMPFILE is updated but gmp has already been extracted, the mv
command will incorrectly create a subdirectory instead of renaming. Remove the
old target before renaming to fix this.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Linux uses GICC_CTLR.EOImodeNS set to 0, which means both priority drop and
deactivate interrupt functionality are made when something is written in
GICC_EOIR.
As the ARM manual specifies: "having an active interrupt in the List registers
with a priority that is not set in the corresponding Active Priorities
register" when GICV_CTLR.EOImode (ie GICC_CTLR.EOImodeNS in the guest context)
result in unpredicable behavior, we need to save/restore GICH_APR.
David Scott [Wed, 20 Mar 2013 20:24:42 +0000 (20:24 +0000)]
ocaml: eventchn: add a 'type t' to represent an event channel
It's a common OCaml convention to add a 'type t' in a module to
represent the main "thing" that the module is about. We add an
opaque type t and to_int/of_int functions for those who really
need it, in particular:
1. to_int is needed for debug logging; and
2. both to_int and of_int are needed for anyone who communicates
a port number through xenstore.
Signed-off-by: David Scott <dave.scott@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Fri, 15 Mar 2013 13:15:50 +0000 (13:15 +0000)]
xen: arm: remove PSR_MODE_MASK from public interface.
This is also defined in sys/ptrace.h on arm64 which breaks the tools build due
to multiple definitions. I expect this is really a bug in the kernel and/or
glibc but we don't really need this symbol in the public headers, at least not
right now, so move it into include/asm instead.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Ian Campbell [Fri, 15 Mar 2013 13:15:47 +0000 (13:15 +0000)]
tools: memshr: arm64 support
I'm not mad keen on propagating these sorts of asm atomic operations throughout
our code base. Other options would be:
- use libatomic-ops, http://www.hpl.hp.com/research/linux/atomic_ops/, although
this doesn't seem to be as widespread as I would like (not in RHEL5 for
example)
- use a pthread lock. This is probably the simplest/best option but I wasn't
able to figure out the locking hierarchy of this code
So I've copped out and just copied the appropriate inlines here.
I also nuked some stray ia64 support and fixed a coment in the arm32 version
while I was here.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Ian Campbell [Fri, 15 Mar 2013 13:15:42 +0000 (13:15 +0000)]
tools: Use AC_SYS_LARGEFILE instead of calling getconf(1)
getconf is not cross-compile friendly since it reports the features of the host
and not the target. There doesn't appear to be a $triplet-getconf.
AC_SYS_LARGEFILE arranges for #defines to appear in config.h however Xen's
build system expects these to be part of C{PP}FLAGS. Since I'm not confident
that everything in Xen includes config.h I instead arrange for the result of
running AC_SYS_LARGERFILE to end up in CFLAGS.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
As well as using x<N> rather than r<N> registers for passing arguments/results
as mandate the use of x16 as the hypercall number.
Add some pedantry about struct alignment layout referencing the ARM Procedure
Calling Standard to avoid confusion with the previous "OABI" convention. While
at it also mandate that hypercall argument structs are always little endian.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Ian Campbell [Wed, 6 Mar 2013 08:54:34 +0000 (08:54 +0000)]
arm: vgic: fix race between evtchn upcall and evtchnop_send
On ARM the evtchn upcall is done by using a local PPI interrupt. However the
guest will clear the evtchn_upcall_pending bit before it EOIs that PPI (which
happens late). This means vgic_vcpu_inject_irq (called via
vcpu_mark_events_pending) sees the PPI as in flight and ends up not reinjecting
it, if this happens after the guest has finished its event channel processing
loop but before the EOI then we have lost the upcall.
To fix this we need to check if an evtchn upcall is pending when returning to
the guest and if so reinject the PPI.
We therefore also need to call gic_restore_pending_irqs on the exit to guest
path in order to pickup any newly inject IRQ and propagate it into a free LR.
This doesn't currently support bumping a lower priority interrupt out of the
LRs in order to inject a new higher priority interrupt. We don't yet implement
interrupt prioritisation (and guests don't use it either) so this will do for
now.
Since gic_restore_pending_irqs is now called in the return to guest path it is
called with interrupts disabled and accordingly must use the
irqsave/irqrestore spinlock primitives.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>