Reorder the SCHED_SWITCH trace before the runstate change trace to fix
a problem with the lost records "resume" code.
Namely: The "lost records" trace includes the currently running
process. But during SCHED_SWITCH, it reads the wrong value, confusing
xenalyze. Making sure there are no trace records between runstate
change trace and the actual context switch fixes it.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
If OOS mode is enabled, after last possible resync, read the guest l1e
one last time. If it's different than the original read, start over
again.
This fixes a race which can result in inconsistent in-sync shadow
tables, leading to corruption:
v1: take page fault, read gl1e from an out-of-sync PT.
v2: modify gl1e, lowering permissions
[v1,v3]: resync l1 which was just read.
v1: propagate change to l1 shadow using stale gl1e
Now we have an in-sync shadow with more permissions than the guest.
The resync can happen either as a result of a 3rd vcpu doing a cr3
update, or under certain conditions by v1 itself.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
ssh is used as the transport by default, although this can be
overridden by specifying a different sshcommand. This is a very
standard approach nowadays and avoids the need for daemons at the
target host in the default configuration, while providing flexibility
to admins. (In the future it might be nice to support plain
unencrypted migration over TCP, which we do not rule out now, although
it is not currently implemented.)
Properties of the migration protocol:
* The domain on the target machine is named "<domname>--incoming"
while it is being transferred.
* The domain on the source machine is renamed
"<domain>--migratedaway"
before we give the target permission to rename and unpause.
* The locking in libxl_domain_rename ensures that of two
simultaneous migration attempts no more than one will succeed.
* We go to some considerable effort to avoid leaving the domain in
a bad state if something goes wrong with one of the ends or the
network, although there is still (inevitably) a possibility of a
unresolvable state (in case of very badly timed network failure)
which is probably best resolved by destroying the domain at both
ends.
Incidental changes:
create_domain now returns a libxl error code rather than exiting on
error.
New ERROR_BADFAIL error code for reporting unpleasant failures.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
* Make create_domain always return to caller
* Have create_domain set its log callback sooner
* Actually write things to logfile, and some error checking
With some combinations of options, create_domain would never return to
the caller, since it would have called daemon and will later exit. So
we fork an additional time, so that we can call daemon in the child
and also return to the caller in the parent. It's a shame that
there's no version of daemon(3) that allows us to do this without the
extra code and pointless extra fork.
daemon(0,0) closes all the fds. So we need to call daemon(0,1) and
organise detaching our stdin/out/err ourselves. Doing this makes
messages actually appear in the xl logfile in /var/log/xen.
Finally, make create_domain call libxl_ctx_set_log sooner. This makes
some lost messages appear.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
xl: New savefile format. Save domain config when saving a domain.
We introduce a new format for saved domains. The new format, in
contrast to the old:
* Has a magic number which can distinguish it from other kinds of
file
* Is extensible
* Can contains the domain configuration file
On domain creation we remember the actual config file used (using the
toolstack data feature of libxl, just introduced), and by default save
it to the save file.
However, options are provided for the following:
* When saving a domain, supplying an alternative config file to
store in the savefile.
* When restoring a domain, supplying an alternative config file.
If a domain is restored with a different config file, it is the
responsibility of the xl user to ensure that the two configs are
"compatible". Changing the targets of virtual devices is supported;
changing other features of the domain is not recommended. Bad changes
may lead to undefined behaviour in the domain, and are in practice
likely to cause resume failures or crashes.
Old format save files generated by old versions of xl are not
supported.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
xl: Remove some duplicated boilerplate. (Improves logging slightly.)
We remove six lines of boilerplate from the top of each function, and
instead have a single struct libxl_ctx which is initialised once at
the top of main.
Likewise we wrap domain_qualifier_to_domid in a new function
find_domain, which does the error handling, and stores the domid and
the specified name (if applicable).
This reduces the size of xl.c by 7% (!)
As a beneficial side effect, the earlier call to libxl_ctx_set_log in
main makes some lost messages appear.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
libxl: Per-domain data storage for the convenience of the library user
We provide a mechanism whereby a user of the libxl library is able to
store some information alongside the domain. The information stored
is a block of bytes. Its lifetime is that of the domain - ie the
userdata is garbage collected alongside the domain if the domain is
destroyed. (This is why the feature needs to be in libxl and cannot
be implemented in the user itself or in libxlutil.)
If a libxl caller does not need to use this feature it can ignore it.
The data is tagged with the (self-declared) name of the libxl user, so
that different users cannot accidentally trip over each others'
userdata. The data is not interpreted at all by libxl.
To assist developers and people debugging, there is a registry of the
known userdata userids, and the corresponding data format as declared
by that libxl user, in libxl.h next to these declarations:
int libxl_userdata_store(struct libxl_ctx *ctx, uint32_t domid,
const char *userdata_userid,
const uint8_t *data, int datalen);
int libxl_userdata_retrieve(struct libxl_ctx *ctx, uint32_t domid,
const char *userdata_userid,
uint8_t **data_r, int *datalen_r);
The next patch will introduce the data for the userid "xl".
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
xl would like to use libxl_report_exitstatus, so expose it in
libxl_utils.h to avoid having to write it twice. Also, give it a
"level" argument to set the loglevel of the resulting message.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
xenstore,libxl: cleanup of xenstore connections across fork()
Provide a new function xs_daemon_destroy_postfork which can be called
by a libxenstore user who has called fork, to close the fd for the
connection to xenstored and free the memory, without trying to do
anything to any threads which libxenstore may have created.
Use this new function in libxl_fork, to avoid accidental use of a
xenstore connection in both parent and child.
Also, fix the doc comment for libxl_spawn_spawn to have the success
return codes the right way round.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
libxl: Expose functions for helping with subprocesses.
* Expose libxl_fork in libxl_utils.h
* Expose libxl_pipe in libxl_utils.h
* Make libxl_exec put SIGPIPE back (so that libxl callers may
have SIGPIPE ignored)
xl would like to use libxl_fork (which is like fork(2) except that it
logs errors) and also a similar function libxl_pipe. So put these in
libxl_utils.[ch] and use them in libxl.c as appropriate, to avoid
having to duplicate code between xl and libxl.
Also, make sure that subprocesses spawned by libxl have SIGPIPE set
back to SIG_DFL as they are entitled to expect. This means that a
libxl caller which sets SIGPIPE to SIG_IGN is no longer buggy. (This
is relevant for xl migration, because xl would like to be such a
caller.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
libxl: libxl_domain_restore: Put fd back to blocking mode
libxl_domain_restore calls, indirectly, xc_domain_restore. The
latter, when doing a live migration, sets the fd from blocking mode
(which it must be on entry, or things go wrong) to nonblocking mode
and leaves it this way. Arguably this is a bug in libxc, but to avoid
disrupting any callers we fix it in libxl.
So libxl_domain_restore now puts the fd back into blocking mode
before returning.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
libxl: New utility functions in for reading and writing files.
We introduce these functions in libxl_utils.h:
int libxl_read_file_contents(struct libxl_ctx *ctx, const char
*filename,
void **data_r, int *datalen_r);
int libxl_read_exactly(struct libxl_ctx *ctx, int fd, void *data,
ssize_t sz,
const char *filename, const char *what);
int libxl_write_exactly(struct libxl_ctx *ctx, int fd, const void
*data,
ssize_t sz, const char *filename, const char
*what);
They will be needed by the following patches. They have to be in
libxl.a rather than libxutil.a because they will be used, amongst
other places, in libxl itself.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
This is needed by the following patches. It makes it much more
convenient for libxl functions to return the errno value from the
failure, when they fail.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
To eliminate racing between dbs timer handler and cpufreq_del_cpu,
using kill_timer instead of stop_timer to make sure timer handler
execution finished before other stuff in cpufreq_del_cpu.
BTW, fix a lost point of cpufreq_statistic_lock taking sequence.
The final, flushing call to discard_file_cache also discards any
errors from fsync. Call fsync explicitly before leaving, to check if
all VM memory actually made it to the disk.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Starting with Family 0x10, model 10 processors, some AMD processors
will have support for the APERF/MPERF MSRs. This patch adds the
checks necessary to support those MSRs.
It also makes the get_measured_perf function defined inside cpufreq.c
driver independent. max_freq is taken from the policy definition
instead of being a private argument in struct acpi_cpufreq_data.
The struct member is entirely removed from the function since it
is no longer used.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Add support for disabling AMD's Boost feature. Boost is similar to
Intel's Turbo and uses the same high level interface. The low
level implementation is different and encapsulated in the powernow
driver for cpufreq.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Refactor the existing code that supports the Intel Turbo feature to
move all the driver specific bits in the cpufreq driver. Create
a tri-state interface for the Turbo feature that can distinguish
amongst enabled Turbo, disabled Turbo, and processors that don't
support Turbo at all.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
tmem: fix ia64 build
/xen/common/built_in.o: In function `tmh_get_first_byte':
/xen/include/xen/tmem_xen.h:350: undefined reference to
`__map_domain_page'
xen: allow guests to set caching attributes for MMIOs
This patch allows guests that have directly mapped MMIO regions to set
the caching attributes for them, and only for them.
Currently we have just an on/off check for a directly assigned device
instead of looking for directly mapped MMIO regions.
'xm info' command now also gives the cpu topology & host numa
information. This will be later used to build guest numa support. The
patch basically changes physinfo sysctl, and adds topology_info &
numa_info sysctls, and also changes the python & libxc code
accordingly.
Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
This fixes xenbus initialization of blkfront, netfront and pcifront
by uniformizing with fbfront: after writing parameters, set state to
initialised, then wait for backend to switch to connect state, and
then only read its parameter and switch to the connect state.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
When no fb is available, init_fbfront will return, so the local
semaphore for synchronization with the kbd thread would get dropped.
Using a global static semaphore instead fixes this.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
tmem: add page deduplication with optional compression or trailing-zero-elimination
Add "page deduplication" capability (with optional compression
and trailing-zero elimination) to Xen's tmem.
(Transparent to tmem-enabled guests.) Ephemeral pages
that have the exact same content are "combined" so that only
one page frame is needed. Since ephemeral pages are essentially
read-only, no C-O-W (and thus no equivalent of swapping) is
necessary. Deduplication can be combined with compression
or "trailing zero elimination" for even more space savings.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
This patch adds a new field in hvm to indicate 1gb is supported by
CPU. In addition, users can turn 1GB feature on/off using a Xen
option ("hap_1gb", default is off). Per Tim's suggestion, I also add
an assertion check in shadow/common.c file to prevent affecting shadow
code.
This patch changes Xen tools to allocate 1GB first. If such requests
fail, it will fall back to 2MB and then 4KB. We skip 1GB allocation
for the MMIO space between 3GB and 4GB.
Limiting the number of idle cpus tickled for vcpu migration purpose
to ONLY ONE to get rid of a lot of IPI events which may impact the
average cpu idle residency time.
The default on option 'tickle_one_idle_cpu=0' can be used to disable
this optimization if needed.
cpuidle: mwait on softirq_pending & remove wakeup ipis
For cpu which enter deep C state via monitor/mwait, wakeup can be done
by writing to the monitored memory. So once monitor softirq_pending,
we can remove the redundant ipis.
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Wei Gang <gang.wei@intel.com>
Allow all unused GSI to be configured via IO-APIC by new pv_ops dom0
Currently Xen disallows setting up any GSI < 16. This makes it
impossible by the kernel to use any PCI devices without ACPI override
but a mapping to this interrupts via IO-APIC.
The patch allows all unused interrupts to be setup via IO-APIC.
Keir Fraser [Wed, 31 Mar 2010 09:21:19 +0000 (10:21 +0100)]
x86/hvm: accelerate I/O intercept handling
currently we go through the emulator every time a HVM guest does an
I/O port access (in/out). This is unnecessary most of the times, as
both VMX and SVM provide all the necessary information already in the
VMCS/VMCB. String instructions are not covered by this shortcut, but
they are quite rare and we would need to access the guest memory
anyway. This patch decodes the information from VMCB/VMCS and calls a
simple handle_mmio wrapper. In handle_mmio() itself the emulation part
will simply be skipped, this approach avoids code duplication. Since
the vendor specific part is quite trivial, I implemented both the VMX
and SVM part, please check the VMX part for sanity.
I boot-tested both versions and ran some simple benchmarks. A micro
benchmark (hammering an I/O port in a tight loop) shows a significant
performance improvement (down to 66% of the time needed to handle the
intercept on an AMD K8, measured in the guest with TSC). Even with
reading a 1GB file from an emulated IDE harddisk (Dom0 cached) I could
get a 4-5% improvement. Some guest code (e.g. the TCP stack in some
Windows version) exercises the PM-Timer I/O port (0x1F48) very often
(multiple 10,000 times per second), these workloads also benefit with
up to 5% improvement from this patch.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Keir Fraser [Wed, 31 Mar 2010 09:12:29 +0000 (10:12 +0100)]
x86: start PCI IRQs Xen uses from Dom0-invoked io_apic_set_pci_routing()
When using a serial port from an add-in PCI card, and that IRQ is (as
usual) outside of the legacy range (0...15), Xen would never really
enable the IRQ, as at the time setup_irq() runs the handler for the
IRQ still is &no_irq_type. Consequently, once the trigger mode and
polarity of the interrupt become known to Xen, it should start such
IRQ(s) it uses for itself.
The question is whether the same should also be done in
ioapic_guest_write(): Legacy kernels don't use PHYSDEVOP_setup_gsi
(and hence don't trigger the code path modified).
Note however that even when a kernel is using PHYSDEVOP_setup_gsi in
the way the pv-ops kernel currently does, there's still no guarantee
that the call would ever be issued for IRQs Xen may be using, since
this happens only when devices get enabled. For Xen's purposes, this
function should be called for *all* device IRQs, regardless of
whether those would actually be (attempted to be) used by the kernel,
i.e. in a subsys_initcall() from drivers/acpi/pci_irq.c iterating
over all PCI devices and doing mostly what acpi_pci_irq_enable() does
except for calling this function in place of acpi_register_gsi(). The
downside of this approach is that without extra filtering in Xen
(based on a hint from Dom0), vectors will then get up even for IRQs
that are unused by both hypervisor and kernel.
Keir Fraser [Wed, 31 Mar 2010 09:11:41 +0000 (10:11 +0100)]
ns16550: enable PCI serial card usage
On some machine, there is no build-in serial port and no LPC
connection. To use serial port, we have to plug in a serial
card. Sometime BIOS doesn't enable the BARs for the PCI devices which
lead to that Xen can't use the add-in serial port for early log
print. This patch try to initialize the serial card and related PCI
bridge to make it usable for xen.
Usage:
Step 1. boot into bare metal Linux, get the information for the PCI
serial ports and the related PCI bridge. On my case:
06:02.0 Serial controller: Lava Computer mfg Inc Lava DSerial-PCI Port
A (prog-if 02 [16550])
Region 0: I/O ports at 5000 [size=8]
Step 2. revise the grub.conf to include
'com1=115200,8n1,0xPPPP,0,<port-bdf>,<bridge-bdf> console=com1' for
xen cmdline. The 0xPPPP is the base I/O port address got for the
serial port in bare metal Linux. For my case, it is 0x5000. The 0
after 0xPPPP means enable polling model for the serial port. The
<port-bdf> is the serial port BDF, 06:02.0 in my case; the
<bridge-bdf> is the bridge BDF bebind which the serial card locates,
00:1e.0 for my case.
Keir Fraser [Tue, 30 Mar 2010 17:31:39 +0000 (18:31 +0100)]
xend: Fix bug of cpu affinity/ vcpu pin under ia32pae
c/s 21040 and 21044 used to break cpu number limit (<=64).
However, they result in bug under ia32pae model:
1. things will go wrong for affinity, making all vcpus pin to a same
cpu with same time;
2. when 'xm vcpu-pin' vpu to cpu, xend will exit abnormally.
Keir Fraser [Tue, 30 Mar 2010 17:30:30 +0000 (18:30 +0100)]
svm: vmcb intercept enumeration
Attached patch enumerates vmcb intercepts with hexadecimal numbers
additionally. This makes looking up the intercept number easier.
No functional changes.
Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Keir Fraser [Tue, 30 Mar 2010 07:32:34 +0000 (08:32 +0100)]
mcheck: Small fix for CMCI Threshold set problem.
When generating new threshold value, we must firstly clean old value
before or the new set value since the new value might be different
with the old (BIOS might pre-set some threshold).
Signed-off-by: Liping Ke <liping.ke@intel.com> Signed-off-by: Ying Huang <ying.huang@intel.com>
Keir Fraser [Tue, 30 Mar 2010 07:31:16 +0000 (08:31 +0100)]
When flush tlb mask, we need consider the cpu_online_map.
The same is true for EPT flushes.
We noticed sometime system hang on cpu online/offline stress test. The
reason is because flush_tlb_mask from __get_page_type is deadloop.
This should be caused by a small windows in cpu offline. The
cpu_online_map is changed and the interrupt is disabled at
take_cpu_down() for the to-be-offline CPU.
However, the __sync_lazy_execstate() called from idle_task_exit() in
the idle_loop() for the to-be-offline CPU. At that time, the
stop_machine_run is finished already, and __get_page_type may be
called in other CPU before the __sync_lazy_execstate().
Thanks Jan pointing out issue in my original patch.
Keir Fraser [Fri, 26 Mar 2010 08:49:13 +0000 (08:49 +0000)]
cpufreq: fix statistic lock problem
cpufreq_statistic_lock should not only protect the statistic memory
pointed by cpufreq_statistic_data[cpu], but also have to protect the
pointer in cpufreq_statistic_data[cpu] itself. So move the read
operation of cpufreq_statistic_data[cpu] after
spin_lock(cpufreq_statistic_lock).
Keir Fraser [Thu, 25 Mar 2010 10:01:05 +0000 (10:01 +0000)]
VT-d: should not disable VT-d when find unknown DMAR structure type
Now 4 DMAR structure types are supported (type value 0 ~ 3). Type
values > 3 are reserved for future use. Current implementation
disables VT-d when find unknown DMAR structure type, this may lead to
VT-d disabling on future platforms before supporting new types on
Xen. For forward compatibility, just skip unknown structures by
skipping the appropriate number of bytes indicated by the Length
field, and then VT-d still can be used.
Signed-off-by: Weidong Han <weidong.han@intel.com>
Keir Fraser [Thu, 25 Mar 2010 09:19:33 +0000 (09:19 +0000)]
x86: s3: write_msi_msg: entry->msg should be in the compatibility format
When Interrupt Remapping is used, after Dom0 S3, Dom0's filesystem
might become inaccessible as the SATA disk's MSI interrupt becomes
buggy. The cause is: After set_msi_affinity() or setup_msi_irq()
invokes write_msi_msg(), entry->msg records the remappable format
message; during S3 resume, Dom0 invokes the PHYSDEVOP_restore_msi
hypercall to restore the MSI registers of devices, and in
pci_restore_msi_state() -> write_msi_msg(), the 'entry->msg' of
remappable format is passed, but in write_msi_msg() -> ... ->
msi_msg_to_remap_entry(), the 'msg' is assumed to be in compatibility
format. As a result, after s3, the IRTE is corrupted.
Actually the only users of 'entry->msg' are pci_restore_msi_state()
and dump_msi(). That's why we don't have issue except Dom0 S3.
Keir Fraser [Thu, 25 Mar 2010 07:41:55 +0000 (07:41 +0000)]
Fix gdbserver-xen support on older kernels.
The xc_ptrace API relies on errno for passing success/failure
indication back to callers. However, mapping operations that fall
back on legacy APIs may leave errno set to a non-zero result even
thought the operation is successful. This patch resets errno after
successful map operations so that xc_ptrace doesn't inadvertently
return a failure.
Keir Fraser [Thu, 25 Mar 2010 07:40:09 +0000 (07:40 +0000)]
x86: fix improper return value from relinquish_memory()
While apparently only a theoretical possibility (domain_kill() has a
BUG_ON() that wasn't reported to trigger so far), I still think it is
better to have the code cleaned up.
Keir Fraser [Wed, 24 Mar 2010 11:06:48 +0000 (11:06 +0000)]
Fix 21051:bcc09eb7379f "x86_32: Relocate multiboot modules to below 1GB."
Copy the modules in ascending order in memory, rather than decsending
order. This reduces the likelihood of the second relocation (in
setup.c) corrupting modules through accidental overwriting.