Keir Fraser [Tue, 4 May 2010 08:30:53 +0000 (09:30 +0100)]
Remus: python netlink fixes
Fix deprecation warning in Qdisc class under python 2.6.
Fix rtattr length and padding (rta_len is unaligned).
Null-terminate qdisc name in rtnl messages.
x86, shadow: propagate pat caching on the shadow l1
PAT caching was only propagated if has_arch_pdevs(),
causing the hvm_get_mem_pinned_cacheattr() to be ignored
in the non passthrough case.
l1_disallow_mask() needs to be relaxed.
Signed-off-by: Jean Guyader <jean.guyader@citrix.com>
The motivation comes from distributors that configure their
crashkernel command line automatically with some configuration tool
(YaST, you know ;)). Of course that tool knows the value of System
RAM, but if the user removes RAM, then the system becomes unbootable
or at least unusable and error handling is very difficult."
For x86, other than Linux we pass the actual amount of RAM rather than
the highest page's address (to cope with sparse physical address
maps).
console: Make initial static console buffer __initdata.
The previous scheme --- freeing an area of BSS --- did not interact
nicely with device passthrough as IOMMU will not have any Xen BSS area
in guest device pagetables. Hence if the freed BSS space gets
allocated to a guest, DMAs to guest's own memory can fail.
The simple solution here is to always free the static buffer at end of
boot (initmem is specially handled for IOMMUs) and require a
dynamically-allocated buffer always to be created.
xend: earlier remove the backend of tapdisk device in
xenstore to release the resource allocated in backend driver
lies in dom0'kernel
Blktapctl thread will use qemu-dm connection instead of tapdisk-ioemu
in the case of FV VM. We found the resource like memory allocated for
this Guest can't be free for backend driver couldn't be closed in qemu-dm.
This patch would remove the backend of tapdisk device earlier in
xenstore to triger qemu-dm to notify the backend driver to release the
resource allocated.
I have tested this patch at the case of
1, save && restore
2, destory && shutdown
3, snapshot
Signed-off-by: James ( Song Wei ) <jsong@novell.com>
The info subcommand was missing from the xl tool. Use the new libxl
wrapper functions to create a clone of "xm info". The splitting into
several smaller functions is enspired by the implementation in
XendNode.py.
Signed-off-by: Andre Przywara <andre.przywara@amd.com> Acked-by: Vincent Hanquez <vincent.hanquez@eu.citrix.com>
Xen provides a xen_version hypercall to query the values of several
interesting things (like hypervisor version, commandline used, actual
changeset, etc.). Create a user-friendly and efficient wrapper around
the libxc function to provide values for xl info output.
Since the information is static during the whole runtime, we store
it within the libxl_ctx structure and just deliver the pointer on
subsequent calls.
Signed-off-by: Andre Przywara <andre.przywara@amd.com> Acked-by: Vincent Hanquez <vincent.hanquez@eu.citrix.com>
The libxl version of the physinfo sysctl does not contain some
fields like nr_nodes or capabilities needed for xl info output.
Add them to the structure and the retrieving function.
Signed-off-by: Andre Przywara <andre.przywara@amd.com> Acked-by: Vincent Hanquez <vincent.hanquez@eu.citrix.com>
BSD sed does not support the '+' in the basic re while gnu sed does.
BSD sed supports '+' in the extended re and uses the -E flag while
gnu sed uses -r.
The only difference with the original version is that the '+'
qualifier is replaced with '\{1\,\}' which should work with both BSD
sed and GNU sed.
Signed-off-by: Christoph Egger <Christoph.Egger@amd.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Exporting cpu on/offline and memory on/offline hotplug interfaces,
so that users can do those (memory/cpu) hotplug actions with
following command line freely:
x86: No need to sync_local_execstate() during CPU hot-unplug.
This is done implicitly when we enter stopmachine_run() context,
because the underlying tasklet mechanism performs the sync before
running a tasklet handler.
Synchronise lazy execstate before calling tasklet handlers.
This ensures we are properly running on idle-vcpu state, which certain
things (e.g., use of vmx_vmcs_{enter,exit}) rely on. It also means we
don't need to do the same thing in the stopmachine_run handler.
Late in the 4.0 release it was discovered that certain order>0
allocations could fail and had no fallback. This conflicted with
tmem especially when combined with aggressive ballooning.
A hack-y workaround patch was added in time for 4.0 that has
reduced (but not completely eliminated) the problem but
tmem was left disabled-by-default for the 4.0 release.
Re-enable it in xen-unstable by default to help identify cases
where the workaround is insufficient. Tmem can be
disabled with the no-tmem Xen boot option. Please report
failures (that are fixed with the no-tmem option) to me.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Implement tasklets as running in VCPU context (sepcifically, idle-VCPU context)
...rather than in softirq context. This is expected to avoid a lot of
subtle deadlocks relating to the fact that softirqs can interrupt a
scheduled vcpu.
Move tasklet implementation into its own source files.
This is preparation for implementing tasklets in vcpu context rather
than softirq context. There is no change to the implementation of
tasklets in this patch.
I noticed that 2 scripts in Xen 4.0.0 are calling "gawk". Normally, in
most distributions, gawk is considered a specific version of awk.
Calling "gawk" and not "awk" generally means that you need
specificities of the "g" version of awk, as opposed to "mawk" which is
another implementation of the same tool.
So, unless I misread the scripts, Xen doesn't need to use gawk but
just any implementation of awk, and the attached patch can safely be
applied.
If I am wrong (which I don't think I am at the first look) and that
there's a reason why gawk is used and not awk, then IMHO, the toplevel
README should mention it in the prerequisites.
From: Thomas Goirand <thomas@goirand.fr> Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
mce: Add a x86_mcinfo_reserve() function, to reserve space from mc_info.
With this method, we don't need to collect bank and globalinformation
to a local variable and do x86_mcinfo_add() to copy that information
to mc_info. This avoid copy and also we can be aware earlier if there
is enough space in the mc_info.
Also extract function that get global/bank information to seperated
function mca_init_bank/mca_init_global.
It's meaningless to get the current information in mce context, keep
it here but should be removed in future.
Also a flag added to mc_info, to indicate some information is lost due
to OOM.
Clean up MCA MSR virtualization and vMCE injection
Remove all virtual MCE related work into a seperated file.
It also try to do some clean-up on the vMCE, including:
a) renmae some function name like mce_init_msr/mce_rdmsr to be
vmce_init_msr/vmce_rdmsr to make it more straightforward,
b) make the vmca_msrs be a pointer in arch_domain,
to decrease arch_domain's size
c) extract per-bank MCA MSR access to be seperated function
(bank_mce_wrmsr/bank_mce_rdmsr) to make it be a bit cleaner.
d) A new file xen/include/asm-x86/mce.h is added for vmce related
header.
x86: Revert how we calculate 'total system RAM' after c/s 20236.
This approach is more straightforward, in that it simply works the
original e820 map. It's what the user expects, and reporting a smaller
value is never appreciated. ;-)
We recently found that FreeBSD 8.0 guest failed to install and boot on
Xen. The reason was that FreeBSD detected clflush feature and invoked
this instruction to flush MMIO space. This caused a page fault; but
x86_emulate.c failed to emulate this instruction (not supported). As a
result, a page fault was detected inside FreeBSD. A similar issue was
reported earlier.
Due to changes in grub2, menu entry titles now have single quote
around them rather than double quotes, but the memtest entries still
are using double quotes, so we need to catch both.
Signed-off-by: David Markey <david.markey@citrix.com>
Since we moved several NUMA info fields from physinfo into separate
functions/structures, we must adapt the node picking algorithm, too.
Currently xm create complains about undefined hash values.
The patch uses the new Python xc binding to get the information and
create a reverse mapping for node_to_cpu, since we now only have a
cpu_to_node field.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
acpi sleep: Rearrange code for entering system sleep states.
We cannot freeze_domains in hypercall-continuation context any more,
since that is a softirq context which can interrupt an arbitrary
vcpu. Hence sleeping all vcpus in that context can easily deadlock
(against the vcpu we interrupted). So rearrange the code to
freeze_domains before calling continue_hypercall_on_cpu().
Update comments around spin_trylock() usage for sysctl and xenpf locks.
Since the execution of stop_machine_run() via cpu_down() is now always
deferred to a hypercall continuation context, the above locks are not
held at that time. Hence the trylock is not specifically to avoid
deadlock with stop_machine_run(), but rather a more general paranoia
about deadlocks in general.
continue_hypercall_on_cpu() always defers execution of the continuation
...even when scheduled to run on the current physical cpu. This
ensures that locks get dropped correctly before executing the
continuation code, and also allows the original caller to determine
whether the continuation has/will execute based on c_h_o_c()'s
immediate return code.
This is the core credit2 patch. It adds the new credit2 scheduler to
the hypervisor, as the non-default scheduler. It should be emphasized
that this is still in the development phase, and is probably still
unstable. It is known to be suboptimal for multi-socket systems.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Credit2 shares a runqueue between several cpus. Rather than have
double locking and dealing with the cpu-to-runqueue races, allow
the scheduler to redefine the sched_lock-to-cpu mapping.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Because credit2 shares a runqueue between several cpus, it needs
to know when a scheduled-out process has finally been context-switched
away so that it can be added to the runqueue again. (Otherwise it may
be grabbed by another processor before the context has been properly
saved.)
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Add an option that will set up the buffers and listen for updates,
but will not enable tracing. This is useful if you have hacks
in Xen to enable tracing at key points (for example, debugging a
shadow bug).
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Unfortunately the latter c/s' change to mpparse.c yielded the former
patch non-functional - Xen's serial port IRQ is not in IQR_DISABLED
state, yet must be allowed to get its trigger mode and polarity set
up in order for it to be usable.
Reorder the SCHED_SWITCH trace before the runstate change trace to fix
a problem with the lost records "resume" code.
Namely: The "lost records" trace includes the currently running
process. But during SCHED_SWITCH, it reads the wrong value, confusing
xenalyze. Making sure there are no trace records between runstate
change trace and the actual context switch fixes it.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
If OOS mode is enabled, after last possible resync, read the guest l1e
one last time. If it's different than the original read, start over
again.
This fixes a race which can result in inconsistent in-sync shadow
tables, leading to corruption:
v1: take page fault, read gl1e from an out-of-sync PT.
v2: modify gl1e, lowering permissions
[v1,v3]: resync l1 which was just read.
v1: propagate change to l1 shadow using stale gl1e
Now we have an in-sync shadow with more permissions than the guest.
The resync can happen either as a result of a 3rd vcpu doing a cr3
update, or under certain conditions by v1 itself.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
ssh is used as the transport by default, although this can be
overridden by specifying a different sshcommand. This is a very
standard approach nowadays and avoids the need for daemons at the
target host in the default configuration, while providing flexibility
to admins. (In the future it might be nice to support plain
unencrypted migration over TCP, which we do not rule out now, although
it is not currently implemented.)
Properties of the migration protocol:
* The domain on the target machine is named "<domname>--incoming"
while it is being transferred.
* The domain on the source machine is renamed
"<domain>--migratedaway"
before we give the target permission to rename and unpause.
* The locking in libxl_domain_rename ensures that of two
simultaneous migration attempts no more than one will succeed.
* We go to some considerable effort to avoid leaving the domain in
a bad state if something goes wrong with one of the ends or the
network, although there is still (inevitably) a possibility of a
unresolvable state (in case of very badly timed network failure)
which is probably best resolved by destroying the domain at both
ends.
Incidental changes:
create_domain now returns a libxl error code rather than exiting on
error.
New ERROR_BADFAIL error code for reporting unpleasant failures.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
* Make create_domain always return to caller
* Have create_domain set its log callback sooner
* Actually write things to logfile, and some error checking
With some combinations of options, create_domain would never return to
the caller, since it would have called daemon and will later exit. So
we fork an additional time, so that we can call daemon in the child
and also return to the caller in the parent. It's a shame that
there's no version of daemon(3) that allows us to do this without the
extra code and pointless extra fork.
daemon(0,0) closes all the fds. So we need to call daemon(0,1) and
organise detaching our stdin/out/err ourselves. Doing this makes
messages actually appear in the xl logfile in /var/log/xen.
Finally, make create_domain call libxl_ctx_set_log sooner. This makes
some lost messages appear.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
xl: New savefile format. Save domain config when saving a domain.
We introduce a new format for saved domains. The new format, in
contrast to the old:
* Has a magic number which can distinguish it from other kinds of
file
* Is extensible
* Can contains the domain configuration file
On domain creation we remember the actual config file used (using the
toolstack data feature of libxl, just introduced), and by default save
it to the save file.
However, options are provided for the following:
* When saving a domain, supplying an alternative config file to
store in the savefile.
* When restoring a domain, supplying an alternative config file.
If a domain is restored with a different config file, it is the
responsibility of the xl user to ensure that the two configs are
"compatible". Changing the targets of virtual devices is supported;
changing other features of the domain is not recommended. Bad changes
may lead to undefined behaviour in the domain, and are in practice
likely to cause resume failures or crashes.
Old format save files generated by old versions of xl are not
supported.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
xl: Remove some duplicated boilerplate. (Improves logging slightly.)
We remove six lines of boilerplate from the top of each function, and
instead have a single struct libxl_ctx which is initialised once at
the top of main.
Likewise we wrap domain_qualifier_to_domid in a new function
find_domain, which does the error handling, and stores the domid and
the specified name (if applicable).
This reduces the size of xl.c by 7% (!)
As a beneficial side effect, the earlier call to libxl_ctx_set_log in
main makes some lost messages appear.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
libxl: Per-domain data storage for the convenience of the library user
We provide a mechanism whereby a user of the libxl library is able to
store some information alongside the domain. The information stored
is a block of bytes. Its lifetime is that of the domain - ie the
userdata is garbage collected alongside the domain if the domain is
destroyed. (This is why the feature needs to be in libxl and cannot
be implemented in the user itself or in libxlutil.)
If a libxl caller does not need to use this feature it can ignore it.
The data is tagged with the (self-declared) name of the libxl user, so
that different users cannot accidentally trip over each others'
userdata. The data is not interpreted at all by libxl.
To assist developers and people debugging, there is a registry of the
known userdata userids, and the corresponding data format as declared
by that libxl user, in libxl.h next to these declarations:
int libxl_userdata_store(struct libxl_ctx *ctx, uint32_t domid,
const char *userdata_userid,
const uint8_t *data, int datalen);
int libxl_userdata_retrieve(struct libxl_ctx *ctx, uint32_t domid,
const char *userdata_userid,
uint8_t **data_r, int *datalen_r);
The next patch will introduce the data for the userid "xl".
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
xl would like to use libxl_report_exitstatus, so expose it in
libxl_utils.h to avoid having to write it twice. Also, give it a
"level" argument to set the loglevel of the resulting message.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
xenstore,libxl: cleanup of xenstore connections across fork()
Provide a new function xs_daemon_destroy_postfork which can be called
by a libxenstore user who has called fork, to close the fd for the
connection to xenstored and free the memory, without trying to do
anything to any threads which libxenstore may have created.
Use this new function in libxl_fork, to avoid accidental use of a
xenstore connection in both parent and child.
Also, fix the doc comment for libxl_spawn_spawn to have the success
return codes the right way round.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
libxl: Expose functions for helping with subprocesses.
* Expose libxl_fork in libxl_utils.h
* Expose libxl_pipe in libxl_utils.h
* Make libxl_exec put SIGPIPE back (so that libxl callers may
have SIGPIPE ignored)
xl would like to use libxl_fork (which is like fork(2) except that it
logs errors) and also a similar function libxl_pipe. So put these in
libxl_utils.[ch] and use them in libxl.c as appropriate, to avoid
having to duplicate code between xl and libxl.
Also, make sure that subprocesses spawned by libxl have SIGPIPE set
back to SIG_DFL as they are entitled to expect. This means that a
libxl caller which sets SIGPIPE to SIG_IGN is no longer buggy. (This
is relevant for xl migration, because xl would like to be such a
caller.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>