David Scott [Fri, 9 Apr 2010 18:57:43 +0000 (19:57 +0100)]
CA-39291: Work around firewalls which kill idle TCP connections by inserting a small empty block into an export every 5s or so.
The failure happens whenever a disk has a lot of zeroes in it: the TCP connection goes idle while the server is scanning for the next non-zero block. Even setting SO_KEEPALIVE on the stunnel sockets and reducing the window probe interval down to 30s didn't fix it. We wish to keep the ability to have a basic client do an export via HTTP GET so we can't add application-level keepalives to the protocol... we must add them to the export itself.
Note this change is backwards compatible. The receiver code expects:
* a common prefix
* a monotonically increasing chunk number
* the first and last blocks to be the same size and included verbatim (even if all zeroes)
* blocks of zeroes the same size as the first block represented as gaps in the increasing chunk number sequence
Therefore including extra files of length 0 in the stream will be ignored provided they
* share the common prefix and chunk numbering scheme
* are not the first or last blocks
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
David Scott [Fri, 28 May 2010 14:25:56 +0000 (15:25 +0100)]
Move the generation of ocaml/util/version.ml into the Makefile so that it can be done before the main build. This unbreaks the rpm build since 'hg id' will fail when run in a plain non-repo directory of sources.
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
Ewan Mellor [Thu, 29 Apr 2010 09:22:35 +0000 (10:22 +0100)]
CA-37279: Stunnel error from WLB connect "No route to host"
Understand "No route to host" as an expected error message from stunnel, and turn that into API error WLB_UNKNOWN_HOST if we see it when contacting WLB.
This is two patches, one for xen-api-libs.hg, and one for xen-api.hg.
Signed-off-by: Ewan Mellor <ewan.mellor@eu.citrix.com>
Zheng Li [Tue, 20 Apr 2010 18:18:53 +0000 (19:18 +0100)]
Cope with the stunnel zombie process issue.
Some versions of stunnel (old versions, or some new ones bulit with some particular Linux distribution versions) have the zombie process issue when called from xe. When it happens, the main stunnel process won't exit for long time after xe closing its communication channel, seems to be waiting for its children processes which are however staying in "defunct" status. The issue was also reported on the server side when stunnel is called by xapi daemon, so it would be useful to set the Stunnel.disconnect arguments properly there as well. However this demands more work to identify which setting is safe for each occurrence, so I'll leave it for future. Moreover, currently xe doesn't wait for the second stunnel connection process (HTTP Get/Put), this should also be fixed in the future. The question is: does xe really needs to care about the status of stunnel in most cases? If not, why not using the double fork tricks everywhere; if yes, a SSL library might be more appropriate than 3rd party tools such as stunnel.
Ian Campbell [Tue, 20 Apr 2010 16:51:08 +0000 (17:51 +0100)]
The OpenVswitch project is trying to standardise on using OpenVswitch rather than simply vswitch so accept both in network.conf and report the mode as openvswitch.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Yang Hongyang [Tue, 20 Apr 2010 13:53:51 +0000 (14:53 +0100)]
Add strict check to VCPUs-params param-key,according to Xen Cloud Platform Administrator's Guide,there's only 'weight','cap' or 'mask' param-key available.
Zheng Li [Sat, 10 Apr 2010 09:15:21 +0000 (10:15 +0100)]
Fix a bug in the previous -debug-on-fail patch
I didn't realize that xe would initial an auxilary stunnel link in certain opearions, so that some of the global variables we stored for the main link might be overwritten unexpectedly. Now the protection is added. What a typical example about how global mutable variable is bad, but I won't really regret having chosen such a solution because it demanded the least code modification, and hence safer for such complex software in that sense.
David Scott [Mon, 12 Apr 2010 13:44:41 +0000 (14:44 +0100)]
CA-40134: when looking up a domid, always look up the domain by uuid rather than relying on the value in the master's database (which could be out of sync).
In particular if the master is suddenly powercycled then recent domid updates might be lost. This results in VMs which are impossible to shutdown.
Looking up the domain by uuid should be strictly better than using the master's version: both are moving targets (eg over migrate) and so we already rely on the per-VM lock to protect (most) accesses.
The only problem is the code is slightly inefficient: it still has to contact the master to look up the VM's uuid and then has to list all domains on the host to build up a table of uuid -> domid. This can be improved later.
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
Rob Hoes [Fri, 9 Apr 2010 09:56:14 +0000 (10:56 +0100)]
CA-38844: CPUID maskability
Distinguish between not maskable ("no") (CPU does not have Intel FlexMigration or AMB Extended Migration), only base features are maskable ("base"), and base+extended features are maskable ("full") in Host.cpu_info:maskable.
Note: this patch should go in together with the flexmigration patch in xen-api-libs.hg.
David Scott [Wed, 7 Apr 2010 20:09:16 +0000 (21:09 +0100)]
CA-39952: explicitly blank {allowed,current}_operations fields in VM exports. The values stored here are redundant sources of potential import failures on older s/w versions.
The operation enums are often extended rendering them unparsable by older software versions. Although we don't guarantee that a new export can be imported on an old host it nevertheless should almost always work.. (apart from this)
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
David Scott [Wed, 7 Apr 2010 16:43:56 +0000 (17:43 +0100)]
CA-38463: backend vifs now have proper "device" symlinks in /sys so to tell the difference between them and a real "physical" interface, look to see whether they link to devices/xen-backend/...
This prevents PIF.scan from accidentally introducing vifX.Y as PIFs...
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
David Scott [Wed, 7 Apr 2010 16:06:31 +0000 (17:06 +0100)]
CA-39972: try extra hard to kill stunnel in 'service xapissl stop'
Previously we sent one SIGTERM and waited for up to 3 minutes. In a quick test, out of 1000 back-to-back 'service xapissl restart' calls, one took the full 3 minutes, as if the signal was ignored.
Now we send additional SIGTERMS as we go around the loop, one per second. In a quick test, 10000 back-to-back 'service xapissl restart' calls completed without any taking more than a few seconds.
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
David Scott [Wed, 7 Apr 2010 16:06:30 +0000 (17:06 +0100)]
CA-39887: Totally throw away VM RRD updates when domains are paused.
Previously we kept the RRD data but silenced the dirty_memory signal. This would allow the following sequence:
1. domain created
domain = paused; memory = 0
<-- RRD updated with memory = 0
<-- memory not considered 'dirty' because domain is paused
2. domain built
domain = paused; memory = some interesting value
<-- RRD updated with memory = some interesting value
<-- memory not considered 'dirty' because domain is paused
3. domain unpaused
domain = running; memory = some interesting value
<-- RRD updated with memory = some interesting value
<-- memory not considered 'dirty' because memory value has *not* changed in the RRD
Now we ignore the RRD updates when the domain is paused. This means that, when the domain is finally unpaused, the new memory value will always be considered to have changed.
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
David Scott [Wed, 7 Apr 2010 10:46:46 +0000 (11:46 +0100)]
CA-39401: on server boot only start VMs with auto_poweron=true which have their affinity set to the local host.
The auto_poweron=true mechanism made sense when all pools were of size 1. It's a bit odd with multi-host pools.. surely these days you would HA protect VMs you care about?
This change makes the auto_poweron mechanism a bit less inexplicable. At some point we ought to merge this with HA.
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
David Scott [Wed, 7 Apr 2010 10:42:09 +0000 (11:42 +0100)]
CA-34888: the Java bindings tests expect there to be enough memory to install from a Win7 (64 bit) template plus a bit. Bump up the default size of the simulated host in xiu.
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
David Scott [Wed, 7 Apr 2010 10:36:23 +0000 (11:36 +0100)]
CA-39745: When a failure to hotplug a disk is detected, check whether the underlying device was a physical CDROM with an empty drive. In this case throw HOST_CD_DRIVE_EMPTY.
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
David Scott [Wed, 7 Apr 2010 10:35:06 +0000 (11:35 +0100)]
CA-39745: watch /local/domain/%d/error/backend/vbd/%d/%d/error for informative errors such as "2 creating vbd structure". This prevents us timing out after 20 minutes if something goes wrong with a blkback device.
A further patch will be needed to provide some decent error diagnosis.
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
Rob Hoes [Wed, 7 Apr 2010 10:34:09 +0000 (11:34 +0100)]
CA-39461: Fix Network.pool_introduce
The bug in Network.pool_introduce led to problems when adding a host to a pool, when this host has a Network with a name different from any Network on the pool. Networks and PIFs were not properly recreated on the pool in this case.
David Scott [Tue, 6 Apr 2010 09:57:56 +0000 (10:57 +0100)]
CA-39889: hotplug PCI devices into guests in the order specified in the (original) other-config key.
There were two problems:
1. xapi was accidentally reversing the list provided in the other_config key (a side-effect of a fold)
2. when listing the PCI devices present in xenstore, the Device.PCI.list function was passing the device order number back in the place reserved for "PCI bus ID". This caused the devices to be hotplugged in reverse over reboot.
Now the behaviour is:
1. devices are hotplugged in the order they are found in the other_config key
2. the hotplug ordering is stable across start/internal reboot/external reboot
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
[CA-39743] Improvements to the wait_xen_free_mem function.
Improvements include:
* We read the values of free_memory and scrub_memory atomically (rather than separately).
* We exit early if (free_memory < required_memory) and (scrub_memory = 0) as it's unlikely that more will become free.
* We time out more quickly than before (64 seconds rather than 256 seconds).
Signed-off-by: Jonathan Knowles <jonathan.knowles@eu.citrix.com>
Marcus Granado [Thu, 1 Apr 2010 10:53:47 +0000 (11:53 +0100)]
CA-38974: improve parameter reporting during obj.destroy in auditlog
Obj.destroy has the nasty side-effect of removing the object name from the database.
Therefore, for .destroy actions, no obj name is shown in the audit record parameters.
This patch caches the object name before Obj.destroy is called, working around this
side-effect.
Signed-off-by: Marcus Granado <marcus.granado@eu.citrix.com>
David Scott [Thu, 1 Apr 2010 10:52:20 +0000 (11:52 +0100)]
CA-38723: make "xe snapshot-" commands advertise "uuid" parameters (rather than "snapshot-uuid") for consistency with other commands but still secretly accept "snapshot-uuid" to avoid breaking anything.
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
Jon Ludlam [Mon, 29 Mar 2010 09:11:36 +0000 (10:11 +0100)]
CA-36244: Broadcast the management_ip_cond condition variable unconditionally in the 'on_dom0_networking_change' - the only thing waiting on this is immediately above, where we're waiting for a management IP. If the IP obtained by DHCP was the same as we had before, the old logic wouldn't have signalled, and therefore the loop would never have woken.
Signed-off-by: Jon Ludlam <Jonathan.Ludlam@eu.citrix.com>
Rob Hoes [Mon, 29 Mar 2010 09:10:00 +0000 (10:10 +0100)]
CA-39188: Change illegal MTU value to 1500 when creating Network
There is already a default of 1500 in the datamodel for the Network.MTU field. However, language bindings tend to fill in their own default of 0 (which is illegal), when the field is not specified by the user.
Rob Hoes [Mon, 29 Mar 2010 09:10:00 +0000 (10:10 +0100)]
Add CPUID features whitelisting foor Pool.join.
This allows the user to exclude CPUID features from the Pool.join checks by setting a mask in Pool.other_config:cpuid_feature_mask. When comparing the CPUID features of the pool and the joining host for equality, this mask is applied before the comparison. The format is the same as the format of the feature field in Host.cpu_info. The default is "ffffff7f-ffffffff-ffffffff-ffffffff", which defines the EST feature, bit 7 of the base ecx flags, as "don't care".
Jon Ludlam [Wed, 24 Mar 2010 11:08:12 +0000 (11:08 +0000)]
CA-32170: On start, xapi GCs messages. If any of those fails to parse, the exception bubbles up to the top and causes xapi to exit. This patch fixes that bad behaviour.
Signed-off-by: Jon Ludlam <Jonathan.Ludlam@eu.citrix.com>
David Scott [Wed, 24 Mar 2010 11:06:53 +0000 (11:06 +0000)]
CA-36934: in the API call patch "precheck", if the patch has been downloaded from the master then delete the patch afterwards. This makes leaks of patch files less likely (although they're going to be rare since there aren't going to be many patches)
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
David Scott [Wed, 17 Mar 2010 17:57:21 +0000 (17:57 +0000)]
CA-36410: When importing VM metadata, if the VM is Suspended then preserve this. If the VM was in any other state (eg Running) then it can't be any more: set it to Halted.
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
David Scott [Wed, 17 Mar 2010 15:33:01 +0000 (15:33 +0000)]
CA-38030: improve the RRD behaviour when the system clock moves backwards (earlier)
Moving the clock backwards (earlier) is a bad thing to do: it will cause many time-related functions to fail. Most things can be cleared with a reboot... except RRDs which are persistent.
When the clock moves towards the future, lots of recent RRD data is lost as it is averaged and then shuffled into more approximate RRDs.
When the clock moves towards the past, the previous behaviour is to reject old updates as invalid. This unfortunately means that fields driven from RRDs (eg memory-actual) freeze (often at 0). The new behaviour is to:
* accept the update even though we know something is wrong
* log a warning message
* generate a single HOST_CLOCK_WENT_BACKWARDS ALERT.
SIgned-off-by: David Scott <dave.scott@eu.citrix.com>
David Scott [Tue, 16 Mar 2010 22:29:33 +0000 (22:29 +0000)]
CA-31448: unconditionally read all the HTTP headers resulting from an HTTP PUT, not just the first line.
xapi HTTP handlers (eg the patch upload one) expext to be able to write a complete set of HTTP response headers, without someone closing the socket and triggering an EPIPE.
Signed-off-by: David Scott <dave.scott@eu.citrix.com>
Alex Zeffertt [Thu, 14 Jan 2010 13:22:53 +0000 (13:22 +0000)]
[CP-1552] SLES 10 SP3 x86_64 support
Add template for SLES 10 SP3 x86_64.
Note: We cannot add a template for the 32 bit version since between
SP2 and SP3 Novell removed the PAE guest kernel from the installer
ISO, leaving only the non-PAE guest kernel. This is not a problem for
them since they have compiled their Xen 3.2 hypervisor to support
non-PAE PV guests (instead of PAE PV guests.) However, since v3.2 this
compile option has been removed from Xen.
This means that we cannot boot the SLES 10 SP3 i386 installer ISO. To
install SLES 10 SP3 i386 it is necessary to install SP2 and upgrade.
Jonathan Knowles [Tue, 16 Mar 2010 14:53:56 +0000 (14:53 +0000)]
CA-38971: Sometimes windows guests take a long time to shutdown cleanly. This patch lengthens the timeout time for clean shutdown requests from 20 minutes to 60 minutes.
Windows guests sometimes have genuine reasons to take longer than 20 minutes to shut down.
Possible reasons include:
* there are updates to install.
* the guest is overloaded.
* the host is overloaded.
Signed-off-by: Jonathan Knowles <jonathan.knowles@eu.citrix.com> Acked-by: David Scott <dave.scott@eu.citrix.com>
Daniel Stodden [Tue, 16 Mar 2010 14:42:59 +0000 (14:42 +0000)]
CA-35059: LVHDRT/TC8713 wants a noninteractive PV guest.
Password init on the console, during an interactive first boot, blocks
the xs-tools daemon start, which prevents a VM.suspend, which TC8713
relies on.
The behavioral difference sneaked in with xgts.hg cset
559:7d7540a5ff43 (utilities support for Debian Squeeze). Fixed by
setting the "noninteractive" PV attribute.
Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
Ewan Mellor [Tue, 16 Mar 2010 14:40:52 +0000 (14:40 +0000)]
CA-38109: HostInstallation/TCQuicktest: quicktest failed: SR_OPERATION_NOT_SUPPORTED: OpaqueRef:2c96e6a1-afdc-19a6-b6cb-37af7e90e504
CP-1603: CR-50: R4. Include REQ259 - Writable ISO SRs
Update quicktest to match new writable ISO SR support. The XenServer
Tools SR is special-cased to prevent writing even if the ISO SR backend
itself allows it, and quicktest needed to be updated to match this.
Also, the VM import code has a special case to protect importing into
ISO SRs, so quicktest needs to know about this, too.
Signed-off-by: Ewan Mellor <ewan.mellor@eu.citrix.com>