mdrabe [Wed, 24 May 2017 20:56:13 +0000 (15:56 -0500)]
Query deleted instance records during _destroy_evacuated_instances
_destroy_evacuated_instances is responsible for cleaning up the
remnants of instance evacuations from the source host. Currently
this method doesn't account for instances that have been deleted
after being evacuated.
John Hua [Thu, 11 Aug 2016 06:48:47 +0000 (14:48 +0800)]
Use physical utilisation for cached images
Since glance images are downloaded and snapshotted before they are used.
Only a small proportion of its VDI will be in use and it will never grow.
So the real disk usage for VDIs varies with their capability to expand.
Disks connected to VMs continue to use the virtual utilisation as they
are able to expand.
Dan Smith [Fri, 16 Jun 2017 14:25:40 +0000 (07:25 -0700)]
Fix regression preventing reporting negative resources for overcommit
In Nova prior to Ocata, the scheduler computes available resources for
a compute node, attempting to mirror the same calculation that happens
locally. It does this to determine if a new instance should fit on the
node. If overcommit is being used, some of these numbers can be negative.
In change 016b810f675b20e8ce78f4c82dc9c679c0162b7a we changed the
compute side to never report negative resources, which was an ironic-
specific fix for nodes that are offline. That, however, has been
corrected for ironic nodes in 047da6498dbb3af71bcb9e6d0e2c38aa23b06615.
Since the base change to the resource tracker has caused the scheduler
and compute to do different math, we need to revert it to avoid the
scheduler sending instances to nodes where it believes -NNN is the
lower limit (with overcommit), but the node is reporting zero.
This doesn't actually affect Ocata because of our use of the placement
engine. However, this code is still in master and needs to be backported.
This part of the change actually didn't even have a unit test, so
this patch adds one to validate that the resource tracker will
calculate and report negative resources.
Rikimaru Honjo [Fri, 26 May 2017 05:04:44 +0000 (14:04 +0900)]
Calculate stopped instance's disk sizes for disk_available_least
disk_available_least is a free disk size information of hypervisors.
This is calculated by the following fomula:
disk_available_least = <free disk size> - <Total gap between virtual
disk size and actual disk size for all instances>
But stopped instance's virtual disk sizes were not calculated
after merging following patch in Juno cycle:
https://review.openstack.org/#/c/105127
So disk_available_least might be larger than actual free disk size.
As a result, instances might be scheduled beyond the actual free
disk size if stopped instances were on a host.
This patch fix it.
Stopped instance's disks will be calculated after merging this patch.
Matt Riedemann [Thu, 8 Jun 2017 13:35:42 +0000 (09:35 -0400)]
libvirt: handle missing rbd_secret_uuid from old connection info
Change Idcbada705c1d38ac5fd7c600141c2de7020eae25 in Ocata
started preferring Cinder connection info for getting RBD auth
values since Nova needs to be using the same settings as Cinder
for volume auth.
However, that introduced a problem for guest connections made
before that change, where the secret_uuid might not have been
configured on the Cinder side and that's what is stored in the
block_device_mappings.connection_info column and is what we're
checking in _set_auth_config_rbd. Before Ocata this wasn't a
problem because we'd use the Nova configuration values for the
rbd_secret_uuid if set. But since Ocata it is a problem since
we don't consult nova.conf if auth was enabled, but not completely
configured, on the Cinder side.
So this adds a fallback check to set the secret_uuid from
nova.conf if it wasn't set in the connection_info via Cinder
originally. A note is also added to caution about removing
any fallback mechanism on the nova side - something we'd
need to consider before we could likely drop this code.
Co-Authored-By: Tadas Ustinavičius <tadas@ring.lt>
NOTE(mriedem): The unit test is modified slightly to not
pass an instance to the disconnect_volume method as that
was only available starting in Pike: b66b7d4f9d
Matt Riedemann [Thu, 9 Feb 2017 23:41:11 +0000 (18:41 -0500)]
libvirt: fix and break up _test_attach_detach_interface
The detach_interface flow in this test was broken because
it wasn't mocking out domain.detachDeviceFlags so the xml
it was expecting to be passed to that method wasn't actually
being verified. The same thing is broken in test
test_detach_interface_device_with_same_mac_address because
it copies the other broken test code.
This change breaks apart the monster attach/detach test method
and converts the detach_interface portion to mock and fixes
the broken assertion.
test_detach_interface_device_with_same_mac_address is just
fixed, not converted to mock.
[BugFix] Release the memory quota for video ram when deleting an instance.
When creating an instance, the quota is used to contain vram,
and the deletion of the instance does not release the memory quota for vram,
if delete intance it need to release the memory quota for vram.
Dan Smith [Thu, 20 Apr 2017 16:12:45 +0000 (09:12 -0700)]
Warn the user about orphaned extra records during keypair migration
Operators who have manually deleted Instance records with FK constraints
disabled may have orphaned InstanceExtra records which will prevent the
keypair migration from running. Normally, this violation of the data
model would be something that earns no sympathy. However, that solution
was (incorrectly) offered up as a workaround in bug 1511466 and multiple
deployments have broken their data as a result. Since the experience
is a unhelpful error message and a blocked migration, this patch attempts
to at least highlight the problem, even though it is on the operator to
actually fix the problem.
Artom Lifshitz [Wed, 17 May 2017 00:22:34 +0000 (00:22 +0000)]
Use VIR_DOMAIN_BLOCK_REBASE_COPY_DEV when rebasing
Previously, in swap_volume, the VIR_DOMAIN_BLOCK_REBASE_COPY_DEV flag
was not passed to virDomainBlockRebase. In the case of iSCSI-backed
disks, this caused the XML to change from <source dev=/dev/iscsi/lun>
to <source file=/dev/iscsi/lun>. This was a problem because
/dev/iscsi/lun is not a regular file. This patch passes the
VIR_DOMAIN_BLOCK_REBASE_COPY_DEV flag to virDomainBlockRebase, causing
the correct <source dev=/dev/iscsi/lun> to be generated upon
volume-update.
Guang Yee [Thu, 18 May 2017 23:38:16 +0000 (16:38 -0700)]
make sure to rebuild claim on recreate
On recreate where the instance is being evacuated to a different node,
we should be rebuilding the claim so the migration context is available
when rebuilding the instance.
Eric Berglund [Wed, 24 May 2017 02:26:28 +0000 (22:26 -0400)]
Add strict option to discover_hosts
This adds a --strict option that can be passed in when calling the
nova-manage cell_v2 discover_hosts command. When this option is used,
the command will only return success if a new host has been found.
In any other case it is considered a failure.
Sylvain Bauza [Tue, 6 Jun 2017 21:28:59 +0000 (23:28 +0200)]
Fix cell0 naming when QS params on the connection
We had a problem when the nova connection string included parameters on the
query string like charset encoding.
Note that the connection string necessarly needs to be RFC1738 compliant as
per Sqlalchemy rules, so it's totally safe to just unquote what the SQLA
helper method gives us as a result.
Also removed a tested connection string since it wasn't RFC1738 compatible.
In some cases, trying to delete a floating IP multiple times in a short
delay can trigger an exception beacause the floating ip deletion
operation is not atomic. If neutronclient's call to delete fails with a
NotFound error, we raise a 404 error to nova's client instead of a 500.
Change-Id: I49ea7e52073148457e794d641ed17d4ef58616f8 Co-Authored-By: Stephen Finucane <sfinucan@redhat.com>
Closes-Bug: #1649852
(cherry picked from commit d99197aece6451013d1de1f08c1af16832ee0e7e)
Matt Riedemann [Thu, 25 May 2017 19:46:22 +0000 (15:46 -0400)]
Avoid lazy-load error when getting instance AZ
When [cinder]cross_az_attach=False (not the default) and doing
boot from volume, the API code validates the BDM by seeing if
the instance and the volume are in the same availability zone.
To get the AZ for the instance, the code is first trying to get
the instance.host value.
In Ocata we stopped creating the instance in the API and moved that
to conductor for cells v2. So the Instance object in this case now
is created in the _provision_instances method and stored in the
BuildRequest object. Since there is no host to set on the instance
yet and the Instance object wasn't populated from DB values, which
before would set the host field on the instance object to None by
default, trying to get instance.host will lazy-load the field and
it blows up with ObjectActionError.
The correct thing to do here is check if the host attribute is set
on the Instance object. There is clear intent to assume host is
not set in the instance since it was using instance.get('host'),
probably from way back in the days when the instance in this case
was a dict. So it's expecting to handle None, but we need to
modernize how that is checked.
Huan Xie [Thu, 2 Mar 2017 02:58:02 +0000 (18:58 -0800)]
Make xenapi driver compatible with assert_can_migrate
The newly released XenServer 7.1 has changed to check vif_map in
its api VM.assert_can_migrate(), if the vif_map isn't set, it will
raise exception with VIF_NOT_IN_MAP, but at this point destination
compute node doesn't have dest_vif_map, this patch is to make code
in xenserver driver be compatible with xenserver's changes
Matt Riedemann [Fri, 26 May 2017 21:48:10 +0000 (17:48 -0400)]
Fix MarkerNotFound when paging and marker was found in cell0
If we're paging over cells and the marker was found in cell0,
we need to null it out so we don't attempt to lookup by marker
from any other cells if there is more room in the limit.
Matt Riedemann [Fri, 26 May 2017 21:21:30 +0000 (17:21 -0400)]
Add recreate functional test for regression bug 1689692
When paging through instances, if the marker is found in cell0
and there are more instances under the limit, we continue paging
through the cell(s) to fill the limit. However, since the marker
was found in cell0 it's not going to be in any other cell database
so we'll end up failing with a marker not found error.
This change adds a functional recreate test for the bug.
The fix will build on this to show when the bug is fixed and the
test will be changed to assert expected normal behavior.
_ensure_console_log_for_instance[1] ensures VM console.log existence.
A change[2] updated in order to succeed if the file exists without nova
being able to read it (typically happens when libvirt rewrites uid/gid)
by ignoring EPERM errors.
It seems the method should ignore EACCES errors. Indeed EACCES errors
are raised when an action is not permitted because of insufficient
permissions where EPERM errors when an action is not permitted at all.
Matt Riedemann [Fri, 26 May 2017 01:35:09 +0000 (21:35 -0400)]
Avoid lazy-loading instance.id when cross_az_attach=False
The instance is no longer created in the API so the id attribute
won't be set, which means when checking the instance AZ against
the volume AZ, if they don't match we can't put the instance.id
in the error message. We shouldn't have been putting the instance
primary key in the error message anyway.
This fixes the bug by using the instance.uuid which is set in
this object in _provision_instances.
Kevin_Zheng [Tue, 23 May 2017 12:28:28 +0000 (20:28 +0800)]
Exclude deleted service records when calling hypervisor statistics
Hypervisor statistics could be incorrect if not
exclude deleted service records from DB.
User may stop 'nova-compute' service on some
compute nodes and delete the service from nova.
When delete 'nova-compute' service, it performs
'soft-delete' to the corresponding db records in
both 'service' table and 'compute_nodes' table if
the compute_nodes record is old, i.e. it is linked
to the service record. For modern compute_nodes
records, they aren't linked to the services table
so deleting the services record will not delete
the compute_nodes record, and the ResourceTracker
won't recreate the compute_nodes record if the host
and hypervisor_hostname still match the existing
record, but restarting the process after deleting
the service will create a new services table record
with the same host/binary/topic.
If the 'nova-compute' service on that server
re-starts, it will automatically add a record
in 'compute_nodes' table (assuming it was deleted
because it was an old-style record) and also a correspoding
record in 'service' table, and if the host name
of the compute node did not change, the newly
created records in 'service' and 'compute_nodes'
table will be identical to the priously soft-deleted
records except the deleted row.
When calling Hypervisor-statistics, the DB layer
joined records across the whole deployment by
comparing records' host field selected from
serivce table and records' host field selected
from compute_nodes table, and the calculated
results could be multiplied if multiple records
from service table have the same host field,
and this scenario could happen if user perform
the above actions.
Co-Authored-By: Matt Riedemann <mriedem.os@gmail.com>
Change-Id: I9dfa15f69f8ef9c6cb36b2734a8601bd73e9d6b3
Closes-Bug: #1692397
(cherry picked from commit 3d3e9cdd774efe96f468f2bcba6c09a40f5e71d3)
Jackie Truong [Thu, 4 May 2017 16:51:22 +0000 (12:51 -0400)]
Fix decoding of encryption key passed to dmcrypt
This patch fixes the decoding of the encryption key passed to dmcrypt.
During the key management move from Nova to Castellan, in the Newton
release, conversion of the encryption key (from a string to list of
unsigned ints) was removed from the key retrieval method. This patch
updates dmcrypt to decode an encryption key string, rather than a list
of unsigned ints. See the linked bug for more information.
The method used to decode the encryption key has been updated to use
binascii, as done in os-brick [1], to maintain consistency. The key
generation and decoding portions of test_dmcrypt have been updated to
reflect this change and ensure compatibility with both, Python 2 and
Python 3.
Kevin_Zheng [Mon, 15 May 2017 07:02:00 +0000 (15:02 +0800)]
Catch exception.OverQuota when create image for volume backed instance
When create image for a volume backed instance, nova will
create snapshots for all volumes attached to the instance
in Cinder, and if quota exceed in Cinder, HTTP 500 will
raise, we should capture this error and raise 403.
Matt Riedemann [Thu, 11 May 2017 22:29:42 +0000 (18:29 -0400)]
Handle special characters in database connection URL netloc
When calling "nova-manage cell_v2 simple_cell_setup" or
"nova-manage cell_v2 map_cell0" without passing in the
--database_connection option, we read the [database]/connection
URL from nova.conf, try to split the URL and then create a
default connection based on the name of the original connection,
so if you're cell database's name is 'nova' you'd end up with
'nova_cell0' for the cell0 database name in the URL.
The problem is the database connection URL has credentials in the
netloc and if the password has special characters in it, those can
mess up the URL split, like splitting on ? which is normally denoting
the beginning of the path in a URL.
This change handles special characters in the password by using
a nice DB connection URL parsing utility method available in
sqlalchemy to get the database name out of the connection URL string
so we can replace it properly with the _cell0 suffix.
Adds a release note as this bug causes issues when upgrading.
melanie witt [Tue, 2 May 2017 21:47:12 +0000 (21:47 +0000)]
Use six.text_type() when logging Instance object
We're seeing a trace in gate jobs, for example:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
402: ordinal not in range(128)
when attempting to log an Instance object with a unicode display name.
This resurfaced relatively recently because of the change in devstack
to use the new OSJournalHandler with use_journal=True which is
suspected of causing some deadlock issues [1] unrelated to this bug.
The problem occurs in code that logs an entire Instance object when
the object has a field with unicode characters in it (display_name).
When the object is sent to logging, the UnicodeDecodeError is raised
while formatting the log record here [2]. This implies an implicit
conversion attempt to unicode at this point.
I found that with the Instance object, the conversion to unicode fails
with the UnicodeDecodeError unless the encoding 'utf-8' is explicitly
specified to six.text_type(). And when specifying an encoding to
six.text_type(), the argument to convert must be a string, not an
Instance object, so this does the conversion in two steps as a utility
function:
1. Get the string representation of the Instance with repr()
2. Call six.text_type(instance_repr, 'utf-8') passing the encoding
if not six.PY3
melanie witt [Tue, 16 May 2017 10:25:42 +0000 (10:25 +0000)]
Cache database and message queue connection objects
Recently in the gate we have seen a trace on some work-in-progress
patches:
OperationalError: (pymysql.err.OperationalError)
(1040, u'Too many connections')
and at least one operator has reported that the number of database
connections increased significantly going from Mitaka to Newton.
It was suspected that the increase was caused by creating new oslo.db
transaction context managers on-the-fly when switching database
connections for cells. Comparing the dstat --tcp output of runs of the
gate-tempest-dsvm-neutron-full-ubuntu-xenial job with and without
caching of the database connections showed a difference of 445 active
TCP connections and 1495 active TCP connections, respectively [1].
This adds caching of the oslo.db transaction context managers and the
oslo.messaging transports to avoid creating a large number of objects
that are not being garbage-collected as expected.
NOTE(melwitt): Conflicts caused by the fact that the set_target_cell
function doesn't exist in Ocata and message queue connections were
not stored on the context in Ocata.
Kaitlin Farr [Fri, 10 Mar 2017 23:09:49 +0000 (18:09 -0500)]
Parse algorithm from cipher for ephemeral disk encryption
Nova's keymgr implementation used to have default values
for the algorithm and bit length. Castellan does not have
default values, and when Castellan replaced keymgr in
Ib563b0ea4b8b4bc1833bf52bf49a68546c384996, the parameters
to the create_key method were not updated. This change
parses the algorithm from the cipher value and passes it
to Castellan's key manager interface.
Chris Friesen [Mon, 15 May 2017 18:48:37 +0000 (12:48 -0600)]
fix InvalidSharedStorage exception message
The exception message for InvalidSharedStorage is grammatically
complex and ignores the possibility of block migration, which
results in a misleading and confusing message for the user.
When any vm creation fails because of exceeding 'gigabytes',
'volumes', 'per_volume_gigabytes' quotas, the error message
generated is specific to 'volumes' quota which says
"Volume resource quota exceeded". Instead, the error message
should be specific to the quota which failed.
Lee Yarwood [Thu, 20 Apr 2017 18:43:32 +0000 (19:43 +0100)]
libvirt: Always disconnect_volume after rebase failures
Previously failures when rebasing onto a new volume would leave the
volume connected to the compute host. For some volume backends such as
iSCSI the subsequent call to terminate_connection would then result in
leftover devices remaining on the host.
This change simply catches any error associated with the rebase and
ensures that disconnect_volume is called for the new volume prior to
terminate_connection finally being called.
NOTE(lyarwood): Conflict caused by MIN_LIBVIRT_VERSION being 1.2.1 in
stable/ocata making I81c32bbea0f04ca876f2078ef2ae0e1975473584
unsuitable. The is_job_complete polling loop removed by that change in
master is now moved into the try block of this change ensuring we catch
any errors that might be thrown while waiting for the async pivot. The
log.exception message also requires translation in Ocata.
Steven Webster [Mon, 27 Mar 2017 16:18:23 +0000 (12:18 -0400)]
Fix port update exception when unshelving an instance with PCI devices
It is possible that _update_port_binding_for_instance() is called
without a migration object, such as when a user unshelves an instance.
If the instance has a port(s) with a PCI device binding, the current
logic extracts a pci mapping from old to new devices from the migration
object and migration context. If a 'new' device is not found in the
PCI mapping, an exception is thrown.
In the case of an unshelve, there is no migration object (or migration
context), and as such we have an empty pci mapping.
This fix will only check for a new device if we have a migration object.
Clark Boylan [Wed, 12 Apr 2017 22:46:31 +0000 (15:46 -0700)]
Fix libvirt group selection in live migration test
The custom start scripting for the nova compute service assumed that the
libvirt group is "libvirtd". Unforately libvirtd is no longer used by
debuntu as they use "libvirt". Add a simple check against /etc/group
for an existing libvirtd user otherwise use libvirt.
Change-Id: Idbda49587f3b62a0870d10817291205bde0e821e
Depends-On: If2dbc53d082fea779448998ea12b821bd037a14e
(cherry picked from commit ea8463679c1c25b496ffca1be6bd9bd026c29225)
Steven Webster [Wed, 5 Apr 2017 13:05:07 +0000 (09:05 -0400)]
Fix mitaka online migration for PCI devices
Currently, a validation error is thrown if we find any PCI device
records which have not populated the parent_addr column on a nova
upgrade. However, the only PCI records for which a parent_addr
makes sense for are those with a device type of 'type-VF' (ie. an
SRIOV virtual function). PCI records with a device type of 'type-PF'
or 'type-PCI' will not have a parent_addr. If any of those records
are present on upgrade, the validation will fail.
This change checks that the device type of the PCI record is
'type-VF' when making sure the parent_addr has been correctly
populated
In db API when we process filters, we didn't
use deepcopy. In cases of "tags" and "not-tags"
we used pop to get the first tag, filtered out
results, and then joined with other tags for
later filtering. When we did pop(), the original
value was deleted, the key "tags"/"not-tags" remains.
In the cell scenario, both single cell(we will
query cell0 and the other cell) and multicell,
as we have to query all the cells in a loop and
the tags list in the filter will keep popping,
this will lead to either a HTTP 500 error(popping
from an empty list) or incorrect result(when
number of tags in the list is larger than cell
number, no HTTP 500 will show, but the filter
results for each cell will be different as
each loop will pop one tag).
Matt Riedemann [Mon, 17 Apr 2017 00:45:25 +0000 (20:45 -0400)]
Add regression test for server filtering by tags bug 1682693
There was a latent bug in the DB API code such that when we
process filters when listing instances, the various tags
filters have their values popped out of the filters dict and
the values (which are lists) for the filter also have the
first item in the list popped out to build the query.
This latent bug was exposed in Newton when we started listing
instances in the API from both cell0 and the main cell database,
because the query to cell0 would pop an item and then it
would not be in the 2nd query to the main cell database. If we
only had one tag in the filter list, we get an IndexError on
the 2nd pop() call.
Note that we also use the build_requests table in the API to
list instances, but there would not be any tagged servers in
that table since a server has to be ACTIVE before you can tag it,
and build_requests only exist until the instance is put into a
cell and starts building there (so it won't be ACTIVE yet).
John Garbutt [Tue, 7 Feb 2017 19:12:50 +0000 (19:12 +0000)]
Stop failed live-migrates getting stuck migrating
When there are failures in driver.cleanup, we are seeing live-migrations
that get stuck in the live-migrating state. While there has been a patch
to stop the cause listed in the bug this closes, there are other
failures (such as a token timeout when talking to cinder or neutron)
that could trigger this same failure mode.
When we hit an error this late in live-migration, it should be a very
rare event, so its best to just put the instance and migration into an
error state, and help alert both the operator and API user to the
failure that has occurred.
Fix HTTP 500 raised for getConsoleLog for stopped instance
Stopped instances with pty console will not contain
`source_node` information, and in the current
implementation the pty variable used later will
result in an UnboundLocalError, which results in a
500 error out of the API.
Matt Riedemann [Wed, 5 Apr 2017 20:27:41 +0000 (16:27 -0400)]
Perform old-style local delete for shelved offloaded instances
This fixes a regression from some local delete code added for cells v2
where it assumed that if an instance did not have a host, it wasn't
scheduled to a cell yet. That assumption misses the fact that the
instance won't have a host if it was shelved offloaded. And to be
shelved offloaded, the instance had to have first been built on a host
in a cell.
So we simply duplicate the same check as later in the _delete() method
for instance.host or shelved-offloaded to decide what the case is.
Obviously this is all a giant mess of duplicate delete path code that
needs to be unwound, and that's the plan, but first we're fixing
regressions and then we can start rolling this duplication all back
so we can get back to the single local delete flow that we know and love.
Matt Riedemann [Fri, 24 Mar 2017 16:06:07 +0000 (12:06 -0400)]
Set size/status during image create with FakeImageService
This is needed for an upcoming change which introduces
a functional test which shelves a server. Shelving a server
creates a snapshot image and in the real world, glance sets
the size and status attributes on the image when it's created
in glance. Our FakeImageService wasn't doing that, so tests
that are running at the same time with the same fake noauth
credentials are listing images and picking up the shelve
snapshot image which doesn't have size or status set and
that produces a KeyError in the API code.
Matt Riedemann [Sat, 1 Apr 2017 01:08:55 +0000 (21:08 -0400)]
Commit usage decrement after destroying instance
This fixes a regression in Ocata where we were always
decrementing quota usage during instance delete even
if we failed to delete the instance. Now the reservation
is properly committed after the instance is destroyed.
The related functional test is updated to show this working
correctly now.
Matt Riedemann [Fri, 31 Mar 2017 23:56:14 +0000 (19:56 -0400)]
Add regression test for quota decrement bug 1678326
This was spotted from someone validating the fix for
bug 1670627. They reported that even though they failed
to delete an instance in ERROR state that was in cell0,
the quota usage was decremented.
This is because we committed the quota reservation
to decrement the usage before actually attempting to destroy
the instance, rather than upon successful deletion.
The rollback after InstanceNotFound is a noop because of
how the Quotas.rollback method noops if the reservations
were already committed. That is in itself arguably a bug,
but not fixed here, especially since the counting quotas
work in Pike will remove all of the reservations commit and
rollback code.
Matt Riedemann [Wed, 5 Apr 2017 19:12:41 +0000 (15:12 -0400)]
Short-circuit local delete path for cells v2 and InstanceNotFound
When we're going down the local delete path for cells v2 in the API
and instance.destroy() fails with an InstanceNotFound error, we are
racing with a concurrent delete request and know that the instance
is alread deleted, so we can just return rather than fall through to
the rest of the code in the _delete() method, like for BDMs and
console tokens.
Matt Riedemann [Fri, 24 Mar 2017 02:07:03 +0000 (22:07 -0400)]
Do not attempt to load osinfo if we do not have os_distro
We get a warning logged every time we try to load up osinfo
with an image metadata that does not have the 'os_distro'
property set. We should be smarter and just not try to load
osinfo at all if we know we cannot get results.
Matt Riedemann [Tue, 21 Mar 2017 17:18:08 +0000 (13:18 -0400)]
libvirt: conditionally set script path for ethernet vif types
Change I4f97c05e2dec610af22a5150dd27696e1d767896 worked around
a change introduced in libvirt 1.3.3 where the script path on
a LibvirtConfigGuestInterface could not be the emptry string
because libvirt would literally take that as the path and couldn't
resolve it, when in fact it used to indicate to libvirt that the
script path is a noop. This has been fixed in libvirt 3.1.
On Ubuntu with libvirt<1.3.3, if the script path is None then
it defaults to /etc/qemu-ifup which is blocked by AppArmor.
So this change adds a conditional check when setting the script
path value based on the libvirt version so we can straddle releases.
Starting from version 1.3.5, Libvirt allows to set a vlan tag for macvtap
passthrough mode on SR-IOV VFs. Libvirt also removes any vlan tags that
has been set externally, by the ip link command.
In order to support the older libvirt versions, this code will make
the behaviour backward compatible by checking the libvirt version.
This can be completely removed once the minimum libvirt version will increase.
Evgeny Antyshev [Mon, 6 Mar 2017 14:27:06 +0000 (14:27 +0000)]
get_model method missing for Ploop image
Image.get_model is called in partition injection code,
and now inject partition attempt fails unconditionally.
This patch makes use of disk/api.py inject_data failure tolerance:
it doesn't fail unless injected data is mandatory.
Balazs Gibizer [Fri, 17 Mar 2017 10:24:49 +0000 (11:24 +0100)]
do not include context to exception notification
The wrap_exception decorator optionally emited a notification.
Based on the code comments the original intention was not to include the
context to that notification due to security reasons. However the
implementation did included the context to the payload of the legacy
notification.
Recently we saw circural reference errors during the payload serialization
of this notification. Based on the logs the only complex data structure
that could cause circural reference is the context. So this patch
removes the context from the legacy exception notification.
The versioned exception notification is not affected as it does not
contain the args of the decorated function.
ShunliZhou [Fri, 10 Mar 2017 06:05:57 +0000 (14:05 +0800)]
Add populate_retry to schedule_and_build_instances
When boot an instance and failed on the compute node, nova will
not retry to boot on other host.
Since https://review.openstack.org/#/c/319379/ change the create
instance workflow and called schedule_and_build_instances which
not populate the retry into filter properties. So nova will not
retry when boot on compute fail. This patch populate retry to
instance properties when call schedule_and_build_instances.
melanie witt [Thu, 16 Mar 2017 18:24:23 +0000 (18:24 +0000)]
Fix functional regression/recreate test for bug 1671648
There are a couple of issues with the test:
1. It doesn't consider both hosts from the two compute services
during scheduling.
2. There is a race where sometimes claims.Claim.__init__ won't
be called because if the RT instance_claim runs before
update_available_resource has run, it will create a
claims.NopClaim instead.
This adds the RetryFilter to enabled_filters, adds set_nodes() calls
to set the nodenames of each compute service to match its host,
resulting in consideration of both hosts for scheduling, and stubs
resource_tracker.ResourceTracker.instance_claim instead of
claims.Claim.__init__.
NOTE(mriedem): The conflict is due to this patch coming after cb4ce72f5f092644aa9b84fa58bcb9fd89b6bedc in Pike. Since this
is a fix for the functional test that the bug fix builds on,
we actually want this to come *before* the bug fix backport.
Matt Riedemann [Wed, 15 Mar 2017 20:58:11 +0000 (16:58 -0400)]
Add a functional regression/recreate test for bug 1671648
This adds a test which recreates the regression bug introduced
in Ocata where build retries are not populated when creating
instances in conductor for cells v2.
The change that fixes the bug will go on top of this and modify
the test to show the bug is fixed.
Matt Riedemann [Tue, 14 Mar 2017 16:34:59 +0000 (12:34 -0400)]
Add release notes for 15.0.1 Ocata bug fix release
There are several high severity, high impact bug fixes going
into the upcoming 15.0.1 Ocata release. This change adds release
notes highlighting the most important fixes along with a known
issue for another regression bug that is not yet fixed in Ocata,
but should be shortly.
Matt Riedemann [Thu, 9 Mar 2017 02:51:07 +0000 (21:51 -0500)]
Decrement quota usage when deleting an instance in cell0
When we fail to schedule an instance, e.g. there are no hosts
available, conductor creates the instance in the cell0 database
and deletes the build request. At this point quota usage
has been incremented in the main 'nova' database.
When the instance is deleted, the build request is already gone
so _delete_while_booting returns False and we lookup the instance
in cell0 and delete it from there, but that flow wasn't decrementing
quota usage like _delete_while_booting was.
This change adds the same quota usage decrement handling that
_delete_while_booting performs.
NOTE(mriedem): This change also pulls in some things from
I7de87dce216835729283bca69f0eff59a679b624 which is not being
backported to Ocata since in Pike it solves a slightly different
part of this quota usage issue. In Pike the cell mapping db_connection
is actually stored on the context object when we get the instance
from nova.compute.api.API.get(). So the fix in Pike is slightly
different from Ocata. However, what we need to pull from that Pike
change is:
1. We need to target the cell that the instance lives in to get the
flavor information when creating the quota reservation.
2. We need to change the functional regression test to assert that
the bug is fixed.
The code and tests are adjusted to be a sort of mix between both
changes in Pike without requiring a full backport of the 2nd
part of the fix in Pike.
This adds a functional regression test for bug 1670627.
This is the recreate scenario. Patches that are proposed to
fix the bug will build on top of this and change it's assertions
to know when it's properly fixed.
Balazs Gibizer [Thu, 9 Mar 2017 16:28:02 +0000 (17:28 +0100)]
Fix missing instance.delete notification
The I8742071b55f018f864f5a382de20075a5b444a79 introduced cases when an
instance object is destroyed without the instance.delete notification
being emitted.
This patch adds the necessary notification to restore legacy
behaviour.