xenbits.xensource.com Git - osstest/openstack-nova.git/log

Merge "Query deleted instance records during _destroy_evacuated_instances" into stable/ocata

Merge "libvirt: fix and break up _test_attach_detach_interface" into stable/ocata

Query deleted instance records during _destroy_evacuated_instances

_destroy_evacuated_instances is responsible for cleaning up the
remnants of instance evacuations from the source host. Currently
this method doesn't account for instances that have been deleted
after being evacuated.

Change-Id: Ib5f6b03189b7fc5cd0b226ea2dca74865fbef12a
Closes-Bug: #1687479
(cherry picked from commit 42b1fa965028c12d6e78b70d2487d5dd49158176)

Use physical utilisation for cached images

Since glance images are downloaded and snapshotted before they are used.
Only a small proportion of its VDI will be in use and it will never grow.
So the real disk usage for VDIs varies with their capability to expand.
Disks connected to VMs continue to use the virtual utilisation as they
are able to expand.

Change-Id: Ie7aae58a47e6651af5b609fb9abc13ab5d61e4df
Closes-bug: 1612057
(cherry picked from commit e1ddae83901735d9bffd9d736f3c033c8986041f)

Fix regression preventing reporting negative resources for overcommit

In Nova prior to Ocata, the scheduler computes available resources for
a compute node, attempting to mirror the same calculation that happens
locally. It does this to determine if a new instance should fit on the
node. If overcommit is being used, some of these numbers can be negative.

In change 016b810f675b20e8ce78f4c82dc9c679c0162b7a we changed the
compute side to never report negative resources, which was an ironic-
specific fix for nodes that are offline. That, however, has been
corrected for ironic nodes in 047da6498dbb3af71bcb9e6d0e2c38aa23b06615.
Since the base change to the resource tracker has caused the scheduler
and compute to do different math, we need to revert it to avoid the
scheduler sending instances to nodes where it believes -NNN is the
lower limit (with overcommit), but the node is reporting zero.

This doesn't actually affect Ocata because of our use of the placement
engine. However, this code is still in master and needs to be backported.
This part of the change actually didn't even have a unit test, so
this patch adds one to validate that the resource tracker will
calculate and report negative resources.

Conflicts:
nova/compute/resource_tracker.py
nova/tests/unit/compute/test_resource_tracker.py

NOTE(mriedem): The conflict is due to change
I80ba844a6e0fcea89f80aa253d57ac73092773ae not being in Ocata.

Change-Id: I25ba6f7f4e4fab6db223368427d889d6b06a77e8
Closes-Bug: #1698383
(cherry picked from commit 0ddf3ce01149d78ee0cf8f7497f8a9074c6f167d)

Merge "Calculate stopped instance's disk sizes for disk_available_least" into stable/ocata

Merge "libvirt: handle missing rbd_secret_uuid from old connection info" into stable/ocata

Calculate stopped instance's disk sizes for disk_available_least

disk_available_least is a free disk size information of hypervisors.
This is calculated by the following fomula:

disk_available_least = <free disk size> - <Total gap between virtual
disk size and actual disk size for all instances>

But stopped instance's virtual disk sizes were not calculated
after merging following patch in Juno cycle:
https://review.openstack.org/#/c/105127

So disk_available_least might be larger than actual free disk size.
As a result, instances might be scheduled beyond the actual free
disk size if stopped instances were on a host.

This patch fix it.
Stopped instance's disks will be calculated after merging this patch.

Change-Id: I8abf6edfa80e3920539e4f6d4906c573f9543b91
Closes-Bug: #1693679
(cherry-picked from commit 3342215034c7a7a938c497c39f6405763201a261)

Merge "fix InvalidSharedStorage exception message" into stable/ocata

Merge "Make xenapi driver compatible with assert_can_migrate" into stable/ocata

Merge "Warn the user about orphaned extra records during keypair migration" into stable/ocata

Merge "[Trivial] Hyper-V: accept Glance vhdx images" into stable/ocata

libvirt: handle missing rbd_secret_uuid from old connection info

Change Idcbada705c1d38ac5fd7c600141c2de7020eae25 in Ocata
started preferring Cinder connection info for getting RBD auth
values since Nova needs to be using the same settings as Cinder
for volume auth.

However, that introduced a problem for guest connections made
before that change, where the secret_uuid might not have been
configured on the Cinder side and that's what is stored in the
block_device_mappings.connection_info column and is what we're
checking in _set_auth_config_rbd. Before Ocata this wasn't a
problem because we'd use the Nova configuration values for the
rbd_secret_uuid if set. But since Ocata it is a problem since
we don't consult nova.conf if auth was enabled, but not completely
configured, on the Cinder side.

So this adds a fallback check to set the secret_uuid from
nova.conf if it wasn't set in the connection_info via Cinder
originally. A note is also added to caution about removing
any fallback mechanism on the nova side - something we'd
need to consider before we could likely drop this code.

Co-Authored-By: Tadas Ustinavičius <tadas@ring.lt>
NOTE(mriedem): The unit test is modified slightly to not
pass an instance to the disconnect_volume method as that
was only available starting in Pike: b66b7d4f9d

Change-Id: I6fc7108817fcd9df4a342c9dabbf14ab7911d06a
Closes-Bug: #1687581
(cherry picked from commit f2d27f6a8afb62815fb6a885bd4f8ae4ed287fd3)

Merge "Add strict option to discover_hosts" into stable/ocata

Merge "make sure to rebuild claim on recreate" into stable/ocata

Merge "Catch neutronclient.NotFound on floating deletion" into stable/ocata

Merge "Use VIR_DOMAIN_BLOCK_REBASE_COPY_DEV when rebasing" into stable/ocata

Merge "[BugFix] Release the memory quota for video ram when deleting an instance." into stable/ocata

libvirt: fix and break up _test_attach_detach_interface

The detach_interface flow in this test was broken because
it wasn't mocking out domain.detachDeviceFlags so the xml
it was expecting to be passed to that method wasn't actually
being verified. The same thing is broken in test
test_detach_interface_device_with_same_mac_address because
it copies the other broken test code.

This change breaks apart the monster attach/detach test method
and converts the detach_interface portion to mock and fixes
the broken assertion.

test_detach_interface_device_with_same_mac_address is just
fixed, not converted to mock.

Change-Id: I6d9a975876c5652ef544c587f65b1bdd1543848b
Related-Bug: #1607714
(cherry picked from commit 1ecd71b08d14450e475dc9512d40828da6fcfe15)

[BugFix] Release the memory quota for video ram when deleting an instance.

When creating an instance, the quota is used to contain vram,
and the deletion of the instance does not release the memory quota for vram,
if delete intance it need to release the memory quota for vram.

Change-Id: Iee2925c585c87d9885fd4bfe38cd9b4d316f5dab
Closes-Bug: #1681989
(cherry picked from commit f2d03bf92db3615693b62f4f3274b867f00c0967)

Warn the user about orphaned extra records during keypair migration

Operators who have manually deleted Instance records with FK constraints
disabled may have orphaned InstanceExtra records which will prevent the
keypair migration from running. Normally, this violation of the data
model would be something that earns no sympathy. However, that solution
was (incorrectly) offered up as a workaround in bug 1511466 and multiple
deployments have broken their data as a result. Since the experience
is a unhelpful error message and a blocked migration, this patch attempts
to at least highlight the problem, even though it is on the operator to
actually fix the problem.

Change-Id: I0c8bf2c495a98c412eb93e19f832948a779bca11
Related-Bug: #1684861
(cherry picked from commit 2cee5bb4d14a5ad6a037857bd3624833317cd931)

[Trivial] Hyper-V: accept Glance vhdx images

Although the Hyper-V driver is used with VHDX images most of the
time, it rejects Glance images marked as VHDX.

Note that for legacy reasons, people usually just mark them as vhd
at he moment, which is misleading and unsafe.

This change ensures that such images are handled properly.

Closes-Bug: #1693470

Change-Id: I1441304bd5a83b04877c1662e6e9d463ba629ed0
(cherry picked from commit f022d431e1f844a8faa7ab55597c83179c13022c)

Use VIR_DOMAIN_BLOCK_REBASE_COPY_DEV when rebasing

Previously, in swap_volume, the VIR_DOMAIN_BLOCK_REBASE_COPY_DEV flag
was not passed to virDomainBlockRebase. In the case of iSCSI-backed
disks, this caused the XML to change from <source dev=/dev/iscsi/lun>
to <source file=/dev/iscsi/lun>. This was a problem because
/dev/iscsi/lun is not a regular file. This patch passes the
VIR_DOMAIN_BLOCK_REBASE_COPY_DEV flag to virDomainBlockRebase, causing
the correct <source dev=/dev/iscsi/lun> to be generated upon
volume-update.

Conflicts:
nova/tests/unit/virt/libvirt/test_driver.py
nova/virt/libvirt/driver.py

NOTE(mriedem): The conflicts are due to
fbcf8d673342570a1518dbf8d88f289c2c39cd30 needing to translate
the exception message in driver.py and for passing instance
to disconnect_volume in test_driver, which was added in Pike with
b66b7d4f9d63e7f45ebfc033696d06c632a33ff1.

Change-Id: I868a0dae0baf8cded9c7c5807ea63ffc5eec0c5e
Closes-bug: 1691195
(cherry picked from commit a8a4a8ea7b8e6c85273ddb02d34d6af1740b183f)

make sure to rebuild claim on recreate

On recreate where the instance is being evacuated to a different node,
we should be rebuilding the claim so the migration context is available
when rebuilding the instance.

Change-Id: I53bdcf8edf640e97b4632ef7a093f14a6e3845e4
Closes-Bug: 1658070
(cherry picked from commit a2b0824aca5cb4a2ae579f625327c51ed0414d35)

Add strict option to discover_hosts

This adds a --strict option that can be passed in when calling the
nova-manage cell_v2 discover_hosts command. When this option is used,
the command will only return success if a new host has been found.
In any other case it is considered a failure.

Closes-Bug: #1692982

Change-Id: I942af11a3987e1edce67423c66931ad9ece65587
(cherry picked from commit aaae213bf32096c56cb1bfcb4d0ac3cafd1c13fb)

Fix cell0 naming when QS params on the connection

We had a problem when the nova connection string included parameters on the
query string like charset encoding.
Note that the connection string necessarly needs to be RFC1738 compliant as
per Sqlalchemy rules, so it's totally safe to just unquote what the SQLA
helper method gives us as a result.

Also removed a tested connection string since it wasn't RFC1738 compatible.

Change-Id: I45fe9b92e8d93a0099d33bb0070e9d4e540595ac
Closes-Bug: #1696001
(cherry picked from commit 9a33092fa9b0c65b37b9fdc860f5a571908d7a69)

Catch neutronclient.NotFound on floating deletion

In some cases, trying to delete a floating IP multiple times in a short
delay can trigger an exception beacause the floating ip deletion
operation is not atomic. If neutronclient's call to delete fails with a
NotFound error, we raise a 404 error to nova's client instead of a 500.

Change-Id: I49ea7e52073148457e794d641ed17d4ef58616f8
Co-Authored-By: Stephen Finucane <sfinucan@redhat.com>
Closes-Bug: #1649852
(cherry picked from commit d99197aece6451013d1de1f08c1af16832ee0e7e)

Merge "libvirt: Always disconnect_volume after rebase failures" into stable/ocata

Merge "Avoid lazy-load error when getting instance AZ" into stable/ocata

Merge "Fix MarkerNotFound when paging and marker was found in cell0" into stable/ocata

Merge "Avoid lazy-loading instance.id when cross_az_attach=False" into stable/ocata

Avoid lazy-load error when getting instance AZ

When [cinder]cross_az_attach=False (not the default) and doing
boot from volume, the API code validates the BDM by seeing if
the instance and the volume are in the same availability zone.
To get the AZ for the instance, the code is first trying to get
the instance.host value.

In Ocata we stopped creating the instance in the API and moved that
to conductor for cells v2. So the Instance object in this case now
is created in the _provision_instances method and stored in the
BuildRequest object. Since there is no host to set on the instance
yet and the Instance object wasn't populated from DB values, which
before would set the host field on the instance object to None by
default, trying to get instance.host will lazy-load the field and
it blows up with ObjectActionError.

The correct thing to do here is check if the host attribute is set
on the Instance object. There is clear intent to assume host is
not set in the instance since it was using instance.get('host'),
probably from way back in the days when the instance in this case
was a dict. So it's expecting to handle None, but we need to
modernize how that is checked.

Change-Id: I0dccb6a416dfe0eae4f7c52dfc28786a449b17bd
Closes-Bug: #1693600
(cherry picked from commit 91bd79058abf79978335010a325952f17729d7a5)

Make xenapi driver compatible with assert_can_migrate

The newly released XenServer 7.1 has changed to check vif_map in
its api VM.assert_can_migrate(), if the vif_map isn't set, it will
raise exception with VIF_NOT_IN_MAP, but at this point destination
compute node doesn't have dest_vif_map, this patch is to make code
in xenserver driver be compatible with xenserver's changes

Closes-bug: 1669719

Change-Id: I8fb8d26fbbc12dd4f17b3541968fc16254131b6c
(cherry picked from commit 55d42affb7760b836f5aecd613d95bceeffed613)

Fix MarkerNotFound when paging and marker was found in cell0

If we're paging over cells and the marker was found in cell0,
we need to null it out so we don't attempt to lookup by marker
from any other cells if there is more room in the limit.

Change-Id: I8a957bebfcecd6ac712103c346e028d80f1ecd7c
Closes-Bug: #1689692
(cherry picked from commit dbaf80d2c94db074a6651c890d532a11baec8da0)

Add recreate functional test for regression bug 1689692

When paging through instances, if the marker is found in cell0
and there are more instances under the limit, we continue paging
through the cell(s) to fill the limit. However, since the marker
was found in cell0 it's not going to be in any other cell database
so we'll end up failing with a marker not found error.

This change adds a functional recreate test for the bug.

The fix will build on this to show when the bug is fixed and the
test will be changed to assert expected normal behavior.

Change-Id: I234e0425e7e800b32cea78f5c1d99997bc03343f
Partial-Bug: #1689692
(cherry picked from commit fe0cf0fe047f9e8890170a90c48594d90e73bda5)

Correct _ensure_console_log_for_instance implementation

_ensure_console_log_for_instance[1] ensures VM console.log existence.

A change[2] updated in order to succeed if the file exists without nova
being able to read it (typically happens when libvirt rewrites uid/gid)
by ignoring EPERM errors.

It seems the method should ignore EACCES errors. Indeed EACCES errors
are raised when an action is not permitted because of insufficient
permissions where EPERM errors when an action is not permitted at all.

[1] nova.virt.libvirt.driver
[2] https://review.openstack.org/392643

Closes-Bug: #1691831
Change-Id: Ifc075a0fd91fc87651fcb306d6439be5369009b6
(cherry picked from commit 3072b0afbc157eef5e72f191525296cfa2b014cb)

Avoid lazy-loading instance.id when cross_az_attach=False

The instance is no longer created in the API so the id attribute
won't be set, which means when checking the instance AZ against
the volume AZ, if they don't match we can't put the instance.id
in the error message. We shouldn't have been putting the instance
primary key in the error message anyway.

This fixes the bug by using the instance.uuid which is set in
this object in _provision_instances.

Change-Id: I396b767815b666706fec980ded482fa4746d2efc
Closes-Bug: #1693654
(cherry picked from commit 40cf447d28a5c9842f3546c4e7fe4efa682293cf)

Exclude deleted service records when calling hypervisor statistics

Hypervisor statistics could be incorrect if not
exclude deleted service records from DB.

User may stop 'nova-compute' service on some
compute nodes and delete the service from nova.
When delete 'nova-compute' service, it performs
'soft-delete' to the corresponding db records in
both 'service' table and 'compute_nodes' table if
the compute_nodes record is old, i.e. it is linked
to the service record. For modern compute_nodes
records, they aren't linked to the services table
so deleting the services record will not delete
the compute_nodes record, and the ResourceTracker
won't recreate the compute_nodes record if the host
and hypervisor_hostname still match the existing
record, but restarting the process after deleting
the service will create a new services table record
with the same host/binary/topic.

If the 'nova-compute' service on that server
re-starts, it will automatically add a record
in 'compute_nodes' table (assuming it was deleted
because it was an old-style record) and also a correspoding
record in 'service' table, and if the host name
of the compute node did not change, the newly
created records in 'service' and 'compute_nodes'
table will be identical to the priously soft-deleted
records except the deleted row.

When calling Hypervisor-statistics, the DB layer
joined records across the whole deployment by
comparing records' host field selected from
serivce table and records' host field selected
from compute_nodes table, and the calculated
results could be multiplied if multiple records
from service table have the same host field,
and this scenario could happen if user perform
the above actions.

Co-Authored-By: Matt Riedemann <mriedem.os@gmail.com>
Change-Id: I9dfa15f69f8ef9c6cb36b2734a8601bd73e9d6b3
Closes-Bug: #1692397
(cherry picked from commit 3d3e9cdd774efe96f468f2bcba6c09a40f5e71d3)

Merge "Use six.text_type() when logging Instance object" into stable/ocata

Merge "Catch exception.OverQuota when create image for volume backed instance" into stable/ocata

Merge "Handle special characters in database connection URL netloc" into stable/ocata

Merge "Fix decoding of encryption key passed to dmcrypt" into stable/ocata

Fix decoding of encryption key passed to dmcrypt

This patch fixes the decoding of the encryption key passed to dmcrypt.
During the key management move from Nova to Castellan, in the Newton
release, conversion of the encryption key (from a string to list of
unsigned ints) was removed from the key retrieval method. This patch
updates dmcrypt to decode an encryption key string, rather than a list
of unsigned ints. See the linked bug for more information.

The method used to decode the encryption key has been updated to use
binascii, as done in os-brick [1], to maintain consistency. The key
generation and decoding portions of test_dmcrypt have been updated to
reflect this change and ensure compatibility with both, Python 2 and
Python 3.

[1] https://github.com/openstack/os-brick/blob/6cf9b1cd689f70a2c50c0fa83a9a9f7c502712a1/os_brick/encryptors/cryptsetup.py#L100-L102

Closes-Bug: #1688342
Change-Id: I050585ecb55742a972038cf72b0650321ded2856
(cherry picked from commit 53a71c1241aac70018c16d174032427a172378ed)

Catch exception.OverQuota when create image for volume backed instance

When create image for a volume backed instance, nova will
create snapshots for all volumes attached to the instance
in Cinder, and if quota exceed in Cinder, HTTP 500 will
raise, we should capture this error and raise 403.

Change-Id: Ic62478e22a7477cfaefac3e63c383082d66bd635
Closes-Bug: #1689284
(cherry picked from commit 29c8ae3cd9653b310c3fac803e6295e5fa885d15)

convert unicode to string before we connect to rados

rados client only support string argument, therefore we have to
convert argument to string first.

closes-Bug: #1672792

Change-Id: I007c15dc61db9dbf7df7b8bea7c3dce49f0396f5
(cherry picked from commit 2f5ebdba5f9c5c1485d143c9ec5989c8fbe1e859)

Handle special characters in database connection URL netloc

When calling "nova-manage cell_v2 simple_cell_setup" or
"nova-manage cell_v2 map_cell0" without passing in the
--database_connection option, we read the [database]/connection
URL from nova.conf, try to split the URL and then create a
default connection based on the name of the original connection,
so if you're cell database's name is 'nova' you'd end up with
'nova_cell0' for the cell0 database name in the URL.

The problem is the database connection URL has credentials in the
netloc and if the password has special characters in it, those can
mess up the URL split, like splitting on ? which is normally denoting
the beginning of the path in a URL.

This change handles special characters in the password by using
a nice DB connection URL parsing utility method available in
sqlalchemy to get the database name out of the connection URL string
so we can replace it properly with the _cell0 suffix.

Adds a release note as this bug causes issues when upgrading.

Change-Id: I7a7678e4af8160e6f48b96095154fca6ca48ff09
Closes-Bug: #1673613
(cherry picked from commit 9a9a620ea2d06e51c01b0864d7275b57d7203e5a)

Merge "Cache database and message queue connection objects" into stable/ocata

Use six.text_type() when logging Instance object

We're seeing a trace in gate jobs, for example:

  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
  402: ordinal not in range(128)

when attempting to log an Instance object with a unicode display name.

This resurfaced relatively recently because of the change in devstack
to use the new OSJournalHandler with use_journal=True which is
suspected of causing some deadlock issues [1] unrelated to this bug.

The problem occurs in code that logs an entire Instance object when
the object has a field with unicode characters in it (display_name).
When the object is sent to logging, the UnicodeDecodeError is raised
while formatting the log record here [2]. This implies an implicit
conversion attempt to unicode at this point.

I found that with the Instance object, the conversion to unicode fails
with the UnicodeDecodeError unless the encoding 'utf-8' is explicitly
specified to six.text_type(). And when specifying an encoding to
six.text_type(), the argument to convert must be a string, not an
Instance object, so this does the conversion in two steps as a utility
function:

  1. Get the string representation of the Instance with repr()
  2. Call six.text_type(instance_repr, 'utf-8') passing the encoding
     if not six.PY3

Closes-Bug: #1580728

[1] https://review.openstack.org/#/c/462163
[2] https://github.com/python/cpython/blob/2e576f5/Lib/logging/__init__.py#L338

Change-Id: I0fc3ae02cb2e401b3240faf0d8b6aa5dc52b91fc
(cherry picked from commit 564958dba6c280eb8e11ae9bbc819c7da6e204bd)

Cache database and message queue connection objects

Recently in the gate we have seen a trace on some work-in-progress
patches:

OperationalError: (pymysql.err.OperationalError)
(1040, u'Too many connections')

and at least one operator has reported that the number of database
connections increased significantly going from Mitaka to Newton.

It was suspected that the increase was caused by creating new oslo.db
transaction context managers on-the-fly when switching database
connections for cells. Comparing the dstat --tcp output of runs of the
gate-tempest-dsvm-neutron-full-ubuntu-xenial job with and without
caching of the database connections showed a difference of 445 active
TCP connections and 1495 active TCP connections, respectively [1].

This adds caching of the oslo.db transaction context managers and the
oslo.messaging transports to avoid creating a large number of objects
that are not being garbage-collected as expected.

Closes-Bug: #1691545

[1] https://docs.google.com/spreadsheets/d/1DIfFfX3kaA_SRoCM-aO7BN4IBEShChXLztOBFeKryt4/edit?usp=sharing

Conflicts:
nova/context.py
nova/tests/unit/test_context.py

NOTE(melwitt): Conflicts caused by the fact that the set_target_cell
function doesn't exist in Ocata and message queue connections were
not stored on the context in Ocata.

Change-Id: I17e0eb836dd87aac5859f506e7d771d42753d31a
(cherry picked from commit 47fa88d94754fcdad6bb132b45196b4d44c0f4cd)

Parse algorithm from cipher for ephemeral disk encryption

Nova's keymgr implementation used to have default values
for the algorithm and bit length. Castellan does not have
default values, and when Castellan replaced keymgr in
Ib563b0ea4b8b4bc1833bf52bf49a68546c384996, the parameters
to the create_key method were not updated. This change
parses the algorithm from the cipher value and passes it
to Castellan's key manager interface.

Conflicts:
nova/tests/unit/compute/test_compute.py

NOTE(mriedem): The conflicts are due to import order changes
in pike for flake8 order checking.

Closes-Bug: #1651887
Change-Id: Ib90bc7571aef59325be0efe123fcf12e86252b85
(cherry picked from commit 1d3acad111c5106592b0921628480fcf41e0fb4c)

Merge "Updated from global requirements" into stable/ocata

fix InvalidSharedStorage exception message

The exception message for InvalidSharedStorage is grammatically
complex and ignores the possibility of block migration, which
results in a misleading and confusing message for the user.

Change it to something simpler and more accurate.

Change-Id: Icfc8e151a4ea36ec303f0b8c2e1dc9e6f0bd5925
Closes-Bug: #1690890
(cherry picked from commit 902b7bf6f5425824a0661b8e4beac4a894749c03)

Updated from global requirements

Change-Id: Ida1ece918a833ff27a4cca16d895d0c9a9373ac2

Catching OverQuota Exception

When any vm creation fails because of exceeding 'gigabytes',
'volumes', 'per_volume_gigabytes' quotas, the error message
generated is specific to 'volumes' quota which says
"Volume resource quota exceeded". Instead, the error message
should be specific to the quota which failed.

Change-Id: I9c1ac2cd4752d5aac20d06407792647b4549ad3d
Closes-Bug: 1680457
(cherry picked from commit 510371d526bd45195be806fb153646abdc269b70)

Merge "Fix the evacuate API without json-schema validation in 2.13" into stable/ocata

Merge "Fix port update exception when unshelving an instance with PCI devices" into stable/ocata

libvirt: Always disconnect_volume after rebase failures

Previously failures when rebasing onto a new volume would leave the
volume connected to the compute host. For some volume backends such as
iSCSI the subsequent call to terminate_connection would then result in
leftover devices remaining on the host.

This change simply catches any error associated with the rebase and
ensures that disconnect_volume is called for the new volume prior to
terminate_connection finally being called.

NOTE(lyarwood): Conflict caused by MIN_LIBVIRT_VERSION being 1.2.1 in
stable/ocata making I81c32bbea0f04ca876f2078ef2ae0e1975473584
unsuitable. The is_job_complete polling loop removed by that change in
master is now moved into the try block of this change ensuring we catch
any errors that might be thrown while waiting for the async pivot. The
log.exception message also requires translation in Ocata.

Conflicts:
nova/virt/libvirt/driver.py

Change-Id: I5997000a0ba6341c4987405cdc0760c3b471bd59
Closes-bug: #1685185
(cherry picked from commit 809065485c19fd535db6740bb21b867c41c008fe)

Merge "Stop failed live-migrates getting stuck migrating" into stable/ocata

Merge "Fix HTTP 500 raised for getConsoleLog for stopped instance" into stable/ocata

Merge "Fix mitaka online migration for PCI devices" into stable/ocata

Fix port update exception when unshelving an instance with PCI devices

It is possible that _update_port_binding_for_instance() is called
without a migration object, such as when a user unshelves an instance.

If the instance has a port(s) with a PCI device binding, the current
logic extracts a pci mapping from old to new devices from the migration
object and migration context. If a 'new' device is not found in the
PCI mapping, an exception is thrown.

In the case of an unshelve, there is no migration object (or migration
context), and as such we have an empty pci mapping.

This fix will only check for a new device if we have a migration object.

Closes-Bug: 1677621
Change-Id: I578153ca862753ef5b8041ee3853d3c7b2e2be30
(cherry picked from commit c2ff276c841934ff147aab836a4bd099297fb46b)

Merge "Use deepcopy when process filters in db api" into stable/ocata

Fix libvirt group selection in live migration test

The custom start scripting for the nova compute service assumed that the
libvirt group is "libvirtd". Unforately libvirtd is no longer used by
debuntu as they use "libvirt". Add a simple check against /etc/group
for an existing libvirtd user otherwise use libvirt.

Change-Id: Idbda49587f3b62a0870d10817291205bde0e821e
Depends-On: If2dbc53d082fea779448998ea12b821bd037a14e
(cherry picked from commit ea8463679c1c25b496ffca1be6bd9bd026c29225)

Fix mitaka online migration for PCI devices

Currently, a validation error is thrown if we find any PCI device
records which have not populated the parent_addr column on a nova
upgrade. However, the only PCI records for which a parent_addr
makes sense for are those with a device type of 'type-VF' (ie. an
SRIOV virtual function). PCI records with a device type of 'type-PF'
or 'type-PCI' will not have a parent_addr. If any of those records
are present on upgrade, the validation will fail.

This change checks that the device type of the PCI record is
'type-VF' when making sure the parent_addr has been correctly
populated

Closes-Bug: #1680918
Change-Id: Ia7e773674a4976fc03deee3f08a6ddb45568ec11
(cherry picked from commit 7f3f0ef1fbb51f6f17d2c13840e0f98d17fa9093)

Fix the evacuate API without json-schema validation in 2.13

The evacuate API loses the json-schema validation in 2.13 since
the commit c01d16e81af6cd9453ffe7133bdc6a4c82e4f6d5. This patch
fixes it.

Change-Id: I7f221e3b924d91739ec9b24fd090410fb5fce55a
Closes-bug: #1683752
(cherry picked from commit c97c44cdfb35ecbf473ac5c98c2d3a4e9871ee8c)

Use deepcopy when process filters in db api

In db API when we process filters, we didn't
use deepcopy. In cases of "tags" and "not-tags"
we used pop to get the first tag, filtered out
results, and then joined with other tags for
later filtering. When we did pop(), the original
value was deleted, the key "tags"/"not-tags" remains.

In the cell scenario, both single cell(we will
query cell0 and the other cell) and multicell,
as we have to query all the cells in a loop and
the tags list in the filter will keep popping,
this will lead to either a HTTP 500 error(popping
from an empty list) or incorrect result(when
number of tags in the list is larger than cell
number, no HTTP 500 will show, but the filter
results for each cell will be different as
each loop will pop one tag).

closes-bug: #1682693

Change-Id: Ia2738dd0c7d1842b68c83d0a9e75e26b2f8d492a
(cherry picked from commit c4820305d2f9ee8d62bcc708baf3fa6dfe7ca960)

Add regression test for server filtering by tags bug 1682693

There was a latent bug in the DB API code such that when we
process filters when listing instances, the various tags
filters have their values popped out of the filters dict and
the values (which are lists) for the filter also have the
first item in the list popped out to build the query.

This latent bug was exposed in Newton when we started listing
instances in the API from both cell0 and the main cell database,
because the query to cell0 would pop an item and then it
would not be in the 2nd query to the main cell database. If we
only had one tag in the filter list, we get an IndexError on
the 2nd pop() call.

Note that we also use the build_requests table in the API to
list instances, but there would not be any tagged servers in
that table since a server has to be ACTIVE before you can tag it,
and build_requests only exist until the instance is put into a
cell and starts building there (so it won't be ACTIVE yet).

Change-Id: I65bdfadd406368775c794ee4d8a16302a8a134bc
Related-Bug: #1682693
(cherry picked from commit 01dd1a05a213c0cbd0097188418cabe915291c8d)

Stop failed live-migrates getting stuck migrating

When there are failures in driver.cleanup, we are seeing live-migrations
that get stuck in the live-migrating state. While there has been a patch
to stop the cause listed in the bug this closes, there are other
failures (such as a token timeout when talking to cinder or neutron)
that could trigger this same failure mode.

When we hit an error this late in live-migration, it should be a very
rare event, so its best to just put the instance and migration into an
error state, and help alert both the operator and API user to the
failure that has occurred.

Closes-Bug: #1662626

Change-Id: Idfdce9e7dd8106af01db0358ada15737cb846395
(cherry picked from commit b56f8fc2d1392f4675a5baae0977e4817a362159)

Merge "Do not attempt to load osinfo if we do not have os_distro" into stable/ocata

Fix HTTP 500 raised for getConsoleLog for stopped instance

Stopped instances with pty console will not contain
`source_node` information, and in the current
implementation the pty variable used later will
result in an UnboundLocalError, which results in a
500 error out of the API.

Closes-bug: #1680363

Change-Id: I4dffba959e2292254dc757f22c3f7893d2da72f9
(cherry picked from commit 0c2b73e80141c6bef309097d1cbd0129d6ca094d)

Merge "libvirt: conditionally set script path for ethernet vif types" into stable/ocata

Merge "libvirt: set vlan tag for macvtap on SR-IOV VFs" into stable/ocata

Merge "Updated from global requirements" into stable/ocata

Perform old-style local delete for shelved offloaded instances

This fixes a regression from some local delete code added for cells v2
where it assumed that if an instance did not have a host, it wasn't
scheduled to a cell yet. That assumption misses the fact that the
instance won't have a host if it was shelved offloaded. And to be
shelved offloaded, the instance had to have first been built on a host
in a cell.

So we simply duplicate the same check as later in the _delete() method
for instance.host or shelved-offloaded to decide what the case is.

Obviously this is all a giant mess of duplicate delete path code that
needs to be unwound, and that's the plan, but first we're fixing
regressions and then we can start rolling this duplication all back
so we can get back to the single local delete flow that we know and love.

Change-Id: Ie2063f621618c1d90aeb59f0f1d7da351862ea9f
Closes-Bug: #1678326
(cherry picked from commit 9245bbf79ddbfd8a2e2310af654711f9d3a547b1)

Regression test for local delete with an attached volume

Once an instance is in a cell and we do a local delete from
the API, we aren't actually detaching volumes and destroying
BDMs.

Related-Bug: #1675570

Change-Id: Ie3e2dfd4b0f1bb3dff4080f460bf8bb40d69f4f4
(cherry picked from commit 3ae12fdc6f428fc500f5c234f03a8ffc8286e7ba)

Set size/status during image create with FakeImageService

This is needed for an upcoming change which introduces
a functional test which shelves a server. Shelving a server
creates a snapshot image and in the real world, glance sets
the size and status attributes on the image when it's created
in glance. Our FakeImageService wasn't doing that, so tests
that are running at the same time with the same fake noauth
credentials are listing images and picking up the shelve
snapshot image which doesn't have size or status set and
that produces a KeyError in the API code.

Change-Id: I70d1505c915e80156aaa4d27a7d47e437b40b6f1
(cherry picked from commit 371a2512a7f27e01592580150faa1cf7517c8979)

Commit usage decrement after destroying instance

This fixes a regression in Ocata where we were always
decrementing quota usage during instance delete even
if we failed to delete the instance. Now the reservation
is properly committed after the instance is destroyed.

The related functional test is updated to show this working
correctly now.

Conflicts:
nova/compute/api.py
nova/tests/unit/compute/test_compute_api.py

NOTE(mriedem): The conflict is due to not having change
edf51119fa59ff8a3337abb9107a06fa33d3c68f in stable/ocata.

Change-Id: I5200e72c195e12b5a069cbae3f209492ed569fb4
Closes-Bug: #1678326
(cherry picked from commit 5c1af55cbe526dea72308767df8709064ffae6a8)

Add regression test for quota decrement bug 1678326

This was spotted from someone validating the fix for
bug 1670627. They reported that even though they failed
to delete an instance in ERROR state that was in cell0,
the quota usage was decremented.

This is because we committed the quota reservation
to decrement the usage before actually attempting to destroy
the instance, rather than upon successful deletion.

The rollback after InstanceNotFound is a noop because of
how the Quotas.rollback method noops if the reservations
were already committed. That is in itself arguably a bug,
but not fixed here, especially since the counting quotas
work in Pike will remove all of the reservations commit and
rollback code.

Change-Id: I12d1fa1a10f9014863123ac9cc3c63ddfb48378e
Partial-Bug: #1678326
(cherry picked from commit 032937ce51c64e0f5292ca3149e3b244863fffca)

Short-circuit local delete path for cells v2 and InstanceNotFound

When we're going down the local delete path for cells v2 in the API
and instance.destroy() fails with an InstanceNotFound error, we are
racing with a concurrent delete request and know that the instance
is alread deleted, so we can just return rather than fall through to
the rest of the code in the _delete() method, like for BDMs and
console tokens.

Conflicts:
nova/compute/api.py

NOTE(mriedem): The conflict is due to not having change
edf51119fa59ff8a3337abb9107a06fa33d3c68f in stable/ocata.

Change-Id: I58690a25044d2804573451983323dde05be9e5d6
Closes-Bug: #1680211
(cherry picked from commit 5a9cc2fb7af3e3a9db44646bbd23cfcfb16891f5)

Updated from global requirements

Change-Id: I534ab3511e68dfd1ba6c45161c0015ae8ad7f5ca

Do not attempt to load osinfo if we do not have os_distro

We get a warning logged every time we try to load up osinfo
with an image metadata that does not have the 'os_distro'
property set. We should be smarter and just not try to load
osinfo at all if we know we cannot get results.

Change-Id: I79ebea3ed222fb0db01d0a33c62a677d0381f66c
Closes-Bug: #1675602
(cherry picked from commit 9910bade6fe7ffde1b11170f894e2876ed8ccdb4)

Add release note for CVE-2017-7214

Change-Id: I1f5b772bff8fa2dc508c4bdd3dd1f32838594786
Related-Bug: #1673569
(cherry picked from commit 3f985f1eda6f29180878a3d21c20c5057179486a)

libvirt: conditionally set script path for ethernet vif types

Change I4f97c05e2dec610af22a5150dd27696e1d767896 worked around
a change introduced in libvirt 1.3.3 where the script path on
a LibvirtConfigGuestInterface could not be the emptry string
because libvirt would literally take that as the path and couldn't
resolve it, when in fact it used to indicate to libvirt that the
script path is a noop. This has been fixed in libvirt 3.1.

On Ubuntu with libvirt<1.3.3, if the script path is None then
it defaults to /etc/qemu-ifup which is blocked by AppArmor.

So this change adds a conditional check when setting the script
path value based on the libvirt version so we can straddle releases.

Change-Id: I192c61b93bd3736fdfe16b6a6906d58997d3eef9
Closes-Bug: #1665698
Related-Bug: #1649527
(cherry picked from commit a41d265a19b7bcb1af8fc179bf864e00023c6cc6)

libvirt: set vlan tag for macvtap on SR-IOV VFs

Starting from version 1.3.5, Libvirt allows to set a vlan tag for macvtap
passthrough mode on SR-IOV VFs. Libvirt also removes any vlan tags that
has been set externally, by the ip link command.

In order to support the older libvirt versions, this code will make
the behaviour backward compatible by checking the libvirt version.
This can be completely removed once the minimum libvirt version will increase.

Change-Id: Ia14e78e0eda81b8d29d5aa6e07e68777665d0710
Closes-Bug: #1657035
(cherry picked from commit bf3ba76a91e362c0b3b8094964adc06f6a4ec62e)

get_model method missing for Ploop image

Image.get_model is called in partition injection code,
and now inject partition attempt fails unconditionally.
This patch makes use of disk/api.py inject_data failure tolerance:
it doesn't fail unless injected data is mandatory.

Closes-Bug: 1670642
Change-Id: I6c68693be818219f78d6fcda019b867066396b29
(cherry picked from commit d629d4e42c04756444a79e9ac15f3f2192c47b52)

Merge "do not include context to exception notification" into stable/ocata

do not include context to exception notification

The wrap_exception decorator optionally emited a notification.
Based on the code comments the original intention was not to include the
context to that notification due to security reasons. However the
implementation did included the context to the payload of the legacy
notification.

Recently we saw circural reference errors during the payload serialization
of this notification. Based on the logs the only complex data structure
that could cause circural reference is the context. So this patch
removes the context from the legacy exception notification.

The versioned exception notification is not affected as it does not
contain the args of the decorated function.

Closes-Bug: #1673375
Change-Id: I1d217620e52d45595a3e0e49ed57b4ab33cd1688
(cherry picked from commit 3bf177a59cfd0b4e74dba256c3466ba2ea9bfbf7)

Add populate_retry to schedule_and_build_instances

When boot an instance and failed on the compute node, nova will
not retry to boot on other host.

Since https://review.openstack.org/#/c/319379/ change the create
instance workflow and called schedule_and_build_instances which
not populate the retry into filter properties. So nova will not
retry when boot on compute fail. This patch populate retry to
instance properties when call schedule_and_build_instances.

Conflicts:
nova/tests/functional/regressions/test_bug_1671648.py

NOTE(mriedem): The conflict is due to putting the functional
test fix before this bug fix in the backport series.

Change-Id: Ifdaddcd265a7fe8282499e27043936f8212610ad
Closes-Bug: #1671648
(cherry picked from commit cb4ce72f5f092644aa9b84fa58bcb9fd89b6bedc)

Fix functional regression/recreate test for bug 1671648

There are a couple of issues with the test:

  1. It doesn't consider both hosts from the two compute services
     during scheduling.

  2. There is a race where sometimes claims.Claim.__init__ won't
     be called because if the RT instance_claim runs before
     update_available_resource has run, it will create a
     claims.NopClaim instead.

This adds the RetryFilter to enabled_filters, adds set_nodes() calls
to set the nodenames of each compute service to match its host,
resulting in consideration of both hosts for scheduling, and stubs
resource_tracker.ResourceTracker.instance_claim instead of
claims.Claim.__init__.

Conflicts:
nova/tests/functional/regressions/test_bug_1671648.py

NOTE(mriedem): The conflict is due to this patch coming after
cb4ce72f5f092644aa9b84fa58bcb9fd89b6bedc in Pike. Since this
is a fix for the functional test that the bug fix builds on,
we actually want this to come *before* the bug fix backport.

Related-Bug: #1671648

Change-Id: I541c03a7960b8f135b005c43cb5c7bcb0b63b9ae
(cherry picked from commit b6b5438c3ddeedab6c7f83d1998d283f1bb503bc)

Add a functional regression/recreate test for bug 1671648

This adds a test which recreates the regression bug introduced
in Ocata where build retries are not populated when creating
instances in conductor for cells v2.

The change that fixes the bug will go on top of this and modify
the test to show the bug is fixed.

Change-Id: Ie9e955d79b4e1441092183135b3f70b003c94db5
Related-Bug: #1671648
(cherry picked from commit 72e1506101b131b51fbe77acc0af19f36899c28d)

Merge "Add release notes for 15.0.1 Ocata bug fix release" into stable/ocata

Merge "Decrement quota usage when deleting an instance in cell0" into stable/ocata

Merge "Add regression test for bug 1670627" into stable/ocata

Merge "Correctly set up deprecation warning" into stable/ocata

Add release notes for 15.0.1 Ocata bug fix release

There are several high severity, high impact bug fixes going
into the upcoming 15.0.1 Ocata release. This change adds release
notes highlighting the most important fixes along with a known
issue for another regression bug that is not yet fixed in Ocata,
but should be shortly.

Change-Id: I06921fed0b1acdd48a23f24b5c42c19f9fb032e4

Merge "Updated from global requirements" into stable/ocata

Decrement quota usage when deleting an instance in cell0

When we fail to schedule an instance, e.g. there are no hosts
available, conductor creates the instance in the cell0 database
and deletes the build request. At this point quota usage
has been incremented in the main 'nova' database.

When the instance is deleted, the build request is already gone
so _delete_while_booting returns False and we lookup the instance
in cell0 and delete it from there, but that flow wasn't decrementing
quota usage like _delete_while_booting was.

This change adds the same quota usage decrement handling that
_delete_while_booting performs.

NOTE(mriedem): This change also pulls in some things from
I7de87dce216835729283bca69f0eff59a679b624 which is not being
backported to Ocata since in Pike it solves a slightly different
part of this quota usage issue. In Pike the cell mapping db_connection
is actually stored on the context object when we get the instance
from nova.compute.api.API.get(). So the fix in Pike is slightly
different from Ocata. However, what we need to pull from that Pike
change is:

1. We need to target the cell that the instance lives in to get the
flavor information when creating the quota reservation.

2. We need to change the functional regression test to assert that
the bug is fixed.

The code and tests are adjusted to be a sort of mix between both
changes in Pike without requiring a full backport of the 2nd
part of the fix in Pike.

Change-Id: I4cb0169ce0de537804ab9129bc671d75ce5f7953
Partial-Bug: #1670627
(cherry picked from commit 018068c4caac324643c7c6a4360fad855dd096eb)

Updated from global requirements

Change-Id: I08973b4c71993833466633e61c3258bddcd8dba4

Add regression test for bug 1670627

This adds a functional regression test for bug 1670627.

This is the recreate scenario. Patches that are proposed to
fix the bug will build on top of this and change it's assertions
to know when it's properly fixed.

Change-Id: I872a3fd5cfd3dd869f74cd3fd0aa5da411b1fec3
Related-Bug: #1670627
(cherry picked from commit 500282d867a15b441068d9327ddcb9ae69f41b7d)

Fix missing instance.delete notification

The I8742071b55f018f864f5a382de20075a5b444a79 introduced cases when an
instance object is destroyed without the instance.delete notification
being emitted.

This patch adds the necessary notification to restore legacy
behaviour.

Closes-Bug: #1665263
Change-Id: I55ce90ca34c927c5dcd340fa5bdb0607a4ad4971
(cherry picked from commit f9ac9531fb3259fe83642dce92d0d1901d0f067e)