--- /dev/null
+% Live Update Handover Protocol
+% David Woodhouse <<dwmw@amazon.co.uk>>
+% Revision 1
+
+Introduction
+============
+
+Purpose
+-------
+
+Live update performs a _kexec_ from one running version of Xen to
+another, preserving all running domains in a form of guest-transparent
+live migration.
+
+This document outlines the memory layout requirements and data stream
+used in handover protocol, to ensure that pages used by running
+domains are preserved during the transition from one version of Xen
+to the next.
+
+
+Compatibility
+-------------
+
+It cannot be repeated often enough that information passed over live
+update is an ABI. It is expected that live update can be performed from
+one major version of Xen to another, or even hypothetically to a system
+which is not Xen at all.
+
+It is necessary that some data are handed over "in place"; in
+particular the memory pages of the running domains. However, no
+internal Xen data structures may be transferred in this fashion; at
+least not without retrospectively declaring them to be ABI, with the
+restrictions that places on subsequent changes.
+
+
+
+Handover
+========
+
+
+Memory Usage Restrictions
+-------------------------
+
+The new Xen must take care not to use any memory pages which already
+belong to guests. To facilitate this, a contiguous region of memory
+is reserved for the boot allocator, known as _live update bootmem_.
+
+This region is reserved by the original Xen during its own boot, and
+the location made available to the _kexec(8)_ user space tool
+through the `kexec_get_range` hypercall using a new region type
+`KEXEC_RANGE_MA_LIVEUPDATE`. It is passed to the new Xen on the
+command line, using the `liveupdate=` parameter.
+
+The new Xen must not use any pages outside this region until it has
+consumed the live update data stream and determined which pages are
+already in use by running domains.
+
+At run time, Xen may use memory from the reserved region for any
+purpose that does not require preservation over a live update; in
+particular it must not be mapped to a domain.
+
+The new Xen executable image must be loaded by kexec to the same
+physical location as the running Xen, since that region of memory is
+known to be available. For that reason, freed init memory from the
+Xen image is also treated as reserved _live update bootmem_.
+
+
+Live Update Data Stream
+-----------------------
+
+During handover, the running Xen pauses all domains and creates a
+_live update data stream_ containing all the information required by
+the new Xen to restore them. This is largely the same as guest
+transparent live migration.
+
+Data pages for this stream may be allocated anywhere in physical
+memory outside the _live update bootmem_ regions.
+
+Xen creates a physically contiguous array of MFNs of the allocated
+data pages, suitable for passing to `vmap()` to obtain a virtually
+contiguous mapping of the whole data stream.
+
+
+Breadcrumb
+----------
+
+Since the live update data stream is created during the final `kexec_exec`
+hypercall, its address cannot be passed on the command line to the
+new Xen since the command line needs to have been set up by `kexec(8)`
+in userspace long beforehand.
+
+Thus, to allow the new Xen to find the data stream, the old Xen places
+a _breadcrumb_ in the first words of the _live update bootmem_, containing
+the number of data pages, and the physical address of the contiguous MFN
+array.
+
+The breadcrumb is written as the last action of the `kexec_reloc()`
+routine during the `kexec` handover, so cannot overwrite anything
+important by virtue of the existing guarantee that Xen will not place
+any data in that region which needs to survive across a live update.
+
+A restriction of the `kexec_reloc()` mechanism for writing the breadcrumb
+is that the values are host-endian and are masked with PAGE_MASK; the low
+bits are zeroed. This is actually perfect for the magic value used
+to recognise a live update breadcrumb, since it neatly prevents any attempt
+to live update to a Xen which uses a different endianness or page size.
+
+For the physical address of the MFN list it's perfectly fine, since
+that list is page-aligned anyway. For the number of pages, it means
+the value must be shifted accordingly. Hence the use of `shifted_nr_pages`
+in the breadcrumb structure below:
+
+
+ 0 1 2 3 4 5 6 7 octet
+ +-------------------------------------------------+
+ | live_update_magic |
+ +-------------------------------------------------+
+ | mfn_array_physaddr |
+ +-------------------------------------------------+
+ | shifted_nr_pages |
+ +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field Description
+------------------- ------------------------------------------------
+live_update_magic "LiveUpda" (0x4c69766555706461) stored in the the host
+ endianness and masked with PAGE_MASK.
+ For example on x86_64: `00 60 70 55 65 76 89 4c`.
+
+mfn_array_physaddr Machine address of MFN list for data streaes.
+
+shift_nr_pages Number of data pages, shifted by PAGE_SHIFT to
+ avoid the limitation of kexec_reloc().
+--------------------------------------------------------------------
+
+
+IOMMU
+-----
+
+Where devices are passed through to domains, it may not be possible
+to quiesce those devices for the purpose of performing the update.
+
+If performing live update with assigned devices, the original Xen will
+leave the IOMMU mappings active during the handover (thus implying
+that IOMMU page tables may not be allocated in the `live update
+bootmem` region either).
+
+The new Xen must resume control of the IOMMU without causing those mappings
+to become invalid even for a short period of time. On hardware which does not
+support Posted Interrupts, interrupts may need to be generated on resume.
+
+_This section will be expanded once we actually have it working._
+
+\clearpage
+
+Data Stream Overview
+====================
+
+Once discovered and mapped, the live update data stream forms a
+virtually contiguous stream of records following the basic form
+documented in the LibXenCtrl Domain Image Format at
+`docs/specs/libxc-migration-stream.pandoc`.
+
+Some record types from the LibXenCtrl Domain Image format are used
+as-is, such as the `X86_PV_INFO`, `X86_PV_VCPU_BASIC`, `HVM_CONTEXT`
+and other records containing domain-specific data.
+
+The Domain Header from that document is not used in that form, and a new
+record of type `LU_DOMAIN_INFO` is defined below.
+
+Other new record types specific to the live update process are defined in
+this document. Of those, some contain global state such as the M2P table
+information, while others are domain-specific.
+
+The live update data stream starts with records containing global
+information, followed any number of times by a `LU_DOMAIN_INFO` record
+and subsequent domain-specific records for that domain.
+
+There is a single `END` record at the end of the live update data stream,
+indicating that no more `DOMAIN_INFO` records are present.
+
+\clearpage
+
+As defined in the LibXenCtrl Domain Image format document, a record
+has the following structure. Record type values defined for live update
+have bit 30 set, and are thus in the range 0x40000000-0x7FFFFFFF for
+mandatory live update records, and 0xC0000000-0xFFFFFFFF for optional
+live update records _(of which there are none at the present time)_.
+
+
+ 0 1 2 3 4 5 6 7 octet
+ +-----------------------+-------------------------+
+ | type | body_length |
+ +-----------+-----------+-------------------------+
+ | body... |
+ ...
+ | | padding (0 to 7 octets) |
+ +-----------+-------------------------------------+
+
+--------------------------------------------------------------------
+Field Description
+----------- -------------------------------------------------------
+type 0x40000000: LU_VERSION
+
+ 0x40000001: LU_M2P
+
+ 0x40000002: LU_M2P_COMPAT
+
+ 0x40000003: LU_DOMAIN_INFO
+
+ 0x40000004 - 0x7FFFFFFF: Reserved for future _mandatory_
+ live update records.
+
+ 0xC0000000 - 0xFFFFFFFF: Reserved for future _optional_
+ live update records.
+
+body_length Length in octets of the record body.
+
+body Content of the record.
+
+padding 0 to 7 octets of zeros to pad the whole record to a multiple
+ of 8 octets.
+--------------------------------------------------------------------
+
+
+\clearpage
+
+Global Records
+==============
+
+LU_VERSION
+----------
+
+The version field indicates the version of Xen from which the system
+is live updating. In theory this should never be relevant, but it
+allows for version-specific workarounds to be implementing in the receiving
+Xen should they become necessary.
+
+ 0 1 2 3 4 5 6 7 octet
+ +-----------------------+-----------+-------------+
+ | xen_major | xen_minor |
+ +-----------------------+-------------------------+
+
+
+--------------------------------------------------------------------
+Field Description
+----------- --------------------------------------------------------
+xen_major The Xen major version from which the system is updating.
+
+xen_minor The Xen minor version from which the system is updating.
+--------------------------------------------------------------------
+
+\clearpage
+
+LU_M2P / LU_M2P_COMPAT
+----------------------
+
+The M2P and compatibility M2P records contain a scatter/gather list of
+pages containing native or 32-bit M2P data.
+
+
+ 0 1 2 3 4 5 6 7 octet
+ +-----------------------+-------------------------+
+ | m2p_page_data[0]... |
+ ...
+ +-------------------------------------------------+
+ | m2p_page_data[N-1]... |
+ ...
+ +-------------------------------------------------+
+
+--------------------------------------------------------------------
+Field Description
+----------- --------------------------------------------------------
+m2p_page_data A 64-bit value containing the physical address of the
+ next page of M2P data, encoding the _order_ of the page
+ into the low 12 bits. Thus, a 1GiB page at 0x4C0000000
+ would be encoded as 0x4C000001E.
+
+ In case the M2P does not contiguously cover pages starting
+ from MFN zero, a discontiguity is indicated by a field
+ with order set to zero. The high bits of the field then
+ provide the MFN for which the subsequent M2P data page
+ provides data.
+
+--------------------------------------------------------------------
+
+\clearpage
+
+Domain Specific Records
+=======================
+
+
+LU_DOMAIN_INFO
+--------------
+
+The domain info record contains general properties necessary to
+recreate a domain in the receiving Xen, and marks the start of a set
+of other domain-specific records pertaining to that domain.
+
+ 0 1 2 3 4 5 6 7 octet
+ +-----------------------+-----------+-------------+
+ | type | page_shift| domain_id |
+ +-----------------------+-----------+-------------+
+ | domain_handle[0-7] |
+ +-------------------------------------------------+
+ | domain_handle[8-15] |
+ +-----------------------+-------------------------+
+ | ssidref | flags |
+ +-----------------------+-------------------------+
+ | max_vcpus | emulation_flags |
+ +-----------------------+-------------------------+
+ | extra_flags | (padding) |
+ +-----------------------+-------------------------+
+
+
+--------------------------------------------------------------------
+Field Description
+--------------- --------------------------------------------------------
+type 0x0000: Reserved.
+
+ 0x0001: x86 PV.
+
+ 0x0002: x86 HVM.
+
+ 0x0003 - 0xFFFFFFFF: Reserved.
+
+page_shift Size of a guest page as a power of two.
+
+ i.e., page size = 2 ^page_shift^.
+
+domain_id Domain ID
+
+
+domain_handle UUID domain handle.
+
+ssidref Security Identifier Index
+
+flags Domain flags using `XEN_DOMCTL_CTF_`
+
+max_vcpus Maximum vCPUs for domain.
+
+emulation_flags Emulation flags using `XEN_X86_EMU_`
+
+extra_flags Additional flags:
+
+ 0x00000001: Is privileged
+
+--------------------------------------------------------------------
+
+\clearpage
+
+Future Extensions
+=================
+
+All changes to this specification should bump the revision number in
+the title block.
+
+All changes to the image or domain headers require the image version
+to be increased.
+
+The format may be extended by adding additional record types.
+
+Extending an existing record type must be done by adding a new record
+type. This allows old images with the old record to still be
+restored.
+
+The image header may only be extended by _appending_ additional
+fields. In particular, the `marker`, `id` and `version` fields must
+never change size or location.
+
+