From 41beacd9259babc58c37c206122f755b3ae66c55 Mon Sep 17 00:00:00 2001
From: "Daniel P. Berrange"
-The libvirt LXC driver manages "Linux Containers". Containers are sets of processes
-with private namespaces which can (but don't always) look like separate machines, but
-do not have their own OS. Here are two example configurations. The first is a very
-light-weight "application container" which does not have its own root image.
+The libvirt LXC driver manages "Linux Containers". At their simplest, containers
+can just be thought of as a collection of processes, separated from the main
+host processes via a set of resource namespaces and constrained via control
+groups resource tunables. The libvirt LXC driver has no dependency on the LXC
+userspace tools hosted on sourceforge.net. It directly utilizes the relevant
+kernel features to build the container environment. This allows for sharing
+of many libvirt technologies across both the QEMU/KVM and LXC drivers. In
+particular sVirt for mandatory access control, auditing of operations,
+integration with control groups and many other features.
+In order to control the resource usage of processes inside containers, the
+libvirt LXC driver requires that certain cgroups controllers are mounted on
+the host OS. The minimum required controllers are 'cpuacct', 'memory' and
+'devices', while recommended extra controllers are 'cpu', 'freezer' and
+'blkio'. Libvirt will not mount the cgroups filesystem itself, leaving
+this up to the init system to take care of. Systemd will do the right thing
+in this respect, while for other init systems the
-The libvirt LXC driver requires that certain cgroups controllers are
-mounted on the host OS. The minimum required controllers are 'cpuacct',
-'memory' and 'devices', while recommended extra controllers are
-'cpu', 'freezer' and 'blkio'. The /etc/cgconfig.conf & cgconfig
-init service used to mount cgroups at host boot time. To manually
-mount them use:
+In order to separate processes inside a container from those in the
+primary "host" OS environment, the libvirt LXC driver requires that
+certain kernel namespaces are compiled in. Libvirt currently requires
+the 'mount', 'ipc', 'pid', and 'uts' namespaces to be available. If
+separate network interfaces are desired, then the 'net' namespace is
+required. In the near future, the 'user' namespace will optionally be
+supported.
+NOTE: In the absence of support for the 'user' namespace,
+processes inside containers cannot be securely isolated from host
+process without the use of a mandatory access control technology
+such as SELinux or AppArmor.
+
-NB, the blkio controller in some kernels will not allow creation of nested
-sub-directories which will prevent correct operation of the libvirt LXC
-driver. On such kernels, it may be necessary to unmount the blkio controller.
+When the container "init" process is started, it will typically
+not be given any command line arguments (eg the equivalent of
+the bootloader args visible in
When the container "init" process is started, it will be given several useful
-environment variables.
+environment variables. The following standard environment variables are mandated
+by systemd container interface
+to be provided by all container technologies on Linux.
+
+In addition to the standard variables, the following libvirt specific
+environment variables are also provided
+In the absence of any explicit configuration, the container will
+inherit the host OS filesystem mounts. A number of mount points will
+be made read only, or re-mounted with new instances to provide
+container specific data. The following special mounts are setup
+by libvirt
+
+The container init process will be started with
+In addition, for every console defined in the guest configuration,
+a symlink will be created from
+Further block or character devices will be made available to containers
+depending on their configuration.
+
+In the absence of the "user" namespace being used, containers cannot
+be considered secure against exploits of the host OS. The sVirt SELinux
+driver provides a way to secure containers even when the "user" namespace
+is not used. The cost is that writing a policy to allow execution of
+arbitrary OS is not practical. The SELinux sVirt policy is typically
+tailored to work with an simpler application confinement use case,
+as provided by the "libvirt-sandbox" project.
+
+The LXC driver is integrated with libvirt's auditing subsystem, which
+causes audit messages to be logged whenever there is an operation
+performed against a container which has impact on host resources.
+So for example, start/stop, device hotplug will all log audit messages
+providing details about what action occurred and any resources
+associated with it. There are the following 3 types of audit messages
+
+All containers are launched with the CAP_MKNOD capability cleared
+and removed from the bounding set. Libvirt will ensure that the
+/dev filesystem is pre-populated with all devices that a container
+is allowed to use. In addition, the cgroup "device" controller is
+configured to block read/write/mknod from all devices except those
+that a container is authorized to use.
+
+As with any libvirt virtualization driver, LXC containers can be
+managed via a wide variety of libvirt based tools. At the lowest
+level the
-In both cases, you can define and start a container using:LXC container driver
+
+
+
Project Links
+Control groups Requirements
-
-
+cgconfig
+init service will be required. For further information, consult the general
+libvirt cgroups documentation.
+Cgroups Requirements
+Namespace requirements
- # mount -t cgroup cgroup /dev/cgroup -o cpuacct,memory,devices,cpu,freezer,blkio
-
+Default container setup
+
+Command line arguments
/proc/cmdline
). If
+any arguments are desired, then must be explicitly set in the
+container XML configuration via one or more initarg
+elements. For example, to run systemd --unit emergency.service
+would use the following XML
+ <os>
+ <type arch='x86_64'>exe</type>
+ <init>/bin/systemd</init>
+ <initarg>--unit</initarg>
+ <initarg>emergency.service</initarg>
+ </os>
+
-Environment setup for the container init
+Environment variables
+
+
+libvirt-lxc
to identify libvirt as the creator/bin:/usr/bin
linux
@@ -54,9 +105,152 @@ environment variables.
+initarg
config element.Filesystem mounts
+
+
+
+
+
+/dev
a new "tmpfs" pre-populated with authorized device nodes/dev/pts
a new private "devpts" instance for console devices/sys
the host "sysfs" instance remounted read-only/proc
a new instance of the "proc" filesystem/proc/sys
the host "/proc/sys" bind-mounted read-only/sys/fs/selinux
the host "selinux" instance remounted read-only/sys/fs/cgroup/NNNN
the host cgroups controllers bind-mounted to
+only expose the sub-tree associated with the container/proc/meminfo
a FUSE backed file reflecting memory limits of the containerDevice nodes
+
+CAP_MKNOD
+capability removed and blocked from re-acquiring it. As such it will
+not be able to create any device nodes in /dev
or anywhere
+else in its filesystems. Libvirt itself will take care of pre-populating
+the /dev
filesystem with any devices that the container
+is authorized to use. The current devices that will be made available
+to all containers are
+
+
+
+/dev/zero
/dev/null
/dev/full
/dev/random
/dev/urandom
/dev/stdin
symlinked to /proc/self/fd/0
/dev/stdout
symlinked to /proc/self/fd/1
/dev/stderr
symlinked to /proc/self/fd/2
/dev/fd
symlinked to /proc/self/fd
/dev/ptmx
symlinked to /dev/pts/ptmx
/dev/console
symlinked to /dev/pts/0
/dev/ttyN
symlinked to
+the corresponding /dev/pts/M
pseudo TTY device. The
+first console will be /dev/tty1
, with further consoles
+numbered incrementally from there.
+Container security
+
+sVirt SELinux
+
+Auditing
+
+
+
+
+VIRT_MACHINE_ID
- details of the SELinux process and
+image security labels assigned to the container.VIRT_CONTROL
- details of an action / operation
+performed against a container. There are the following types of
+operation
+
+
+op=start
- a container has been started. Provides
+ the machine name, uuid and PID of the libvirt_lxc
+ controller processop=init
- the init PID of the container has been
+ started. Provides the machine name, uuid and PID of the
+ libvirt_lxc
controller process and PID of the
+ init process (in the host PID namespace)op=stop
- a container has been stopped. Provides
+ the machine name, uuidVIRT_RESOURCE
- details of a host resource
+associated with a container action.Device access
+
+Example configurations
Example config version 1
@@ -121,21 +315,158 @@ debootstrap, whatever) under /opt/vm-1-root:
</domain>
+
+Container usage / management
+
+virsh
command can be used to perform many
+tasks, by passing the -c lxc:///
argument. As an
+alternative to repeating the URI with every command, the LIBVIRT_DEFAULT_URI
+environment variable can be set to lxc:///
. The
+examples that follow outline some common operations with virsh
+and LXC. For further details about usage of virsh consult its
+manual page.
+Defining (saving) container configuration>
+
virsh define
command takes an XML configuration
+document and loads it into libvirt, saving the configuration on disk
+
+# virsh -c lxc:/// define myguest.xml ++ +
+The virsh dumpxml
command can be used to view the
+current XML configuration of a container. By default the XML
+output reflects the current state of the container. If the
+container is running, it is possible to explicitly request the
+persistent configuration, instead of the current live configuration
+using the --inactive
flag
+
+# virsh -c lxc:/// dumpxml myguest ++ +
+The virsh start
command can be used to start a
+container from a previously defined persistent configuration
+
+# virsh -c lxc:/// start myguest ++ +
+It is also possible to start so called "transient" containers,
+which do not require a persistent configuration to be saved
+by libvirt, using the virsh create
command.
+
-virsh --connect lxc:/// define v1.xml -virsh --connect lxc:/// start vm1 +# virsh -c lxc:/// create myguest.xml-and then get a console using: + + +
+The virsh shutdown
command can be used
+to request a graceful shutdown of the container. By default
+this command will first attempt to send a message to the
+init process via the /dev/initctl
device node.
+If no such device node exists, then it will send SIGTERM
+to PID 1 inside the container.
+
-virsh --connect lxc:/// console vm1 +# virsh -c lxc:/// shutdown myguest-
Now doing 'ps -ef' will only show processes in the container, for -instance. You can undefine it using + +
+If the container does not respond to the graceful shutdown
+request, it can be forceably stopped using the virsh destroy
-virsh --connect lxc:/// undefine vm1 +# virsh -c lxc:/// destroy myguest+ + +
+The virsh reboot
command can be used
+to request a graceful shutdown of the container. By default
+this command will first attempt to send a message to the
+init process via the /dev/initctl
device node.
+If no such device node exists, then it will send SIGHUP
+to PID 1 inside the container.
+
+# virsh -c lxc:/// reboot myguest ++ +
+The virsh undefine
command can be used to delete the
+persistent configuration of a container. If the guest is currently
+running, this will turn it into a "transient" guest.
+
+# virsh -c lxc:/// undefine myguest ++ +
+The virsh console
command can be used to connect
+to the text console associated with a container. If the container
+has been configured with multiple console devices, then the
+--devname
argument can be used to choose the
+console to connect to
+
+# virsh -c lxc:/// console myguest ++ +
+The virsh lxc-enter-namespace
command can be used
+to enter the namespaces and security context of a container
+and then execute an arbitrary command.
+
+# virsh -c lxc:/// lxc-enter-namespace myguest -- /bin/ls -al /dev ++ +
+The virt-top
command can be used to monitor the
+activity and resource utilization of all containers on a
+host
+
+# virt-top -c lxc:/// ++ -- 2.39.5