diff Documentation/cpusets.txt @ 0:831230e53067

Import 2.6.18 from kernel.org tarball.
author Ian Campbell <ian.campbell@xensource.com>
date Wed Apr 11 14:15:44 2007 +0100 (2007-04-11)
line diff
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/Documentation/cpusets.txt	Wed Apr 11 14:15:44 2007 +0100
     1.3 @@ -0,0 +1,618 @@
     1.4 +				CPUSETS
     1.5 +				-------
     1.6 +
     1.7 +Copyright (C) 2004 BULL SA.
     1.8 +Written by Simon.Derr@bull.net
     1.9 +
    1.10 +Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
    1.11 +Modified by Paul Jackson <pj@sgi.com>
    1.12 +Modified by Christoph Lameter <clameter@sgi.com>
    1.13 +
    1.14 +CONTENTS:
    1.15 +=========
    1.16 +
    1.17 +1. Cpusets
    1.18 +  1.1 What are cpusets ?
    1.19 +  1.2 Why are cpusets needed ?
    1.20 +  1.3 How are cpusets implemented ?
    1.21 +  1.4 What are exclusive cpusets ?
    1.22 +  1.5 What does notify_on_release do ?
    1.23 +  1.6 What is memory_pressure ?
    1.24 +  1.7 What is memory spread ?
    1.25 +  1.8 How do I use cpusets ?
    1.26 +2. Usage Examples and Syntax
    1.27 +  2.1 Basic Usage
    1.28 +  2.2 Adding/removing cpus
    1.29 +  2.3 Setting flags
    1.30 +  2.4 Attaching processes
    1.31 +3. Questions
    1.32 +4. Contact
    1.33 +
    1.34 +1. Cpusets
    1.35 +==========
    1.36 +
    1.37 +1.1 What are cpusets ?
    1.38 +----------------------
    1.39 +
    1.40 +Cpusets provide a mechanism for assigning a set of CPUs and Memory
    1.41 +Nodes to a set of tasks.
    1.42 +
    1.43 +Cpusets constrain the CPU and Memory placement of tasks to only
    1.44 +the resources within a tasks current cpuset.  They form a nested
    1.45 +hierarchy visible in a virtual file system.  These are the essential
    1.46 +hooks, beyond what is already present, required to manage dynamic
    1.47 +job placement on large systems.
    1.48 +
    1.49 +Each task has a pointer to a cpuset.  Multiple tasks may reference
    1.50 +the same cpuset.  Requests by a task, using the sched_setaffinity(2)
    1.51 +system call to include CPUs in its CPU affinity mask, and using the
    1.52 +mbind(2) and set_mempolicy(2) system calls to include Memory Nodes
    1.53 +in its memory policy, are both filtered through that tasks cpuset,
    1.54 +filtering out any CPUs or Memory Nodes not in that cpuset.  The
    1.55 +scheduler will not schedule a task on a CPU that is not allowed in
    1.56 +its cpus_allowed vector, and the kernel page allocator will not
    1.57 +allocate a page on a node that is not allowed in the requesting tasks
    1.58 +mems_allowed vector.
    1.59 +
    1.60 +User level code may create and destroy cpusets by name in the cpuset
    1.61 +virtual file system, manage the attributes and permissions of these
    1.62 +cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
    1.63 +specify and query to which cpuset a task is assigned, and list the
    1.64 +task pids assigned to a cpuset.
    1.65 +
    1.66 +
    1.67 +1.2 Why are cpusets needed ?
    1.68 +----------------------------
    1.69 +
    1.70 +The management of large computer systems, with many processors (CPUs),
    1.71 +complex memory cache hierarchies and multiple Memory Nodes having
    1.72 +non-uniform access times (NUMA) presents additional challenges for
    1.73 +the efficient scheduling and memory placement of processes.
    1.74 +
    1.75 +Frequently more modest sized systems can be operated with adequate
    1.76 +efficiency just by letting the operating system automatically share
    1.77 +the available CPU and Memory resources amongst the requesting tasks.
    1.78 +
    1.79 +But larger systems, which benefit more from careful processor and
    1.80 +memory placement to reduce memory access times and contention,
    1.81 +and which typically represent a larger investment for the customer,
    1.82 +can benefit from explicitly placing jobs on properly sized subsets of
    1.83 +the system.
    1.84 +
    1.85 +This can be especially valuable on:
    1.86 +
    1.87 +    * Web Servers running multiple instances of the same web application,
    1.88 +    * Servers running different applications (for instance, a web server
    1.89 +      and a database), or
    1.90 +    * NUMA systems running large HPC applications with demanding
    1.91 +      performance characteristics.
    1.92 +    * Also cpu_exclusive cpusets are useful for servers running orthogonal
    1.93 +      workloads such as RT applications requiring low latency and HPC
    1.94 +      applications that are throughput sensitive
    1.95 +
    1.96 +These subsets, or "soft partitions" must be able to be dynamically
    1.97 +adjusted, as the job mix changes, without impacting other concurrently
    1.98 +executing jobs. The location of the running jobs pages may also be moved
    1.99 +when the memory locations are changed.
   1.100 +
   1.101 +The kernel cpuset patch provides the minimum essential kernel
   1.102 +mechanisms required to efficiently implement such subsets.  It
   1.103 +leverages existing CPU and Memory Placement facilities in the Linux
   1.104 +kernel to avoid any additional impact on the critical scheduler or
   1.105 +memory allocator code.
   1.106 +
   1.107 +
   1.108 +1.3 How are cpusets implemented ?
   1.109 +---------------------------------
   1.110 +
   1.111 +Cpusets provide a Linux kernel mechanism to constrain which CPUs and
   1.112 +Memory Nodes are used by a process or set of processes.
   1.113 +
   1.114 +The Linux kernel already has a pair of mechanisms to specify on which
   1.115 +CPUs a task may be scheduled (sched_setaffinity) and on which Memory
   1.116 +Nodes it may obtain memory (mbind, set_mempolicy).
   1.117 +
   1.118 +Cpusets extends these two mechanisms as follows:
   1.119 +
   1.120 + - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
   1.121 +   kernel.
   1.122 + - Each task in the system is attached to a cpuset, via a pointer
   1.123 +   in the task structure to a reference counted cpuset structure.
   1.124 + - Calls to sched_setaffinity are filtered to just those CPUs
   1.125 +   allowed in that tasks cpuset.
   1.126 + - Calls to mbind and set_mempolicy are filtered to just
   1.127 +   those Memory Nodes allowed in that tasks cpuset.
   1.128 + - The root cpuset contains all the systems CPUs and Memory
   1.129 +   Nodes.
   1.130 + - For any cpuset, one can define child cpusets containing a subset
   1.131 +   of the parents CPU and Memory Node resources.
   1.132 + - The hierarchy of cpusets can be mounted at /dev/cpuset, for
   1.133 +   browsing and manipulation from user space.
   1.134 + - A cpuset may be marked exclusive, which ensures that no other
   1.135 +   cpuset (except direct ancestors and descendents) may contain
   1.136 +   any overlapping CPUs or Memory Nodes.
   1.137 +   Also a cpu_exclusive cpuset would be associated with a sched
   1.138 +   domain.
   1.139 + - You can list all the tasks (by pid) attached to any cpuset.
   1.140 +
   1.141 +The implementation of cpusets requires a few, simple hooks
   1.142 +into the rest of the kernel, none in performance critical paths:
   1.143 +
   1.144 + - in init/main.c, to initialize the root cpuset at system boot.
   1.145 + - in fork and exit, to attach and detach a task from its cpuset.
   1.146 + - in sched_setaffinity, to mask the requested CPUs by what's
   1.147 +   allowed in that tasks cpuset.
   1.148 + - in sched.c migrate_all_tasks(), to keep migrating tasks within
   1.149 +   the CPUs allowed by their cpuset, if possible.
   1.150 + - in sched.c, a new API partition_sched_domains for handling
   1.151 +   sched domain changes associated with cpu_exclusive cpusets
   1.152 +   and related changes in both sched.c and arch/ia64/kernel/domain.c
   1.153 + - in the mbind and set_mempolicy system calls, to mask the requested
   1.154 +   Memory Nodes by what's allowed in that tasks cpuset.
   1.155 + - in page_alloc.c, to restrict memory to allowed nodes.
   1.156 + - in vmscan.c, to restrict page recovery to the current cpuset.
   1.157 +
   1.158 +In addition a new file system, of type "cpuset" may be mounted,
   1.159 +typically at /dev/cpuset, to enable browsing and modifying the cpusets
   1.160 +presently known to the kernel.  No new system calls are added for
   1.161 +cpusets - all support for querying and modifying cpusets is via
   1.162 +this cpuset file system.
   1.163 +
   1.164 +Each task under /proc has an added file named 'cpuset', displaying
   1.165 +the cpuset name, as the path relative to the root of the cpuset file
   1.166 +system.
   1.167 +
   1.168 +The /proc/<pid>/status file for each task has two added lines,
   1.169 +displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
   1.170 +and mems_allowed (on which Memory Nodes it may obtain memory),
   1.171 +in the format seen in the following example:
   1.172 +
   1.173 +  Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
   1.174 +  Mems_allowed:   ffffffff,ffffffff
   1.175 +
   1.176 +Each cpuset is represented by a directory in the cpuset file system
   1.177 +containing the following files describing that cpuset:
   1.178 +
   1.179 + - cpus: list of CPUs in that cpuset
   1.180 + - mems: list of Memory Nodes in that cpuset
   1.181 + - memory_migrate flag: if set, move pages to cpusets nodes
   1.182 + - cpu_exclusive flag: is cpu placement exclusive?
   1.183 + - mem_exclusive flag: is memory placement exclusive?
   1.184 + - tasks: list of tasks (by pid) attached to that cpuset
   1.185 + - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
   1.186 + - memory_pressure: measure of how much paging pressure in cpuset
   1.187 +
   1.188 +In addition, the root cpuset only has the following file:
   1.189 + - memory_pressure_enabled flag: compute memory_pressure?
   1.190 +
   1.191 +New cpusets are created using the mkdir system call or shell
   1.192 +command.  The properties of a cpuset, such as its flags, allowed
   1.193 +CPUs and Memory Nodes, and attached tasks, are modified by writing
   1.194 +to the appropriate file in that cpusets directory, as listed above.
   1.195 +
   1.196 +The named hierarchical structure of nested cpusets allows partitioning
   1.197 +a large system into nested, dynamically changeable, "soft-partitions".
   1.198 +
   1.199 +The attachment of each task, automatically inherited at fork by any
   1.200 +children of that task, to a cpuset allows organizing the work load
   1.201 +on a system into related sets of tasks such that each set is constrained
   1.202 +to using the CPUs and Memory Nodes of a particular cpuset.  A task
   1.203 +may be re-attached to any other cpuset, if allowed by the permissions
   1.204 +on the necessary cpuset file system directories.
   1.205 +
   1.206 +Such management of a system "in the large" integrates smoothly with
   1.207 +the detailed placement done on individual tasks and memory regions
   1.208 +using the sched_setaffinity, mbind and set_mempolicy system calls.
   1.209 +
   1.210 +The following rules apply to each cpuset:
   1.211 +
   1.212 + - Its CPUs and Memory Nodes must be a subset of its parents.
   1.213 + - It can only be marked exclusive if its parent is.
   1.214 + - If its cpu or memory is exclusive, they may not overlap any sibling.
   1.215 +
   1.216 +These rules, and the natural hierarchy of cpusets, enable efficient
   1.217 +enforcement of the exclusive guarantee, without having to scan all
   1.218 +cpusets every time any of them change to ensure nothing overlaps a
   1.219 +exclusive cpuset.  Also, the use of a Linux virtual file system (vfs)
   1.220 +to represent the cpuset hierarchy provides for a familiar permission
   1.221 +and name space for cpusets, with a minimum of additional kernel code.
   1.222 +
   1.223 +The cpus file in the root (top_cpuset) cpuset is read-only.
   1.224 +It automatically tracks the value of cpu_online_map, using a CPU
   1.225 +hotplug notifier.  If and when memory nodes can be hotplugged,
   1.226 +we expect to make the mems file in the root cpuset read-only
   1.227 +as well, and have it track the value of node_online_map.
   1.228 +
   1.229 +
   1.230 +1.4 What are exclusive cpusets ?
   1.231 +--------------------------------
   1.232 +
   1.233 +If a cpuset is cpu or mem exclusive, no other cpuset, other than
   1.234 +a direct ancestor or descendent, may share any of the same CPUs or
   1.235 +Memory Nodes.
   1.236 +
   1.237 +A cpuset that is cpu_exclusive has a scheduler (sched) domain
   1.238 +associated with it.  The sched domain consists of all CPUs in the
   1.239 +current cpuset that are not part of any exclusive child cpusets.
   1.240 +This ensures that the scheduler load balancing code only balances
   1.241 +against the CPUs that are in the sched domain as defined above and
   1.242 +not all of the CPUs in the system. This removes any overhead due to
   1.243 +load balancing code trying to pull tasks outside of the cpu_exclusive
   1.244 +cpuset only to be prevented by the tasks' cpus_allowed mask.
   1.245 +
   1.246 +A cpuset that is mem_exclusive restricts kernel allocations for
   1.247 +page, buffer and other data commonly shared by the kernel across
   1.248 +multiple users.  All cpusets, whether mem_exclusive or not, restrict
   1.249 +allocations of memory for user space.  This enables configuring a
   1.250 +system so that several independent jobs can share common kernel data,
   1.251 +such as file system pages, while isolating each jobs user allocation in
   1.252 +its own cpuset.  To do this, construct a large mem_exclusive cpuset to
   1.253 +hold all the jobs, and construct child, non-mem_exclusive cpusets for
   1.254 +each individual job.  Only a small amount of typical kernel memory,
   1.255 +such as requests from interrupt handlers, is allowed to be taken
   1.256 +outside even a mem_exclusive cpuset.
   1.257 +
   1.258 +
   1.259 +1.5 What does notify_on_release do ?
   1.260 +------------------------------------
   1.261 +
   1.262 +If the notify_on_release flag is enabled (1) in a cpuset, then whenever
   1.263 +the last task in the cpuset leaves (exits or attaches to some other
   1.264 +cpuset) and the last child cpuset of that cpuset is removed, then
   1.265 +the kernel runs the command /sbin/cpuset_release_agent, supplying the
   1.266 +pathname (relative to the mount point of the cpuset file system) of the
   1.267 +abandoned cpuset.  This enables automatic removal of abandoned cpusets.
   1.268 +The default value of notify_on_release in the root cpuset at system
   1.269 +boot is disabled (0).  The default value of other cpusets at creation
   1.270 +is the current value of their parents notify_on_release setting.
   1.271 +
   1.272 +
   1.273 +1.6 What is memory_pressure ?
   1.274 +-----------------------------
   1.275 +The memory_pressure of a cpuset provides a simple per-cpuset metric
   1.276 +of the rate that the tasks in a cpuset are attempting to free up in
   1.277 +use memory on the nodes of the cpuset to satisfy additional memory
   1.278 +requests.
   1.279 +
   1.280 +This enables batch managers monitoring jobs running in dedicated
   1.281 +cpusets to efficiently detect what level of memory pressure that job
   1.282 +is causing.
   1.283 +
   1.284 +This is useful both on tightly managed systems running a wide mix of
   1.285 +submitted jobs, which may choose to terminate or re-prioritize jobs that
   1.286 +are trying to use more memory than allowed on the nodes assigned them,
   1.287 +and with tightly coupled, long running, massively parallel scientific
   1.288 +computing jobs that will dramatically fail to meet required performance
   1.289 +goals if they start to use more memory than allowed to them.
   1.290 +
   1.291 +This mechanism provides a very economical way for the batch manager
   1.292 +to monitor a cpuset for signs of memory pressure.  It's up to the
   1.293 +batch manager or other user code to decide what to do about it and
   1.294 +take action.
   1.295 +
   1.296 +==> Unless this feature is enabled by writing "1" to the special file
   1.297 +    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
   1.298 +    code of __alloc_pages() for this metric reduces to simply noticing
   1.299 +    that the cpuset_memory_pressure_enabled flag is zero.  So only
   1.300 +    systems that enable this feature will compute the metric.
   1.301 +
   1.302 +Why a per-cpuset, running average:
   1.303 +
   1.304 +    Because this meter is per-cpuset, rather than per-task or mm,
   1.305 +    the system load imposed by a batch scheduler monitoring this
   1.306 +    metric is sharply reduced on large systems, because a scan of
   1.307 +    the tasklist can be avoided on each set of queries.
   1.308 +
   1.309 +    Because this meter is a running average, instead of an accumulating
   1.310 +    counter, a batch scheduler can detect memory pressure with a
   1.311 +    single read, instead of having to read and accumulate results
   1.312 +    for a period of time.
   1.313 +
   1.314 +    Because this meter is per-cpuset rather than per-task or mm,
   1.315 +    the batch scheduler can obtain the key information, memory
   1.316 +    pressure in a cpuset, with a single read, rather than having to
   1.317 +    query and accumulate results over all the (dynamically changing)
   1.318 +    set of tasks in the cpuset.
   1.319 +
   1.320 +A per-cpuset simple digital filter (requires a spinlock and 3 words
   1.321 +of data per-cpuset) is kept, and updated by any task attached to that
   1.322 +cpuset, if it enters the synchronous (direct) page reclaim code.
   1.323 +
   1.324 +A per-cpuset file provides an integer number representing the recent
   1.325 +(half-life of 10 seconds) rate of direct page reclaims caused by
   1.326 +the tasks in the cpuset, in units of reclaims attempted per second,
   1.327 +times 1000.
   1.328 +
   1.329 +
   1.330 +1.7 What is memory spread ?
   1.331 +---------------------------
   1.332 +There are two boolean flag files per cpuset that control where the
   1.333 +kernel allocates pages for the file system buffers and related in
   1.334 +kernel data structures.  They are called 'memory_spread_page' and
   1.335 +'memory_spread_slab'.
   1.336 +
   1.337 +If the per-cpuset boolean flag file 'memory_spread_page' is set, then
   1.338 +the kernel will spread the file system buffers (page cache) evenly
   1.339 +over all the nodes that the faulting task is allowed to use, instead
   1.340 +of preferring to put those pages on the node where the task is running.
   1.341 +
   1.342 +If the per-cpuset boolean flag file 'memory_spread_slab' is set,
   1.343 +then the kernel will spread some file system related slab caches,
   1.344 +such as for inodes and dentries evenly over all the nodes that the
   1.345 +faulting task is allowed to use, instead of preferring to put those
   1.346 +pages on the node where the task is running.
   1.347 +
   1.348 +The setting of these flags does not affect anonymous data segment or
   1.349 +stack segment pages of a task.
   1.350 +
   1.351 +By default, both kinds of memory spreading are off, and memory
   1.352 +pages are allocated on the node local to where the task is running,
   1.353 +except perhaps as modified by the tasks NUMA mempolicy or cpuset
   1.354 +configuration, so long as sufficient free memory pages are available.
   1.355 +
   1.356 +When new cpusets are created, they inherit the memory spread settings
   1.357 +of their parent.
   1.358 +
   1.359 +Setting memory spreading causes allocations for the affected page
   1.360 +or slab caches to ignore the tasks NUMA mempolicy and be spread
   1.361 +instead.    Tasks using mbind() or set_mempolicy() calls to set NUMA
   1.362 +mempolicies will not notice any change in these calls as a result of
   1.363 +their containing tasks memory spread settings.  If memory spreading
   1.364 +is turned off, then the currently specified NUMA mempolicy once again
   1.365 +applies to memory page allocations.
   1.366 +
   1.367 +Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag
   1.368 +files.  By default they contain "0", meaning that the feature is off
   1.369 +for that cpuset.  If a "1" is written to that file, then that turns
   1.370 +the named feature on.
   1.371 +
   1.372 +The implementation is simple.
   1.373 +
   1.374 +Setting the flag 'memory_spread_page' turns on a per-process flag
   1.375 +PF_SPREAD_PAGE for each task that is in that cpuset or subsequently
   1.376 +joins that cpuset.  The page allocation calls for the page cache
   1.377 +is modified to perform an inline check for this PF_SPREAD_PAGE task
   1.378 +flag, and if set, a call to a new routine cpuset_mem_spread_node()
   1.379 +returns the node to prefer for the allocation.
   1.380 +
   1.381 +Similarly, setting 'memory_spread_cache' turns on the flag
   1.382 +PF_SPREAD_SLAB, and appropriately marked slab caches will allocate
   1.383 +pages from the node returned by cpuset_mem_spread_node().
   1.384 +
   1.385 +The cpuset_mem_spread_node() routine is also simple.  It uses the
   1.386 +value of a per-task rotor cpuset_mem_spread_rotor to select the next
   1.387 +node in the current tasks mems_allowed to prefer for the allocation.
   1.388 +
   1.389 +This memory placement policy is also known (in other contexts) as
   1.390 +round-robin or interleave.
   1.391 +
   1.392 +This policy can provide substantial improvements for jobs that need
   1.393 +to place thread local data on the corresponding node, but that need
   1.394 +to access large file system data sets that need to be spread across
   1.395 +the several nodes in the jobs cpuset in order to fit.  Without this
   1.396 +policy, especially for jobs that might have one thread reading in the
   1.397 +data set, the memory allocation across the nodes in the jobs cpuset
   1.398 +can become very uneven.
   1.399 +
   1.400 +
   1.401 +1.8 How do I use cpusets ?
   1.402 +--------------------------
   1.403 +
   1.404 +In order to minimize the impact of cpusets on critical kernel
   1.405 +code, such as the scheduler, and due to the fact that the kernel
   1.406 +does not support one task updating the memory placement of another
   1.407 +task directly, the impact on a task of changing its cpuset CPU
   1.408 +or Memory Node placement, or of changing to which cpuset a task
   1.409 +is attached, is subtle.
   1.410 +
   1.411 +If a cpuset has its Memory Nodes modified, then for each task attached
   1.412 +to that cpuset, the next time that the kernel attempts to allocate
   1.413 +a page of memory for that task, the kernel will notice the change
   1.414 +in the tasks cpuset, and update its per-task memory placement to
   1.415 +remain within the new cpusets memory placement.  If the task was using
   1.416 +mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
   1.417 +its new cpuset, then the task will continue to use whatever subset
   1.418 +of MPOL_BIND nodes are still allowed in the new cpuset.  If the task
   1.419 +was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
   1.420 +in the new cpuset, then the task will be essentially treated as if it
   1.421 +was MPOL_BIND bound to the new cpuset (even though its numa placement,
   1.422 +as queried by get_mempolicy(), doesn't change).  If a task is moved
   1.423 +from one cpuset to another, then the kernel will adjust the tasks
   1.424 +memory placement, as above, the next time that the kernel attempts
   1.425 +to allocate a page of memory for that task.
   1.426 +
   1.427 +If a cpuset has its CPUs modified, then each task using that
   1.428 +cpuset does _not_ change its behavior automatically.  In order to
   1.429 +minimize the impact on the critical scheduling code in the kernel,
   1.430 +tasks will continue to use their prior CPU placement until they
   1.431 +are rebound to their cpuset, by rewriting their pid to the 'tasks'
   1.432 +file of their cpuset.  If a task had been bound to some subset of its
   1.433 +cpuset using the sched_setaffinity() call, and if any of that subset
   1.434 +is still allowed in its new cpuset settings, then the task will be
   1.435 +restricted to the intersection of the CPUs it was allowed on before,
   1.436 +and its new cpuset CPU placement.  If, on the other hand, there is
   1.437 +no overlap between a tasks prior placement and its new cpuset CPU
   1.438 +placement, then the task will be allowed to run on any CPU allowed
   1.439 +in its new cpuset.  If a task is moved from one cpuset to another,
   1.440 +its CPU placement is updated in the same way as if the tasks pid is
   1.441 +rewritten to the 'tasks' file of its current cpuset.
   1.442 +
   1.443 +In summary, the memory placement of a task whose cpuset is changed is
   1.444 +updated by the kernel, on the next allocation of a page for that task,
   1.445 +but the processor placement is not updated, until that tasks pid is
   1.446 +rewritten to the 'tasks' file of its cpuset.  This is done to avoid
   1.447 +impacting the scheduler code in the kernel with a check for changes
   1.448 +in a tasks processor placement.
   1.449 +
   1.450 +Normally, once a page is allocated (given a physical page
   1.451 +of main memory) then that page stays on whatever node it
   1.452 +was allocated, so long as it remains allocated, even if the
   1.453 +cpusets memory placement policy 'mems' subsequently changes.
   1.454 +If the cpuset flag file 'memory_migrate' is set true, then when
   1.455 +tasks are attached to that cpuset, any pages that task had
   1.456 +allocated to it on nodes in its previous cpuset are migrated
   1.457 +to the tasks new cpuset. The relative placement of the page within
   1.458 +the cpuset is preserved during these migration operations if possible.
   1.459 +For example if the page was on the second valid node of the prior cpuset
   1.460 +then the page will be placed on the second valid node of the new cpuset.
   1.461 +
   1.462 +Also if 'memory_migrate' is set true, then if that cpusets
   1.463 +'mems' file is modified, pages allocated to tasks in that
   1.464 +cpuset, that were on nodes in the previous setting of 'mems',
   1.465 +will be moved to nodes in the new setting of 'mems.'
   1.466 +Pages that were not in the tasks prior cpuset, or in the cpusets
   1.467 +prior 'mems' setting, will not be moved.
   1.468 +
   1.469 +There is an exception to the above.  If hotplug functionality is used
   1.470 +to remove all the CPUs that are currently assigned to a cpuset,
   1.471 +then the kernel will automatically update the cpus_allowed of all
   1.472 +tasks attached to CPUs in that cpuset to allow all CPUs.  When memory
   1.473 +hotplug functionality for removing Memory Nodes is available, a
   1.474 +similar exception is expected to apply there as well.  In general,
   1.475 +the kernel prefers to violate cpuset placement, over starving a task
   1.476 +that has had all its allowed CPUs or Memory Nodes taken offline.  User
   1.477 +code should reconfigure cpusets to only refer to online CPUs and Memory
   1.478 +Nodes when using hotplug to add or remove such resources.
   1.479 +
   1.480 +There is a second exception to the above.  GFP_ATOMIC requests are
   1.481 +kernel internal allocations that must be satisfied, immediately.
   1.482 +The kernel may drop some request, in rare cases even panic, if a
   1.483 +GFP_ATOMIC alloc fails.  If the request cannot be satisfied within
   1.484 +the current tasks cpuset, then we relax the cpuset, and look for
   1.485 +memory anywhere we can find it.  It's better to violate the cpuset
   1.486 +than stress the kernel.
   1.487 +
   1.488 +To start a new job that is to be contained within a cpuset, the steps are:
   1.489 +
   1.490 + 1) mkdir /dev/cpuset
   1.491 + 2) mount -t cpuset none /dev/cpuset
   1.492 + 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
   1.493 +    the /dev/cpuset virtual file system.
   1.494 + 4) Start a task that will be the "founding father" of the new job.
   1.495 + 5) Attach that task to the new cpuset by writing its pid to the
   1.496 +    /dev/cpuset tasks file for that cpuset.
   1.497 + 6) fork, exec or clone the job tasks from this founding father task.
   1.498 +
   1.499 +For example, the following sequence of commands will setup a cpuset
   1.500 +named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
   1.501 +and then start a subshell 'sh' in that cpuset:
   1.502 +
   1.503 +  mount -t cpuset none /dev/cpuset
   1.504 +  cd /dev/cpuset
   1.505 +  mkdir Charlie
   1.506 +  cd Charlie
   1.507 +  /bin/echo 2-3 > cpus
   1.508 +  /bin/echo 1 > mems
   1.509 +  /bin/echo $$ > tasks
   1.510 +  sh
   1.511 +  # The subshell 'sh' is now running in cpuset Charlie
   1.512 +  # The next line should display '/Charlie'
   1.513 +  cat /proc/self/cpuset
   1.514 +
   1.515 +In the future, a C library interface to cpusets will likely be
   1.516 +available.  For now, the only way to query or modify cpusets is
   1.517 +via the cpuset file system, using the various cd, mkdir, echo, cat,
   1.518 +rmdir commands from the shell, or their equivalent from C.
   1.519 +
   1.520 +The sched_setaffinity calls can also be done at the shell prompt using
   1.521 +SGI's runon or Robert Love's taskset.  The mbind and set_mempolicy
   1.522 +calls can be done at the shell prompt using the numactl command
   1.523 +(part of Andi Kleen's numa package).
   1.524 +
   1.525 +2. Usage Examples and Syntax
   1.526 +============================
   1.527 +
   1.528 +2.1 Basic Usage
   1.529 +---------------
   1.530 +
   1.531 +Creating, modifying, using the cpusets can be done through the cpuset
   1.532 +virtual filesystem.
   1.533 +
   1.534 +To mount it, type:
   1.535 +# mount -t cpuset none /dev/cpuset
   1.536 +
   1.537 +Then under /dev/cpuset you can find a tree that corresponds to the
   1.538 +tree of the cpusets in the system. For instance, /dev/cpuset
   1.539 +is the cpuset that holds the whole system.
   1.540 +
   1.541 +If you want to create a new cpuset under /dev/cpuset:
   1.542 +# cd /dev/cpuset
   1.543 +# mkdir my_cpuset
   1.544 +
   1.545 +Now you want to do something with this cpuset.
   1.546 +# cd my_cpuset
   1.547 +
   1.548 +In this directory you can find several files:
   1.549 +# ls
   1.550 +cpus  cpu_exclusive  mems  mem_exclusive  tasks
   1.551 +
   1.552 +Reading them will give you information about the state of this cpuset:
   1.553 +the CPUs and Memory Nodes it can use, the processes that are using
   1.554 +it, its properties.  By writing to these files you can manipulate
   1.555 +the cpuset.
   1.556 +
   1.557 +Set some flags:
   1.558 +# /bin/echo 1 > cpu_exclusive
   1.559 +
   1.560 +Add some cpus:
   1.561 +# /bin/echo 0-7 > cpus
   1.562 +
   1.563 +Now attach your shell to this cpuset:
   1.564 +# /bin/echo $$ > tasks
   1.565 +
   1.566 +You can also create cpusets inside your cpuset by using mkdir in this
   1.567 +directory.
   1.568 +# mkdir my_sub_cs
   1.569 +
   1.570 +To remove a cpuset, just use rmdir:
   1.571 +# rmdir my_sub_cs
   1.572 +This will fail if the cpuset is in use (has cpusets inside, or has
   1.573 +processes attached).
   1.574 +
   1.575 +2.2 Adding/removing cpus
   1.576 +------------------------
   1.577 +
   1.578 +This is the syntax to use when writing in the cpus or mems files
   1.579 +in cpuset directories:
   1.580 +
   1.581 +# /bin/echo 1-4 > cpus		-> set cpus list to cpus 1,2,3,4
   1.582 +# /bin/echo 1,2,3,4 > cpus	-> set cpus list to cpus 1,2,3,4
   1.583 +
   1.584 +2.3 Setting flags
   1.585 +-----------------
   1.586 +
   1.587 +The syntax is very simple:
   1.588 +
   1.589 +# /bin/echo 1 > cpu_exclusive 	-> set flag 'cpu_exclusive'
   1.590 +# /bin/echo 0 > cpu_exclusive 	-> unset flag 'cpu_exclusive'
   1.591 +
   1.592 +2.4 Attaching processes
   1.593 +-----------------------
   1.594 +
   1.595 +# /bin/echo PID > tasks
   1.596 +
   1.597 +Note that it is PID, not PIDs. You can only attach ONE task at a time.
   1.598 +If you have several tasks to attach, you have to do it one after another:
   1.599 +
   1.600 +# /bin/echo PID1 > tasks
   1.601 +# /bin/echo PID2 > tasks
   1.602 +	...
   1.603 +# /bin/echo PIDn > tasks
   1.604 +
   1.605 +
   1.606 +3. Questions
   1.607 +============
   1.608 +
   1.609 +Q: what's up with this '/bin/echo' ?
   1.610 +A: bash's builtin 'echo' command does not check calls to write() against
   1.611 +   errors. If you use it in the cpuset file system, you won't be
   1.612 +   able to tell whether a command succeeded or failed.
   1.613 +
   1.614 +Q: When I attach processes, only the first of the line gets really attached !
   1.615 +A: We can only return one error code per call to write(). So you should also
   1.616 +   put only ONE pid.
   1.617 +
   1.618 +4. Contact
   1.619 +==========
   1.620 +
   1.621 +Web: http://www.bullopensource.org/cpuset