annotate Documentation/cpusets.txt @ 0:831230e53067

Import 2.6.18 from kernel.org tarball.
author Ian Campbell <ian.campbell@xensource.com>
date Wed Apr 11 14:15:44 2007 +0100 (2007-04-11)
rev   line source
ian@0 1 CPUSETS
ian@0 2 -------
ian@0 3
ian@0 4 Copyright (C) 2004 BULL SA.
ian@0 5 Written by Simon.Derr@bull.net
ian@0 6
ian@0 7 Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
ian@0 8 Modified by Paul Jackson <pj@sgi.com>
ian@0 9 Modified by Christoph Lameter <clameter@sgi.com>
ian@0 10
ian@0 11 CONTENTS:
ian@0 12 =========
ian@0 13
ian@0 14 1. Cpusets
ian@0 15 1.1 What are cpusets ?
ian@0 16 1.2 Why are cpusets needed ?
ian@0 17 1.3 How are cpusets implemented ?
ian@0 18 1.4 What are exclusive cpusets ?
ian@0 19 1.5 What does notify_on_release do ?
ian@0 20 1.6 What is memory_pressure ?
ian@0 21 1.7 What is memory spread ?
ian@0 22 1.8 How do I use cpusets ?
ian@0 23 2. Usage Examples and Syntax
ian@0 24 2.1 Basic Usage
ian@0 25 2.2 Adding/removing cpus
ian@0 26 2.3 Setting flags
ian@0 27 2.4 Attaching processes
ian@0 28 3. Questions
ian@0 29 4. Contact
ian@0 30
ian@0 31 1. Cpusets
ian@0 32 ==========
ian@0 33
ian@0 34 1.1 What are cpusets ?
ian@0 35 ----------------------
ian@0 36
ian@0 37 Cpusets provide a mechanism for assigning a set of CPUs and Memory
ian@0 38 Nodes to a set of tasks.
ian@0 39
ian@0 40 Cpusets constrain the CPU and Memory placement of tasks to only
ian@0 41 the resources within a tasks current cpuset. They form a nested
ian@0 42 hierarchy visible in a virtual file system. These are the essential
ian@0 43 hooks, beyond what is already present, required to manage dynamic
ian@0 44 job placement on large systems.
ian@0 45
ian@0 46 Each task has a pointer to a cpuset. Multiple tasks may reference
ian@0 47 the same cpuset. Requests by a task, using the sched_setaffinity(2)
ian@0 48 system call to include CPUs in its CPU affinity mask, and using the
ian@0 49 mbind(2) and set_mempolicy(2) system calls to include Memory Nodes
ian@0 50 in its memory policy, are both filtered through that tasks cpuset,
ian@0 51 filtering out any CPUs or Memory Nodes not in that cpuset. The
ian@0 52 scheduler will not schedule a task on a CPU that is not allowed in
ian@0 53 its cpus_allowed vector, and the kernel page allocator will not
ian@0 54 allocate a page on a node that is not allowed in the requesting tasks
ian@0 55 mems_allowed vector.
ian@0 56
ian@0 57 User level code may create and destroy cpusets by name in the cpuset
ian@0 58 virtual file system, manage the attributes and permissions of these
ian@0 59 cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
ian@0 60 specify and query to which cpuset a task is assigned, and list the
ian@0 61 task pids assigned to a cpuset.
ian@0 62
ian@0 63
ian@0 64 1.2 Why are cpusets needed ?
ian@0 65 ----------------------------
ian@0 66
ian@0 67 The management of large computer systems, with many processors (CPUs),
ian@0 68 complex memory cache hierarchies and multiple Memory Nodes having
ian@0 69 non-uniform access times (NUMA) presents additional challenges for
ian@0 70 the efficient scheduling and memory placement of processes.
ian@0 71
ian@0 72 Frequently more modest sized systems can be operated with adequate
ian@0 73 efficiency just by letting the operating system automatically share
ian@0 74 the available CPU and Memory resources amongst the requesting tasks.
ian@0 75
ian@0 76 But larger systems, which benefit more from careful processor and
ian@0 77 memory placement to reduce memory access times and contention,
ian@0 78 and which typically represent a larger investment for the customer,
ian@0 79 can benefit from explicitly placing jobs on properly sized subsets of
ian@0 80 the system.
ian@0 81
ian@0 82 This can be especially valuable on:
ian@0 83
ian@0 84 * Web Servers running multiple instances of the same web application,
ian@0 85 * Servers running different applications (for instance, a web server
ian@0 86 and a database), or
ian@0 87 * NUMA systems running large HPC applications with demanding
ian@0 88 performance characteristics.
ian@0 89 * Also cpu_exclusive cpusets are useful for servers running orthogonal
ian@0 90 workloads such as RT applications requiring low latency and HPC
ian@0 91 applications that are throughput sensitive
ian@0 92
ian@0 93 These subsets, or "soft partitions" must be able to be dynamically
ian@0 94 adjusted, as the job mix changes, without impacting other concurrently
ian@0 95 executing jobs. The location of the running jobs pages may also be moved
ian@0 96 when the memory locations are changed.
ian@0 97
ian@0 98 The kernel cpuset patch provides the minimum essential kernel
ian@0 99 mechanisms required to efficiently implement such subsets. It
ian@0 100 leverages existing CPU and Memory Placement facilities in the Linux
ian@0 101 kernel to avoid any additional impact on the critical scheduler or
ian@0 102 memory allocator code.
ian@0 103
ian@0 104
ian@0 105 1.3 How are cpusets implemented ?
ian@0 106 ---------------------------------
ian@0 107
ian@0 108 Cpusets provide a Linux kernel mechanism to constrain which CPUs and
ian@0 109 Memory Nodes are used by a process or set of processes.
ian@0 110
ian@0 111 The Linux kernel already has a pair of mechanisms to specify on which
ian@0 112 CPUs a task may be scheduled (sched_setaffinity) and on which Memory
ian@0 113 Nodes it may obtain memory (mbind, set_mempolicy).
ian@0 114
ian@0 115 Cpusets extends these two mechanisms as follows:
ian@0 116
ian@0 117 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
ian@0 118 kernel.
ian@0 119 - Each task in the system is attached to a cpuset, via a pointer
ian@0 120 in the task structure to a reference counted cpuset structure.
ian@0 121 - Calls to sched_setaffinity are filtered to just those CPUs
ian@0 122 allowed in that tasks cpuset.
ian@0 123 - Calls to mbind and set_mempolicy are filtered to just
ian@0 124 those Memory Nodes allowed in that tasks cpuset.
ian@0 125 - The root cpuset contains all the systems CPUs and Memory
ian@0 126 Nodes.
ian@0 127 - For any cpuset, one can define child cpusets containing a subset
ian@0 128 of the parents CPU and Memory Node resources.
ian@0 129 - The hierarchy of cpusets can be mounted at /dev/cpuset, for
ian@0 130 browsing and manipulation from user space.
ian@0 131 - A cpuset may be marked exclusive, which ensures that no other
ian@0 132 cpuset (except direct ancestors and descendents) may contain
ian@0 133 any overlapping CPUs or Memory Nodes.
ian@0 134 Also a cpu_exclusive cpuset would be associated with a sched
ian@0 135 domain.
ian@0 136 - You can list all the tasks (by pid) attached to any cpuset.
ian@0 137
ian@0 138 The implementation of cpusets requires a few, simple hooks
ian@0 139 into the rest of the kernel, none in performance critical paths:
ian@0 140
ian@0 141 - in init/main.c, to initialize the root cpuset at system boot.
ian@0 142 - in fork and exit, to attach and detach a task from its cpuset.
ian@0 143 - in sched_setaffinity, to mask the requested CPUs by what's
ian@0 144 allowed in that tasks cpuset.
ian@0 145 - in sched.c migrate_all_tasks(), to keep migrating tasks within
ian@0 146 the CPUs allowed by their cpuset, if possible.
ian@0 147 - in sched.c, a new API partition_sched_domains for handling
ian@0 148 sched domain changes associated with cpu_exclusive cpusets
ian@0 149 and related changes in both sched.c and arch/ia64/kernel/domain.c
ian@0 150 - in the mbind and set_mempolicy system calls, to mask the requested
ian@0 151 Memory Nodes by what's allowed in that tasks cpuset.
ian@0 152 - in page_alloc.c, to restrict memory to allowed nodes.
ian@0 153 - in vmscan.c, to restrict page recovery to the current cpuset.
ian@0 154
ian@0 155 In addition a new file system, of type "cpuset" may be mounted,
ian@0 156 typically at /dev/cpuset, to enable browsing and modifying the cpusets
ian@0 157 presently known to the kernel. No new system calls are added for
ian@0 158 cpusets - all support for querying and modifying cpusets is via
ian@0 159 this cpuset file system.
ian@0 160
ian@0 161 Each task under /proc has an added file named 'cpuset', displaying
ian@0 162 the cpuset name, as the path relative to the root of the cpuset file
ian@0 163 system.
ian@0 164
ian@0 165 The /proc/<pid>/status file for each task has two added lines,
ian@0 166 displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
ian@0 167 and mems_allowed (on which Memory Nodes it may obtain memory),
ian@0 168 in the format seen in the following example:
ian@0 169
ian@0 170 Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
ian@0 171 Mems_allowed: ffffffff,ffffffff
ian@0 172
ian@0 173 Each cpuset is represented by a directory in the cpuset file system
ian@0 174 containing the following files describing that cpuset:
ian@0 175
ian@0 176 - cpus: list of CPUs in that cpuset
ian@0 177 - mems: list of Memory Nodes in that cpuset
ian@0 178 - memory_migrate flag: if set, move pages to cpusets nodes
ian@0 179 - cpu_exclusive flag: is cpu placement exclusive?
ian@0 180 - mem_exclusive flag: is memory placement exclusive?
ian@0 181 - tasks: list of tasks (by pid) attached to that cpuset
ian@0 182 - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
ian@0 183 - memory_pressure: measure of how much paging pressure in cpuset
ian@0 184
ian@0 185 In addition, the root cpuset only has the following file:
ian@0 186 - memory_pressure_enabled flag: compute memory_pressure?
ian@0 187
ian@0 188 New cpusets are created using the mkdir system call or shell
ian@0 189 command. The properties of a cpuset, such as its flags, allowed
ian@0 190 CPUs and Memory Nodes, and attached tasks, are modified by writing
ian@0 191 to the appropriate file in that cpusets directory, as listed above.
ian@0 192
ian@0 193 The named hierarchical structure of nested cpusets allows partitioning
ian@0 194 a large system into nested, dynamically changeable, "soft-partitions".
ian@0 195
ian@0 196 The attachment of each task, automatically inherited at fork by any
ian@0 197 children of that task, to a cpuset allows organizing the work load
ian@0 198 on a system into related sets of tasks such that each set is constrained
ian@0 199 to using the CPUs and Memory Nodes of a particular cpuset. A task
ian@0 200 may be re-attached to any other cpuset, if allowed by the permissions
ian@0 201 on the necessary cpuset file system directories.
ian@0 202
ian@0 203 Such management of a system "in the large" integrates smoothly with
ian@0 204 the detailed placement done on individual tasks and memory regions
ian@0 205 using the sched_setaffinity, mbind and set_mempolicy system calls.
ian@0 206
ian@0 207 The following rules apply to each cpuset:
ian@0 208
ian@0 209 - Its CPUs and Memory Nodes must be a subset of its parents.
ian@0 210 - It can only be marked exclusive if its parent is.
ian@0 211 - If its cpu or memory is exclusive, they may not overlap any sibling.
ian@0 212
ian@0 213 These rules, and the natural hierarchy of cpusets, enable efficient
ian@0 214 enforcement of the exclusive guarantee, without having to scan all
ian@0 215 cpusets every time any of them change to ensure nothing overlaps a
ian@0 216 exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
ian@0 217 to represent the cpuset hierarchy provides for a familiar permission
ian@0 218 and name space for cpusets, with a minimum of additional kernel code.
ian@0 219
ian@0 220 The cpus file in the root (top_cpuset) cpuset is read-only.
ian@0 221 It automatically tracks the value of cpu_online_map, using a CPU
ian@0 222 hotplug notifier. If and when memory nodes can be hotplugged,
ian@0 223 we expect to make the mems file in the root cpuset read-only
ian@0 224 as well, and have it track the value of node_online_map.
ian@0 225
ian@0 226
ian@0 227 1.4 What are exclusive cpusets ?
ian@0 228 --------------------------------
ian@0 229
ian@0 230 If a cpuset is cpu or mem exclusive, no other cpuset, other than
ian@0 231 a direct ancestor or descendent, may share any of the same CPUs or
ian@0 232 Memory Nodes.
ian@0 233
ian@0 234 A cpuset that is cpu_exclusive has a scheduler (sched) domain
ian@0 235 associated with it. The sched domain consists of all CPUs in the
ian@0 236 current cpuset that are not part of any exclusive child cpusets.
ian@0 237 This ensures that the scheduler load balancing code only balances
ian@0 238 against the CPUs that are in the sched domain as defined above and
ian@0 239 not all of the CPUs in the system. This removes any overhead due to
ian@0 240 load balancing code trying to pull tasks outside of the cpu_exclusive
ian@0 241 cpuset only to be prevented by the tasks' cpus_allowed mask.
ian@0 242
ian@0 243 A cpuset that is mem_exclusive restricts kernel allocations for
ian@0 244 page, buffer and other data commonly shared by the kernel across
ian@0 245 multiple users. All cpusets, whether mem_exclusive or not, restrict
ian@0 246 allocations of memory for user space. This enables configuring a
ian@0 247 system so that several independent jobs can share common kernel data,
ian@0 248 such as file system pages, while isolating each jobs user allocation in
ian@0 249 its own cpuset. To do this, construct a large mem_exclusive cpuset to
ian@0 250 hold all the jobs, and construct child, non-mem_exclusive cpusets for
ian@0 251 each individual job. Only a small amount of typical kernel memory,
ian@0 252 such as requests from interrupt handlers, is allowed to be taken
ian@0 253 outside even a mem_exclusive cpuset.
ian@0 254
ian@0 255
ian@0 256 1.5 What does notify_on_release do ?
ian@0 257 ------------------------------------
ian@0 258
ian@0 259 If the notify_on_release flag is enabled (1) in a cpuset, then whenever
ian@0 260 the last task in the cpuset leaves (exits or attaches to some other
ian@0 261 cpuset) and the last child cpuset of that cpuset is removed, then
ian@0 262 the kernel runs the command /sbin/cpuset_release_agent, supplying the
ian@0 263 pathname (relative to the mount point of the cpuset file system) of the
ian@0 264 abandoned cpuset. This enables automatic removal of abandoned cpusets.
ian@0 265 The default value of notify_on_release in the root cpuset at system
ian@0 266 boot is disabled (0). The default value of other cpusets at creation
ian@0 267 is the current value of their parents notify_on_release setting.
ian@0 268
ian@0 269
ian@0 270 1.6 What is memory_pressure ?
ian@0 271 -----------------------------
ian@0 272 The memory_pressure of a cpuset provides a simple per-cpuset metric
ian@0 273 of the rate that the tasks in a cpuset are attempting to free up in
ian@0 274 use memory on the nodes of the cpuset to satisfy additional memory
ian@0 275 requests.
ian@0 276
ian@0 277 This enables batch managers monitoring jobs running in dedicated
ian@0 278 cpusets to efficiently detect what level of memory pressure that job
ian@0 279 is causing.
ian@0 280
ian@0 281 This is useful both on tightly managed systems running a wide mix of
ian@0 282 submitted jobs, which may choose to terminate or re-prioritize jobs that
ian@0 283 are trying to use more memory than allowed on the nodes assigned them,
ian@0 284 and with tightly coupled, long running, massively parallel scientific
ian@0 285 computing jobs that will dramatically fail to meet required performance
ian@0 286 goals if they start to use more memory than allowed to them.
ian@0 287
ian@0 288 This mechanism provides a very economical way for the batch manager
ian@0 289 to monitor a cpuset for signs of memory pressure. It's up to the
ian@0 290 batch manager or other user code to decide what to do about it and
ian@0 291 take action.
ian@0 292
ian@0 293 ==> Unless this feature is enabled by writing "1" to the special file
ian@0 294 /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
ian@0 295 code of __alloc_pages() for this metric reduces to simply noticing
ian@0 296 that the cpuset_memory_pressure_enabled flag is zero. So only
ian@0 297 systems that enable this feature will compute the metric.
ian@0 298
ian@0 299 Why a per-cpuset, running average:
ian@0 300
ian@0 301 Because this meter is per-cpuset, rather than per-task or mm,
ian@0 302 the system load imposed by a batch scheduler monitoring this
ian@0 303 metric is sharply reduced on large systems, because a scan of
ian@0 304 the tasklist can be avoided on each set of queries.
ian@0 305
ian@0 306 Because this meter is a running average, instead of an accumulating
ian@0 307 counter, a batch scheduler can detect memory pressure with a
ian@0 308 single read, instead of having to read and accumulate results
ian@0 309 for a period of time.
ian@0 310
ian@0 311 Because this meter is per-cpuset rather than per-task or mm,
ian@0 312 the batch scheduler can obtain the key information, memory
ian@0 313 pressure in a cpuset, with a single read, rather than having to
ian@0 314 query and accumulate results over all the (dynamically changing)
ian@0 315 set of tasks in the cpuset.
ian@0 316
ian@0 317 A per-cpuset simple digital filter (requires a spinlock and 3 words
ian@0 318 of data per-cpuset) is kept, and updated by any task attached to that
ian@0 319 cpuset, if it enters the synchronous (direct) page reclaim code.
ian@0 320
ian@0 321 A per-cpuset file provides an integer number representing the recent
ian@0 322 (half-life of 10 seconds) rate of direct page reclaims caused by
ian@0 323 the tasks in the cpuset, in units of reclaims attempted per second,
ian@0 324 times 1000.
ian@0 325
ian@0 326
ian@0 327 1.7 What is memory spread ?
ian@0 328 ---------------------------
ian@0 329 There are two boolean flag files per cpuset that control where the
ian@0 330 kernel allocates pages for the file system buffers and related in
ian@0 331 kernel data structures. They are called 'memory_spread_page' and
ian@0 332 'memory_spread_slab'.
ian@0 333
ian@0 334 If the per-cpuset boolean flag file 'memory_spread_page' is set, then
ian@0 335 the kernel will spread the file system buffers (page cache) evenly
ian@0 336 over all the nodes that the faulting task is allowed to use, instead
ian@0 337 of preferring to put those pages on the node where the task is running.
ian@0 338
ian@0 339 If the per-cpuset boolean flag file 'memory_spread_slab' is set,
ian@0 340 then the kernel will spread some file system related slab caches,
ian@0 341 such as for inodes and dentries evenly over all the nodes that the
ian@0 342 faulting task is allowed to use, instead of preferring to put those
ian@0 343 pages on the node where the task is running.
ian@0 344
ian@0 345 The setting of these flags does not affect anonymous data segment or
ian@0 346 stack segment pages of a task.
ian@0 347
ian@0 348 By default, both kinds of memory spreading are off, and memory
ian@0 349 pages are allocated on the node local to where the task is running,
ian@0 350 except perhaps as modified by the tasks NUMA mempolicy or cpuset
ian@0 351 configuration, so long as sufficient free memory pages are available.
ian@0 352
ian@0 353 When new cpusets are created, they inherit the memory spread settings
ian@0 354 of their parent.
ian@0 355
ian@0 356 Setting memory spreading causes allocations for the affected page
ian@0 357 or slab caches to ignore the tasks NUMA mempolicy and be spread
ian@0 358 instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
ian@0 359 mempolicies will not notice any change in these calls as a result of
ian@0 360 their containing tasks memory spread settings. If memory spreading
ian@0 361 is turned off, then the currently specified NUMA mempolicy once again
ian@0 362 applies to memory page allocations.
ian@0 363
ian@0 364 Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag
ian@0 365 files. By default they contain "0", meaning that the feature is off
ian@0 366 for that cpuset. If a "1" is written to that file, then that turns
ian@0 367 the named feature on.
ian@0 368
ian@0 369 The implementation is simple.
ian@0 370
ian@0 371 Setting the flag 'memory_spread_page' turns on a per-process flag
ian@0 372 PF_SPREAD_PAGE for each task that is in that cpuset or subsequently
ian@0 373 joins that cpuset. The page allocation calls for the page cache
ian@0 374 is modified to perform an inline check for this PF_SPREAD_PAGE task
ian@0 375 flag, and if set, a call to a new routine cpuset_mem_spread_node()
ian@0 376 returns the node to prefer for the allocation.
ian@0 377
ian@0 378 Similarly, setting 'memory_spread_cache' turns on the flag
ian@0 379 PF_SPREAD_SLAB, and appropriately marked slab caches will allocate
ian@0 380 pages from the node returned by cpuset_mem_spread_node().
ian@0 381
ian@0 382 The cpuset_mem_spread_node() routine is also simple. It uses the
ian@0 383 value of a per-task rotor cpuset_mem_spread_rotor to select the next
ian@0 384 node in the current tasks mems_allowed to prefer for the allocation.
ian@0 385
ian@0 386 This memory placement policy is also known (in other contexts) as
ian@0 387 round-robin or interleave.
ian@0 388
ian@0 389 This policy can provide substantial improvements for jobs that need
ian@0 390 to place thread local data on the corresponding node, but that need
ian@0 391 to access large file system data sets that need to be spread across
ian@0 392 the several nodes in the jobs cpuset in order to fit. Without this
ian@0 393 policy, especially for jobs that might have one thread reading in the
ian@0 394 data set, the memory allocation across the nodes in the jobs cpuset
ian@0 395 can become very uneven.
ian@0 396
ian@0 397
ian@0 398 1.8 How do I use cpusets ?
ian@0 399 --------------------------
ian@0 400
ian@0 401 In order to minimize the impact of cpusets on critical kernel
ian@0 402 code, such as the scheduler, and due to the fact that the kernel
ian@0 403 does not support one task updating the memory placement of another
ian@0 404 task directly, the impact on a task of changing its cpuset CPU
ian@0 405 or Memory Node placement, or of changing to which cpuset a task
ian@0 406 is attached, is subtle.
ian@0 407
ian@0 408 If a cpuset has its Memory Nodes modified, then for each task attached
ian@0 409 to that cpuset, the next time that the kernel attempts to allocate
ian@0 410 a page of memory for that task, the kernel will notice the change
ian@0 411 in the tasks cpuset, and update its per-task memory placement to
ian@0 412 remain within the new cpusets memory placement. If the task was using
ian@0 413 mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
ian@0 414 its new cpuset, then the task will continue to use whatever subset
ian@0 415 of MPOL_BIND nodes are still allowed in the new cpuset. If the task
ian@0 416 was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
ian@0 417 in the new cpuset, then the task will be essentially treated as if it
ian@0 418 was MPOL_BIND bound to the new cpuset (even though its numa placement,
ian@0 419 as queried by get_mempolicy(), doesn't change). If a task is moved
ian@0 420 from one cpuset to another, then the kernel will adjust the tasks
ian@0 421 memory placement, as above, the next time that the kernel attempts
ian@0 422 to allocate a page of memory for that task.
ian@0 423
ian@0 424 If a cpuset has its CPUs modified, then each task using that
ian@0 425 cpuset does _not_ change its behavior automatically. In order to
ian@0 426 minimize the impact on the critical scheduling code in the kernel,
ian@0 427 tasks will continue to use their prior CPU placement until they
ian@0 428 are rebound to their cpuset, by rewriting their pid to the 'tasks'
ian@0 429 file of their cpuset. If a task had been bound to some subset of its
ian@0 430 cpuset using the sched_setaffinity() call, and if any of that subset
ian@0 431 is still allowed in its new cpuset settings, then the task will be
ian@0 432 restricted to the intersection of the CPUs it was allowed on before,
ian@0 433 and its new cpuset CPU placement. If, on the other hand, there is
ian@0 434 no overlap between a tasks prior placement and its new cpuset CPU
ian@0 435 placement, then the task will be allowed to run on any CPU allowed
ian@0 436 in its new cpuset. If a task is moved from one cpuset to another,
ian@0 437 its CPU placement is updated in the same way as if the tasks pid is
ian@0 438 rewritten to the 'tasks' file of its current cpuset.
ian@0 439
ian@0 440 In summary, the memory placement of a task whose cpuset is changed is
ian@0 441 updated by the kernel, on the next allocation of a page for that task,
ian@0 442 but the processor placement is not updated, until that tasks pid is
ian@0 443 rewritten to the 'tasks' file of its cpuset. This is done to avoid
ian@0 444 impacting the scheduler code in the kernel with a check for changes
ian@0 445 in a tasks processor placement.
ian@0 446
ian@0 447 Normally, once a page is allocated (given a physical page
ian@0 448 of main memory) then that page stays on whatever node it
ian@0 449 was allocated, so long as it remains allocated, even if the
ian@0 450 cpusets memory placement policy 'mems' subsequently changes.
ian@0 451 If the cpuset flag file 'memory_migrate' is set true, then when
ian@0 452 tasks are attached to that cpuset, any pages that task had
ian@0 453 allocated to it on nodes in its previous cpuset are migrated
ian@0 454 to the tasks new cpuset. The relative placement of the page within
ian@0 455 the cpuset is preserved during these migration operations if possible.
ian@0 456 For example if the page was on the second valid node of the prior cpuset
ian@0 457 then the page will be placed on the second valid node of the new cpuset.
ian@0 458
ian@0 459 Also if 'memory_migrate' is set true, then if that cpusets
ian@0 460 'mems' file is modified, pages allocated to tasks in that
ian@0 461 cpuset, that were on nodes in the previous setting of 'mems',
ian@0 462 will be moved to nodes in the new setting of 'mems.'
ian@0 463 Pages that were not in the tasks prior cpuset, or in the cpusets
ian@0 464 prior 'mems' setting, will not be moved.
ian@0 465
ian@0 466 There is an exception to the above. If hotplug functionality is used
ian@0 467 to remove all the CPUs that are currently assigned to a cpuset,
ian@0 468 then the kernel will automatically update the cpus_allowed of all
ian@0 469 tasks attached to CPUs in that cpuset to allow all CPUs. When memory
ian@0 470 hotplug functionality for removing Memory Nodes is available, a
ian@0 471 similar exception is expected to apply there as well. In general,
ian@0 472 the kernel prefers to violate cpuset placement, over starving a task
ian@0 473 that has had all its allowed CPUs or Memory Nodes taken offline. User
ian@0 474 code should reconfigure cpusets to only refer to online CPUs and Memory
ian@0 475 Nodes when using hotplug to add or remove such resources.
ian@0 476
ian@0 477 There is a second exception to the above. GFP_ATOMIC requests are
ian@0 478 kernel internal allocations that must be satisfied, immediately.
ian@0 479 The kernel may drop some request, in rare cases even panic, if a
ian@0 480 GFP_ATOMIC alloc fails. If the request cannot be satisfied within
ian@0 481 the current tasks cpuset, then we relax the cpuset, and look for
ian@0 482 memory anywhere we can find it. It's better to violate the cpuset
ian@0 483 than stress the kernel.
ian@0 484
ian@0 485 To start a new job that is to be contained within a cpuset, the steps are:
ian@0 486
ian@0 487 1) mkdir /dev/cpuset
ian@0 488 2) mount -t cpuset none /dev/cpuset
ian@0 489 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
ian@0 490 the /dev/cpuset virtual file system.
ian@0 491 4) Start a task that will be the "founding father" of the new job.
ian@0 492 5) Attach that task to the new cpuset by writing its pid to the
ian@0 493 /dev/cpuset tasks file for that cpuset.
ian@0 494 6) fork, exec or clone the job tasks from this founding father task.
ian@0 495
ian@0 496 For example, the following sequence of commands will setup a cpuset
ian@0 497 named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
ian@0 498 and then start a subshell 'sh' in that cpuset:
ian@0 499
ian@0 500 mount -t cpuset none /dev/cpuset
ian@0 501 cd /dev/cpuset
ian@0 502 mkdir Charlie
ian@0 503 cd Charlie
ian@0 504 /bin/echo 2-3 > cpus
ian@0 505 /bin/echo 1 > mems
ian@0 506 /bin/echo $$ > tasks
ian@0 507 sh
ian@0 508 # The subshell 'sh' is now running in cpuset Charlie
ian@0 509 # The next line should display '/Charlie'
ian@0 510 cat /proc/self/cpuset
ian@0 511
ian@0 512 In the future, a C library interface to cpusets will likely be
ian@0 513 available. For now, the only way to query or modify cpusets is
ian@0 514 via the cpuset file system, using the various cd, mkdir, echo, cat,
ian@0 515 rmdir commands from the shell, or their equivalent from C.
ian@0 516
ian@0 517 The sched_setaffinity calls can also be done at the shell prompt using
ian@0 518 SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
ian@0 519 calls can be done at the shell prompt using the numactl command
ian@0 520 (part of Andi Kleen's numa package).
ian@0 521
ian@0 522 2. Usage Examples and Syntax
ian@0 523 ============================
ian@0 524
ian@0 525 2.1 Basic Usage
ian@0 526 ---------------
ian@0 527
ian@0 528 Creating, modifying, using the cpusets can be done through the cpuset
ian@0 529 virtual filesystem.
ian@0 530
ian@0 531 To mount it, type:
ian@0 532 # mount -t cpuset none /dev/cpuset
ian@0 533
ian@0 534 Then under /dev/cpuset you can find a tree that corresponds to the
ian@0 535 tree of the cpusets in the system. For instance, /dev/cpuset
ian@0 536 is the cpuset that holds the whole system.
ian@0 537
ian@0 538 If you want to create a new cpuset under /dev/cpuset:
ian@0 539 # cd /dev/cpuset
ian@0 540 # mkdir my_cpuset
ian@0 541
ian@0 542 Now you want to do something with this cpuset.
ian@0 543 # cd my_cpuset
ian@0 544
ian@0 545 In this directory you can find several files:
ian@0 546 # ls
ian@0 547 cpus cpu_exclusive mems mem_exclusive tasks
ian@0 548
ian@0 549 Reading them will give you information about the state of this cpuset:
ian@0 550 the CPUs and Memory Nodes it can use, the processes that are using
ian@0 551 it, its properties. By writing to these files you can manipulate
ian@0 552 the cpuset.
ian@0 553
ian@0 554 Set some flags:
ian@0 555 # /bin/echo 1 > cpu_exclusive
ian@0 556
ian@0 557 Add some cpus:
ian@0 558 # /bin/echo 0-7 > cpus
ian@0 559
ian@0 560 Now attach your shell to this cpuset:
ian@0 561 # /bin/echo $$ > tasks
ian@0 562
ian@0 563 You can also create cpusets inside your cpuset by using mkdir in this
ian@0 564 directory.
ian@0 565 # mkdir my_sub_cs
ian@0 566
ian@0 567 To remove a cpuset, just use rmdir:
ian@0 568 # rmdir my_sub_cs
ian@0 569 This will fail if the cpuset is in use (has cpusets inside, or has
ian@0 570 processes attached).
ian@0 571
ian@0 572 2.2 Adding/removing cpus
ian@0 573 ------------------------
ian@0 574
ian@0 575 This is the syntax to use when writing in the cpus or mems files
ian@0 576 in cpuset directories:
ian@0 577
ian@0 578 # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4
ian@0 579 # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4
ian@0 580
ian@0 581 2.3 Setting flags
ian@0 582 -----------------
ian@0 583
ian@0 584 The syntax is very simple:
ian@0 585
ian@0 586 # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive'
ian@0 587 # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive'
ian@0 588
ian@0 589 2.4 Attaching processes
ian@0 590 -----------------------
ian@0 591
ian@0 592 # /bin/echo PID > tasks
ian@0 593
ian@0 594 Note that it is PID, not PIDs. You can only attach ONE task at a time.
ian@0 595 If you have several tasks to attach, you have to do it one after another:
ian@0 596
ian@0 597 # /bin/echo PID1 > tasks
ian@0 598 # /bin/echo PID2 > tasks
ian@0 599 ...
ian@0 600 # /bin/echo PIDn > tasks
ian@0 601
ian@0 602
ian@0 603 3. Questions
ian@0 604 ============
ian@0 605
ian@0 606 Q: what's up with this '/bin/echo' ?
ian@0 607 A: bash's builtin 'echo' command does not check calls to write() against
ian@0 608 errors. If you use it in the cpuset file system, you won't be
ian@0 609 able to tell whether a command succeeded or failed.
ian@0 610
ian@0 611 Q: When I attach processes, only the first of the line gets really attached !
ian@0 612 A: We can only return one error code per call to write(). So you should also
ian@0 613 put only ONE pid.
ian@0 614
ian@0 615 4. Contact
ian@0 616 ==========
ian@0 617
ian@0 618 Web: http://www.bullopensource.org/cpuset