annotate Documentation/md.txt @ 0:831230e53067

Import 2.6.18 from kernel.org tarball.
author Ian Campbell <ian.campbell@xensource.com>
date Wed Apr 11 14:15:44 2007 +0100 (2007-04-11)
rev   line source
ian@0 1 Tools that manage md devices can be found at
ian@0 2 http://www.<country>.kernel.org/pub/linux/utils/raid/....
ian@0 3
ian@0 4
ian@0 5 Boot time assembly of RAID arrays
ian@0 6 ---------------------------------
ian@0 7
ian@0 8 You can boot with your md device with the following kernel command
ian@0 9 lines:
ian@0 10
ian@0 11 for old raid arrays without persistent superblocks:
ian@0 12 md=<md device no.>,<raid level>,<chunk size factor>,<fault level>,dev0,dev1,...,devn
ian@0 13
ian@0 14 for raid arrays with persistent superblocks
ian@0 15 md=<md device no.>,dev0,dev1,...,devn
ian@0 16 or, to assemble a partitionable array:
ian@0 17 md=d<md device no.>,dev0,dev1,...,devn
ian@0 18
ian@0 19 md device no. = the number of the md device ...
ian@0 20 0 means md0,
ian@0 21 1 md1,
ian@0 22 2 md2,
ian@0 23 3 md3,
ian@0 24 4 md4
ian@0 25
ian@0 26 raid level = -1 linear mode
ian@0 27 0 striped mode
ian@0 28 other modes are only supported with persistent super blocks
ian@0 29
ian@0 30 chunk size factor = (raid-0 and raid-1 only)
ian@0 31 Set the chunk size as 4k << n.
ian@0 32
ian@0 33 fault level = totally ignored
ian@0 34
ian@0 35 dev0-devn: e.g. /dev/hda1,/dev/hdc1,/dev/sda1,/dev/sdb1
ian@0 36
ian@0 37 A possible loadlin line (Harald Hoyer <HarryH@Royal.Net>) looks like this:
ian@0 38
ian@0 39 e:\loadlin\loadlin e:\zimage root=/dev/md0 md=0,0,4,0,/dev/hdb2,/dev/hdc3 ro
ian@0 40
ian@0 41
ian@0 42 Boot time autodetection of RAID arrays
ian@0 43 --------------------------------------
ian@0 44
ian@0 45 When md is compiled into the kernel (not as module), partitions of
ian@0 46 type 0xfd are scanned and automatically assembled into RAID arrays.
ian@0 47 This autodetection may be suppressed with the kernel parameter
ian@0 48 "raid=noautodetect". As of kernel 2.6.9, only drives with a type 0
ian@0 49 superblock can be autodetected and run at boot time.
ian@0 50
ian@0 51 The kernel parameter "raid=partitionable" (or "raid=part") means
ian@0 52 that all auto-detected arrays are assembled as partitionable.
ian@0 53
ian@0 54 Boot time assembly of degraded/dirty arrays
ian@0 55 -------------------------------------------
ian@0 56
ian@0 57 If a raid5 or raid6 array is both dirty and degraded, it could have
ian@0 58 undetectable data corruption. This is because the fact that it is
ian@0 59 'dirty' means that the parity cannot be trusted, and the fact that it
ian@0 60 is degraded means that some datablocks are missing and cannot reliably
ian@0 61 be reconstructed (due to no parity).
ian@0 62
ian@0 63 For this reason, md will normally refuse to start such an array. This
ian@0 64 requires the sysadmin to take action to explicitly start the array
ian@0 65 desipite possible corruption. This is normally done with
ian@0 66 mdadm --assemble --force ....
ian@0 67
ian@0 68 This option is not really available if the array has the root
ian@0 69 filesystem on it. In order to support this booting from such an
ian@0 70 array, md supports a module parameter "start_dirty_degraded" which,
ian@0 71 when set to 1, bypassed the checks and will allows dirty degraded
ian@0 72 arrays to be started.
ian@0 73
ian@0 74 So, to boot with a root filesystem of a dirty degraded raid[56], use
ian@0 75
ian@0 76 md-mod.start_dirty_degraded=1
ian@0 77
ian@0 78
ian@0 79 Superblock formats
ian@0 80 ------------------
ian@0 81
ian@0 82 The md driver can support a variety of different superblock formats.
ian@0 83 Currently, it supports superblock formats "0.90.0" and the "md-1" format
ian@0 84 introduced in the 2.5 development series.
ian@0 85
ian@0 86 The kernel will autodetect which format superblock is being used.
ian@0 87
ian@0 88 Superblock format '0' is treated differently to others for legacy
ian@0 89 reasons - it is the original superblock format.
ian@0 90
ian@0 91
ian@0 92 General Rules - apply for all superblock formats
ian@0 93 ------------------------------------------------
ian@0 94
ian@0 95 An array is 'created' by writing appropriate superblocks to all
ian@0 96 devices.
ian@0 97
ian@0 98 It is 'assembled' by associating each of these devices with an
ian@0 99 particular md virtual device. Once it is completely assembled, it can
ian@0 100 be accessed.
ian@0 101
ian@0 102 An array should be created by a user-space tool. This will write
ian@0 103 superblocks to all devices. It will usually mark the array as
ian@0 104 'unclean', or with some devices missing so that the kernel md driver
ian@0 105 can create appropriate redundancy (copying in raid1, parity
ian@0 106 calculation in raid4/5).
ian@0 107
ian@0 108 When an array is assembled, it is first initialized with the
ian@0 109 SET_ARRAY_INFO ioctl. This contains, in particular, a major and minor
ian@0 110 version number. The major version number selects which superblock
ian@0 111 format is to be used. The minor number might be used to tune handling
ian@0 112 of the format, such as suggesting where on each device to look for the
ian@0 113 superblock.
ian@0 114
ian@0 115 Then each device is added using the ADD_NEW_DISK ioctl. This
ian@0 116 provides, in particular, a major and minor number identifying the
ian@0 117 device to add.
ian@0 118
ian@0 119 The array is started with the RUN_ARRAY ioctl.
ian@0 120
ian@0 121 Once started, new devices can be added. They should have an
ian@0 122 appropriate superblock written to them, and then passed be in with
ian@0 123 ADD_NEW_DISK.
ian@0 124
ian@0 125 Devices that have failed or are not yet active can be detached from an
ian@0 126 array using HOT_REMOVE_DISK.
ian@0 127
ian@0 128
ian@0 129 Specific Rules that apply to format-0 super block arrays, and
ian@0 130 arrays with no superblock (non-persistent).
ian@0 131 -------------------------------------------------------------
ian@0 132
ian@0 133 An array can be 'created' by describing the array (level, chunksize
ian@0 134 etc) in a SET_ARRAY_INFO ioctl. This must has major_version==0 and
ian@0 135 raid_disks != 0.
ian@0 136
ian@0 137 Then uninitialized devices can be added with ADD_NEW_DISK. The
ian@0 138 structure passed to ADD_NEW_DISK must specify the state of the device
ian@0 139 and it's role in the array.
ian@0 140
ian@0 141 Once started with RUN_ARRAY, uninitialized spares can be added with
ian@0 142 HOT_ADD_DISK.
ian@0 143
ian@0 144
ian@0 145
ian@0 146 MD devices in sysfs
ian@0 147 -------------------
ian@0 148 md devices appear in sysfs (/sys) as regular block devices,
ian@0 149 e.g.
ian@0 150 /sys/block/md0
ian@0 151
ian@0 152 Each 'md' device will contain a subdirectory called 'md' which
ian@0 153 contains further md-specific information about the device.
ian@0 154
ian@0 155 All md devices contain:
ian@0 156 level
ian@0 157 a text file indicating the 'raid level'. This may be a standard
ian@0 158 numerical level prefixed by "RAID-" - e.g. "RAID-5", or some
ian@0 159 other name such as "linear" or "multipath".
ian@0 160 If no raid level has been set yet (array is still being
ian@0 161 assembled), this file will be empty.
ian@0 162
ian@0 163 raid_disks
ian@0 164 a text file with a simple number indicating the number of devices
ian@0 165 in a fully functional array. If this is not yet known, the file
ian@0 166 will be empty. If an array is being resized (not currently
ian@0 167 possible) this will contain the larger of the old and new sizes.
ian@0 168 Some raid level (RAID1) allow this value to be set while the
ian@0 169 array is active. This will reconfigure the array. Otherwise
ian@0 170 it can only be set while assembling an array.
ian@0 171
ian@0 172 chunk_size
ian@0 173 This is the size if bytes for 'chunks' and is only relevant to
ian@0 174 raid levels that involve striping (1,4,5,6,10). The address space
ian@0 175 of the array is conceptually divided into chunks and consecutive
ian@0 176 chunks are striped onto neighbouring devices.
ian@0 177 The size should be atleast PAGE_SIZE (4k) and should be a power
ian@0 178 of 2. This can only be set while assembling an array
ian@0 179
ian@0 180 component_size
ian@0 181 For arrays with data redundancy (i.e. not raid0, linear, faulty,
ian@0 182 multipath), all components must be the same size - or at least
ian@0 183 there must a size that they all provide space for. This is a key
ian@0 184 part or the geometry of the array. It is measured in sectors
ian@0 185 and can be read from here. Writing to this value may resize
ian@0 186 the array if the personality supports it (raid1, raid5, raid6),
ian@0 187 and if the component drives are large enough.
ian@0 188
ian@0 189 metadata_version
ian@0 190 This indicates the format that is being used to record metadata
ian@0 191 about the array. It can be 0.90 (traditional format), 1.0, 1.1,
ian@0 192 1.2 (newer format in varying locations) or "none" indicating that
ian@0 193 the kernel isn't managing metadata at all.
ian@0 194
ian@0 195 level
ian@0 196 The raid 'level' for this array. The name will often (but not
ian@0 197 always) be the same as the name of the module that implements the
ian@0 198 level. To be auto-loaded the module must have an alias
ian@0 199 md-$LEVEL e.g. md-raid5
ian@0 200 This can be written only while the array is being assembled, not
ian@0 201 after it is started.
ian@0 202
ian@0 203 layout
ian@0 204 The "layout" for the array for the particular level. This is
ian@0 205 simply a number that is interpretted differently by different
ian@0 206 levels. It can be written while assembling an array.
ian@0 207
ian@0 208 resync_start
ian@0 209 The point at which resync should start. If no resync is needed,
ian@0 210 this will be a very large number. At array creation it will
ian@0 211 default to 0, though starting the array as 'clean' will
ian@0 212 set it much larger.
ian@0 213
ian@0 214 new_dev
ian@0 215 This file can be written but not read. The value written should
ian@0 216 be a block device number as major:minor. e.g. 8:0
ian@0 217 This will cause that device to be attached to the array, if it is
ian@0 218 available. It will then appear at md/dev-XXX (depending on the
ian@0 219 name of the device) and further configuration is then possible.
ian@0 220
ian@0 221 safe_mode_delay
ian@0 222 When an md array has seen no write requests for a certain period
ian@0 223 of time, it will be marked as 'clean'. When another write
ian@0 224 request arrive, the array is marked as 'dirty' before the write
ian@0 225 commenses. This is known as 'safe_mode'.
ian@0 226 The 'certain period' is controlled by this file which stores the
ian@0 227 period as a number of seconds. The default is 200msec (0.200).
ian@0 228 Writing a value of 0 disables safemode.
ian@0 229
ian@0 230 array_state
ian@0 231 This file contains a single word which describes the current
ian@0 232 state of the array. In many cases, the state can be set by
ian@0 233 writing the word for the desired state, however some states
ian@0 234 cannot be explicitly set, and some transitions are not allowed.
ian@0 235
ian@0 236 clear
ian@0 237 No devices, no size, no level
ian@0 238 Writing is equivalent to STOP_ARRAY ioctl
ian@0 239 inactive
ian@0 240 May have some settings, but array is not active
ian@0 241 all IO results in error
ian@0 242 When written, doesn't tear down array, but just stops it
ian@0 243 suspended (not supported yet)
ian@0 244 All IO requests will block. The array can be reconfigured.
ian@0 245 Writing this, if accepted, will block until array is quiessent
ian@0 246 readonly
ian@0 247 no resync can happen. no superblocks get written.
ian@0 248 write requests fail
ian@0 249 read-auto
ian@0 250 like readonly, but behaves like 'clean' on a write request.
ian@0 251
ian@0 252 clean - no pending writes, but otherwise active.
ian@0 253 When written to inactive array, starts without resync
ian@0 254 If a write request arrives then
ian@0 255 if metadata is known, mark 'dirty' and switch to 'active'.
ian@0 256 if not known, block and switch to write-pending
ian@0 257 If written to an active array that has pending writes, then fails.
ian@0 258 active
ian@0 259 fully active: IO and resync can be happening.
ian@0 260 When written to inactive array, starts with resync
ian@0 261
ian@0 262 write-pending
ian@0 263 clean, but writes are blocked waiting for 'active' to be written.
ian@0 264
ian@0 265 active-idle
ian@0 266 like active, but no writes have been seen for a while (safe_mode_delay).
ian@0 267
ian@0 268
ian@0 269 sync_speed_min
ian@0 270 sync_speed_max
ian@0 271 This are similar to /proc/sys/dev/raid/speed_limit_{min,max}
ian@0 272 however they only apply to the particular array.
ian@0 273 If no value has been written to these, of if the word 'system'
ian@0 274 is written, then the system-wide value is used. If a value,
ian@0 275 in kibibytes-per-second is written, then it is used.
ian@0 276 When the files are read, they show the currently active value
ian@0 277 followed by "(local)" or "(system)" depending on whether it is
ian@0 278 a locally set or system-wide value.
ian@0 279
ian@0 280 sync_completed
ian@0 281 This shows the number of sectors that have been completed of
ian@0 282 whatever the current sync_action is, followed by the number of
ian@0 283 sectors in total that could need to be processed. The two
ian@0 284 numbers are separated by a '/' thus effectively showing one
ian@0 285 value, a fraction of the process that is complete.
ian@0 286
ian@0 287 sync_speed
ian@0 288 This shows the current actual speed, in K/sec, of the current
ian@0 289 sync_action. It is averaged over the last 30 seconds.
ian@0 290
ian@0 291
ian@0 292 As component devices are added to an md array, they appear in the 'md'
ian@0 293 directory as new directories named
ian@0 294 dev-XXX
ian@0 295 where XXX is a name that the kernel knows for the device, e.g. hdb1.
ian@0 296 Each directory contains:
ian@0 297
ian@0 298 block
ian@0 299 a symlink to the block device in /sys/block, e.g.
ian@0 300 /sys/block/md0/md/dev-hdb1/block -> ../../../../block/hdb/hdb1
ian@0 301
ian@0 302 super
ian@0 303 A file containing an image of the superblock read from, or
ian@0 304 written to, that device.
ian@0 305
ian@0 306 state
ian@0 307 A file recording the current state of the device in the array
ian@0 308 which can be a comma separated list of
ian@0 309 faulty - device has been kicked from active use due to
ian@0 310 a detected fault
ian@0 311 in_sync - device is a fully in-sync member of the array
ian@0 312 writemostly - device will only be subject to read
ian@0 313 requests if there are no other options.
ian@0 314 This applies only to raid1 arrays.
ian@0 315 spare - device is working, but not a full member.
ian@0 316 This includes spares that are in the process
ian@0 317 of being recoverred to
ian@0 318 This list make grow in future.
ian@0 319 This can be written to.
ian@0 320 Writing "faulty" simulates a failure on the device.
ian@0 321 Writing "remove" removes the device from the array.
ian@0 322 Writing "writemostly" sets the writemostly flag.
ian@0 323 Writing "-writemostly" clears the writemostly flag.
ian@0 324
ian@0 325 errors
ian@0 326 An approximate count of read errors that have been detected on
ian@0 327 this device but have not caused the device to be evicted from
ian@0 328 the array (either because they were corrected or because they
ian@0 329 happened while the array was read-only). When using version-1
ian@0 330 metadata, this value persists across restarts of the array.
ian@0 331
ian@0 332 This value can be written while assembling an array thus
ian@0 333 providing an ongoing count for arrays with metadata managed by
ian@0 334 userspace.
ian@0 335
ian@0 336 slot
ian@0 337 This gives the role that the device has in the array. It will
ian@0 338 either be 'none' if the device is not active in the array
ian@0 339 (i.e. is a spare or has failed) or an integer less than the
ian@0 340 'raid_disks' number for the array indicating which possition
ian@0 341 it currently fills. This can only be set while assembling an
ian@0 342 array. A device for which this is set is assumed to be working.
ian@0 343
ian@0 344 offset
ian@0 345 This gives the location in the device (in sectors from the
ian@0 346 start) where data from the array will be stored. Any part of
ian@0 347 the device before this offset us not touched, unless it is
ian@0 348 used for storing metadata (Formats 1.1 and 1.2).
ian@0 349
ian@0 350 size
ian@0 351 The amount of the device, after the offset, that can be used
ian@0 352 for storage of data. This will normally be the same as the
ian@0 353 component_size. This can be written while assembling an
ian@0 354 array. If a value less than the current component_size is
ian@0 355 written, component_size will be reduced to this value.
ian@0 356
ian@0 357
ian@0 358 An active md device will also contain and entry for each active device
ian@0 359 in the array. These are named
ian@0 360
ian@0 361 rdNN
ian@0 362
ian@0 363 where 'NN' is the possition in the array, starting from 0.
ian@0 364 So for a 3 drive array there will be rd0, rd1, rd2.
ian@0 365 These are symbolic links to the appropriate 'dev-XXX' entry.
ian@0 366 Thus, for example,
ian@0 367 cat /sys/block/md*/md/rd*/state
ian@0 368 will show 'in_sync' on every line.
ian@0 369
ian@0 370
ian@0 371
ian@0 372 Active md devices for levels that support data redundancy (1,4,5,6)
ian@0 373 also have
ian@0 374
ian@0 375 sync_action
ian@0 376 a text file that can be used to monitor and control the rebuild
ian@0 377 process. It contains one word which can be one of:
ian@0 378 resync - redundancy is being recalculated after unclean
ian@0 379 shutdown or creation
ian@0 380 recover - a hot spare is being built to replace a
ian@0 381 failed/missing device
ian@0 382 idle - nothing is happening
ian@0 383 check - A full check of redundancy was requested and is
ian@0 384 happening. This reads all block and checks
ian@0 385 them. A repair may also happen for some raid
ian@0 386 levels.
ian@0 387 repair - A full check and repair is happening. This is
ian@0 388 similar to 'resync', but was requested by the
ian@0 389 user, and the write-intent bitmap is NOT used to
ian@0 390 optimise the process.
ian@0 391
ian@0 392 This file is writable, and each of the strings that could be
ian@0 393 read are meaningful for writing.
ian@0 394
ian@0 395 'idle' will stop an active resync/recovery etc. There is no
ian@0 396 guarantee that another resync/recovery may not be automatically
ian@0 397 started again, though some event will be needed to trigger
ian@0 398 this.
ian@0 399 'resync' or 'recovery' can be used to restart the
ian@0 400 corresponding operation if it was stopped with 'idle'.
ian@0 401 'check' and 'repair' will start the appropriate process
ian@0 402 providing the current state is 'idle'.
ian@0 403
ian@0 404 mismatch_count
ian@0 405 When performing 'check' and 'repair', and possibly when
ian@0 406 performing 'resync', md will count the number of errors that are
ian@0 407 found. The count in 'mismatch_cnt' is the number of sectors
ian@0 408 that were re-written, or (for 'check') would have been
ian@0 409 re-written. As most raid levels work in units of pages rather
ian@0 410 than sectors, this my be larger than the number of actual errors
ian@0 411 by a factor of the number of sectors in a page.
ian@0 412
ian@0 413 Each active md device may also have attributes specific to the
ian@0 414 personality module that manages it.
ian@0 415 These are specific to the implementation of the module and could
ian@0 416 change substantially if the implementation changes.
ian@0 417
ian@0 418 These currently include
ian@0 419
ian@0 420 stripe_cache_size (currently raid5 only)
ian@0 421 number of entries in the stripe cache. This is writable, but
ian@0 422 there are upper and lower limits (32768, 16). Default is 128.
ian@0 423 strip_cache_active (currently raid5 only)
ian@0 424 number of active entries in the stripe cache