ia64/linux-2.6.18-xen.hg

diff Documentation/md.txt @ 0:831230e53067

Import 2.6.18 from kernel.org tarball.
author Ian Campbell <ian.campbell@xensource.com>
date Wed Apr 11 14:15:44 2007 +0100 (2007-04-11)
parents
children
line diff
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/Documentation/md.txt	Wed Apr 11 14:15:44 2007 +0100
     1.3 @@ -0,0 +1,424 @@
     1.4 +Tools that manage md devices can be found at
     1.5 +   http://www.<country>.kernel.org/pub/linux/utils/raid/....
     1.6 +
     1.7 +
     1.8 +Boot time assembly of RAID arrays
     1.9 +---------------------------------
    1.10 +
    1.11 +You can boot with your md device with the following kernel command
    1.12 +lines:
    1.13 +
    1.14 +for old raid arrays without persistent superblocks:
    1.15 +  md=<md device no.>,<raid level>,<chunk size factor>,<fault level>,dev0,dev1,...,devn
    1.16 +
    1.17 +for raid arrays with persistent superblocks
    1.18 +  md=<md device no.>,dev0,dev1,...,devn
    1.19 +or, to assemble a partitionable array:
    1.20 +  md=d<md device no.>,dev0,dev1,...,devn
    1.21 +  
    1.22 +md device no. = the number of the md device ... 
    1.23 +              0 means md0, 
    1.24 +	      1 md1,
    1.25 +	      2 md2,
    1.26 +	      3 md3,
    1.27 +	      4 md4
    1.28 +
    1.29 +raid level = -1 linear mode
    1.30 +              0 striped mode
    1.31 +	      other modes are only supported with persistent super blocks
    1.32 +
    1.33 +chunk size factor = (raid-0 and raid-1 only)
    1.34 +              Set  the chunk size as 4k << n.
    1.35 +	      
    1.36 +fault level = totally ignored
    1.37 +			    
    1.38 +dev0-devn: e.g. /dev/hda1,/dev/hdc1,/dev/sda1,/dev/sdb1
    1.39 +			    
    1.40 +A possible loadlin line (Harald Hoyer <HarryH@Royal.Net>)  looks like this:
    1.41 +
    1.42 +e:\loadlin\loadlin e:\zimage root=/dev/md0 md=0,0,4,0,/dev/hdb2,/dev/hdc3 ro
    1.43 +
    1.44 +
    1.45 +Boot time autodetection of RAID arrays
    1.46 +--------------------------------------
    1.47 +
    1.48 +When md is compiled into the kernel (not as module), partitions of
    1.49 +type 0xfd are scanned and automatically assembled into RAID arrays.
    1.50 +This autodetection may be suppressed with the kernel parameter
    1.51 +"raid=noautodetect".  As of kernel 2.6.9, only drives with a type 0
    1.52 +superblock can be autodetected and run at boot time.
    1.53 +
    1.54 +The kernel parameter "raid=partitionable" (or "raid=part") means
    1.55 +that all auto-detected arrays are assembled as partitionable.
    1.56 +
    1.57 +Boot time assembly of degraded/dirty arrays
    1.58 +-------------------------------------------
    1.59 +
    1.60 +If a raid5 or raid6 array is both dirty and degraded, it could have
    1.61 +undetectable data corruption.  This is because the fact that it is
    1.62 +'dirty' means that the parity cannot be trusted, and the fact that it
    1.63 +is degraded means that some datablocks are missing and cannot reliably
    1.64 +be reconstructed (due to no parity).
    1.65 +
    1.66 +For this reason, md will normally refuse to start such an array.  This
    1.67 +requires the sysadmin to take action to explicitly start the array
    1.68 +desipite possible corruption.  This is normally done with
    1.69 +   mdadm --assemble --force ....
    1.70 +
    1.71 +This option is not really available if the array has the root
    1.72 +filesystem on it.  In order to support this booting from such an
    1.73 +array, md supports a module parameter "start_dirty_degraded" which,
    1.74 +when set to 1, bypassed the checks and will allows dirty degraded
    1.75 +arrays to be started.
    1.76 +
    1.77 +So, to boot with a root filesystem of a dirty degraded raid[56], use
    1.78 +
    1.79 +   md-mod.start_dirty_degraded=1
    1.80 +
    1.81 +
    1.82 +Superblock formats
    1.83 +------------------
    1.84 +
    1.85 +The md driver can support a variety of different superblock formats.
    1.86 +Currently, it supports superblock formats "0.90.0" and the "md-1" format
    1.87 +introduced in the 2.5 development series.
    1.88 +
    1.89 +The kernel will autodetect which format superblock is being used.
    1.90 +
    1.91 +Superblock format '0' is treated differently to others for legacy
    1.92 +reasons - it is the original superblock format.
    1.93 +
    1.94 +
    1.95 +General Rules - apply for all superblock formats
    1.96 +------------------------------------------------
    1.97 +
    1.98 +An array is 'created' by writing appropriate superblocks to all
    1.99 +devices.
   1.100 +
   1.101 +It is 'assembled' by associating each of these devices with an
   1.102 +particular md virtual device.  Once it is completely assembled, it can
   1.103 +be accessed.
   1.104 +
   1.105 +An array should be created by a user-space tool.  This will write
   1.106 +superblocks to all devices.  It will usually mark the array as
   1.107 +'unclean', or with some devices missing so that the kernel md driver
   1.108 +can create appropriate redundancy (copying in raid1, parity
   1.109 +calculation in raid4/5).
   1.110 +
   1.111 +When an array is assembled, it is first initialized with the
   1.112 +SET_ARRAY_INFO ioctl.  This contains, in particular, a major and minor
   1.113 +version number.  The major version number selects which superblock
   1.114 +format is to be used.  The minor number might be used to tune handling
   1.115 +of the format, such as suggesting where on each device to look for the
   1.116 +superblock.
   1.117 +
   1.118 +Then each device is added using the ADD_NEW_DISK ioctl.  This
   1.119 +provides, in particular, a major and minor number identifying the
   1.120 +device to add.
   1.121 +
   1.122 +The array is started with the RUN_ARRAY ioctl.
   1.123 +
   1.124 +Once started, new devices can be added.  They should have an
   1.125 +appropriate superblock written to them, and then passed be in with
   1.126 +ADD_NEW_DISK.
   1.127 +
   1.128 +Devices that have failed or are not yet active can be detached from an
   1.129 +array using HOT_REMOVE_DISK.
   1.130 +
   1.131 +
   1.132 +Specific Rules that apply to format-0 super block arrays, and
   1.133 +       arrays with no superblock (non-persistent).
   1.134 +-------------------------------------------------------------
   1.135 +
   1.136 +An array can be 'created' by describing the array (level, chunksize
   1.137 +etc) in a SET_ARRAY_INFO ioctl.  This must has major_version==0 and
   1.138 +raid_disks != 0.
   1.139 +
   1.140 +Then uninitialized devices can be added with ADD_NEW_DISK.  The
   1.141 +structure passed to ADD_NEW_DISK must specify the state of the device
   1.142 +and it's role in the array.
   1.143 +
   1.144 +Once started with RUN_ARRAY, uninitialized spares can be added with
   1.145 +HOT_ADD_DISK.
   1.146 +
   1.147 +
   1.148 +
   1.149 +MD devices in sysfs
   1.150 +-------------------
   1.151 +md devices appear in sysfs (/sys) as regular block devices,
   1.152 +e.g.
   1.153 +   /sys/block/md0
   1.154 +
   1.155 +Each 'md' device will contain a subdirectory called 'md' which
   1.156 +contains further md-specific information about the device.
   1.157 +
   1.158 +All md devices contain:
   1.159 +  level
   1.160 +     a text file indicating the 'raid level'.  This may be a standard
   1.161 +     numerical level prefixed by "RAID-" - e.g. "RAID-5", or some
   1.162 +     other name such as "linear" or "multipath".
   1.163 +     If no raid level has been set yet (array is still being
   1.164 +     assembled), this file will be empty.
   1.165 +
   1.166 +  raid_disks
   1.167 +     a text file with a simple number indicating the number of devices
   1.168 +     in a fully functional array.  If this is not yet known, the file
   1.169 +     will be empty.  If an array is being resized (not currently
   1.170 +     possible) this will contain the larger of the old and new sizes.
   1.171 +     Some raid level (RAID1) allow this value to be set while the
   1.172 +     array is active.  This will reconfigure the array.   Otherwise
   1.173 +     it can only be set while assembling an array.
   1.174 +
   1.175 +  chunk_size
   1.176 +     This is the size if bytes for 'chunks' and is only relevant to
   1.177 +     raid levels that involve striping (1,4,5,6,10). The address space
   1.178 +     of the array is conceptually divided into chunks and consecutive
   1.179 +     chunks are striped onto neighbouring devices.
   1.180 +     The size should be atleast PAGE_SIZE (4k) and should be a power
   1.181 +     of 2.  This can only be set while assembling an array
   1.182 +
   1.183 +  component_size
   1.184 +     For arrays with data redundancy (i.e. not raid0, linear, faulty,
   1.185 +     multipath), all components must be the same size - or at least
   1.186 +     there must a size that they all provide space for.  This is a key
   1.187 +     part or the geometry of the array.  It is measured in sectors
   1.188 +     and can be read from here.  Writing to this value may resize
   1.189 +     the array if the personality supports it (raid1, raid5, raid6),
   1.190 +     and if the component drives are large enough.
   1.191 +
   1.192 +  metadata_version
   1.193 +     This indicates the format that is being used to record metadata
   1.194 +     about the array.  It can be 0.90 (traditional format), 1.0, 1.1,
   1.195 +     1.2 (newer format in varying locations) or "none" indicating that
   1.196 +     the kernel isn't managing metadata at all.
   1.197 +
   1.198 +  level
   1.199 +     The raid 'level' for this array.  The name will often (but not
   1.200 +     always) be the same as the name of the module that implements the
   1.201 +     level.  To be auto-loaded the module must have an alias
   1.202 +        md-$LEVEL  e.g. md-raid5
   1.203 +     This can be written only while the array is being assembled, not
   1.204 +     after it is started.
   1.205 +
   1.206 +  layout
   1.207 +     The "layout" for the array for the particular level.  This is
   1.208 +     simply a number that is interpretted differently by different
   1.209 +     levels.  It can be written while assembling an array.
   1.210 +
   1.211 +  resync_start
   1.212 +     The point at which resync should start.  If no resync is needed,
   1.213 +     this will be a very large number.  At array creation it will
   1.214 +     default to 0, though starting the array as 'clean' will
   1.215 +     set it much larger.
   1.216 +
   1.217 +   new_dev
   1.218 +     This file can be written but not read.  The value written should
   1.219 +     be a block device number as major:minor.  e.g. 8:0
   1.220 +     This will cause that device to be attached to the array, if it is
   1.221 +     available.  It will then appear at md/dev-XXX (depending on the
   1.222 +     name of the device) and further configuration is then possible.
   1.223 +
   1.224 +   safe_mode_delay
   1.225 +     When an md array has seen no write requests for a certain period
   1.226 +     of time, it will be marked as 'clean'.  When another write
   1.227 +     request arrive, the array is marked as 'dirty' before the write
   1.228 +     commenses.  This is known as 'safe_mode'.
   1.229 +     The 'certain period' is controlled by this file which stores the
   1.230 +     period as a number of seconds.  The default is 200msec (0.200).
   1.231 +     Writing a value of 0 disables safemode.
   1.232 +
   1.233 +   array_state
   1.234 +     This file contains a single word which describes the current
   1.235 +     state of the array.  In many cases, the state can be set by
   1.236 +     writing the word for the desired state, however some states
   1.237 +     cannot be explicitly set, and some transitions are not allowed.
   1.238 +
   1.239 +     clear
   1.240 +         No devices, no size, no level
   1.241 +         Writing is equivalent to STOP_ARRAY ioctl
   1.242 +     inactive
   1.243 +         May have some settings, but array is not active
   1.244 +            all IO results in error
   1.245 +         When written, doesn't tear down array, but just stops it
   1.246 +     suspended (not supported yet)
   1.247 +         All IO requests will block. The array can be reconfigured.
   1.248 +         Writing this, if accepted, will block until array is quiessent
   1.249 +     readonly
   1.250 +         no resync can happen.  no superblocks get written.
   1.251 +         write requests fail
   1.252 +     read-auto
   1.253 +         like readonly, but behaves like 'clean' on a write request.
   1.254 +
   1.255 +     clean - no pending writes, but otherwise active.
   1.256 +         When written to inactive array, starts without resync
   1.257 +         If a write request arrives then
   1.258 +           if metadata is known, mark 'dirty' and switch to 'active'.
   1.259 +           if not known, block and switch to write-pending
   1.260 +         If written to an active array that has pending writes, then fails.
   1.261 +     active
   1.262 +         fully active: IO and resync can be happening.
   1.263 +         When written to inactive array, starts with resync
   1.264 +
   1.265 +     write-pending
   1.266 +         clean, but writes are blocked waiting for 'active' to be written.
   1.267 +
   1.268 +     active-idle
   1.269 +         like active, but no writes have been seen for a while (safe_mode_delay).
   1.270 +
   1.271 +
   1.272 +   sync_speed_min
   1.273 +   sync_speed_max
   1.274 +     This are similar to /proc/sys/dev/raid/speed_limit_{min,max}
   1.275 +     however they only apply to the particular array.
   1.276 +     If no value has been written to these, of if the word 'system'
   1.277 +     is written, then the system-wide value is used.  If a value,
   1.278 +     in kibibytes-per-second is written, then it is used.
   1.279 +     When the files are read, they show the currently active value
   1.280 +     followed by "(local)" or "(system)" depending on whether it is
   1.281 +     a locally set or system-wide value.
   1.282 +
   1.283 +   sync_completed
   1.284 +     This shows the number of sectors that have been completed of
   1.285 +     whatever the current sync_action is, followed by the number of
   1.286 +     sectors in total that could need to be processed.  The two
   1.287 +     numbers are separated by a '/'  thus effectively showing one
   1.288 +     value, a fraction of the process that is complete.
   1.289 +
   1.290 +   sync_speed
   1.291 +     This shows the current actual speed, in K/sec, of the current
   1.292 +     sync_action.  It is averaged over the last 30 seconds.
   1.293 +
   1.294 +
   1.295 +As component devices are added to an md array, they appear in the 'md'
   1.296 +directory as new directories named
   1.297 +      dev-XXX
   1.298 +where XXX is a name that the kernel knows for the device, e.g. hdb1.
   1.299 +Each directory contains:
   1.300 +
   1.301 +      block
   1.302 +        a symlink to the block device in /sys/block, e.g.
   1.303 +	     /sys/block/md0/md/dev-hdb1/block -> ../../../../block/hdb/hdb1
   1.304 +
   1.305 +      super
   1.306 +        A file containing an image of the superblock read from, or
   1.307 +        written to, that device.
   1.308 +
   1.309 +      state
   1.310 +        A file recording the current state of the device in the array
   1.311 +	which can be a comma separated list of
   1.312 +	      faulty   - device has been kicked from active use due to
   1.313 +                         a detected fault
   1.314 +	      in_sync  - device is a fully in-sync member of the array
   1.315 +	      writemostly - device will only be subject to read
   1.316 +		         requests if there are no other options.
   1.317 +			 This applies only to raid1 arrays.
   1.318 +	      spare    - device is working, but not a full member.
   1.319 +			 This includes spares that are in the process
   1.320 +			 of being recoverred to
   1.321 +	This list make grow in future.
   1.322 +	This can be written to.
   1.323 +	Writing "faulty"  simulates a failure on the device.
   1.324 +	Writing "remove" removes the device from the array.
   1.325 +	Writing "writemostly" sets the writemostly flag.
   1.326 +	Writing "-writemostly" clears the writemostly flag.
   1.327 +
   1.328 +      errors
   1.329 +	An approximate count of read errors that have been detected on
   1.330 +	this device but have not caused the device to be evicted from
   1.331 +	the array (either because they were corrected or because they
   1.332 +	happened while the array was read-only).  When using version-1
   1.333 +	metadata, this value persists across restarts of the array.
   1.334 +
   1.335 +	This value can be written while assembling an array thus
   1.336 +	providing an ongoing count for arrays with metadata managed by
   1.337 +	userspace.
   1.338 +
   1.339 +      slot
   1.340 +        This gives the role that the device has in the array.  It will
   1.341 +	either be 'none' if the device is not active in the array
   1.342 +        (i.e. is a spare or has failed) or an integer less than the
   1.343 +	'raid_disks' number for the array indicating which possition
   1.344 +	it currently fills.  This can only be set while assembling an
   1.345 +	array.  A device for which this is set is assumed to be working.
   1.346 +
   1.347 +      offset
   1.348 +        This gives the location in the device (in sectors from the
   1.349 +        start) where data from the array will be stored.  Any part of
   1.350 +        the device before this offset us not touched, unless it is
   1.351 +        used for storing metadata (Formats 1.1 and 1.2).
   1.352 +
   1.353 +      size
   1.354 +        The amount of the device, after the offset, that can be used
   1.355 +        for storage of data.  This will normally be the same as the
   1.356 +	component_size.  This can be written while assembling an
   1.357 +        array.  If a value less than the current component_size is
   1.358 +        written, component_size will be reduced to this value.
   1.359 +
   1.360 +
   1.361 +An active md device will also contain and entry for each active device
   1.362 +in the array.  These are named
   1.363 +
   1.364 +    rdNN
   1.365 +
   1.366 +where 'NN' is the possition in the array, starting from 0.
   1.367 +So for a 3 drive array there will be rd0, rd1, rd2.
   1.368 +These are symbolic links to the appropriate 'dev-XXX' entry.
   1.369 +Thus, for example,
   1.370 +       cat /sys/block/md*/md/rd*/state
   1.371 +will show 'in_sync' on every line.
   1.372 +
   1.373 +
   1.374 +
   1.375 +Active md devices for levels that support data redundancy (1,4,5,6)
   1.376 +also have
   1.377 +
   1.378 +   sync_action
   1.379 +     a text file that can be used to monitor and control the rebuild
   1.380 +     process.  It contains one word which can be one of:
   1.381 +       resync        - redundancy is being recalculated after unclean
   1.382 +                       shutdown or creation
   1.383 +       recover       - a hot spare is being built to replace a
   1.384 +                       failed/missing device
   1.385 +       idle          - nothing is happening
   1.386 +       check         - A full check of redundancy was requested and is
   1.387 +                       happening.  This reads all block and checks
   1.388 +                       them. A repair may also happen for some raid
   1.389 +                       levels.
   1.390 +       repair        - A full check and repair is happening.  This is
   1.391 +                       similar to 'resync', but was requested by the
   1.392 +                       user, and the write-intent bitmap is NOT used to
   1.393 +		       optimise the process.
   1.394 +
   1.395 +      This file is writable, and each of the strings that could be
   1.396 +      read are meaningful for writing.
   1.397 +
   1.398 +       'idle' will stop an active resync/recovery etc.  There is no
   1.399 +           guarantee that another resync/recovery may not be automatically
   1.400 +	   started again, though some event will be needed to trigger
   1.401 +           this.
   1.402 +	'resync' or 'recovery' can be used to restart the
   1.403 +           corresponding operation if it was stopped with 'idle'.
   1.404 +	'check' and 'repair' will start the appropriate process
   1.405 +           providing the current state is 'idle'.
   1.406 +
   1.407 +   mismatch_count
   1.408 +      When performing 'check' and 'repair', and possibly when
   1.409 +      performing 'resync', md will count the number of errors that are
   1.410 +      found.  The count in 'mismatch_cnt' is the number of sectors
   1.411 +      that were re-written, or (for 'check') would have been
   1.412 +      re-written.  As most raid levels work in units of pages rather
   1.413 +      than sectors, this my be larger than the number of actual errors
   1.414 +      by a factor of the number of sectors in a page.
   1.415 +
   1.416 +Each active md device may also have attributes specific to the
   1.417 +personality module that manages it.
   1.418 +These are specific to the implementation of the module and could
   1.419 +change substantially if the implementation changes.
   1.420 +
   1.421 +These currently include
   1.422 +
   1.423 +  stripe_cache_size  (currently raid5 only)
   1.424 +      number of entries in the stripe cache.  This is writable, but
   1.425 +      there are upper and lower limits (32768, 16).  Default is 128.
   1.426 +  strip_cache_active (currently raid5 only)
   1.427 +      number of active entries in the stripe cache