ia64/linux-2.6.18-xen.hg

annotate Documentation/networking/cxgb.txt @ 897:329ea0ccb344

balloon: try harder to balloon up under memory pressure.

Currently if the balloon driver is unable to increase the guest's
reservation it assumes the failure was due to reaching its full
allocation, gives up on the ballooning operation and records the limit
it reached as the "hard limit". The driver will not try again until
the target is set again (even to the same value).

However it is possible that ballooning has in fact failed due to
memory pressure in the host and therefore it is desirable to keep
attempting to reach the target in case memory becomes available. The
most likely scenario is that some guests are ballooning down while
others are ballooning up and therefore there is temporary memory
pressure while things stabilise. You would not expect a well behaved
toolstack to ask a domain to balloon to more than its allocation nor
would you expect it to deliberately over-commit memory by setting
balloon targets which exceed the total host memory.

This patch drops the concept of a hard limit and causes the balloon
driver to retry increasing the reservation on a timer in the same
manner as when decreasing the reservation.

Also if we partially succeed in increasing the reservation
(i.e. receive less pages than we asked for) then we may as well keep
those pages rather than returning them to Xen.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
author Keir Fraser <keir.fraser@citrix.com>
date Fri Jun 05 14:01:20 2009 +0100 (2009-06-05)
parents 831230e53067
children
rev   line source
ian@0 1 Chelsio N210 10Gb Ethernet Network Controller
ian@0 2
ian@0 3 Driver Release Notes for Linux
ian@0 4
ian@0 5 Version 2.1.1
ian@0 6
ian@0 7 June 20, 2005
ian@0 8
ian@0 9 CONTENTS
ian@0 10 ========
ian@0 11 INTRODUCTION
ian@0 12 FEATURES
ian@0 13 PERFORMANCE
ian@0 14 DRIVER MESSAGES
ian@0 15 KNOWN ISSUES
ian@0 16 SUPPORT
ian@0 17
ian@0 18
ian@0 19 INTRODUCTION
ian@0 20 ============
ian@0 21
ian@0 22 This document describes the Linux driver for Chelsio 10Gb Ethernet Network
ian@0 23 Controller. This driver supports the Chelsio N210 NIC and is backward
ian@0 24 compatible with the Chelsio N110 model 10Gb NICs.
ian@0 25
ian@0 26
ian@0 27 FEATURES
ian@0 28 ========
ian@0 29
ian@0 30 Adaptive Interrupts (adaptive-rx)
ian@0 31 ---------------------------------
ian@0 32
ian@0 33 This feature provides an adaptive algorithm that adjusts the interrupt
ian@0 34 coalescing parameters, allowing the driver to dynamically adapt the latency
ian@0 35 settings to achieve the highest performance during various types of network
ian@0 36 load.
ian@0 37
ian@0 38 The interface used to control this feature is ethtool. Please see the
ian@0 39 ethtool manpage for additional usage information.
ian@0 40
ian@0 41 By default, adaptive-rx is disabled.
ian@0 42 To enable adaptive-rx:
ian@0 43
ian@0 44 ethtool -C <interface> adaptive-rx on
ian@0 45
ian@0 46 To disable adaptive-rx, use ethtool:
ian@0 47
ian@0 48 ethtool -C <interface> adaptive-rx off
ian@0 49
ian@0 50 After disabling adaptive-rx, the timer latency value will be set to 50us.
ian@0 51 You may set the timer latency after disabling adaptive-rx:
ian@0 52
ian@0 53 ethtool -C <interface> rx-usecs <microseconds>
ian@0 54
ian@0 55 An example to set the timer latency value to 100us on eth0:
ian@0 56
ian@0 57 ethtool -C eth0 rx-usecs 100
ian@0 58
ian@0 59 You may also provide a timer latency value while disabling adpative-rx:
ian@0 60
ian@0 61 ethtool -C <interface> adaptive-rx off rx-usecs <microseconds>
ian@0 62
ian@0 63 If adaptive-rx is disabled and a timer latency value is specified, the timer
ian@0 64 will be set to the specified value until changed by the user or until
ian@0 65 adaptive-rx is enabled.
ian@0 66
ian@0 67 To view the status of the adaptive-rx and timer latency values:
ian@0 68
ian@0 69 ethtool -c <interface>
ian@0 70
ian@0 71
ian@0 72 TCP Segmentation Offloading (TSO) Support
ian@0 73 -----------------------------------------
ian@0 74
ian@0 75 This feature, also known as "large send", enables a system's protocol stack
ian@0 76 to offload portions of outbound TCP processing to a network interface card
ian@0 77 thereby reducing system CPU utilization and enhancing performance.
ian@0 78
ian@0 79 The interface used to control this feature is ethtool version 1.8 or higher.
ian@0 80 Please see the ethtool manpage for additional usage information.
ian@0 81
ian@0 82 By default, TSO is enabled.
ian@0 83 To disable TSO:
ian@0 84
ian@0 85 ethtool -K <interface> tso off
ian@0 86
ian@0 87 To enable TSO:
ian@0 88
ian@0 89 ethtool -K <interface> tso on
ian@0 90
ian@0 91 To view the status of TSO:
ian@0 92
ian@0 93 ethtool -k <interface>
ian@0 94
ian@0 95
ian@0 96 PERFORMANCE
ian@0 97 ===========
ian@0 98
ian@0 99 The following information is provided as an example of how to change system
ian@0 100 parameters for "performance tuning" an what value to use. You may or may not
ian@0 101 want to change these system parameters, depending on your server/workstation
ian@0 102 application. Doing so is not warranted in any way by Chelsio Communications,
ian@0 103 and is done at "YOUR OWN RISK". Chelsio will not be held responsible for loss
ian@0 104 of data or damage to equipment.
ian@0 105
ian@0 106 Your distribution may have a different way of doing things, or you may prefer
ian@0 107 a different method. These commands are shown only to provide an example of
ian@0 108 what to do and are by no means definitive.
ian@0 109
ian@0 110 Making any of the following system changes will only last until you reboot
ian@0 111 your system. You may want to write a script that runs at boot-up which
ian@0 112 includes the optimal settings for your system.
ian@0 113
ian@0 114 Setting PCI Latency Timer:
ian@0 115 setpci -d 1425:* 0x0c.l=0x0000F800
ian@0 116
ian@0 117 Disabling TCP timestamp:
ian@0 118 sysctl -w net.ipv4.tcp_timestamps=0
ian@0 119
ian@0 120 Disabling SACK:
ian@0 121 sysctl -w net.ipv4.tcp_sack=0
ian@0 122
ian@0 123 Setting large number of incoming connection requests:
ian@0 124 sysctl -w net.ipv4.tcp_max_syn_backlog=3000
ian@0 125
ian@0 126 Setting maximum receive socket buffer size:
ian@0 127 sysctl -w net.core.rmem_max=1024000
ian@0 128
ian@0 129 Setting maximum send socket buffer size:
ian@0 130 sysctl -w net.core.wmem_max=1024000
ian@0 131
ian@0 132 Set smp_affinity (on a multiprocessor system) to a single CPU:
ian@0 133 echo 1 > /proc/irq/<interrupt_number>/smp_affinity
ian@0 134
ian@0 135 Setting default receive socket buffer size:
ian@0 136 sysctl -w net.core.rmem_default=524287
ian@0 137
ian@0 138 Setting default send socket buffer size:
ian@0 139 sysctl -w net.core.wmem_default=524287
ian@0 140
ian@0 141 Setting maximum option memory buffers:
ian@0 142 sysctl -w net.core.optmem_max=524287
ian@0 143
ian@0 144 Setting maximum backlog (# of unprocessed packets before kernel drops):
ian@0 145 sysctl -w net.core.netdev_max_backlog=300000
ian@0 146
ian@0 147 Setting TCP read buffers (min/default/max):
ian@0 148 sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000"
ian@0 149
ian@0 150 Setting TCP write buffers (min/pressure/max):
ian@0 151 sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000"
ian@0 152
ian@0 153 Setting TCP buffer space (min/pressure/max):
ian@0 154 sysctl -w net.ipv4.tcp_mem="10000000 10000000 10000000"
ian@0 155
ian@0 156 TCP window size for single connections:
ian@0 157 The receive buffer (RX_WINDOW) size must be at least as large as the
ian@0 158 Bandwidth-Delay Product of the communication link between the sender and
ian@0 159 receiver. Due to the variations of RTT, you may want to increase the buffer
ian@0 160 size up to 2 times the Bandwidth-Delay Product. Reference page 289 of
ian@0 161 "TCP/IP Illustrated, Volume 1, The Protocols" by W. Richard Stevens.
ian@0 162 At 10Gb speeds, use the following formula:
ian@0 163 RX_WINDOW >= 1.25MBytes * RTT(in milliseconds)
ian@0 164 Example for RTT with 100us: RX_WINDOW = (1,250,000 * 0.1) = 125,000
ian@0 165 RX_WINDOW sizes of 256KB - 512KB should be sufficient.
ian@0 166 Setting the min, max, and default receive buffer (RX_WINDOW) size:
ian@0 167 sysctl -w net.ipv4.tcp_rmem="<min> <default> <max>"
ian@0 168
ian@0 169 TCP window size for multiple connections:
ian@0 170 The receive buffer (RX_WINDOW) size may be calculated the same as single
ian@0 171 connections, but should be divided by the number of connections. The
ian@0 172 smaller window prevents congestion and facilitates better pacing,
ian@0 173 especially if/when MAC level flow control does not work well or when it is
ian@0 174 not supported on the machine. Experimentation may be necessary to attain
ian@0 175 the correct value. This method is provided as a starting point fot the
ian@0 176 correct receive buffer size.
ian@0 177 Setting the min, max, and default receive buffer (RX_WINDOW) size is
ian@0 178 performed in the same manner as single connection.
ian@0 179
ian@0 180
ian@0 181 DRIVER MESSAGES
ian@0 182 ===============
ian@0 183
ian@0 184 The following messages are the most common messages logged by syslog. These
ian@0 185 may be found in /var/log/messages.
ian@0 186
ian@0 187 Driver up:
ian@0 188 Chelsio Network Driver - version 2.1.1
ian@0 189
ian@0 190 NIC detected:
ian@0 191 eth#: Chelsio N210 1x10GBaseX NIC (rev #), PCIX 133MHz/64-bit
ian@0 192
ian@0 193 Link up:
ian@0 194 eth#: link is up at 10 Gbps, full duplex
ian@0 195
ian@0 196 Link down:
ian@0 197 eth#: link is down
ian@0 198
ian@0 199
ian@0 200 KNOWN ISSUES
ian@0 201 ============
ian@0 202
ian@0 203 These issues have been identified during testing. The following information
ian@0 204 is provided as a workaround to the problem. In some cases, this problem is
ian@0 205 inherent to Linux or to a particular Linux Distribution and/or hardware
ian@0 206 platform.
ian@0 207
ian@0 208 1. Large number of TCP retransmits on a multiprocessor (SMP) system.
ian@0 209
ian@0 210 On a system with multiple CPUs, the interrupt (IRQ) for the network
ian@0 211 controller may be bound to more than one CPU. This will cause TCP
ian@0 212 retransmits if the packet data were to be split across different CPUs
ian@0 213 and re-assembled in a different order than expected.
ian@0 214
ian@0 215 To eliminate the TCP retransmits, set smp_affinity on the particular
ian@0 216 interrupt to a single CPU. You can locate the interrupt (IRQ) used on
ian@0 217 the N110/N210 by using ifconfig:
ian@0 218 ifconfig <dev_name> | grep Interrupt
ian@0 219 Set the smp_affinity to a single CPU:
ian@0 220 echo 1 > /proc/irq/<interrupt_number>/smp_affinity
ian@0 221
ian@0 222 It is highly suggested that you do not run the irqbalance daemon on your
ian@0 223 system, as this will change any smp_affinity setting you have applied.
ian@0 224 The irqbalance daemon runs on a 10 second interval and binds interrupts
ian@0 225 to the least loaded CPU determined by the daemon. To disable this daemon:
ian@0 226 chkconfig --level 2345 irqbalance off
ian@0 227
ian@0 228 By default, some Linux distributions enable the kernel feature,
ian@0 229 irqbalance, which performs the same function as the daemon. To disable
ian@0 230 this feature, add the following line to your bootloader:
ian@0 231 noirqbalance
ian@0 232
ian@0 233 Example using the Grub bootloader:
ian@0 234 title Red Hat Enterprise Linux AS (2.4.21-27.ELsmp)
ian@0 235 root (hd0,0)
ian@0 236 kernel /vmlinuz-2.4.21-27.ELsmp ro root=/dev/hda3 noirqbalance
ian@0 237 initrd /initrd-2.4.21-27.ELsmp.img
ian@0 238
ian@0 239 2. After running insmod, the driver is loaded and the incorrect network
ian@0 240 interface is brought up without running ifup.
ian@0 241
ian@0 242 When using 2.4.x kernels, including RHEL kernels, the Linux kernel
ian@0 243 invokes a script named "hotplug". This script is primarily used to
ian@0 244 automatically bring up USB devices when they are plugged in, however,
ian@0 245 the script also attempts to automatically bring up a network interface
ian@0 246 after loading the kernel module. The hotplug script does this by scanning
ian@0 247 the ifcfg-eth# config files in /etc/sysconfig/network-scripts, looking
ian@0 248 for HWADDR=<mac_address>.
ian@0 249
ian@0 250 If the hotplug script does not find the HWADDRR within any of the
ian@0 251 ifcfg-eth# files, it will bring up the device with the next available
ian@0 252 interface name. If this interface is already configured for a different
ian@0 253 network card, your new interface will have incorrect IP address and
ian@0 254 network settings.
ian@0 255
ian@0 256 To solve this issue, you can add the HWADDR=<mac_address> key to the
ian@0 257 interface config file of your network controller.
ian@0 258
ian@0 259 To disable this "hotplug" feature, you may add the driver (module name)
ian@0 260 to the "blacklist" file located in /etc/hotplug. It has been noted that
ian@0 261 this does not work for network devices because the net.agent script
ian@0 262 does not use the blacklist file. Simply remove, or rename, the net.agent
ian@0 263 script located in /etc/hotplug to disable this feature.
ian@0 264
ian@0 265 3. Transport Protocol (TP) hangs when running heavy multi-connection traffic
ian@0 266 on an AMD Opteron system with HyperTransport PCI-X Tunnel chipset.
ian@0 267
ian@0 268 If your AMD Opteron system uses the AMD-8131 HyperTransport PCI-X Tunnel
ian@0 269 chipset, you may experience the "133-Mhz Mode Split Completion Data
ian@0 270 Corruption" bug identified by AMD while using a 133Mhz PCI-X card on the
ian@0 271 bus PCI-X bus.
ian@0 272
ian@0 273 AMD states, "Under highly specific conditions, the AMD-8131 PCI-X Tunnel
ian@0 274 can provide stale data via split completion cycles to a PCI-X card that
ian@0 275 is operating at 133 Mhz", causing data corruption.
ian@0 276
ian@0 277 AMD's provides three workarounds for this problem, however, Chelsio
ian@0 278 recommends the first option for best performance with this bug:
ian@0 279
ian@0 280 For 133Mhz secondary bus operation, limit the transaction length and
ian@0 281 the number of outstanding transactions, via BIOS configuration
ian@0 282 programming of the PCI-X card, to the following:
ian@0 283
ian@0 284 Data Length (bytes): 1k
ian@0 285 Total allowed outstanding transactions: 2
ian@0 286
ian@0 287 Please refer to AMD 8131-HT/PCI-X Errata 26310 Rev 3.08 August 2004,
ian@0 288 section 56, "133-MHz Mode Split Completion Data Corruption" for more
ian@0 289 details with this bug and workarounds suggested by AMD.
ian@0 290
ian@0 291 It may be possible to work outside AMD's recommended PCI-X settings, try
ian@0 292 increasing the Data Length to 2k bytes for increased performance. If you
ian@0 293 have issues with these settings, please revert to the "safe" settings
ian@0 294 and duplicate the problem before submitting a bug or asking for support.
ian@0 295
ian@0 296 NOTE: The default setting on most systems is 8 outstanding transactions
ian@0 297 and 2k bytes data length.
ian@0 298
ian@0 299 4. On multiprocessor systems, it has been noted that an application which
ian@0 300 is handling 10Gb networking can switch between CPUs causing degraded
ian@0 301 and/or unstable performance.
ian@0 302
ian@0 303 If running on an SMP system and taking performance measurements, it
ian@0 304 is suggested you either run the latest netperf-2.4.0+ or use a binding
ian@0 305 tool such as Tim Hockin's procstate utilities (runon)
ian@0 306 <http://www.hockin.org/~thockin/procstate/>.
ian@0 307
ian@0 308 Binding netserver and netperf (or other applications) to particular
ian@0 309 CPUs will have a significant difference in performance measurements.
ian@0 310 You may need to experiment which CPU to bind the application to in
ian@0 311 order to achieve the best performance for your system.
ian@0 312
ian@0 313 If you are developing an application designed for 10Gb networking,
ian@0 314 please keep in mind you may want to look at kernel functions
ian@0 315 sched_setaffinity & sched_getaffinity to bind your application.
ian@0 316
ian@0 317 If you are just running user-space applications such as ftp, telnet,
ian@0 318 etc., you may want to try the runon tool provided by Tim Hockin's
ian@0 319 procstate utility. You could also try binding the interface to a
ian@0 320 particular CPU: runon 0 ifup eth0
ian@0 321
ian@0 322
ian@0 323 SUPPORT
ian@0 324 =======
ian@0 325
ian@0 326 If you have problems with the software or hardware, please contact our
ian@0 327 customer support team via email at support@chelsio.com or check our website
ian@0 328 at http://www.chelsio.com
ian@0 329
ian@0 330 ===============================================================================
ian@0 331
ian@0 332 Chelsio Communications
ian@0 333 370 San Aleso Ave.
ian@0 334 Suite 100
ian@0 335 Sunnyvale, CA 94085
ian@0 336 http://www.chelsio.com
ian@0 337
ian@0 338 This program is free software; you can redistribute it and/or modify
ian@0 339 it under the terms of the GNU General Public License, version 2, as
ian@0 340 published by the Free Software Foundation.
ian@0 341
ian@0 342 You should have received a copy of the GNU General Public License along
ian@0 343 with this program; if not, write to the Free Software Foundation, Inc.,
ian@0 344 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
ian@0 345
ian@0 346 THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
ian@0 347 WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
ian@0 348 MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
ian@0 349
ian@0 350 Copyright (c) 2003-2005 Chelsio Communications. All rights reserved.
ian@0 351
ian@0 352 ===============================================================================