annotate Documentation/networking/tcp.txt @ 897:329ea0ccb344

balloon: try harder to balloon up under memory pressure.

Currently if the balloon driver is unable to increase the guest's
reservation it assumes the failure was due to reaching its full
allocation, gives up on the ballooning operation and records the limit
it reached as the "hard limit". The driver will not try again until
the target is set again (even to the same value).

However it is possible that ballooning has in fact failed due to
memory pressure in the host and therefore it is desirable to keep
attempting to reach the target in case memory becomes available. The
most likely scenario is that some guests are ballooning down while
others are ballooning up and therefore there is temporary memory
pressure while things stabilise. You would not expect a well behaved
toolstack to ask a domain to balloon to more than its allocation nor
would you expect it to deliberately over-commit memory by setting
balloon targets which exceed the total host memory.

This patch drops the concept of a hard limit and causes the balloon
driver to retry increasing the reservation on a timer in the same
manner as when decreasing the reservation.

Also if we partially succeed in increasing the reservation
(i.e. receive less pages than we asked for) then we may as well keep
those pages rather than returning them to Xen.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
author Keir Fraser <keir.fraser@citrix.com>
date Fri Jun 05 14:01:20 2009 +0100 (2009-06-05)
parents 831230e53067
rev   line source
ian@0 1 TCP protocol
ian@0 2 ============
ian@0 3
ian@0 4 Last updated: 21 June 2005
ian@0 5
ian@0 6 Contents
ian@0 7 ========
ian@0 8
ian@0 9 - Congestion control
ian@0 10 - How the new TCP output machine [nyi] works
ian@0 11
ian@0 12 Congestion control
ian@0 13 ==================
ian@0 14
ian@0 15 The following variables are used in the tcp_sock for congestion control:
ian@0 16 snd_cwnd The size of the congestion window
ian@0 17 snd_ssthresh Slow start threshold. We are in slow start if
ian@0 18 snd_cwnd is less than this.
ian@0 19 snd_cwnd_cnt A counter used to slow down the rate of increase
ian@0 20 once we exceed slow start threshold.
ian@0 21 snd_cwnd_clamp This is the maximum size that snd_cwnd can grow to.
ian@0 22 snd_cwnd_stamp Timestamp for when congestion window last validated.
ian@0 23 snd_cwnd_used Used as a highwater mark for how much of the
ian@0 24 congestion window is in use. It is used to adjust
ian@0 25 snd_cwnd down when the link is limited by the
ian@0 26 application rather than the network.
ian@0 27
ian@0 28 As of 2.6.13, Linux supports pluggable congestion control algorithms.
ian@0 29 A congestion control mechanism can be registered through functions in
ian@0 30 tcp_cong.c. The functions used by the congestion control mechanism are
ian@0 31 registered via passing a tcp_congestion_ops struct to
ian@0 32 tcp_register_congestion_control. As a minimum name, ssthresh,
ian@0 33 cong_avoid, min_cwnd must be valid.
ian@0 34
ian@0 35 Private data for a congestion control mechanism is stored in tp->ca_priv.
ian@0 36 tcp_ca(tp) returns a pointer to this space. This is preallocated space - it
ian@0 37 is important to check the size of your private data will fit this space, or
ian@0 38 alternatively space could be allocated elsewhere and a pointer to it could
ian@0 39 be stored here.
ian@0 40
ian@0 41 There are three kinds of congestion control algorithms currently: The
ian@0 42 simplest ones are derived from TCP reno (highspeed, scalable) and just
ian@0 43 provide an alternative the congestion window calculation. More complex
ian@0 44 ones like BIC try to look at other events to provide better
ian@0 45 heuristics. There are also round trip time based algorithms like
ian@0 46 Vegas and Westwood+.
ian@0 47
ian@0 48 Good TCP congestion control is a complex problem because the algorithm
ian@0 49 needs to maintain fairness and performance. Please review current
ian@0 50 research and RFC's before developing new modules.
ian@0 51
ian@0 52 The method that is used to determine which congestion control mechanism is
ian@0 53 determined by the setting of the sysctl net.ipv4.tcp_congestion_control.
ian@0 54 The default congestion control will be the last one registered (LIFO);
ian@0 55 so if you built everything as modules. the default will be reno. If you
ian@0 56 build with the default's from Kconfig, then BIC will be builtin (not a module)
ian@0 57 and it will end up the default.
ian@0 58
ian@0 59 If you really want a particular default value then you will need
ian@0 60 to set it with the sysctl. If you use a sysctl, the module will be autoloaded
ian@0 61 if needed and you will get the expected protocol. If you ask for an
ian@0 62 unknown congestion method, then the sysctl attempt will fail.
ian@0 63
ian@0 64 If you remove a tcp congestion control module, then you will get the next
ian@0 65 available one. Since reno can not be built as a module, and can not be
ian@0 66 deleted, it will always be available.
ian@0 67
ian@0 68 How the new TCP output machine [nyi] works.
ian@0 69 ===========================================
ian@0 70
ian@0 71 Data is kept on a single queue. The skb->users flag tells us if the frame is
ian@0 72 one that has been queued already. To add a frame we throw it on the end. Ack
ian@0 73 walks down the list from the start.
ian@0 74
ian@0 75 We keep a set of control flags
ian@0 76
ian@0 77
ian@0 78 sk->tcp_pend_event
ian@0 79
ian@0 80 TCP_PEND_ACK Ack needed
ian@0 81 TCP_ACK_NOW Needed now
ian@0 82 TCP_WINDOW Window update check
ian@0 83 TCP_WINZERO Zero probing
ian@0 84
ian@0 85
ian@0 86 sk->transmit_queue The transmission frame begin
ian@0 87 sk->transmit_new First new frame pointer
ian@0 88 sk->transmit_end Where to add frames
ian@0 89
ian@0 90 sk->tcp_last_tx_ack Last ack seen
ian@0 91 sk->tcp_dup_ack Dup ack count for fast retransmit
ian@0 92
ian@0 93
ian@0 94 Frames are queued for output by tcp_write. We do our best to send the frames
ian@0 95 off immediately if possible, but otherwise queue and compute the body
ian@0 96 checksum in the copy.
ian@0 97
ian@0 98 When a write is done we try to clear any pending events and piggy back them.
ian@0 99 If the window is full we queue full sized frames. On the first timeout in
ian@0 100 zero window we split this.
ian@0 101
ian@0 102 On a timer we walk the retransmit list to send any retransmits, update the
ian@0 103 backoff timers etc. A change of route table stamp causes a change of header
ian@0 104 and recompute. We add any new tcp level headers and refinish the checksum
ian@0 105 before sending.
ian@0 106