Those things about TCP (Part 2)

Those things about TCP (Part 2)

This article is the next one, so if you are not familiar with TCP, please take a look at the previous article " Things about TCP ( Part 1)" ". In the previous article, we introduced the protocol header, state machine, and data of TCP. Something in the retransmission. But TCP has to solve a big problem, which is to dynamically adjust its own packet sending speed in a network according to different situations, small to make its connection more stable, and large to make the entire network more stable. Before you read the next article, you need to be prepared. There are many algorithms and strategies in this article, which may trigger your various thoughts and allow your brain to allocate a lot of memory and computing resources. Therefore, it is not suitable for reading in the toilet. .

TCP's RTT algorithm

From the previous TCP retransmission mechanism, we know that the setting of Timeout is very important for retransmission.

  • If the setting is long, the retransmission will be slow, and it will be retransmitted after a long time, which is inefficient and poor performance;
  • If it is set short, it may cause retransmission without losing it. As a result, the retransmission is fast, which will increase network congestion, leading to more timeouts, and more timeouts leading to more retransmissions.

Moreover, there is no way to set a dead value for this timeout under different network conditions. It can only be set dynamically. In order to set it dynamically, TCP introduces RTT-Round Trip Time, which is the time from when a data packet is sent to when it comes back. In this way, the sender knows approximately how much time is needed, so that Timeout-RTO (Retransmission TimeOut) can be easily set to make our retransmission mechanism more efficient. It sounds simple. It seems to be t0 when the sender sends the packet, and then the receiver writes a t1 when the ack comes back, so RTT = t1-t0. It's not that simple. This is just a sampling and cannot represent the general situation.

Classical Algorithm

 The classic algorithm defined in RFC793 is like this:

1) 1. sample the RTT first, and write down the most recent RTT values.

2) Then do smoothing calculation SRTT (Smoothed RTT). The formula is: (The value of is between 0.8 and 0.9. This algorithm is called Exponential weighted moving average in English, and weighted moving average in Chinese)

SRTT = ( * SRTT) + ((1- ) * RTT)

3) Start calculating RTO. The formula is as follows:

RTO = min [UBOUND, max [LBOUND, ( * SRTT)]]

among them:

  • UBOUND is the maximum timeout time, the upper limit
  • LBOUND is the minimum timeout time, the lower limit
  • The value is generally between 1.3 and 2.0.
Karn/Partridge algorithm

But the above algorithm will have an ultimate problem when retransmitting-do you use the time of the first data transmission and the time of ack return as the RTT sample value, or use the time of retransmission and the time of ACK return to do it. RTT sample value?

Regardless of which end you choose, this question is just a matter of pressing the gourd. As shown below:

  • Case (a) is that ack did not come back, so retransmission. If you calculate the time for the first transmission and ACK, then it is obviously too big.
  • Case (b) is that the ack comes back slowly, but it causes a retransmission, but not long after the retransmission, the ACK comes back before. If you calculate the difference between the retransmission time and the ACK return time, it will be short.

So in 1987, a Karn/Partridge Algorithm was developed. The biggest feature of this algorithm is that it ignores retransmissions and does not sample the retransmitted RTT (you see, you don't need to solve non-existent problems).

However, in this way, it will cause a big bug-if at a certain time, the network flickers and suddenly slows down, resulting in a relatively large delay. This delay causes all packets to be retransmitted (because of the previous The RTO is very small), so the RTO will not be updated because the re-rotation is not counted. This is a disaster. So the Karn algorithm uses a tricky method-as soon as a retransmission occurs, the existing RTO value is doubled (this is the so-called Exponential backoff). Obviously, this dead rule is for a RTT that needs to be estimated more accurately. It's not reliable.

Jacobson/Karels algorithm

The first two algorithms use "weighted moving average". The biggest problem with this method is that if there is a large fluctuation in RTT, it is difficult to find because it is smoothed out. Therefore, in 1988, someone introduced a new algorithm called Jacobson/Karels Algorithm (see RFC6289 ). This algorithm introduces the difference between the latest RTT sampling and the smoothed SRTT as a factor for calculation. The formula is as follows: (where DevRTT means Deviation RTT)

SRTT = SRTT + (RTT SRTT)-calculate smooth RTT

DevRTT = (1- )*DevRTT + *(|RTT-SRTT|) Calculate the difference between the smoothed RTT and the real (weighted moving average)

RTO = * SRTT + *DevRTT-God-like formula

(Among them: under Linux, = 0.125, = 0.25, = 1, = 4-this is the "good parameter adjustment" in the algorithm, nobody knows why, it just works...) The final algorithm It is used in today's TCP protocol (Linux source code is: tcp_rtt_estimator ).

TCP sliding window

Need to explain, if you don't understand the TCP sliding window, you don't understand the TCP protocol. We all know that TCP must solve the problems of reliable transmission and packet reordering. Therefore, TCP must know the actual data processing bandwidth or data processing speed of the network, so as not to cause network congestion and packet loss. .

Therefore, TCP has introduced some technologies and designs for network flow control, and Sliding Window is one of the technologies. As we said earlier, there is a field in the TCP header called Window, also called Advertised-Window. This field is used by the receiver to tell the sender how many buffers it has to receive data. Therefore, the sending end can send data according to the processing capability of the receiving end, without causing the receiving end to process it. In order to illustrate the sliding window, we need to first look at some data structures of the TCP buffer:

In the picture above, we can see:

  • The receiving end LastByteRead points to the position read in the TCP buffer, NextByteExpected points to the last position of the received continuous packet, LastByteRcved points to the last position of the received packet, we can see that some data is still in the middle It has not arrived, so there is a blank area of data.
  • The LastByteAcked of the sender points to the location that was Acked by the receiver (indicating a successful sending confirmation), LastByteSent said it was sent, but the Ack that was successfully confirmed has not yet been received, and LastByteWritten points to the place where the upper application is writing.


  • The receiver will report its AdvertisedWindow = MaxRcvBuffer LastByteRcvd 1 in the ACK back to the sender;
  • The sender will control the size of the sent data according to this window to ensure that the receiver can process it.

Let's take a look at the schematic diagram of the sender's sliding window:

( Image source )

The above figure is divided into four parts, namely: (the black model is the sliding window)

  • #1 The data confirmed by ack has been received.
  • #2Send has not received ack yet.
  • #3 has not been sent out in the window (the receiver still has space).
  • #4Data outside the window (no room for the receiver)

The following is a schematic diagram after sliding (received 36 ack, and sent 46-51 bytes):

Let's take a look at a diagram of the receiving end controlling the sending end:

( Image source )

Zero Window

In the above figure, we can see how a slow server (receiving end) reduces the TCP Sliding Window of the client (sending end) to 0. At this point, you must ask, if Window becomes 0, what will happen to TCP? Does the sender stop sending data? Yes, the sender will not send data. You can imagine it as "Window Closed". Then you must ask, if the sender does not send data, and the receiver will have Window size available for a while, how to notify the sender?

To solve this problem, TCP uses Zero Window Probe technology, abbreviated as ZWP, that is to say, after the window becomes 0, the sender will send a ZWP packet to the receiver, and let the receiver ack his Window size, generally this The value will be set to 3 times, the first time is about 30-60 seconds (different implementations may be different). If it is still 0 after 3 times, some TCP implementations will send RST to disconnect the link.

Note: DDoS attacks may occur as long as there is a waiting place, and Zero Window is no exception. Some attackers will set Window to 0 after establishing a link with HTTP and sending a GET request, and then the server can only wait for it to proceed. ZWP, so the attacker will send a large number of such requests concurrently, exhausting the resources on the server side. (For this attack, you can take a look at the SockStress entry on Wikipedia )

In addition, in Wireshark, you can use tcp.analysis.zero_window to filter the packets, and then use the follow TCP stream in the right-click menu, you can see the ZeroWindowProbe and ZeroWindowProbeAck packets.

Silly Window Syndrome

Silly Window Syndrome translates into Chinese as "Confused Window Syndrome". As you can see above, if our receiver is too busy to take away the data in Receive Windows, it will cause the sender to get smaller and smaller. In the end, if the receiver vacates a few bytes and tells the sender that there is a window of a few bytes, our sender will send these bytes without hesitation.

You know, our TCP+IP header has 40 bytes. For a few bytes, it is too uneconomical to achieve such a large overhead.

In addition, you need to know that there is an MTU on the network. For Ethernet, the MTU is 1500 bytes. After removing the 40 bytes of the TCP+IP header, the real data transmission can be 1460. This is the so-called MSS (Max Segment Size) Note that the TCP RFC defines the default value of this MSS as 536. This is because  RFC 791 says that any IP device must receive at least the size of 576 (in fact, 576 is the MTU of the dial-up network, and 576 Minus the 20 bytes of the IP header is 536).

If your network packets can fill up the MTU, then you can use the entire bandwidth, if not, then you will waste bandwidth. (Packages larger than MTU have two endings, one is directly lost, the other is re-divided and packaged and sent) You can imagine that one MTU is equivalent to the maximum number of people that can be installed on an airplane, if this If the aircraft is fully loaded, the bandwidth is the highest. If only one person is transported by an aircraft, the cost will undoubtedly increase, which is also equivalent.

So, the phenomenon of Silly Windows Syndrome is like you have only one or two people in a 200-person airplane. It is not difficult to solve this problem. It is to avoid responding to a small window size until there is a large enough window size to respond. This idea can be implemented at both ends of the sender and receiver at the same time.

  • If the problem is caused by the Receiver side, then David D Clark's solution will be used. On the receiver side, if the received data causes the window size to be smaller than a certain value, you can directly ack(0) back to the sender, which will close the window and prevent the sender from sending data again, and wait until the receiver has processed some data After the windows size is greater than or equal to MSS, or half of the receiver buffer is empty, you can open the window and let send send data.
  • If the problem is caused by the sender side, then the famous Nagle's algorithm will be used  . The idea of this algorithm is also to delay processing. It has two main conditions: 1) Wait until Window Size>=MSS or Data Size>=MSS, 2) Receive the ack response packet of the data sent before, then he will send it. Data, otherwise you are accumulating data.

In addition, the Nagle algorithm is turned on by default, so for some programs that require small packet scenarios-such as interactive programs such as telnet or ssh, you need to turn off this algorithm. You can set the TCP_NODELAY option in the Socket to turn off this algorithm (turning off the Nagle algorithm has no global parameters and needs to be turned off according to the characteristics of each application)

setsockopt(sock_fd, IPPROTO_TCP, TCP_NODELAY, (

In addition, some articles on the Internet say that the socket option of TCP_CORK also closes the Nagle algorithm, which is wrong. TCP_CORK is actually an updated radical Nagle fortune teller, which completely prohibits the sending of small packets, and the Nagle algorithm does not prohibit the sending of small packets, but prohibits the sending of a large number of small packets. It is best not to set both options.

Congestion Handling of TCP-Congestion Handling

As we know above, TCP uses Sliding Window to do flow control (Flow Control), but TCP feels that this is not enough, because Sliding Window needs to rely on the sender and receiver of the connection, and it does not know what is happening in the middle of the network. The designers of TCP felt that it was not enough for a great and awesome protocol to only achieve flow control, because flow control is only a matter of layer 4 and above of the network model, and TCP should also be more intelligent about the entire network.

Specifically, we know that TCP uses a timer to sample RTT and calculate RTO. However, if the delay on the network increases suddenly, then TCP's response to this matter is to retransmit the data. However, retransmission will cause network problems. The burden is heavier, so it will cause greater delay and more packet loss, so this situation will enter a vicious circle and be continuously amplified. Imagine that if there are thousands of TCP connections in a network that act like this, then a "network storm" will immediately form, and the TCP protocol will bring down the entire network. This is a disaster.

Therefore, TCP cannot ignore what is happening on the network and retransmit data without thinking, causing greater damage to the network. The design philosophy of TCP is: TCP is not a selfish protocol. When congestion occurs, you must sacrifice yourself. Just like a traffic jam, every car should give way instead of grabbing the road.

For papers on congestion control, please refer to " Congestion Avoidance and Control " (PDF)

Congestion control is mainly based on four algorithms: 1) slow start, 2) congestion avoidance, 3) congestion occurrence, 4) fast recovery. These four algorithms were not developed in one day. The development of these four algorithms has gone through a lot of time, and it is still being optimized today. Remarks:

  • In 1988, TCP-Tahoe proposed 1) slow start, 2) congestion avoidance, 3) fast retransmission when congestion occurs
  • In 1990, TCP Reno added 4) Fast recovery on the basis of Tahoe
Slow Start Algorithm-Slow Start

1. let's take a look at the slow hot start of TCP. Slow start means that the connection that has just joined the network will speed up little by little. Don't take up the road as aggressively as those privileged cars. New students still have to be slow when getting on the high speed. Don't mess up the order already on the high speed.

The slow start algorithm is as follows (cwnd stands for Congestion Window):

1) Initialize cwnd = 1 after the connection is established, indicating that data of the size of MSS can be transmitted.

2) Whenever an ACK is received, cwnd++; rises linearly

3) Whenever an RTT has passed, cwnd = cwnd*2; increases exponentially

4) There is also a ssthresh (slow start threshold), which is an upper limit. When cwnd >= ssthresh, it will enter the "congestion avoidance algorithm" (this algorithm will be described later)

Therefore, we can see that if the network speed is fast, the ACK will return quickly and the RTT will be short, so this slow start is not slow at all. The figure below illustrates this process.

Here, I need to mention that a Google paper " An Argument for Increasing TCP's Initial Congestion Window " adopted the suggestion of this paper after Linux 3.0-initialize cwnd to 10 MSS. Before Linux 3.0, such as 2.6, Linux adopted RFC3390 , cwnd is changed with the value of MSS, if MSS< 1095, cwnd = 4; if MSS> 2190, cwnd = 2; otherwise, it is 3 .

 Congestion Avoidance Algorithm-Congestion Avoidance

As mentioned earlier, there is also a ssthresh (slow start threshold), which is an upper limit. When cwnd >= ssthresh, it will enter the "congestion avoidance algorithm". Generally speaking, the value of ssthresh is 65535, and the unit is byte. When cwnd reaches this value, the algorithm is as follows:

1) When an ACK is received, cwnd = cwnd + 1/cwnd

2) When each RTT passes, cwnd = cwnd + 1

In this way, you can avoid network congestion caused by excessive growth, and slowly increase and adjust to the optimal value of the network. Obviously, it is a linearly increasing algorithm.

Algorithm in congestion state

As we said before, when packets are lost, there are two situations:

1) Wait until the RTO expires and retransmit the data packet. TCP thinks this situation is too bad, and the response is very strong.

    • sshthresh = cwnd/2
    • cwnd reset to 1
    • Enter the slow start process

2) Fast Retransmit algorithm, that is, retransmission is started when 3 duplicate ACKs are received, without waiting for the RTO to time out.

    • The implementation of TCP Tahoe is the same as the RTO timeout.
    • The implementation of TCP Reno is:

      • cwnd = cwnd/2
      • sshthresh = cwnd
      • Enter the fast recovery algorithm-Fast Recovery

Above we can see that after the RTO timeout, sshthresh will become half of cwnd, which means that if cwnd<=sshthresh there is a packet loss, then TCP sshthresh will be reduced by half, and then wait for cwnd to quickly When the exponential increase climbs to this place, it will become a slow linear increase. We can see how TCP quickly and carefully found the balance point of website traffic through this strong earthquake.

Fast Recovery Algorithm-Fast Recovery

TCP Reno

This algorithm is defined in RFC5681 . Fast retransmission and fast recovery algorithms are generally used at the same time. The fast recovery algorithm thinks that you still have 3 Duplicated Acks indicating that the network is not so bad, so there is no need to be as strong as the RTO timeout. Note, as mentioned earlier, before entering Fast Recovery, cwnd and sshthresh have been updated:

  • cwnd = cwnd/2
  • sshthresh = cwnd

Then, the real Fast Recovery algorithm is as follows:

  • cwnd = sshthresh + 3 * MSS (3 means to confirm that 3 packets have been received)
  • Retransmit the data packet specified by Duplicated ACKs
  • If you receive duplicated Acks again, then cwnd = cwnd +1
  • If a new Ack is received, then cwnd = sshthresh, and then enter the congestion avoidance algorithm.

If you think about the above algorithm carefully, you will know that the above algorithm also has a problem, that is-it relies on 3 repeated Acks. Note that 3 repeated Acks does not mean that only one packet has been lost, it is very likely that many packets have been lost. But this algorithm will only retransmit one, and the remaining packets can only wait until the RTO timeout, so it enters the nightmare mode-a timeout window will be halved, and multiple timeouts will exceed the TCP transmission speed. The number has dropped, and the Fast Recovery algorithm will not be triggered anymore.

Generally speaking, as we said earlier, the SACK or D-SACK method can make Fast Recovery or Sender smarter in making decisions, but not all TCP implementations support SACK (SACK needs to be supported on both ends ), therefore, a solution without SACK is needed. The algorithm for congestion control through SACK is FACK (will be discussed later)

TCP New Reno

Therefore, in 1995, the TCP New Reno (see  RFC 6582  ) algorithm was proposed, mainly to improve the Fast Recovery algorithm without the support of SACK

  • When the sender receives 3 Duplicated Acks and enters the Fast Retransimit mode, the developer retransmits the packet indicated by the repeated Acks. If only this packet is lost, then the Ack that comes back after retransmitting this packet will ack back the entire data that has been transmitted by the sender. If not, it means that multiple packets have been lost. We call this ACK Partial ACK.
  • Once the sender finds that the Partial ACK appears, the sender can infer that multiple packets have been lost, so it will continue to retransmit the first packet that is not acked in the sliding window. The Fast Recovery process is not really ended until the Partial Ack is no longer received

We can see that this "Fast Recovery change" is a very radical gameplay. He also extended the process of Fast Retransmit and Fast Recovery.

Algorithm diagram

Let's take a look at a simple diagram to see what the various algorithms above look like at the same time:


FACK algorithm

The full name of FACK is the Forward Acknowledgment algorithm. The address of the paper is here (PDF). Forward Acknowledgement: Refining TCP Congestion Control  This algorithm is based on SACK. We said earlier that SACK uses the TCP extension field Ack. Which data is received and which data is not? Received, his advantage over Fast Retransmit's three duplicated acks is that the former only knows that there are packets lost, not one or more, and SACK can accurately know which packets are lost. Therefore, SACK allows the sender to retransmit the lost packets during the retransmission process, instead of transmitting them one by one. However, if the retransmitted packet data is more, it will cause the original The busy network is even busier. Therefore, FACK is used for congestion flow control in the retransmission process.

  • This algorithm will save the largest Sequence Number in SACK in the variable snd.fack. The update of snd.fack will be carried by ack. If everything is in order, it will be the same as snd.una (snd.una has not received the ack yet. Place, which is the first place of category #2 in the sliding window)
  • Then define an awnd = snd.nxt-snd.fack (snd.nxt points to the place to be sent in the sliding window of the sender-the first position of category#3 in the sliding windows icon above), so that awnd means Data on the web. (The so-called awnd means: actual quantity of data outstanding in the network)
  • If the data needs to be retransmitted, then awnd = snd.nxt snd.fack + retran_data, that is to say, awnd is the transmitted data + the retransmitted data.
  • Then the condition to trigger Fast Recovery is: ((snd.fack snd.una)> (3*MSS)) || (dupacks == 3)). In this way, there is no need to wait for 3 duplicated acks to retransmit, but as long as the largest data in the sack and the ack data are longer (3 MSS), the retransmission is triggered. The cwnd remains unchanged during the entire retransmission process. Until the first packet loss snd.nxt<=snd.una (that is, the retransmitted data is confirmed), then the congestion avoidance mechanism-cwnd rises linearly.

We can see that if there is no FACK, then in the case of more packet loss, the original conservative algorithm will underestimate the size of the window that needs to be used, and it will take a few RTTs to complete the recovery, and FACK will be more aggressive To do this. However, if FACK is in a network where network packets will be reordered, there will be a big problem.

Introduction to other congestion control algorithms

TCP Vegas congestion control algorithm

This algorithm was proposed in 1994, and it mainly modified TCP Reno. This algorithm calculates a benchmark RTT through very heavy monitoring of the RTT. Then use this benchmark RTT to estimate the actual bandwidth of the current network. If the actual bandwidth is smaller or more active than our expected bandwidth, then start to linearly reduce or increase the size of cwnd. If the calculated RTT is greater than Timeout, then it will be retransmitted without waiting for the ack timeout. (The core idea of Vegas is to use the value of RTT to affect the congestion window, not through packet loss) The paper on this algorithm is " TCP Vegas: End to End Congestion Avoidance on a Global Internet " This paper was given to Vegas and New Reno Compared:

For the implementation of this algorithm, you can refer to the Linux source code: /net/ipv4/tcp_vegas.h/net/ipv4/tcp_vegas.c

HSTCP (High Speed TCP) algorithm

This algorithm comes from RFC 3649 ( Wikipedia entry ). It changed the most basic algorithm, which made the Congestion Window rise faster and slower. among them:

  • Window growth method for congestion avoidance: cwnd = cwnd + (cwnd)/cwnd
  • Window drop mode after packet loss: cwnd = (1- (cwnd))*cwnd

Note: (cwnd) and (cwnd) are both functions. If you want them to be the same as standard TCP, then let (cwnd)=1 and (cwnd)=0.5. The value of (cwnd) and (cwnd) is a dynamic change. For the implementation of this algorithm, you can refer to the Linux source code: /net/ipv4/tcp_highspeed.c

 TCP BIC algorithm

In 2004, the BIC algorithm was produced within production. Now you can also check the relevant news "Google: American scientists developed the BIC-TCP protocol six thousand times faster than DSL " BIC, the full name of Binary Increase Congestion control , is the default congestion control algorithm in Linux 2.6.8. The inventors of BIC have reported that so many congestion control algorithms are trying to find a suitable cwnd-Congestion Window, and the proponents of BIC-TCP have seen through the essence of the matter, in fact, this is a search process, so the BIC algorithm is mainly Binary Search is used to do this. For the implementation of this algorithm, you can refer to the Linux source code: /net/ipv4/tcp_bic.c

TCP WestWood algorithm

Westwood uses the same slow start algorithm and congestion avoidance algorithm as Reno. Westwood's main improvement: bandwidth estimation is done at the sending end, when packet loss is detected, the congestion window and slow start threshold are set according to the bandwidth value. So, how does this algorithm measure bandwidth? The bandwidth is measured once every RTT time. The formula for measuring bandwidth is very simple, which is how many bytes have been successfully acked in this period of RTT. Because this bandwidth is the same as using RTT to calculate RTO, it also needs to be smoothed to a value from each sample-also using a weighted moving average formula. In addition, we know that if the bandwidth of a network can send X bytes per second, and RTT is when the data is sent to confirm the need, therefore, X * RTT should be the size of our buffer. Therefore, in this algorithm, the value of ssthresh is est_BD * min-RTT (the smallest RTT value). If the packet loss is caused by Duplicated ACKs, then if cwnd> ssthresh, then cwin = ssthresh. If it is caused by RTO, cwnd = 1, enter slow start. For the implementation of this algorithm, you can refer to the Linux source code:  /net/ipv4/tcp_westwood.c


For more algorithms, you can  find relevant clues in the TCP Congestion Avoidance Algorithm entry of Wikipedia 


Well, I think it can be over here. TCP has developed to this day, and there are several books in it. The main purpose of this article is to bring you into these classical basic technologies and knowledge. I hope this article will enable you to understand TCP, and I hope this article will enable you to start learning these basic or low-level knowledge with interest and confidence.

Of course, there are too many things about TCP, different people may have different understandings, and there may be some absurdities and even errors in this article, and I hope to get your feedback and criticism.

(End of full text)