Freescale Semiconductor, Inc. White Paper AltiVecTCPIPWP/D Rev. 1, 1/2003 Freescale Semiconductor, Inc... Enhanced TCP/IP Performance with AltiVec Jacob Pan CPD Applications Abstract Today's high speed networks are capable of delivering data at gigabit/second rates to desktop computers and some embedded systems. This dramatic improvement is taking place both in local area networks (LANs) where gigabit Ethernet is widely deployed and in wide area networks (WANs) where fiber optics is becoming dominant. The bottleneck in communications has shifted from the physical transmission media to the host processor, which typically runs a Transmission Control Protocol/Internet Protocol (TCP/IP) protocol stack to provide interconnection to most networking applications. In addition, the TCP/IP implementation presents contradictory requirements for processors to cope with both quick control branches and large data streams. This paper illustrates how performance of a TCP/IP protocol stack can be significantly improved by using AltiVec technology with little or no impact on the protocol source code. Performance Enhancement TCP/IP typically refers to the protocol suite that connects host computers through the internet. It encompass protocols such as Internet Protocol (IP), Transmission Control Protocol (TCP), User Datagram Protocol (UDP), Internet Control Message Protocol (ICMP), Address Resolution Protocol (ARP), and File Transfer Protocol (FTP). This paper focuses mainly on enhancing TCP performance as an example in applying AltiVec technology in a real protocol stack implementation. TCP was chosen because it is the most complex protocol 1 in the suite and also because real network traffic shows that TCP packets comprise roughly 80% of all wide area network (WAN) traffic [1]. At the local area network (LAN) level, TCP packets also contribute to a significant portion of the network traffic but may not be dominant. This is mainly due to large amounts of non-IP carrying data link layer traffic. However, TCP also demonstrates strong locality between hosts, which implies that for some applications, TCP processing is still intensive at the LAN level. In the following sections, an AltiVec-enabled software solution for TCP performance enhancement is revealed progressively, beginning with a bottleneck analysis of the protocol stack. Various solutions are then compared within stand-alone functions. In the end, the overall performance impact is presented with benchmark results. In addition, the process of identifying code fragments that can be vectorized is summarized; the same approach can be extended to other networking applications. 1TCP contributes to about 50 percent of the code in the entire TCP/IP stack. For More Information On This Product, Go to: www.freescale.com Freescale Semiconductor, Inc. Bottleneck Analysis in TCP/IP Protocol Stack Freescale Semiconductor, Inc... The implementation of a TCP/IP stack can vary greatly depending on hardware resources, memory footprint, and the type of data services required. Essentially, networking is about moving data from one location to another in a controlled fashion. To ensure either a reliable or a best effort data transfer service, the protocol stack needs to integrate fairly complex control logic to cope with the following: * * * * * * * * * Packet differentiation Encapsulation/decapsulation Segmentation/reassembly Time-out and retransmission Acknowledgment Duplicate packet detection Tracking every byte of data with a sequence/acknowledge number Flow control Congestion avoidance A complete implementation of the TCP/IP protocol stack often requires over ten thousand lines of C code. The code complexity is attributed mainly to the need for managing connection admission and dealing with network anomalies. However, in a well-designed network, the host processor spends much less time handling error conditions than performing normal data packet processing. But even the best implementation of a TCP/IP stack cannot avoid performing a data copy and a checksum operation at least once in each direction. Therefore, the most common and expensive portion of the code is data manipulation, such as memory copy, memory initialization, and the checksum calculation. This characteristic is more significant with networking applications, such as FTP and SMTP, that use larger packet sizes. On the other hand, with the explosive growth of gigabit Ethernet, non-standard Ethernet jumbo packets (up to 9 Kbytes) are more and more attractive in some LAN environments. This is because larger frames usually mean fewer CPU interrupts and less processing overhead for a given data transfer size. Often the per-packet processing overhead sets the limit of TCP performance in the LAN environment. The following equation explains how TCP throughput has an upper bound based on the following parameters: Throughput ~0.7 * MSS / (RTT * sqrt(packet_loss)) where RTT = round trip time MSS = maximum segment size. Therefore, maximum TCP throughput has a directly linear relationship to MSS; MSS is defined as the size of the maximum transfer unit (MTU) minus the number of bytes in the TCP/IP headers: MSS = MTU - TCP/IP headers To process a large standard TCP segment 1 or a jumbo frame, the number of control statements is independent from the block size. It may still involve processing a few `if' statements, but they are essentially the same as for processing smaller packets. However, with a jumbo frame data manipulation requires more CPU clock cycles proportional to the packet size. For example, computing a TCP checksum is generally 1Up 2 to 1460 bytes due to the limit of 1500 bytes payload in standard Ethernet. Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com MOTOROLA Freescale Semiconductor, Inc. considered to be one of the most expensive operations in the TCP/IP protocol stack because the checksum is based on both the header (both a TCP header and a pseudo IP header) and the entire TCP payload. Figure 1 shows a typical checksum function in C. unsigned short checksum( unsigned short *addr, int len, unsigned long sum, int byte_swap) Freescale Semiconductor, Inc... { unsigned short *pData = addr; while (len>1) { sum += *pData++; len-=2; } /* process odd byte */ if (len==1) { if (byte_swap==0) { sum+=*(unsigned char *)pData; } else { sum+=(*(unsigned char *)pData)<<8; } } /* add sum of carry bits back */ sum=(sum>>16)+ (sum&0xffff); sum+=(sum>>16); return(~sum); } Figure 1. checksum Function The code in the checksum function was compiled with a GNU C compiler with optimization level 2 and executed on a test bench with the configuration shown in Table 1. Table 1. Test Configurations for Generic C checksum Function Platform Motorola Sandpoint X3 Processor MPC7455 Core frequency 750 MHz Bus speed 100 MHz Test function checksum Compiler GNU C 2.95 with -O2 option In the best case (where data is resident in the data cache), this code can generate checksums at a rate of 3.07 clock cycles per byte, which is equivalent to 244 Mbytes/sec. In real TCP/IP processing. The worst case scenario occurs in the receive direction, where data packets are not cache-resident. This case was also experimented with by flushing the data cache each time before running the test. The result on the same MPC7455 processor shows that only 102 Mbytes/sec throughput can be achieved. Apparently, this performance poses a dramatic bottleneck to high-speed or bursty traffic such as gigabit Ethernet. Some embedded processors provide a built-in TCP/IP hardware assist unit to accelerate checksum computation; however, this approach requires extensive software changes to the existing stack and results in poor software portability. MOTOROLA Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com 3 Freescale Semiconductor, Inc. In addition to computing the checksum, a protocol stack needs to move data in both receive and transmit directions between the application layer and physical devices. Packet segmentation and queueing also requires processing resources. As a result, accelerating data manipulation/movement becomes a key factor to ease or eliminate this bottleneck in the TCP/IP stack. Later in this paper, the profiling result of a realistic TCP benchmark demonstrates the significance that a few frequently used data processing routines, such as memcpy(), checksum(), and memcpy_and_checksum() can have on performance. Freescale Semiconductor, Inc... Using AltiVec Technology in the TCP/IP Stack The AltiVec technology in the MPC74xx processors is based on the implementation of separate vector/SIMD (single instruction stream, multiple data streams) execution units that have a high degree of data parallelism. A single instruction in AltiVec operates on multiple data items allowing for a faster and more efficient way to process large quantities of data. It also adds a 128-bit vector register file to the existing integer- and floating-point registers. The quad-word registers allow for quick processing on large amounts of data all at the same time. Data manipulations in a TCP/IP stack are ideally suited for AltiVec instructions. For an example, a TCP checksum is computed as the 16-bit one's complement of the one's complement sum of all 16-bit words in the TCP segment header and TCP segment data. If the data does not end on a 16-bit boundary, it is padded with zeroes at the end. The C code implementation is shown in the code segment in Figure 1. If the checksum algorithm is coded with AltiVec instructions, the middle loop looks like the following: vaddcuw vadduwm vaddcuw vadduwm vadduwm vadduwm V_Carry_Current,VD1,VD2 V_Temp_Sum,VD1,VD2 V_Carry_Sum,V_Temp_Sum,V_Sum V_Sum,V_Temp_Sum,V_Sum VCAR,V_Carry_Current,VCAR VCAR,V_Carry_Sum,VCAR // // // // // // add data and store carries add data (no carries) carry from sum update update sum update carries from previous adds add carries from previous sum Only six instructions are needed for computing a checksum for two quad words (32 bytes). Only one clock cycle is needed for each instruction on a MPC74xx processor. AltiVec also provides load and store instructions that can operate on 16 bytes at a time. To speed up checksum and other data processing, most operating systems, such as Linux, implement these functions in scalar assembly code, which is optimized for each processor architecture. The most commonly used assembly functions used by TCP/IP stack include computing checksum, memory copy, and computing checksum in the memory copy loop. Compared to scalar assembly instructions, computing the checksum or performing memcpy with AltiVec instructions not only requires far fewer logic/algebraic operations but also much less frequent use of the more time-consuming (expensive) load and store instructions. Also, the AltiVec instruction set allows efficient processing of various alignment cases. In most cases, function interfaces are standardized and abstracted, so assembly functions can be linked directly with the protocol stack without changing the stack itself. The same convenient interfaces are also provided with AltiVec-enabled library functions. Therefore, with little porting effort, an existing TCP/IP stack can take advantage of the vector/SIMD engine that is available with the MPC74xx processor family. To experiment with the performance of AltiVec-enabled library functions, the same hardware configuration as in Table 1 was used to compare the cases for the following: * 4 C/glibc Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com MOTOROLA Freescale Semiconductor, Inc. * * Linux PPC Assembly AltiVec Assembly Table 2 shows which figure provides the test case for the function being compared. Table 2. Test Cases for Stand-Alone Functions Freescale Semiconductor, Inc... Function Name Warm Cache Cold Cache checksum Figure 2 Figure 3 memcpy Figure 4 Figure 5 memcpy+checksum Figure 6 Figure 7 It is important to notice that the AltiVec assembly functions used in this paper are not just optimized for limited scenarios, such as with a given alignment, but rather, they are written for general-purpose data processing 1. For example, in this test all source and destination data buffers are only word aligned (aligned to 4-byte boundaries and not quad-word aligned 2). The following figures give side-by-side performance comparisons of each function. In addition, to measure real-world benefits, AltiVec-enabled library functions are tested with a TCP benchmark. Unless otherwise stated, the resulting data is all obtained from the same test bed described in Table 1. Figure 2 shows the throughput comparison of the same checksum function shown in Figure 1 coded in AltiVec assembly, hand-coded scalar assembly from Linux 3, and the generic C function compiled by the GNU C compiler. The test repeats calling the checksum function 100 times on the same data buffer for each size, so that most of the computation can be assumed to be the warm cache case. To ensure fair comparisons, data caches are flushed each time before incrementing data sizes. . Figure 2. checksum Throughput with Warm Data Cache 1Performance of the above AltiVec assembly functions is fairly invariant 2AltiVec load and store instructions operate on quad-word boundary. 3Linux 2.4.12 kernel MOTOROLA to alignment. Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com 5 Freescale Semiconductor, Inc. Minor differences are observed with small data packet sizes. However, as packet sizes increase, the AltiVec function provides increasing performance enhancement over scalar code. It can generate a checksum about four times faster than the hand-coded scalar assembly. Freescale Semiconductor, Inc... Figure 3 shows the same test in the worst case scenario. Bandwidth of the memory subsystem imposes a limit on this case. In most real world TCP/IP implementations, this scenario is typically avoided by computing a checksum in the same loop as memory copy. For example, in the receive direction, the checksum is computed while data is being copied from a receive ring buffer to a socket buffer. In the transmit direction, most data is obtained from the application layer via a socket buffer that is most likely cache-resident. Therefore, it is reasonable to assume that data is cache resident (or even register resident) while a checksum for TCP packets is computed. Figure 3. checksum Throughput with Cold Data Cache The algorithm for the checksum examples who the increased performance even without the use of data streaming or large loop unrolling. These techniques, in addition to dcbz before stores, are used in the memcpy examples in Figure 4 and Figure 5. 6 Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com MOTOROLA Freescale Semiconductor, Inc... Freescale Semiconductor, Inc. Figure 4. memcpy Throughout with Warm Data Cache The warm cache example in Figure 4 yields an improvement similar to that of the warm cache case for checksum; however, using data streaming, loop unrolling, and dcbz before store techniques greatly improves the performance in the cold cache case, shown in Figure 5. Figure 5. memcpy Throughout with Cold Data Cache Figure 6 and Figure 7 show the performance of the checksum and memcpy() functions with both warm and cold data caches. Again, AltiVec-enabled functions provide significant speed up for larger data sizes. It is important to notice that the checksum computation is nearly free for AltiVec if combined with memcpy. The reason is that, in the memcpy loop, all of the data is already resident in the quad-word AltiVec vector registers. Therefore, computing checksum with the resident data only takes three single-cycle instructions MOTOROLA Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com 7 Freescale Semiconductor, Inc. to execute. On the contrary, the scalar version memcpy is 10 to 20 percent slower if combined the with checksum. Freescale Semiconductor, Inc... As shown in Figure 6, the combined memcpy and checksum routine coded with AltiVec instructions is about four times faster than the encoded scalar assembly routine. Notice that the performance and the actual execution sequence of the combined checksum and memcpy C code is very compiler-dependent. In Figure 6 and Figure 7, the C implementation combines two separate function calls instead of computing checksum in the same loop as memcpy. Figure 6. checksum_and_memcpy Throughput with Warm Data Cache The data stream touch (dst) instruction is used in checksum_and_memcpy(), the result in Figure 7 shows that a speed up of more than 2 times can be achieved with medium sized data packets. Figure 7. checksum_and_memcpy Throughput with Cold Data Cache 8 Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com MOTOROLA Freescale Semiconductor, Inc. Benchmark Results To show how these AltiVec-enabled library functions can benefit TCP performance in the real networking environment, a TCP benchmark was used to evaluate the performance impact with the above three functions. Internal to the TCP benchmark, to simulate client-sever conversation, two tasks were created with their own TCP protocol control block (TCB), respectively. In a single-thread environment, the two tasks simply alternate in a busy loop. Freescale Semiconductor, Inc... Two simple emulation layers were created to measure the performance related to the TCP-network layer and the TCP-application layer interfaces. To avoid overhead associated with maintaining emulation layers, the implementation adopted the simple and effective mechanism described in the following paragraphs. At the application emulation layer, all buffer descriptors are initialized as a ring buffer; thus, data can be quickly retrieved by the server/client task at run time. The network interface emulation layer is implemented as two double-linked lists with queue heads and tails stored in static data structures attached to the corresponding TCB. In addition to reducing overhead, this approach allows simulating network error conditions by simply changing status bits in buffer descriptors or by swapping pointers for out-of-order conditions. Figure 8 shows the architecture of the TCP benchmark. Each TCB maintains an event-driven state machine for connection management and exception handling. Data flow in the egress direction includes two queues: the unsent queue and the unacknowledged queue. Input data from the network emulation layer is checked for matching address information and data integrity. No further data processing is performed. In summary, this TCP benchmark models all normal data transactions, segmentation, queuing, and connection management. Because network anomalies, such as corrupted, duplicate, lost, and out-of-order packets, occur rarely [2] in the real network, modeling of those exception cases is not included in this TCP benchmark. The following five standard tests were available in the benchmark to simulate different network traffic patterns: * * * * * The bulk data transfer test uses maximum standard TCP segment size (1460 Bytes) to emulate FTP like data flow. The interactive data transfer test uses small data packets to simulate TELNET or RLOGIN type of traffic The mixed packet size test is based on statistics collected in real network traffic The connection request/response test is configured such that the connection setup and close process is repeatedly performed in the loop with no data exchanged. The jumbo frame size test uses a non-standard gigabit Ethernet jumbo frame of 8000 bytes. Figure 8 shows how the network interface data structure (NIF) models the data as necessary for interacting with the network and lower layers. Some data passing arrows in the graph include computing a partial checksum in memcpy. MOTOROLA Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com 9 Freescale Semiconductor, Inc. Application Application Emulation Layer Data Data Application Data Data Pass data Pass reference/ checksum Data Data Client Task TCP Control Block Server Task TCP Control Block Unsent Unacked Input queue queue queue Unsent Unacked Input queue queue queue Freescale Semiconductor, Inc... TCP Layer Client NIF Network Interface Emulation Layer Server NIF (Information needed for TCP) Packet Packet Packet Packet Packet Packet Packet Packet Figure 8. TCP Benchmark Internal Architecture The profile result of the TCP benchmark is shown in Figure 9. It shows the percentage of execution time consumed by C checksum and memcpy routines 1 on a Solaris system 2. Data is generated by a GNU profiler 2.12.1. Due to the multitasking nature of the Solaris platform, this result may vary slightly in later test cases (on a single-task environment). 60 50 40 30 checksum memcpy 20 10 0 (B)ulk (I)nteractive (J)umbo packet (M)ixed size (R)esponse Figure 9. Profile Result of TCP Benchmark on Solaris 1C and GLIBC 2Sun OS 5.8 10 compiled by Solaris native GNU C compiler version 2.95.3 Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com MOTOROLA Freescale Semiconductor, Inc. As shown in Figure 11, in three of the five standard test cases, AltiVec-enabled library functions achieved significant speed-up over scalar C and assembly code. TCP BM Standard Test Performance 2.5 Libc Linux assembly AltiVec* 2 1.5 1 Freescale Semiconductor, Inc... 0.5 0 Bulk Intx Jumbo Mixed Resp Figure 10. TCP Benchmark Standard Test Performance (Normalized) TCP BM throughput Libc Linux assembly Mixed Resp AltiVec* 180 160 140 120 100 80 60 40 20 0 Bulk Intx Jumbo Figure 11. TCP Benchmark Standard Test Result (Normalized) Because interactive data flow and response time tests consist mainly of signaling packets and tiny data packets (averaging 1 to 2 bytes of payload in the interactive data flow), little performance difference can be observed in those two cases. Therefore the throughput is minimal as compared to the other three cases. This characteristic is easily seen in Figure 11, which shows the actual data throughput for the five tests. In reality, the round trip time is more important in the interactive data flow and response time tests, whereas data throughput is the focus of the bulk data flow cases with either a standard packet size, a jumbo frame size, or in a mixed packet environment. MOTOROLA Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com 11 Freescale Semiconductor, Inc. Concluding Remarks This paper presents the benefits of applying AltiVec technology in today's challenging network computing environment. A systematic and quantitative approach is illustrated with the enhancing TCP performance example. In summary, AltiVec can be used whenever multiple streams of data can be operated on in parallel or when large data blocks need to be processed. An AltiVec implementation of a function may vary from its scalar version, but a software abstraction layer can be used to isolate the code changes. Ultimately, AltiVec-enabled library functions can be linked in with minimal porting effort. Freescale Semiconductor, Inc... Another key to optimization is to detect the opportunity to properly vectorize generic scalar code. An in-depth knowledge of the target environment and efficient profiling tools can help detect the bottlenecks. Along with knowledge of the AltiVec architecture, potential performance enhancement can be achieved. References [1] Ramon Cacerest, Peter B. Danzig, Sugih Jamin, Danny J. Mitzel. Characteristics of Wide-Area TCP/IP Conversations. CSD, Univ. of Southern California, 1991 [2] Jeffrey C. Mongul. Observing TCP Dynamics in Real Networks, WRL, DEC, 1992 [3] V. Paxson. Known TCP Implementation Problems, RFC2525. Network Working Group, 1999 [4] W. Richard Stevens, TCP/IP Illustrated, 1994 [5] Jon B. Postel. Transmission Control Protocol, RFC793, Network Information Center, SRI International, 1981 12 Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com MOTOROLA Freescale Semiconductor, Inc. Freescale Semiconductor, Inc... THIS PAGE INTENTIONALLY LEFT BLANK 13 Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com MOTOROLA Freescale Semiconductor, Inc. Freescale Semiconductor, Inc... THIS PAGE INTENTIONALLY LEFT BLANK 14 Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com MOTOROLA Freescale Semiconductor, Inc. Freescale Semiconductor, Inc... THIS PAGE INTENTIONALLY LEFT BLANK MOTOROLA Enhanced TCP/IP Performance with AltiVec For More Information On This Product, Go to: www.freescale.com 15 Freescale Semiconductor, Inc. HOW TO REACH US: USA/EUROPE/LOCATIONS NOT LISTED: Motorola Literature Distribution P.O. Box 5405, Denver, Colorado 80217 1-303-675-2140 (800) 441-2447 JAPAN: Freescale Semiconductor, Inc... Motorola Japan Ltd. SPS, Technical Information Center 3-20-1, Minami-Azabu Minato-ku Tokyo 106-8573 Japan 81-3-3440-3569 Information in this document is provided solely to enable system and software implementers to use Motorola products. There are no express or implied copyright licenses granted hereunder to design ASIA/PACIFIC: or fabricate any integrated circuits or integrated circuits based on the information in this document. Motorola Semiconductors H.K. Ltd. Silicon Harbour Centre, 2 Dai King Street Tai Po Industrial Estate, Tai Po, N.T., Hong Kong 852-26668334 Motorola reserves the right to make changes without further notice to any products herein. Motorola makes no warranty, representation or guarantee regarding the suitability of its products for any particular purpose, nor does Motorola assume any liability arising out of the application or use of any product or circuit, and specifically disclaims any and all liability, including without TECHNICAL INFORMATION CENTER: limitation consequential or incidental damages. "Typical" parameters which may be provided in (800) 521-6274 Motorola data sheets and/or specifications can and do vary in different applications and actual HOME PAGE: performance may vary over time. All operating parameters, including "Typicals" must be validated for each customer application by customer's technical experts. Motorola does not convey any http://www.motorola.com/semiconductors license under its patent rights nor the rights of others. Motorola products are not designed, intended, or authorized for use as components in systems intended for surgical implant into the body, or other applications intended to support or sustain life, or for any other application in which the failure of the Motorola product could create a situation where personal injury or death may occur. Should Buyer purchase or use Motorola products for any such unintended or unauthorized application, Buyer shall indemnify and hold Motorola and its officers, employees, subsidiaries, affiliates, and distributors harmless against all claims, costs, damages, and expenses, and reasonable attorney fees arising out of, directly or indirectly, any claim of personal injury or death associated with such unintended or unauthorized use, even if such claim alleges that Motorola was negligent regarding the design or manufacture of the part. Motorola and the Stylized M Logo are registered in the U.S. Patent and Trademark Office. digital dna is a trademark of Motorola, Inc. All other product or service names are the property of their respective owners. Motorola, Inc. is an Equal Opportunity/Affirmative Action Employer. (c) Motorola, Inc. 2002 AltiVecTCPIPWP/D For More Information On This Product, Go to: www.freescale.com