4
Enhanced TCP/IP P erformance with AltiVec
MOTOROLA
In addition to computing the checksum, a protocol stack needs to move data in both receive and transmit
directions between the application layer and physical devices. Packet segmentation and queueing also
requires processing resources.
As a result, accelerating data manipulation/movement becomes a key factor to ease or eliminate this
bottleneck in the TCP/IP stack. Later in this paper, the profiling result of a realistic TCP benchmark
demonstrates the significance that a few frequently used data processing routines, such as memcpy(),
checksum(), and memcpy_and_checksum() can have on performance.
Using AltiVec Technology in the TCP/IP Stack
The AltiVec technology in the MPC74xx processors is based on the implementation of separate
vector/SIMD (single instruction stream, multiple data streams) execution units that have a high degree of
data parallelism. A single instruction in AltiVec operates on multiple data items allowing for a faster and
more ef ficient way to process large quantities of data. It also adds a 128-bit v ector register file to the existing
integer- and floating-point registers. The quad-word registers allow for quick processing on large amounts
of data all at the same time.
Data manipulations in a TCP/IP stack are ideally suited for AltiVec instructions. For an example, a TCP
checksum is computed as the 16-bit one's complement of the one's complement sum of all 16-bit words in
the TCP segment header and TCP segment data. If the data does not end on a 16-bit boundary, it is padded
with zeroes at the end. The C code implementation is shown in the code segment in Figure 1. If the
checksum algorithm is coded with AltiVec instructions, the middle loop looks like the following:
vaddcuw V_Carry_Current,VD1,VD2 // add data and store carries
vadduwm V_Temp_Sum,VD1,VD2 // add data (no carries)
vaddcuw V_Carry_Sum,V_Temp_Sum,V_Sum // carry from sum update
vadduwm V_Sum,V_Temp_Sum,V_Sum // update sum
vadduwm VCAR,V_Carry_Current,VCAR // update carries from previous adds
vadduwm VCAR,V_Carry_Sum,VCAR // add carries from previous sum
Only six instructions are needed for computing a checksum for two quad words (32 bytes). Only one clock
cycle is needed for each instruction on a MPC74xx processor. AltiVec also provides load and store
instructions that can operate on 16 bytes at a time.
To speed up checksum and other data processing, most operating systems, such as Linux, implement these
functions in scalar assembly code, which is optimized for each processor architecture. The most commonly
used assembly functions used by TCP/IP stack include computing checksum, memory cop y , and computing
checksum in the memory copy loop.
Compared to scalar assembly instructions, computing the checksum or performing memcpy with AltiVec
instructions not only requires far fewer logic/algebraic operations but also much less frequent use of the
more time-consuming (expensive) load and store instructions. Also, the AltiVec instruction set allows
efficient processing of various alignment cases.
In most cases, function interfaces are standardized and abstracted, so assembly functions can be linked
directly with the protocol stack without changing the stack itself. The same convenient interfaces are also
provided with AltiVec-enabled library functions. Therefore, with little porting effort, an existing TCP/IP
stack can take advantage of the vector/SIMD engine that is available with the MPC74xx processor family.
To experiment with the performance of AltiVec-enabled library functions, the same hardware configuration
as in Table 1 was used to compare the cases for the following:
• C/glibc
Freescale Semiconductor, I
Freescale Semiconductor, Inc.
For More Information On This Product,
Go to: www.freescale.com
nc...