AltiVec™ Technology
AltiVec™ Technology
On May 7, 1998, Motorola disclosed a new technology which integrates
into the existing PowerPC Architecture a new high bandwidth, parallel
operation vector execution unit
Motorola’s new AltiVec™ technology expands the capabilities of
PowerPC Microprocessors by providing leading-edge, general-purpose
processing performance while concurrently addressing high-bandwidth
data processing and algorithmic-intensive computations in a single-chip
solution
Branding of the new AltiVec technology defines and expresses Motorola’s
leadership and long-term commitment to the PowerPC Architecture and
its customers
AltiVec is a trademark of Motorola, Inc.
AltiVec™ Technology in Publications
AltiVec™ Technology in Publications
“The ability to access a high-performance vector unit in addition to the integer and
floating point units in the RISC core provides a leap in performance beyond the
traditional advances in CPU clock frequency. . . In the end, AltiVec should have a
profound impact on digital signal processing applications that require both
flexibility and performance. The fact that it will be available in a high-performance
general-purpose RISC processor is icing on the cake.” Richard Jaenicke, SKY
Computers, Real Time Computing 5/98.
Jim Turley, a senior editor at Microprocessor Report, said AltiVec is similar to
MMX, but it doubles the amount of data that can be processed in the MMX
technology. Turley said that as the PowerPC Architecture moves further away
from the desktop, its network capabilities will become more important. “AltiVec
makes PowerPC a much more interesting Alternative than it was just last week,”
Turley said. “It really does bring quite a lot to the party.” Austin American
Statesman 5/8/98.
“Think of the chip as having a doorway in and out of the chip that’s 32 bits wide,
but the hallways are 128 bits wide,” Turley offers. “Once data is moved out of
(main memory) and into the chip, they can really swing lots of data around and do
special number crunching, more so than other chips.” CNET 5/6/98.
Technology Introduction
Technology Introduction
Development of AltiVec technology was driven by customer demands for
next-generation performance for high bandwidth applications
Motorola’s premier installed base and long term customer relationships in
the networking and telecommunications markets provide early and
widespread design win opportunities
Combination of the powerful superscalar PowerPC microprocessor core
with the new functionality provided by AltiVec technology on a single chip
creates a unique new product class
This new class of product will enable new microprocessor applications
and performance levels through ultra-high bandwidth and parallel data
manipulation
Provides leading edge general purpose processing performance while
concurrently addressing high-bandwidth data handling processing and
algorithmic intensive computations in a single chip solution
Targets networking, computing and other performance-driven applications
The existing comprehensive tool vendor support for PowerPC
Architecture will be leveraged to rapidly deliver full infrastructure support
for the AltiVec technology
Motorola is working with leading tools providers to develop simulators,
assemblers, linkers and compilers to assure full support for the AltiVec
technology
Goals of AltiVec
Goals of AltiVec
Meet computational demands of networking infrastructure
Enable faster, more secure encryption methods optimized for SIMD
processing model
Provide compelling performance on multimedia-oriented desktop
computers, desktop publishing, and digital video processing
Enable real-time processing of the most demanding data streams
(MPEG-2 decode, continuous speech recognition, real-time, high-
resolution 3D graphics . . .)
What is a Vector Architecture?
What is a Vector Architecture?
A vector architecture allows the simultaneous processing of many data
items in parallel
Vector architecture has roots in supercomputing, which attempted to
extract large amounts of parallelism from software
Massive parallel capabilities but limited data types
Great for computation intensive applications but not for systems requiring
more diverse processing and real-time constraints
Operations are performed on multiple data elements by a single
instruction – referred to as Single Instruction Multiple Data (SIMD)
parallel processing
AltiVec technology is a short vector architecture
Uses 128-bit wide registers to provide 4-, 8-, or 16-way parallelism
Supports a wide variety of data types
SIMD extension to PowerPC ISA
Processes multiple data streams/blocks in a single cycle
Common approach to accelerate processing of next-gen data types (audio,
video, packet data)
Benefits of AltiVec
Benefits of AltiVec
Provides a single high-performance RISC microprocessor with DSP-like
compute power for controller and signal processing functions
Supplements the performance-leading PowerPC architecture with a new class
of execution unit
New vector processing engine provides for highly parallel operations, allowing
for the simultaneous execution of up to 16 operations in a single clock cycle
Can accelerate many traditional computing and embedded processing
operations with its wide datapaths and wide field operations
Provides product designers and customers with a new “one part – one
code base” approach to product design while also providing a tremendous
jump in performance
Offers a programmable solution that can easily migrate via software
upgrades to follow changing standards and customer requirements
Because this integrated solution is still 100% compatible with the industry
standard PowerPC architecture, design and support are simplified
Leverage PowerPC compatibility and legacy code, add AltiVec performance as
you need it
AltiVec – A Solution to Many Embedded Computing Problems
AltiVec – A Solution to Many Embedded Computing Problems
Access Concentrators/DSLAMs
ADSL and Digital Data Concentrators
Speech Recognition
Voice/Sound Processing
Image and Video Processing
Array Numeric Processing
Basestation Processing
Infrastructure
AltiVec – The Best Solution to Computing Problems
AltiVec – The Best Solution to Computing Problems
High bandwidth data communications
Realtime Continuous Speech I/O
HMM, Viterbi Acceleration, Neural Algorithms
Soft-Modem
V.34, 56K
3D Graphics
Games, Entertainment
High precision CAD
Virtual Reality
Motion Video
MPEG2, MPEG4
H.234
High Fidelity Audio
3D Audio, AC-3
Machine Intelligence
AltiVec Features
AltiVec Features
Provides SIMD (Single Instruction, Multiple Data) functionality for
embedded applications with massive data processing needs
Key elements of Motorola’s AltiVec technology include:
128-bit vector execution unit with 32-entry 128-bit register file
Parallel processing with Vector permute unit and Vector ALU
162 new instructions
New data types:
»Packed byte, halfword, and word integers
»Packed IEEE single-precision floats
Saturation arithmetic
Simplified architecture
No interrupts other than data storage interrupt on loads and stores
No hardware unaligned access support
No penalty for running AltiVec and standard PowerPC instructions
simultaneously
Streamlined architecture to facilitate efficient implementation
Maintains PowerPC architecture’s RISC register-to-register
programming model
AltiVec Features
AltiVec Features
Supports parallel operation on byte, halfword, word and 128-bit operands
Intra and inter-element arithmetic instructions
Intra and inter-element conditional instructions
Powerful Permute, Shift and Rotate instructions
Vector integer and floating-point arithmetic
Data types
»8-, 16- and 32-bit signed- and unsigned-integer data types
»32-bit IEEE single-precision floating-point data ype
»8-, 16-, and 32-bit Boolean data types (e.g. 0xFFFF % 16-bit TRUE)
Modulo & saturation integer arithmetic
32-bit “IEEE-default” single-precision floating-point arithmetic
»IEEE-default exception handling
»IEEE-default “round-to-nearest”
»fast non-IEEE mode (e.g. denorms flushed to zero)
Control flow with highly flexible bit manipulation engine
Compare creates field mask used by select function
Compare Rc bit enables setting Condition Register
»Trivial accept/reject in 3D graphics
»Exception detection via software polling
AltiVec’s Vector Execution Unit
AltiVec’s Vector Execution Unit
Concurrency with integer and floating-point units
Separate, dedicated 32 128-bit vector registers
Larger namespace = reduced register pressure/spillage
Longer vector length = more data-level parallelism
Separate files can all be accessed by execution units in parallel
Deeper register files allow more sophisticated software optimizations
No penalty for mingling integer, floating point and AltiVec technology
operations
Cache / Memory
Dispatch
FPU Vector Unit
IU FPRs Vector Register File
GPRs
32 64 128
Instruction Stream
128 bits64
32
PowerPC
Execution Flow
128-bit Vector Architecture
128-bit Vector Architecture
128-bit wide data paths between L1 cache, L2 cache, Load/Store Units and
Registers
Wider data paths speed save and restore operations
Offers SIMD processing support for
16-way parallelism for 8-bit signed and unsigned integers and characters
8-way parallelism for 16-bit signed and unsigned integers
4-way parallelism for 32-bit signed and unsigned integers and IEEE floating
point numbers
Two fully pipelined independent execution units
Vector Permute Unit is a highly flexible byte manipulation engin e
Vector ALU (Arithmetic Logical Unit) performs up to 16 operations in a single
clock cycle
»Contains Vector Simple Fixed-Point, Vector Complex Fixed-Point, and Vector
Floating-Point execution engines
Dual Altivec instruction issue: One arithmetic, one “permute”
Sample Based Processing
Sample Based Processing
SISD
AC3 - Audio Decode
do { decode ( channel 1 )
decode ( channel 2 )
decode ( channel 3 )
decode ( channel 4 )
decode ( channel 5 )
decode ( channel 6 )
} while (Amplifier is on; step time)
SIMD
AC3 - Audio Decode
do {
decode (ch 1, ch2, c3, c4, c5, c6)
} while (Amplifier is on; step time)
Approx 6x performance improvement.
Simplified Systems Design
Simplified Systems Design
Existing DSP based systems model is Single MPU with Multiple DSPs
2 different architectures, software code bases, and hardware types
DSPs limit frequency and algorithm support
Upgrades in performance require new hardware, and at least 1 major
software change
Result: High-cost upgrades, slow time-to-market, in-field hardware
swap required
New Model is Single or Multiple MPUs
One architecture, one code base, and one hardware type
Very high frequency with total algorithm support
Performance upgrades available with only a software change
Result: Low cost upgrades, fast time-to-market, no in-field hardware
swap
AltiVec Instruction Set Features
AltiVec Instruction Set Features
162 new instructions added to the PowerPC ISA
4-operand, non-destructive instructions
Up to three source operands and a single destination operand
Supports advanced “multiply-add/sum” and permute primitives
All instructions fully pipelined with single-cycle throughput
Simple ops: 1 cycle latency
Compound ops: 3-4 cycle latency
No restriction on issue with scalar instructions
Enhanced cache/memory interface
Software hints for data re-use probability
Prefetch support (stride-N access)
Simplified load/store architecture
Simple byte, halfword, word and quadword loads & stores
No unaligned accesses – software-managed via permute instruction
In summary . . .
In summary . . .
AltiVec™ Technology provides a robust instruction set
In terms of operations
In terms of data types and sizes
In terms of other options (such as saturation, data movement)
Allows a streamlined hardware implementation (cycle time,
latency, throughput)
AltiVec enables a broad range of embedded and
computing applications
AltiVec Instruction Set
AltiVec Instruction Set
Vector Intra-element Instructions
Integer Arithmetic Instructions
Floating-Point Arithmetic Instructions
Conditional Control Flow Instructions
Rotate, Shift and Logical Instructions
Memory Access Instructions
Vector Inter-element Instructions
Permute Instruction
Data Movement/Manipulation Instructions
Integer Multiply Odd/Even Instructions
Integer Multiply-Sum Instructions
Integer Sum Across Instructions
SIMD Intra-element Instructions
SIMD Intra-element Instructions
16 x 8-bit elements
8 x 16-bit elements
4 x 32-bit elements
VB
VT
VA
VC
op op op op op op op op op op op op op op op op
VB
VT
VA
VC
op op op op op op op op
VB
VT
VA
VC
op op op op
Integer Arithmetic Instructions
Integer Arithmetic Instructions
Vector
Add
Subtract
Unsigned
Signed
Byte
Halfword
Word
Modulo
Saturate
Unsigned Word and Write Carry Out
Multiply
Odd
Even Unsigned
Signed
Byte
Halfword
Word
-Low-Add Unsigned Halfword Modulo
-High[-Round]-Add Signed Halfword Saturate
Average Unsigned
Signed
Byte
Halfword
Word
VB
VT
VA
VC
op op op op op op op op
Multiply Odd and Even
Multiply Odd and Even
VT
VA
VB
********
VA
VB
VT
********
VT
VA
VB
* * * *
VT
VA
VB
* * * *
Multipy Odd
Multipy Even
byte half-word
byte half-word
Multiply-High and Add
Multiply-High and Add
VA
VB
prod
VT
s
* * **
s s s s s s VC
+ + + ++ + + +
temp
sat
s
* * * *
16
17
16
Vector Multiply High-Add Signed Halfword Saturate
vmhaddshs VT, VA, VB, VC
Multiply-High Round and Add
Multiply-High Round and Add
VA
VB
prod
VT
s
* * **
s s ss s s s VC
+ + + ++ + + +
temp
0...........01 0...........01 0...........01 0...........01 0...........01 0...........01 0...........01 const
sat
* * * *
0...........01
18
17
16
16
Vector Multiply High-Round and Add Signed Halfword Saturate
vmhraddshs VT, VA, VB, VC
Multiply Low and Add
Multiply Low and Add
VA
VB
prod
VT
* *
VC
temp
**
++
* * * *
++ + + + +
Vector Multiply-Low and Add Unsigned Halfword Modulo
vmladduhm VT, VA, VB, VC
Floating-Point Arithmetic Instructions
Floating-Point Arithmetic Instructions
Vector
Multiply-Add
Negative Multiply-Subtract
Reciprocal
Reciprocal Square Root
Log base 2
2 raised to Exponent
Estimate
Add
Subtract
Floating-Point
VB
VT
VA
VC
op op op op
Performs four 32-bit “IEEE-default” single-precision floating point arithmetic
operations in parallel
Pipelined 4-cycle latency for Add, Subtract, and Fused Multiply-Add
Floating-Point Conversion Instructions
Floating-Point Conversion Instructions
Vector Round to Floating-Point Integer
to Nearest
toward Zero
toward +infinity
toward –infinity
Convert From
To
Unsigned
Signed
Fixed-Point Word
VB
VT
op op op op
Conditional Instructions
Conditional Instructions
Vector
Compare
EQ Unsigned
GT Unsigned
Signed
Byte
Halfword
Word
EQ
GT
GE
Bounds
Floating-Point
and Record
Select
Maximum
Minimum
Unsigned
Signed
Byte
Halfword
Word
Floating-Point
VB
VT
VA
VC
opopopopopopopop
Compare-Select
Compare-Select
Supports IF-THEN-ELSE by computing results for both branch paths
and then selecting correct path’s results
Reduces inefficient control flow feedback to branch unit
Compare creates field masks
target element set to ones where comparison is true
target element set to zeroes where comparison false
Select performs a bit-wise selection based on generated mask
Compare-Select
Compare-Select
VA
VC
VB
VT
............
...........
...........
...........
...........
01001100
VT
VA
VB
== = = = = = =
00...00 FF...FF 00...00 00...00 00...00 FF...FF FF...FF FF...FF
vcmp
vsel
vcmp VT, VA, VB
vsel VT,VA,VB,VC
Rc Bit
Rc Bit
Rc bit enables setting CR field 6 with summary
comparison status
CR bit 24 set if all compare result elements are equal to all ones
CR bit 26 set if all compare result elements are equal to all zeroes
Provides limited but critical control flow support
Trivial accept/reject on 3D clipping
3D lighting
Special handling of exceptions
»Software polls for occurrence of saturation
»Image processing application identified that needs to reset thresholds
of other pixel components when one component saturates
vmx_cmpeq( )
Application:
Permits removal of control flow
Performance:
PowerPC: 48 instructions
( 32 cycles throughput )
PowerPC + AltiVec: 2 instructions
( 3 cycles throughput )
XX
XX XX
XX XX
XX XX
XX
··
··
··
··
vmsum
··
·· ··
·· ··
··
XX
XX XX
XX XX
XX XX
XX XX
XX XX
XX XX
XX XX
XX XX
XX XX
XX XX
XX XX
XX
vsum
Vector Dot Product (FIR)
Vector Compare and Select
Application:
Used heavily in DSP code
Performance:
PowerPC: 36 instructions
( 18 cycles throughput )
PowerPC + AltiVec: 2 instructions
( 2 cycles throughput )
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
C1 00 00 00 1A 1A C1 1A 00 C1 00 00 1A 00 1A C1
00 FF FF FF 00 00 00 00 FF 00 FF FF 00 FF 00 00
9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A
C1 9A 9A 9A 1A 1A C1 1A 9A C1 9A 9A 1A 9A 1A C1
C1 00 00 00 1A 1A C1 1A 00 C1 00 00 1A 00 1A C1
vmx_sel( )
==
== ==
== ==
== ==
== ==
== ==
== ==
== ==
== ==
== ==
== ==
== ==
== ==
== ==
== ==
== ==
==
vsel vcmp
Algorithmic Features
Rotate, Shift and Logical Instructions
Rotate, Shift and Logical Instructions
Vector
Rotate Left Unsigned
Shift Left Unsigned
Right Unsigned
Signed
Byte
Halfword
Word
VB=shift count
VA
VT
VB
VT
VA
opopopopopopopop
Load/Store Vector Instructions
Load/Store Vector Instructions
Load
Store

Vector
MRU
LRU
Load Vector for Shift
Left
Right
Load
Store

Vector Element Byte
Halfword
Word
Memory 0x0000
0x0010
0x0020
0x0030
VT
Memory 0x0000
0x0010
0x0020
0x0030
VT
Enhanced Cache/Memory Interface
Enhanced Cache/Memory Interface
Load & store with LRU instructions
Performs memory access and marks cache block “next to be replaced”
Avoids flushing cache with multimedia data exhibiting limited data reuse
“Software-managed” memory buffer in cache
Data stream touch and stop instructions
Enables reuse of data cache as a memory access buffer
Alleviates memory access latency by enabling early data prefetch
Data stream touch offers software directed cache prefetch
»Specifies N blocks of data to be prefetched from memory into cache
»Each block to be prefetched is K bytes in length
»The first block to be prefetched is located at address X, the second at address
X+N, the third at address X+2N, . . .
Data stream stop
»Allows stopping of a data stream operation when speculative-incorrectly started
»Selectively stop any stream via ID tag
“Transient/static” hint for cache placement strategy
123 N
Memory
Block Size = 0-32 Vectors
Stride = ±32KBytes
0-256 Blocks
Data Stream Prefetch
Provides a hardware engine for Data Stream Prefetching
Four simultaneous streams are supported, independent and asynchronous,
addressable via 2-bit ID tag
Inst. Dispatch
DST
Select
Bus
VTQ
Data
Stream
Engine
Data
Stream
Engine
Load/
Store
Unit
Load/
Store
Unit
Data
Cache
Data
Cache Data
MMU
Data
MMU
AltiVec Data Stream Prefetch
Permute Instruction
Permute Instruction
Provides full byte-wide data crossbar for 128-bit registers
Selects any 16 8-bit elements from 2x16 8-bit elements
Other AltiVec instructions are special cases of permute
Pack, unpack, and merge
Splat (element or literal replication)
Shift left long immediate
Permute also supports other higher-level functions
Software-managed unaligned acces s
Table look-up
VA
VB
VC
VT
0x00 0x16 0x0F 0x1A 0x0A 0x17 0x0C 0x1F 0x08 0x1D 0x1E 0x1C 0x050x15 0x00 0x14
A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AFA0 A1 A2
B3 B4 B5 B6 B7 BA BB BC BD BE BFB0 B1 B2 B8 B9
B5 A0 A0B4 A5A8AA ACB7B6 AF BA BF BD BE BC
0123456789ABCDEF 101112131415161718191A1B1C1D1E1F
17 18 D E F 1E 1 0 12 11 10A 14141414
VRAVRB
VRC
VRT
Applications:
Byte interleaving (conv. encoding)
Dynamic memory address alignment
Fast 32-element table lookup
Performance:
PowerPC: 5 to 50 instructions,
depending on application
PowerPC + AltiVec: 1 instruction
( 1 cycle throughput )
Vector Permute
Control Vector
Input Vectors
Output Vector
123 N
Memory
Block Size = 0-32 Vectors
Stride = ±32KBytes
0-256 Blocks
Data Stream Prefetch Applications:
Software-directed prefetch into cache
Up to 4 simultaneous streams
Hides memory latency for stream
processing applications (MPEG, FIR, etc)
Performance:
PowerPC: Does not exist
PowerPC + AltiVec: 1 instruction
( Can save hundreds of clocks)
Data Management Features
Unaligned Accesses
Unaligned Accesses
N+1 memory accesses for software, 2N accesses for hardware
For a series of array elements: A0, A1, A2, A3
A0
v00 v01
v10
A0 A1A0
,&A0
v01
v10
v00
,v00,v01,v02
v02,&A1
,&A0
lvx
lvx
lvsl
vperm
A0
v00 v01
A1 A2
A1
A3A2
A3
A0
A0
A2
A2
A1
A1
A3
v02
A0
A1
A2
A3
vT
(v10-v13)
,&A0v00lvx v01
v10,v00,v01,v02
v02,&A1
,&A0
v00,&A2
v11,v01,v00,v02
v01,&A3
v12,v00,v01,v02
v00,&(A3+16)
v13,v01,v00,v02
lvx
lvsl
vperm
lvx
lvx
lvx
vperm
vperm
vperm
0x00 0x10 0x20 0x30 0x40 0x50 0x60 0x70 0x80 0x90 0xA0
A0 A1 A2 A3
For a single vector, A0: mod16(addr(A0))
Unpack Instructions
Unpack Instructions
* unpacks 1-5-5-5 pixels into 4 8-bit components
Vector Unpack High Signed Byte
Halfword
Pixel*
Vector Unpack Low Signed Byte
Halfword
Pixel*
VB
VTss ss ss ss ss ss ss ss
ssss
VB
VTssss ssssssss
Pack Instructions
Pack Instructions
* packs 4 8-bit components into 1-5-5-5 pixels
Vector Pack
Unsigned Halfword
Word Modulo
Saturate
Signed Halfword
Word
Saturate
Pixel*
VA VB
VT
Merge Instructions
Merge Instructions
VA
VB
VT
Vector Merge High Byte
Halfword
Word
Vector Merge Low
VA
VB
VT
Byte
Halfword
Word
Splat Instructions
Splat Instructions
• Useful for scalar operands
Vector Splat [Immediate Signed] Byte
Halfword
Word
VA
VT
selectIMM
Shift Left Double Immediate
Shift Left Double Immediate
Immediate form of permute
Supports 128-bit rotate/shift left/right
Rotate via specifying VA=VB
Shift left via specifying (VB)=0
Shift right via specifying (VA)=0 & shift left count = 16–shift right
count
Example: vsldi VT,VA,VB,#0004
VA VB
IMM
VT
= 4
a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae afa0 a1 a2 b3 b4 b5 b6 b7 ba bb bc bd be bfb0 b1 b2 b8 b9
a4 a5 a6 a7 a8 a9 aa ab ac ad ae af b3b0 b1 b2
Shift Quadword Instructions
Shift Quadword Instructions
Vector Shift Left
Right
by Octet
00 00 00 00 00
VB
VA
VT
Shift Cnt
5
121 124
don’t care
. . . . . . . . .
Shift vector left or right by octets, zero fill.
by bit
Vector Shift
Vector Shift
125 127
VB
VA
VT
Shift Cnt
6
6
0..0
666666666666666
Vector Shift Left
Right
Shift entire vector left or right by bit up to 8 bits, zero fill.
Multiply-Sum Integer Instructions
Multiply-Sum Integer Instructions
Unsigned
Mixed-signed
Vector Multiply-Sum Byte Modulo
VA
VB
VT
VC
8x8bit
mult
4x16bit +
32-bit sum
Vector Multiply-Sum Unsigned
Signed
Halfword Modulo
Saturate
16x16 bit
multiply
3x32-bit
Sum
VA
VB
VT
VC
Sum Across Integer Instructions
Sum Across Integer Instructions
Vector Sum Across Signed Integer32 Signed Saturate
VA
VT
VB
5x32-bit sum
VT
VA
VB
V
VT
VB
A
3x32 -bit sum 3x32 -bit sum
Sum Across Partial Integer Instructions
Sum Across Partial Integer Instructions
Vector Sum Across Partial (1/2) Signed Integer32 Saturate
Vector Sum Across Partial (1/4) Signed Integer8 Saturate
Vector Sum Across Partial (1/4) Signed Integer16 Saturate
VA
VT
2x16-bit + 32-bit sum
VB
2x16-bit + 32-bit sum 2x16-bit + 32-bit sum 2x16-bit + 32-bit sum
Partial Product
Partial Product
16 bits
X
A
B
=
32 bits
Bhigh
Blow 0000000
Bhigh Blowssssssss
XX
AA
==
24 bit
partial
products
Use permute to duplicate the A operands and spread the B operands
Can then use mulsum to produce up to 40 bits of accumulation
Need to combine partial sums after the main loop
Problem: arbitrary permute of each bit in a vector register, assume
static permute
Strategy:
1) break 128 bit vector into 16 bytes
2) pre-calculate permute, rotate and select control vectors
3) use permute to get byte to correct location
4) use rotate to get bit to correct location with each byte
5) use select to merge bit into result byte
6) repeat for all eight bits in each byte
128 Bit Permute
128 Bit Permute
0 1 2 3 4 5 124 125 126 127
¥ ¥Ê¥Ê¥Ê¥Ê
¥ ¥Ê¥Ê¥Ê¥Ê
¥ ¥Ê¥Ê¥Ê¥Ê
128 Bit Permute
128 Bit Permute
0000000 0000000 0000000 00000001111
permute
rotate
select
Multiply-Sum Integer Instructions
Multiply-Sum Integer Instructions
VA
VB
8x8-bit
multiply
8x8-bit
multiply
VT
8x8-bit
multiply
8x8-bit
multiply
8x8-bit
multiply
8x8-bit
multiply
8x8-bit
multiply
8x8-bit
multiply
8x8-bit
multiply
8x8-bit
multiply
8x8-bit
multiply
8x8-bit
multiply
8x8-bit
multiply
8x8-bit
multiply
8x8-bit
multiply
8x8-bit
multiply
VC
4x16-bit + 32-bit sum 4x16-bit + 32-bit sum 4x16-bit + 32 6bit sum 4x16-bit + 32-bit sum
VA
VB
VT
VC
16x16-bit
multiply
16x16-bit
multiply
16x16-bit
multiply
16x16-bit
multiply
16x16-bit
multiply
16x16-bit
multiply
16x16-bit
multiply
16x16-bit
multiply
3x32-bit sum 3x32-bit sum 3x32-bit sum 3x32-bit sum
Vector Multiply-Sum Unsigned
Mixed-signed
Byte Modulo
Vector Multiply-Sum Unsigned
Signed
Halfword Modulo
Saturate
Parallel Table Lookup
Parallel Table Lookup
Example: 64 entry table, 8-Bit entries
Strategy
1. Divide table into two 32-Entry tables
2. Load table into vector registers
3. Load values to be looked up (indices) into VC
4. Do two permutes, one using the upper half of the table as inputs,
one using the lower half
5. Do a vector compare greater than each index with constant
0001 1111. This results in 0X00 when the index is less than 31 (entry
is in first half of table) and 0XFF when the index is greater than 31
(entry is in second half of table)
6. Use vector select with the result of the compare to select (MUX) the
correct entry. This is equivalent to a MUX using sixth bit of the
index as the select input.
Parallel Table Lookup
Parallel Table Lookup
Parallel lookup is scalable:
table can be up to size of vector register file (minus a few)
32 entry table = 16 lookup/inst
64 entry table = 4 lookup/inst
256 entry table = 4/5 lookup/inst
non-SIMD code ~ 1/2 lookup/inst but table size unlimited
Can be further improved
piecewise linear approx.
Applications
Galois Field multiplication (used in Reed-Solomon ECC)
image/video processing (color correction)
Parallel Table Lookup
Parallel Table Lookup
0 1 2 29 30 31
32 : 1
V11
Index
14 15
V12
Index
32 33 34 61 62 63
32 : 1
V13 V14
V11 V12 V13 V14
VPERM VCMPGT
VSEL
0:4
5
2:1
Parallel Table Lookup
Parallel Table Lookup
AltiVec code for 64-entry parallel table lookup
16-element parallel lookup into a table of 64 byte elements
; Assume V31 holds the 16 valid 6-bit index values
; which are to be looked up from a 64-element table,
; contained in V11 through V14.
; Assume V02 holds the looked up values.
; Assume following replicated constant: 0b00011111, in V27.
vperm V00,V11,V12,V31 ; lookup within first 32 elems
vperm V01,V13,V14,V31 ; lookup within second 32 elems
vcmpgtub V08,V31,V27 ; this comparison with 0b00011111
; will splat/replicate the 6th bit
vsel V02,V00,V01,V08 ; use 6th bit for choosing between
; the two lookups
256-Byte Table
256-Byte Table
32-Byte
Chunks
bit #
0:4
5
6
7
00 01 00 01 00 01 00 01
VPERM
VCMP
VSEL
VCMP
VSEL
VCMP
VSEL