ALTIVECPR - Motorola / Freescale Semiconductor

AltiVec™ Technology

➤On May 7, 1998, Motorola disclosed a new technology which integrates

into the existing PowerPC Architecture a new high bandwidth, parallel

operation vector execution unit

➤Motorola’s new AltiVec™ technology expands the capabilities of

PowerPC Microprocessors by providing leading-edge, general-purpose

processing performance while concurrently addressing high-bandwidth

data processing and algorithmic-intensive computations in a single-chip

solution

➤Branding of the new AltiVec technology defines and expresses Motorola’s

leadership and long-term commitment to the PowerPC Architecture and

its customers

AltiVec is a trademark of Motorola, Inc.

AltiVec™ Technology in Publications

➤“The ability to access a high-performance vector unit in addition to the integer and

floating point units in the RISC core provides a leap in performance beyond the

traditional advances in CPU clock frequency. . . In the end, AltiVec should have a

profound impact on digital signal processing applications that require both

flexibility and performance. The fact that it will be available in a high-performance

general-purpose RISC processor is icing on the cake.” Richard Jaenicke, SKY

Computers, Real Time Computing 5/98.

➤Jim Turley, a senior editor at Microprocessor Report, said AltiVec is similar to

MMX, but it doubles the amount of data that can be processed in the MMX

technology. Turley said that as the PowerPC Architecture moves further away

from the desktop, its network capabilities will become more important. “AltiVec

makes PowerPC a much more interesting Alternative than it was just last week,”

Turley said. “It really does bring quite a lot to the party.” Austin American

Statesman 5/8/98.

➤“Think of the chip as having a doorway in and out of the chip that’s 32 bits wide,

but the hallways are 128 bits wide,” Turley offers. “Once data is moved out of

(main memory) and into the chip, they can really swing lots of data around and do

special number crunching, more so than other chips.” CNET 5/6/98.

Technology Introduction

➤Development of AltiVec technology was driven by customer demands for

next-generation performance for high bandwidth applications

➤Motorola’s premier installed base and long term customer relationships in

the networking and telecommunications markets provide early and

widespread design win opportunities

➤Combination of the powerful superscalar PowerPC microprocessor core

with the new functionality provided by AltiVec technology on a single chip

creates a unique new product class

➤This new class of product will enable new microprocessor applications

and performance levels through ultra-high bandwidth and parallel data

manipulation

•Provides leading edge general purpose processing performance while

concurrently addressing high-bandwidth data handling processing and

algorithmic intensive computations in a single chip solution

•Targets networking, computing and other performance-driven applications

➤The existing comprehensive tool vendor support for PowerPC

Architecture will be leveraged to rapidly deliver full infrastructure support

for the AltiVec technology

•Motorola is working with leading tools providers to develop simulators,

assemblers, linkers and compilers to assure full support for the AltiVec

technology

Goals of AltiVec

➤Meet computational demands of networking infrastructure

➤Enable faster, more secure encryption methods optimized for SIMD

processing model

➤Provide compelling performance on multimedia-oriented desktop

computers, desktop publishing, and digital video processing

➤Enable real-time processing of the most demanding data streams

(MPEG-2 decode, continuous speech recognition, real-time, high-

resolution 3D graphics . . .)

What is a Vector Architecture?

➤A vector architecture allows the simultaneous processing of many data

items in parallel

➤Vector architecture has roots in supercomputing, which attempted to

extract large amounts of parallelism from software

•Massive parallel capabilities but limited data types

•Great for computation intensive applications but not for systems requiring

more diverse processing and real-time constraints

➤Operations are performed on multiple data elements by a single

instruction – referred to as Single Instruction Multiple Data (SIMD)

parallel processing

➤AltiVec technology is a short vector architecture

•Uses 128-bit wide registers to provide 4-, 8-, or 16-way parallelism

•Supports a wide variety of data types

➤SIMD extension to PowerPC ISA

•Processes multiple data streams/blocks in a single cycle

•Common approach to accelerate processing of next-gen data types (audio,

video, packet data)

Benefits of AltiVec

➤Provides a single high-performance RISC microprocessor with DSP-like

compute power for controller and signal processing functions

•Supplements the performance-leading PowerPC architecture with a new class

of execution unit

•New vector processing engine provides for highly parallel operations, allowing

for the simultaneous execution of up to 16 operations in a single clock cycle

•Can accelerate many traditional computing and embedded processing

operations with its wide datapaths and wide field operations

➤Provides product designers and customers with a new “one part – one

code base” approach to product design while also providing a tremendous

jump in performance

➤Offers a programmable solution that can easily migrate via software

upgrades to follow changing standards and customer requirements

•Because this integrated solution is still 100% compatible with the industry

standard PowerPC architecture, design and support are simplified

•Leverage PowerPC compatibility and legacy code, add AltiVec performance as

you need it

AltiVec – A Solution to Many Embedded Computing Problems

➤Access Concentrators/DSLAMs

•ADSL and Digital Data Concentrators

➤Speech Recognition

➤Voice/Sound Processing

➤Image and Video Processing

➤Array Numeric Processing

➤Basestation Processing

Infrastructure

AltiVec – The Best Solution to Computing Problems

➤High bandwidth data communications

➤Realtime Continuous Speech I/O

•HMM, Viterbi Acceleration, Neural Algorithms

➤Soft-Modem

•V.34, 56K

➤3D Graphics

•Games, Entertainment

•High precision CAD

➤Virtual Reality

➤Motion Video

•MPEG2, MPEG4

•H.234

➤High Fidelity Audio

•3D Audio, AC-3

➤Machine Intelligence

AltiVec Features

➤Provides SIMD (Single Instruction, Multiple Data) functionality for

embedded applications with massive data processing needs

➤Key elements of Motorola’s AltiVec technology include:

•128-bit vector execution unit with 32-entry 128-bit register file

•Parallel processing with Vector permute unit and Vector ALU

•162 new instructions

•New data types:

»Packed byte, halfword, and word integers

»Packed IEEE single-precision floats

•Saturation arithmetic

➤Simplified architecture

•No interrupts other than data storage interrupt on loads and stores

•No hardware unaligned access support

•No penalty for running AltiVec and standard PowerPC instructions

simultaneously

•Streamlined architecture to facilitate efficient implementation

➤Maintains PowerPC architecture’s RISC register-to-register

programming model

AltiVec Features

➤Supports parallel operation on byte, halfword, word and 128-bit operands

•Intra and inter-element arithmetic instructions

•Intra and inter-element conditional instructions

•Powerful Permute, Shift and Rotate instructions

➤Vector integer and floating-point arithmetic

•Data types

»8-, 16- and 32-bit signed- and unsigned-integer data types

»32-bit IEEE single-precision floating-point data ype

»8-, 16-, and 32-bit Boolean data types (e.g. 0xFFFF % 16-bit TRUE)

•Modulo & saturation integer arithmetic

•32-bit “IEEE-default” single-precision floating-point arithmetic

»IEEE-default exception handling

»IEEE-default “round-to-nearest”

»fast non-IEEE mode (e.g. denorms flushed to zero)

➤Control flow with highly flexible bit manipulation engine

•Compare creates field mask used by select function

•Compare Rc bit enables setting Condition Register

»Trivial accept/reject in 3D graphics

»Exception detection via software polling

AltiVec’s Vector Execution Unit

➤Concurrency with integer and floating-point units

➤Separate, dedicated 32 128-bit vector registers

•Larger namespace = reduced register pressure/spillage

•Longer vector length = more data-level parallelism

•Separate files can all be accessed by execution units in parallel

•Deeper register files allow more sophisticated software optimizations

➤No penalty for mingling integer, floating point and AltiVec technology

operations

Cache / Memory

Dispatch

FPU Vector Unit

IU FPRs Vector Register File

GPRs

32 64 128

Instruction Stream

128 bits64

32

PowerPC

Execution Flow

128-bit Vector Architecture

➤128-bit wide data paths between L1 cache, L2 cache, Load/Store Units and

Registers

•Wider data paths speed save and restore operations

➤Offers SIMD processing support for

•16-way parallelism for 8-bit signed and unsigned integers and characters

•8-way parallelism for 16-bit signed and unsigned integers

•4-way parallelism for 32-bit signed and unsigned integers and IEEE floating

point numbers

➤Two fully pipelined independent execution units

•Vector Permute Unit is a highly flexible byte manipulation engin e

•Vector ALU (Arithmetic Logical Unit) performs up to 16 operations in a single

clock cycle

»Contains Vector Simple Fixed-Point, Vector Complex Fixed-Point, and Vector

Floating-Point execution engines

•Dual Altivec instruction issue: One arithmetic, one “permute”

Sample Based Processing

SISD

AC3 - Audio Decode

do { decode ( channel 1 )

decode ( channel 2 )

decode ( channel 3 )

decode ( channel 4 )

decode ( channel 5 )

decode ( channel 6 )

} while (Amplifier is on; step time)

SIMD

AC3 - Audio Decode

do {

decode (ch 1, ch2, c3, c4, c5, c6)

} while (Amplifier is on; step time)

Approx 6x performance improvement.

Simplified Systems Design

➤Existing DSP based systems model is Single MPU with Multiple DSPs

•2 different architectures, software code bases, and hardware types

•DSPs limit frequency and algorithm support

•Upgrades in performance require new hardware, and at least 1 major

software change

•Result: High-cost upgrades, slow time-to-market, in-field hardware

swap required

➤New Model is Single or Multiple MPUs

•One architecture, one code base, and one hardware type

•Very high frequency with total algorithm support

•Performance upgrades available with only a software change

•Result: Low cost upgrades, fast time-to-market, no in-field hardware

swap

AltiVec Instruction Set Features

➤162 new instructions added to the PowerPC ISA

➤4-operand, non-destructive instructions

•Up to three source operands and a single destination operand

•Supports advanced “multiply-add/sum” and permute primitives

➤All instructions fully pipelined with single-cycle throughput

•Simple ops: 1 cycle latency

•Compound ops: 3-4 cycle latency

•No restriction on issue with scalar instructions

➤Enhanced cache/memory interface

•Software hints for data re-use probability

•Prefetch support (stride-N access)

➤Simplified load/store architecture

•Simple byte, halfword, word and quadword loads & stores

•No unaligned accesses – software-managed via permute instruction

In summary . . .

➤AltiVec™ Technology provides a robust instruction set

•In terms of operations

•In terms of data types and sizes

•In terms of other options (such as saturation, data movement)

•Allows a streamlined hardware implementation (cycle time,

latency, throughput)

➤AltiVec enables a broad range of embedded and

computing applications

AltiVec Instruction Set

➤Vector Intra-element Instructions

•Integer Arithmetic Instructions

•Floating-Point Arithmetic Instructions

•Conditional Control Flow Instructions

•Rotate, Shift and Logical Instructions

•Memory Access Instructions

➤Vector Inter-element Instructions

•Permute Instruction

•Data Movement/Manipulation Instructions

•Integer Multiply Odd/Even Instructions

•Integer Multiply-Sum Instructions

•Integer Sum Across Instructions

SIMD Intra-element Instructions

➤16 x 8-bit elements

➤8 x 16-bit elements

➤4 x 32-bit elements

VB

VT

VA

VC

op op op op op op op op op op op op op op op op

VB

VT

VA

VC

op op op op op op op op

VB

VT

VA

VC

op op op op

Integer Arithmetic Instructions

Vector

Add

Subtract

 Unsigned

Signed



Byte

Halfword

Word



 



Modulo

Saturate





 



Unsigned Word and Write Carry Out

















Multiply

Odd

Even Unsigned

Signed



 

Byte

Halfword

Word



 











-Low-Add Unsigned Halfword Modulo

-High[-Round]-Add Signed Halfword Saturate











Average Unsigned

Signed



Byte

Halfword

Word



 

































VB

VT

VA

VC

op op op op op op op op

Multiply Odd and Even

VT

VA

VB

********

VA

VB

VT

********

VT

VA

VB

* * * *

VT

VA

VB

* * * *

Multipy Odd

Multipy Even

byte half-word

Multiply-High and Add

VA

VB

prod

VT

s

* * **

s s s s s s VC

+ + + ++ + + +

temp

sat

s

* * * *

16

17

16

Vector Multiply High-Add Signed Halfword Saturate

vmhaddshs VT, VA, VB, VC

Multiply-High Round and Add

VA

VB

prod

VT

s

* * **

s s ss s s s VC

+ + + ++ + + +

temp

0...........01 0...........01 0...........01 0...........01 0...........01 0...........01 0...........01 const

sat

* * * *

0...........01

18

17

16

Vector Multiply High-Round and Add Signed Halfword Saturate

vmhraddshs VT, VA, VB, VC

Multiply Low and Add

VA

VB

prod

VT

* *

VC

temp

**

++

* * * *

++ + + + +

Vector Multiply-Low and Add Unsigned Halfword Modulo

vmladduhm VT, VA, VB, VC

Floating-Point Arithmetic Instructions

Vector

Multiply-Add

Negative Multiply-Subtract

Reciprocal

Reciprocal Square Root

Log base 2

2 raised to Exponent 





Estimate

Add

Subtract



















Floating-Point

VB

VT

VA

VC

op op op op

➤Performs four 32-bit “IEEE-default” single-precision floating point arithmetic

operations in parallel

➤Pipelined 4-cycle latency for Add, Subtract, and Fused Multiply-Add

Floating-Point Conversion Instructions

Vector Round to Floating-Point Integer

to Nearest

toward Zero

toward +infinity

toward –infinity







Convert From

To



Unsigned

Signed



 



 Fixed-Point Word



















VB

VT

op op op op

Conditional Instructions

Vector

Compare

EQ Unsigned

GT Unsigned

Signed





Byte

Halfword

Word







 





EQ

GT

GE

Bounds





Floating-Point



















and Record

Select

Maximum

Minimum 

Unsigned

Signed 

Byte

Halfword

Word









Floating-Point





















































VB

VT

VA

VC

opopopopopopopop

Compare-Select

➤Supports IF-THEN-ELSE by computing results for both branch paths

and then selecting correct path’s results

➤Reduces inefficient control flow feedback to branch unit

➤Compare creates field masks

•target element set to ones where comparison is true

•target element set to zeroes where comparison false

➤Select performs a bit-wise selection based on generated mask

Compare-Select

VA

VC

VB

VT

............

...........

01001100

VT

VA

VB

== = = = = = =

00...00 FF...FF 00...00 00...00 00...00 FF...FF FF...FF FF...FF

vcmp

vsel

vcmp VT, VA, VB

vsel VT,VA,VB,VC

Rc Bit

➤Rc bit enables setting CR field 6 with summary

comparison status

•CR bit 24 set if all compare result elements are equal to all ones

•CR bit 26 set if all compare result elements are equal to all zeroes

➤Provides limited but critical control flow support

•Trivial accept/reject on 3D clipping

•3D lighting

•Special handling of exceptions

»Software polls for occurrence of saturation

»Image processing application identified that needs to reset thresholds

of other pixel components when one component saturates

vmx_cmpeq( )

Application:

Permits removal of control flow

Performance:

PowerPC: 48 instructions

( 32 cycles throughput )

PowerPC + AltiVec: 2 instructions

( 3 cycles throughput )

XX

XX XX

XX

··

vmsum

··

·· ··

··

XX

XX XX

XX

vsum

Vector Dot Product (FIR)

Vector Compare and Select

Application:

Used heavily in DSP code

Performance:

PowerPC: 36 instructions

( 18 cycles throughput )

PowerPC + AltiVec: 2 instructions

( 2 cycles throughput )

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

C1 00 00 00 1A 1A C1 1A 00 C1 00 00 1A 00 1A C1

00 FF FF FF 00 00 00 00 FF 00 FF FF 00 FF 00 00

9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A

C1 9A 9A 9A 1A 1A C1 1A 9A C1 9A 9A 1A 9A 1A C1

C1 00 00 00 1A 1A C1 1A 00 C1 00 00 1A 00 1A C1

vmx_sel( )

==

== ==

==

vsel vcmp

Algorithmic Features

Rotate, Shift and Logical Instructions

Vector

Rotate Left Unsigned

Shift Left Unsigned

Right Unsigned

Signed



















Byte

Halfword

Word



























VB=shift count

VA

VT

VB

VT

VA

opopopopopopopop

Load/Store Vector Instructions

Load

Store





Vector

MRU

LRU







Load Vector for Shift

Left

Right







Load

Store



Vector Element Byte

Halfword

Word















Memory 0x0000

0x0010

0x0020

0x0030

VT

Memory 0x0000

0x0010

0x0020

0x0030

VT

Enhanced Cache/Memory Interface

➤Load & store with LRU instructions

•Performs memory access and marks cache block “next to be replaced”

•Avoids flushing cache with multimedia data exhibiting limited data reuse

•“Software-managed” memory buffer in cache

➤Data stream touch and stop instructions

•Enables reuse of data cache as a memory access buffer

•Alleviates memory access latency by enabling early data prefetch

•Data stream touch offers software directed cache prefetch

»Specifies N blocks of data to be prefetched from memory into cache

»Each block to be prefetched is K bytes in length

»The first block to be prefetched is located at address X, the second at address

X+N, the third at address X+2N, . . .

•Data stream stop

»Allows stopping of a data stream operation when speculative-incorrectly started

»Selectively stop any stream via ID tag

➤“Transient/static” hint for cache placement strategy

123 N

Memory

Block Size = 0-32 Vectors

Stride = ±32KBytes

0-256 Blocks

Data Stream Prefetch

➤Provides a hardware engine for Data Stream Prefetching

➤Four simultaneous streams are supported, independent and asynchronous,

addressable via 2-bit ID tag

Inst. Dispatch

DST

Select

Bus

VTQ

Data

Stream

Engine

Data

Stream

Engine

Load/

Store

Unit

Load/

Store

Unit

Data

Cache

Data

Cache Data

MMU

Data

MMU

AltiVec Data Stream Prefetch

Permute Instruction

➤Provides full byte-wide data crossbar for 128-bit registers

➤Selects any 16 8-bit elements from 2x16 8-bit elements

➤Other AltiVec instructions are special cases of permute

•Pack, unpack, and merge

•Splat (element or literal replication)

•Shift left long immediate

•Permute also supports other higher-level functions

•Software-managed unaligned acces s

•Table look-up

VA

VB

VC

VT

0x00 0x16 0x0F 0x1A 0x0A 0x17 0x0C 0x1F 0x08 0x1D 0x1E 0x1C 0x050x15 0x00 0x14

A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AFA0 A1 A2

B3 B4 B5 B6 B7 BA BB BC BD BE BFB0 B1 B2 B8 B9

B5 A0 A0B4 A5A8AA ACB7B6 AF BA BF BD BE BC

0123456789ABCDEF 101112131415161718191A1B1C1D1E1F

17 18 D E F 1E 1 0 12 11 10A 14141414

VRAVRB

VRC

VRT

Applications:

• Byte interleaving (conv. encoding)

• Dynamic memory address alignment

• Fast 32-element table lookup

Performance:

PowerPC: 5 to 50 instructions,

depending on application

PowerPC + AltiVec: 1 instruction

( 1 cycle throughput )

Vector Permute

Control Vector

Input Vectors

Output Vector

123 N

Memory

Block Size = 0-32 Vectors

Stride = ±32KBytes

0-256 Blocks

Data Stream Prefetch Applications:

• Software-directed prefetch into cache

• Up to 4 simultaneous streams

• Hides memory latency for stream

processing applications (MPEG, FIR, etc)

Performance:

PowerPC: Does not exist

PowerPC + AltiVec: 1 instruction

( Can save hundreds of clocks)

Data Management Features

Unaligned Accesses

➤N+1 memory accesses for software, 2N accesses for hardware

➤For a series of array elements: A0, A1, A2, A3

A0

v00 v01

v10

A0 A1A0

,&A0

v01

v10

v00

,v00,v01,v02

v02,&A1

,&A0

lvx

lvsl

vperm

A0

v00 v01

A1 A2

A1

A3A2

A3

A0

A2

A1

A3

v02

A0

A1

A2

A3

vT

⇒

(v10-v13)

,&A0v00lvx v01

v10,v00,v01,v02

v02,&A1

,&A0

v00,&A2

v11,v01,v00,v02

v01,&A3

v12,v00,v01,v02

v00,&(A3+16)

v13,v01,v00,v02

lvx

lvsl

vperm

lvx

vperm

0x00 0x10 0x20 0x30 0x40 0x50 0x60 0x70 0x80 0x90 0xA0

A0 A1 A2 A3

➤For a single vector, A0: mod16(addr(A0))

Unpack Instructions

* unpacks 1-5-5-5 pixels into 4 8-bit components

Vector Unpack High Signed Byte

Halfword





Pixel*















Vector Unpack Low Signed Byte

Halfword





Pixel*















VB

VTss ss ss ss ss ss ss ss

ssss

VB

VTssss ssssssss

Pack Instructions

* packs 4 8-bit components into 1-5-5-5 pixels

Vector Pack

Unsigned Halfword

Word Modulo

Saturate

Signed Halfword

Word

















Saturate

Pixel*















VA VB

VT

Merge Instructions

VA

VB

VT

Vector Merge High Byte

Halfword

Word













Vector Merge Low

VA

VB

VT

Byte

Halfword

Word













Splat Instructions

• Useful for scalar operands

Vector Splat [Immediate Signed] Byte

Halfword

Word













VA

VT

selectIMM

Shift Left Double Immediate

➤Immediate form of permute

➤Supports 128-bit rotate/shift left/right

•Rotate via specifying VA=VB

•Shift left via specifying (VB)=0

•Shift right via specifying (VA)=0 & shift left count = 16–shift right

count

➤Example: vsldi VT,VA,VB,#0004

VA VB

IMM

VT

= 4

a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae afa0 a1 a2 b3 b4 b5 b6 b7 ba bb bc bd be bfb0 b1 b2 b8 b9

a4 a5 a6 a7 a8 a9 aa ab ac ad ae af b3b0 b1 b2

Shift Quadword Instructions

Vector Shift Left

Right



by Octet





00 00 00 00 00

VB

VA

VT

Shift Cnt

5

121 124

don’t care

. . . . . . . . .

Shift vector left or right by octets, zero fill.

by bit







Vector Shift

125 127

VB

VA

VT

Shift Cnt

6

0..0

666666666666666

Vector Shift Left

Right







Shift entire vector left or right by bit up to 8 bits, zero fill.

Multiply-Sum Integer Instructions

Unsigned

Mixed-signed







Vector Multiply-Sum Byte Modulo

VA

VB

VT

VC

8x8bit

mult

4x16bit +

32-bit sum

Vector Multiply-Sum Unsigned

Signed





 Halfword Modulo

Saturate







16x16 bit

multiply

3x32-bit

Sum

VA

VB

VT

VC

Sum Across Integer Instructions

Vector Sum Across Signed Integer32 Signed Saturate

VA

VT

VB

5x32-bit sum

VT

VA

VB

V

VT

VB

A

3x32 -bit sum 3x32 -bit sum

Sum Across Partial Integer Instructions

Vector Sum Across Partial (1/2) Signed Integer32 Saturate

Vector Sum Across Partial (1/4) Signed Integer8 Saturate

Vector Sum Across Partial (1/4) Signed Integer16 Saturate

VA

VT

2x16-bit + 32-bit sum

VB

2x16-bit + 32-bit sum 2x16-bit + 32-bit sum 2x16-bit + 32-bit sum

Partial Product

16 bits

X

A

B

=

32 bits

Bhigh

Blow 0000000

Bhigh Blowssssssss

XX

AA

==

24 bit

partial

products

➤Use permute to duplicate the A operands and spread the B operands

➤Can then use mulsum to produce up to 40 bits of accumulation

➤Need to combine partial sums after the main loop

Problem: arbitrary permute of each bit in a vector register, assume

static permute

Strategy:

1) break 128 bit vector into 16 bytes

2) pre-calculate permute, rotate and select control vectors

3) use permute to get byte to correct location

4) use rotate to get bit to correct location with each byte

5) use select to merge bit into result byte

6) repeat for all eight bits in each byte

128 Bit Permute

0 1 2 3 4 5 124 125 126 127

¥ ¥Ê¥Ê¥Ê¥Ê

128 Bit Permute

0000000 0000000 0000000 00000001111

permute

rotate

select

Multiply-Sum Integer Instructions

VA

VB

8x8-bit

multiply

8x8-bit

multiply

VT

8x8-bit

multiply

8x8-bit

multiply

8x8-bit

multiply

8x8-bit

multiply

8x8-bit

multiply

8x8-bit

multiply

8x8-bit

multiply

8x8-bit

multiply

8x8-bit

multiply

8x8-bit

multiply

8x8-bit

multiply

8x8-bit

multiply

8x8-bit

multiply

8x8-bit

multiply

VC

4x16-bit + 32-bit sum 4x16-bit + 32-bit sum 4x16-bit + 32 6bit sum 4x16-bit + 32-bit sum

VA

VB

VT

VC

16x16-bit

multiply

16x16-bit

multiply

16x16-bit

multiply

16x16-bit

multiply

16x16-bit

multiply

16x16-bit

multiply

16x16-bit

multiply

16x16-bit

multiply

3x32-bit sum 3x32-bit sum 3x32-bit sum 3x32-bit sum

Vector Multiply-Sum Unsigned

Mixed-signed





 Byte Modulo

Vector Multiply-Sum Unsigned

Signed





 Halfword Modulo

Saturate







Parallel Table Lookup

➤Example: 64 entry table, 8-Bit entries

Strategy

1. Divide table into two 32-Entry tables

2. Load table into vector registers

3. Load values to be looked up (indices) into VC

4. Do two permutes, one using the upper half of the table as inputs,

one using the lower half

5. Do a vector compare greater than each index with constant

0001 1111. This results in 0X00 when the index is less than 31 (entry

is in first half of table) and 0XFF when the index is greater than 31

(entry is in second half of table)

6. Use vector select with the result of the compare to select (MUX) the

correct entry. This is equivalent to a MUX using sixth bit of the

index as the select input.

Parallel Table Lookup

➤Parallel lookup is scalable:

•table can be up to size of vector register file (minus a few)

•32 entry table = 16 lookup/inst

•64 entry table = 4 lookup/inst

•256 entry table = 4/5 lookup/inst

•non-SIMD code ~ 1/2 lookup/inst but table size unlimited

➤Can be further improved

•piecewise linear approx.

➤Applications

•Galois Field multiplication (used in Reed-Solomon ECC)

•image/video processing (color correction)

Parallel Table Lookup

0 1 2 29 30 31

32 : 1

V11

Index

14 15

V12

Index

32 33 34 61 62 63

32 : 1

V13 V14

V11 V12 V13 V14

VPERM VCMPGT

VSEL

0:4

5

2:1

Parallel Table Lookup

AltiVec code for 64-entry parallel table lookup

16-element parallel lookup into a table of 64 byte elements

; Assume V31 holds the 16 valid 6-bit index values

; which are to be looked up from a 64-element table,

; contained in V11 through V14.

; Assume V02 holds the looked up values.

; Assume following replicated constant: 0b00011111, in V27.

vperm V00,V11,V12,V31 ; lookup within first 32 elems

vperm V01,V13,V14,V31 ; lookup within second 32 elems

vcmpgtub V08,V31,V27 ; this comparison with 0b00011111

; will splat/replicate the 6th bit

vsel V02,V00,V01,V08 ; use 6th bit for choosing between

; the two lookups

256-Byte Table

32-Byte

Chunks

bit #

0:4

5

6

7

00 01 00 01 00 01 00 01

VPERM

VCMP

VSEL

VCMP

VSEL

VCMP

VSEL