Upload
elvis-buckley
View
236
Download
0
Embed Size (px)
Citation preview
makebettergames
About This Talk• Will discuss how to do quaternion
math on PS2
• Assume that you already know and want to use quaternions
• Assume that you already know something about how the VU works
makebettergames
About Me• Lead engineer at Red Storm
Entertainment
• Not a quaternion god
• Not a vector unit god
• Not really familiar with VCL
• Just a 3D guy trying to get by…
makebettergames
About the code• Most examples written in macro mode
(VU0)
• Easy to translate to micro mode
• Examples that would be faster in micro mode are discussed separately
makebettergames
Matrices on PS2• PS2 is really well set up to do
matrices
• Multiplies are highly parallel
• Not so good for quaternions
makebettergames
Matrix Multiply• This is what we’re
up against• Takes 4/7 cycles to
transform a point• Takes 16/19 cycles
to concat matrices (9/12 cycles for 3x3 matrix)
vmulax ACC, vf2, vf1x vmadday ACC, vf3, vf1y vmaddaz ACC, vf4, vf1zvmaddw vf6, vf5, vf1w
makebettergames
Why Quaternions?• Quaternions take up less space: 4
floats vs. 9 (best case)
• Quaternions interpolate well
• Avoid floating point drift (normalize vs. Gram-Schmidt orthogonalization)
makebettergames
Quaternions on VU• Fit very well
• Four floats, aligned to 16-bit boundary
• Work just like homogeneous point
• Make sure stored (x,y,z,w) not (w,x,y,z)
makebettergames
Quaternion Multiplication• If quaternion is (x, y, z,w) or (v, w) then
) , ( 212121122121 vvvvvv wwwwqq
• All standard vector operations> Add, scale, dot product, cross product
makebettergames
Quaternion Mult on PS2• Interleaves dot
product and rest via accumulator
• Takes advantage of linearity of cross product
• Cycle count: 8/11• Less than matrix!
vmul vf3, vf1, vf2vopmula.xyz acc, vf1, vf2vmaddaw.xyz acc, vf2, vf1wvmaddaw.xyz acc, vf1, vf2wvopmsub.xyz vf3, vf2, vf1 vsubaz.w acc, vf3, vf3zvmsubax.w acc, vf0, vf3xvmsuby.w vf3, vf0, vf3y
w= w1·w2 v1 • v2
v = w1·v2 + w2·v1 + v1 v2
makebettergames
Vector Rotation• Formula for vector rotation:
0
1
wp
qpqp
• Two mults takes 16 cycles, plus the inverse
• Can do better
makebettergames
Vector Rotation, Take Two• If q is normalized, then can do:
)( )(2 )( 2 pvvpvpvpvp ww
• This is faster than two straight multiplies on serial processor
• Faster on vector processor, too!
makebettergames
Vector Rotation on VU• p in vf1, q in vf2vmul vf11, vf1, vf2
vopmula.xyz acc, vf2, vf1vopmsub.xyz vf5, vf1, vf2vmul.w vf6w, vf2w, vf2wvadd.w vf7w, vf2w, vf2w vmulax.w accw, vf0w, vf11xvmadday.w accw, vf0w, vf11yvmaddz.w vf11w, vf0w, vf11zvopmula.xyz acc, vf2, vf5vmaddaw.xyz acc, vf5, vf7wvmaddaw.xyz acc, vf1, vf6wvmaddaw.xyz acc, vf2, vf11wvopmsub.xyz vf3, vf5, vf2
p = (vp)·v + w2·p + 2w·(v p) + v (v p)
makebettergames
Vector Rotation on VU• First part builds all
the pieces• Second part adds
‘em all together• Cycles: 13/16• Better than straight
multiply• Worse than matrix
vmul vf11, vf1, vf2vopmula.xyz acc, vf2, vf1vopmsub.xyz vf5, vf1, vf2vmul.w vf6w, vf2w, vf2wvadd.w vf7w, vf2w, vf2w vmulax.w accw, vf0w, vf11xvmadday.w accw, vf0w, vf11yvmaddz.w vf11w, vf0w, vf11zvopmula.xyz acc, vf2, vf5vmaddaw.xyz acc, vf5, vf7wvmaddaw.xyz acc, vf1, vf6wvmaddaw.xyz acc, vf2, vf11wvopmsub.xyz vf3, vf5, vf2
makebettergames
Full Transforms• Combination of translation vector t,
quat r, 3 scale factors s
• Once again, want to transform point
• Basic formula:
1rprtp )(s
makebettergames
Point Transformation• p in vf1, q in vf2• scale in vf3• translation in vf4• Takes four extra
cycles for scale (including stalls), one extra for xlate
• Cycle count: 18/21
vmul vf1, vf1, vf3vmul vf11, vf1, vf2vopmula.xyz acc, vf2, vf1vopmsub.xyz vf5, vf1, vf2vmul.w vf6w, vf2w, vf2wvadd.w vf7w, vf2w, vf2w vmulax.w accw, vf0w, vf11xvmadday.w accw, vf0w, vf11yvmaddz.w vf11w, vf0w, vf11zvopmula.xyz acc, vf2, vf5vmaddaw.xyz acc, vf5, vf7wvmaddaw.xyz acc, vf1, vf6wvmaddaw.xyz acc, vf2, vf11wvmaddaw.xyz acc, vf4, vf0wvopmsub.xyz vf3, vf5, vf2
makebettergames
Transform Concatenation• Look at formula:
12122 rtrtt
rrr
)( 2
12
21
s
sss
• Have to transform point and multiply two quaternions and multiply scales
makebettergames
Transform Concatenation• Takes 8 cycles for quat multiply, 18
for transform, 1 for scale
• Have three stall cycles available
• Bottom line: 24/27 cycles
• Much slower than matrix multiplication
• Not recommended
makebettergames
Matrix Conversion• Quat-vector transformation not as
efficient as matrix-vector transformation (13 cycles vs. 4)
• To do multiple points, want to convert quaternion to a 4x4 matrix
makebettergames
Matrix Conversion• Corresponding 4x4 matrix to
normalized quat q = (x,y,z,w) is:
1000
02212222
02222122
02222221
22
22
22
yxwxyzwyxz
wxyzzxwzxy
wyxzwzxyzy
qM
• Not obvious how to do this efficiently
makebettergames
Matrix Conversion• Two approaches• One works well in macro mode• One in micro mode
> uses Lower instructions to achieve better parallelism
makebettergames
Matrix Conversion (macro)• Idea: matrix is built from two other
matrices
wzyx
zwxy
yxwz
xyzw
wzyx
zwxy
yxwz
xyzw
wzyx
qM
q ),,,(
makebettergames
Matrix Conversion (macro)• Simplification: matrix multiply is series
of row vector multiplies
• Create right matrix, generate left matrix via accumulator tricks
wzyx
zwxy
yxwz
xyzw
qR
makebettergames
Matrix Conversion (macro)• Look at one row in matrix multiply:
vmulax ACC, vf5, vf1x
vmadday ACC, vf6, vf1y
vmaddaz ACC, vf7, vf1z
vmaddw vf9, vf8, vf1w
• Or could just do:vmulaw ACC, vf8, vf1w vmadday ACC, vf6, vf1y vmaddaz ACC, vf7, vf1zvmaddx vf9, vf5, vf1x
• Is linear, so order doesn’t matter
makebettergames
Matrix Conversion (macro)• Idea: all values we need for left matrix are
in quaternion• Load accumulator with mula by w value
(always positive)• vmadd or vmsub to multiply by positive or
negative value and accumulate
vmulaw.xyz acc, vf2, vf5wvmaddax.xyz acc, vf3, vf5xvmadday.xyz acc, vf4, vf5yvmsubz.xyz vf13, vf1, vf5z
4vf
3vf
2vf
1vf
13vf
5vf
yxwz
wzyx
makebettergames
Matrix Conversion (macro)• More simplification:
> Last row of Mq always (0,0,0,1), don’t compute!
> Last column always 0 too, don’t compute!
> Last row of Rq just the quat in VU format
• Just build:
~
~
~
~
zyx
wxy
xwz
yzw
qR
makebettergames
Matrix Conversion (macro)vaddw.x vf1, vf0, vf4
vaddz.y vf1, vf0, vf4
vsuby.z vf1, vf0, vf4
vsubz.x vf2, vf0, vf4
vaddw.y vf2, vf0, vf4
vaddx.z vf2, vf0, vf4
vaddy.x vf3, vf0, vf4
vsubx.y vf3, vf0, vf4
vaddw.z vf3, vf0, vf4
vmr32.w vf12, vf0
vmr32.w vf13, vf0
vmr32.w vf14, vf0
• Stage one:> Load quat in vf4> Build right matrix> Clear right column of
result
vf1=(w,z,-y,~)
vf2=(-z,w,x,~)
vf3=(y,-x,w,~)
vf4=(x,y,z,w)
makebettergames
Matrix Conversion (macro)vmulaw.xyz acc, vf1, vf4w vmaddaz.xyz acc, vf2, vf4zvmsubay.xyz acc, vf3, vf4yvmaddx.xyz vf12, vf4, vf4x vmulaw.xyz acc, vf2, vf4wvmaddax.xyz acc, vf3, vf4xvmadday.xyz acc, vf4, vf4yvmsubz.xyz vf13, vf1, vf4zvmulaw.xyz acc, vf3, vf4wvmaddaz.xyz acc, vf4, vf4zvmadday.xyz acc, vf1, vf4yvmsubx.xyz vf14, vf2, vf4xvmove.xyzw vf15, vf0
• Stage two:> Matrix multiply to get first
three rows
> Clear bottom row
• Note: accumulate only on xyz (w already cleared)
• Cycles: 25/28
makebettergames
Matrix Conversion (micro)• Lots of duplicate calculations in matrix
1000
02212222
02222122
02222221
22
22
22
yxwxyzwyxz
wxyzzxwzxy
wyxzwzxyzy
qM
• Idea: calculate only what we need, use shifting and accumulator tricks to parallelize efficiently
• Devised by Colin Hughes of SCEE
makebettergames
Matrix Conversion (micro)
mula acc, vf1, vf1 loi SQRT_2muli vf3, vf1, I mr32.w vf24, vf0madd vf2, vf1, vf1 nopaddw vf4, vf0, vf0w nopopmula acc, vf3, vf3 move vf27, vf0msubw vf5, vf3, vf3w mr32.w vf26, vf0maddw vf6, vf3, vf3w mr32.w vf25, vf0addaw.xyz acc, vf0, vf0w nopmsubax.yz acc, vf4, vf2x nopmsuby.z vf26, vf4, vf2y mr32 vf3, vf5msubay.xz acc, vf4, vf2y mr32 vf7, vf6msubz.y vf25, vf4, vf2z mr32.y vf24, vf5msubz.x vf24, vf4, vf2z mr32.x vf26, vf5addy.z vf24, vf0, vf6y mr32.z vf25, vf3addx.y vf26, vf0, vf6x mr32.x vf25, vf7
• Three parts• Calculate
elements• Clear matrix• Shift, add
and copy into place
• 16/19 cycles
makebettergames
Matrix Conversion• If you’re converting a quaternion and
going to use it immediately, can make some assumptions
• Don’t create bottom row (just use vf0)
• Don’t clear right column (just use xyz)
• Saves four cycles in macro mode case
makebettergames
Transform to Matrix• Use one of the quaternion matrix
techniques
• Scale first three rows by each scale factor
• Replace last row with translation
• Results:> 29/32 for macro mode> 20/23 for micro mode
makebettergames
Normalization• Need to normalize quaternion to keep
it useful for rotation> (Also avoids floating point drift)
• Fortunately PS2 has reciprocal square root instruction
• Unfortunately it takes a while
makebettergames
Normalizationvmul vf2, vf1,
vf1
vaddaz.w acc, vf2, vf2
vmaddax.w acc, vf0, vf2
vmaddy.w vf2, vf0, vf2
vrsqrt Q, vf0w, vf2w
vwaitq
vmulq vf1, vf1, Q
• Compute dot product
• Compute 1/length• Scale quaternion• With stalls, takes
24/27 cycles
makebettergames
Normalization• Another approach
>From “The Inner Product”, March 2002 Game Developer by Jonathan Blow
>Approximate 1/x via Newton-Raphson iteration
>First iteration takes (looks like) 4/7 cycles on VU0
>Second iteration takes as long as RSQRT >Recommend: if x > 0.91521198, use
approx>Otherwise use RSQRT
makebettergames
Interpolation• This is where it’s at
• It would be great if it was fast
• Um, well…
makebettergames
Interpolation• First look at spherical linear interp
θsin
)θsin())1(θsin(),,slerp(
rqrq
ttt
• That’s a lot of sines• Could precompute , 1/sin • But at least 28 cycles for one of the
other sines• We (RSE) don’t use slerp anyway
makebettergames
Interpolation• Lerp, then
rqq
rqrq
tt
ttt )1(),,lerp(
is simply (q in vf1, r in vf2, t in vf3w)vaddax acc, vf1, vf0x
vmsubaw acc, vf1, vf3w
vmaddw vf1, vf2,vf3w
• Need to normalize afterwards
• Makes 30/33 cycles
makebettergames
Interpolation• Not quite that simple
• Problem: if q•r < 0, interpolation will take long way around sphere
• Need to negate one quat
• Gives the same orientation, but the interpolation takes the short route
makebettergames
Linear Interpolationvmul vf4, vf1, vf2 vaddaz.w acc, vf04, vf4 vmaddax.w acc, vf00, vf4 vmaddy.w vf4, vf00, vf4 vnop vnop vnop cfc2 t0,$16 and t0,t0,0x0002 vaddax acc, vf1, vf0x beq t0,zero,Add vmsubaw acc, vf2,vf3w b Finish Add: vmaddaw acc, vf2,vf3w Finish: vmsubw vf1, vf1, vf3w
• Compute dot product
• Check for negative • Interpolate• Follow up with
normalization• Takes 43/46 cycles
makebettergames
Linear Interpolation• There’s more we can do• Jonathan Blow’s article, again• Use spline to correct error in lerp• More investigation needed• Initial results: takes about 24-26 more
cycles• Looks faster than slerp, more
accurate than lerp
makebettergames
How We’re Using All This• A bit research-y at the moment
• VU0-based math library
• Optimization in specific routines
• In particular, concatenation and interpolation for bones animation
• More memory savings: store quat as 4.12 fixed-point shorts
makebettergames
Conclusions• Quaternions useful on PS2
• Cheaper to concatenate (alone)
• Convert to matrix to transform
• Use linear interpolation
• Check out Jonathan Blow’s article
makebettergames
References• Shoemake, Ken, “Animating Rotation with
Quaternion Curves,” Computer Graphics, Vol. 19, No. 3 (July 1985).
• EE Core Instruction Set Manual• VU User’s Manual• Sony newsgroups• Blow, Jonathan, “Hacking Quaternions,” Game
Developer, Vol. 9, No. 3 (March 2002). [get updated source from www.gdmag.com/code.htm]
makebettergames
Questions?
makebettergames
• Please hand in comment sheets
• Slides available at:
http://obiwan.redstorm.com/~jimvv