It’s a Fast, Smooth, Well-lit World Out There: Confetti ... · Intel® Microarchitecture Codename Sandy Bridge Remodel Unreality By Nancy Nicolaisen As documented by extensive neurobiologic

It’s a Fast, Smooth, Well-lit World Out There: Confetti, MLAA, and

Intel® Microarchitecture Codename Sandy Bridge Remodel

Unreality By Nancy Nicolaisen

As documented by extensive neurobiologic research, about 80 percent of the entire bandwidth of

the human brain is devoted to vision. Another 12 percent or so is devoted to fun. This second bit is

conjecture but represents a reasonable guess based on casual observation of the electronic

entertainment industry. All this goes a long way toward explaining why, of all the dynamic, rapidly

evolving fields of technology, computer-generated visuals are among the most vibrant in terms of

innovation. One thing that makes graphics software development such fertile ground for

innovators is that as rendering algorithms have become increasingly sophisticated, cutting-edge

graphics hardware has advanced dramatically in performance and fallen in price. This trend is

making systems formerly accessible to the fortunate few commonplace in home and personal

gaming systems. High-end graphics developers Wolfgang Engel and Peter Santoki were in the best

imaginable position to see the opportunity this trajectory posed when they decided to strike out

on their own and leave Rockstar Games, publisher of some of the most successful games of all

time.

Engel and Santoki enjoyed the sort of jobs that most in the game industry would kill for: Engel was

lead graphics programmer and a key game engine developer, while Santoki was an acknowledged

wizard in the creation and application of visual effects. However, the two friends and colleagues

hungered for that elusive “something more.” Driven, as Engel puts it, “to follow their dream,” the

two founded Confetti Special Effects Inc. to pursue research in leading-edge graphics

programming technology and to become a game middleware powerhouse (Figure 1). Dreams, it

seems, can come true. Confetti has become one of the earliest providers of advanced graphics

middleware tools based on Morphological Anti-aliasing (MLAA) technology.

2 Confetti, MLAA, and Intel® Sandy Bridge

v4a DRAFT 14 February 2011

Figure 1. Launching a rocket using depth of field, lit soft particles, and point light shadows

Smooth Those Edges and Make It Snappy

MLAA is a relatively new technique but has the potential to dramatically improve game

experiences. Initially showcased on high-end graphical processing units (GPU), the technology

enables developers to boost the image quality of their games efficiently and in real time. To

understand how much improvement MLAA represents, you need to look back. Historically,

dynamically rendering computer graphics dictated a simple trade-off: You could have it look good,

happen fast, or run well on inexpensive platforms, but it was a “pick any two” situation. In part,

this limitation stems from the physiology of human eyesight. It turns out that, given a random

sample of the population, the number of color-sensitive cones on one person's retina may differ by

up to 40 times relative to that of another person, but there will be no measurable difference in

their ability to see. This is because we see mostly with our brains, not our eyes. More important

still, we don't “see” everything in our field of view equally.

Human vision is fine-tuned to detect alignment, which is why edges that appear to have “stair

steps” are much more noticeable and visually displeasing in video graphics than minor variations

in color, to which we are significantly less sensitive. (The phenomenon of blocky, stair-stepped

edges is known by computer scientists as spatial aliasing.) For these reasons, it has long been

standard practice in graphics programming to “soften away” spatial aliases by applying gradients

of color change at edge boundaries, commonly know as anti-aliasing (AA). Detecting edge

Confetti, MLAA, and Intel® Sandy Bridge 3


boundaries in complex shapes—particularly shapes with holes in them, like grates or fences—is a

demanding problem that can be computationally costly to solve, particularly in the case of

repetitively used graphic objects such as texture maps. Although many programming techniques

exist to smooth edges and give textures a pleasing or realistic look and feel, they all suffer from

one or both of these problems: They are resource intensive, requiring extraordinary amounts of

computing power for reasonable performance, or they omit remediation of some amount of

aliasing, causing images to appear to shimmer or have obvious remaining “jaggies.” One of

Confetti’s noteworthy early successes was to create cutting-edge AA middleware tools for use by

game developers. Those AA tools finally achieve the graphic programmer’s trifecta of happy

outcomes: They create smooth, sinuous edges; they’re fast; and they run well on the most common

configurations used by PC gamers (Figure 2).

Figure 2. Launching a rocket using depth particles, fire, point lights and shadows, and sparks

Confetti’s Road to MLAA

Why did Confetti settle on a plan to implement MLAA in its product offering? Confetti CTO Engel, a

widely published authority on computer-generated visuals, reflects here on his previous

experience with AA techniques:

“I wrote a significant part of the game engine used in Rockstar’s products, so optimizing image quality is

something I’ve thought about a lot. If you use older software anti aliasing strategies like Multi Sampling Anti-



aliasing (MSAA), you get excellent image quality, but it can quickly become expensive in terms of performance

impacts and sacrifices gamers with low-end platforms. Similarly, if you code to hardware anti-aliasing, then you

limit your audience size to a specific installed hardware base. MLAA seemed to me to have the potential to

overcome a lot of barriers for us at once, because it works in a compartment image space, making it efficient and

portable.”

Engel explains the conceptual way in which Confetti implements MLAA in its deferred lighting

engine like this:

1. Confetti’s engine detects edges, discovering verticals and horizontals.

2. Edges are assigned values.

3. Based on the edge values, the image space is blurred in horizontal and vertical directions.

“It sounds simple,” says Engel, “but there are challenges. How do you detect edges? And how do

you apply a chroma filter in vertical and horizontal directions?”

One thing with which the Confetti team was most impressed with in their early MLAA

implementation experience was the technology’s performance. Their MLAA-reliant lighting engine

ran as fast as if there were no AA being performed at all. Image quality was another standout

benefit.

“The quality is really close to MSAA, but doing anti-aliasing in image space is so much simpler and easier. It gives

you all kinds of flexibility, like the ability to switch anti-aliasing on and off without rebooting the whole game. For

example, you can do things like toggling anti-aliasing on or off depending on the workload or the camera style.”

The Intel® Microarchitecture Codename Sandy Bridge Piece of the Solution

It is not surprising that MLAA image enhancement works beautifully with the 2nd generation

Intel® Core™ Processor family—featuring Intel microarchitecture codename Sandy Bridge—

,because in a real sense, the two were made for each other (see Figure 3 and Figure 4). The

efficiency of this pairing wasn't lost on Engel and his team. When asked to describe their typical

workflow and design process, he made these observations:

“We look at the hardware first and see what it can do. From there, we try to come up with assumptions about

how we can get a certain effect from a given hardware configuration. Of course, at a very early point in our

evaluations, we also look at potential market federation for a given platform. We closely monitor the monthly

Steam hardware survey to identify trends.

“In December 2010, I believe the last survey showed something like 74 percent Intel® architecture usage. Given

these statistics, one thing we realized when we first looked at Sandy Bridge was that it would certainly have a

large market presence. In our early evaluation group, we thought ‘Oh, that’s so cool, because suddenly we have

a very powerful platform that will have lots of users.’ The next thought was, ‘How can we use this?’”

(Steam conducts a voluntary survey to collect data about what kinds of computer hardware and

software people are using. Find it at http://store.steampowered.com/hwsurvey.)



Figure 3. MLAA before and after. The image on the left is rendered without MLAA; the image on

the right has MLAA applied.

Figure 4. Detail: Notice the spatial aliasing in the shoulder line before applying MLAA and the

smoother profile of the right image after the MLAA processing.

Intel microarchitecture codename Sandy Bridge provided many advantages to Confetti. It offered

increased memory, much better rendering performance, and flexibility in designing new solutions.

“I’ve worked a great deal with previous generations of graphics hardware, where it was very difficult to do

deferred lighting because the memory bandwidth was just not there. We did a couple of test runs on [Intel

microarchitecture codename] Sandy Bridge, and it really surprised us with the fact that there was suddenly so

much more memory bandwidth. That was actually the first time we could run on processor graphics.”

(Processor graphics are a defining feature of the 2nd Generation Intel® Core™ i7/i5/i3

processors, wherein all the graphics capabilities are built right into the CPU chip.) Engel and his

team were most pleased about being able to run with their new shadow system on a piece of

hardware that will be common in the market. As Engel notes:

“What really surprised us was that when we started to do performance measurements, we found that the cost

per light and the cost per shadow map were really good. We all still have our gaming console hardware and lots



of specialized entertainment hardware, so coming from that background, we were very surprised that we can use

more lights and we can render more shadows on the [Intel] Sandy Bridge platforms. We are quite excited about

this, because it means that whatever we do now will run on the majority of PCs.“

Doing the Metrics

Confetti did a lot of analysis before implementing its deferred lighting engine, because it was a

“no-going-back” kind of step. In earlier versions of the company’s middleware and tools, the team

left this feature out. Seductive as it was, it was simply too big a market risk. Says Engel:

“Two years ago, we were thinking about doing deferred lighting and decided against including it, because the

low-end consumer platforms just couldn’t handle it. They really couldn’t do it at all. And because deferred

lighting influences the look and feel of your game, we couldn’t just say, ‘Okay let’s have a fall-back method.’ If

you have deferred lighting in your engine, you can’t dynamically drop back to a low-end approach.”

The Confetti team had to be sure they were creating a product that could be used by as wide a

variety of gamers as possible, and they needed empirical performance metrics to back up their

decision, so they used the Intel® Graphics Performance Analyzers (Intel® GPA) to give solid

foundations to their estimations of rendering performance and characteristics.

Engel and his peers liked Intel GPA—a lot:

“Let me just first say GPA is great. It’s awesome. For someone like me, coming from a game console background,

the standard of comparison is high. Video game consoles have really great profilers, so we were kind of spoiled.

Targeting consoles, we had been able to go down to the nitty gritty details. Intel GPA is the first PC-based tool

where we can say ‘Okay, this is comparable to game console tools.’ That pretty much says it all. We get a very

detailed view. We can also reuse our custom tagging system. This is key, because we were already productive

with that tool, and we were comfortable with it. We tag parts of our code and can see, for example, how we

render lights and how we render shadows. We get millisecond orders of feedback on performance, down to

whatever level of granularity we want. It just worked. And that whole system was very reliable. One of the

optimizations we figured out with [Intel] GPA was an improvement in rendering cube shadow maps. With

[Microsoft*] DirectX* 10, you can render in all six faces of a cube shadow map with one draw call. The geometry

shader will then replicate—if necessary—triangles into the six faces. It also does frustum and triangle culling in

the geometry shader, so the geometry shader is pretty busy.”

The code in this inner loop might look like Listing 1.

Listing 1. Example Inner-loop Code

// Loop over cube faces [unroll] for (int i = 0; i < 6; i++) { // Translate the view projection matrix to the position of the light float4x4 pViewProjArray = viewProjArray[i]; // // translate //



// access the row HLSL[row][column] pViewProjArray[0].w += dot(pViewProjArray[0].xyz, -In[0].lightpos.xyz); pViewProjArray[1].w += dot(pViewProjArray[1].xyz, -In[0].lightpos.xyz); pViewProjArray[2].w += dot(pViewProjArray[2].xyz, -In[0].lightpos.xyz); pViewProjArray[3].w += dot(pViewProjArray[3].xyz, -In[0].lightpos.xyz); float4 pos[3]; pos[0] = mul(pViewProjArray, float4(In[0].position.xyz, 1.0)); pos[1] = mul(pViewProjArray, float4(In[1].position.xyz, 1.0)); pos[2] = mul(pViewProjArray, float4(In[2].position.xyz, 1.0)); // Use frustum culling to improve performance float4 t0 = saturate(pos[0].xyxy * float4(-1, -1, 1, 1) - pos[0].w); float4 t1 = saturate(pos[1].xyxy * float4(-1, -1, 1, 1) - pos[1].w); float4 t2 = saturate(pos[2].xyxy * float4(-1, -1, 1, 1) - pos[2].w); float4 t = t0 * t1 * t2; [branch] if (!any(t)) { // Use backface culling to improve performance float2 d0 = pos[1].xy * pos[0].w - pos[0].xy * pos[1].w; float2 d1 = pos[2].xy * pos[0].w - pos[0].xy * pos[2].w; [branch] if (d1.x * d0.y > d0.x * d1.y || min(min(pos[0].w, pos[1].w), pos[2].w) < 0.0) { Out.face = i; [unroll] for (int k = 0; k < 3; k++) { Out.position = pos[k]; Stream.Append(Out); } Stream.RestartStrip(); } } }

To relieve the workload of the geometry shader, Engel’s team moved the offset and transformation

code into the vertex shader. This was a performance gain of more than 25 percent. Listing 2 shows

the source code.

Listing 2. Vertex Shader

float4x4 viewProjArray[6]; float3 LightPos; GsIn main(VsIn In) { GsIn Out; float3 position = In.position - LightPos; [unroll]



for (int i=0; i<3; ++i) { Out.position[i] = mul(viewProjArray[i*2], float4(position.xyz, 1.0)); Out.extraZ[i] = mul(viewProjArray[i*2+1], float4(position.xyz, 1.0)).z; } return Out; } //------------------------------------------------------------------------------ [Geometry shader] #define POSITIVE_X 0 #define NEGATIVE_X 1 #define POSITIVE_Y 2 #define NEGATIVE_Y 3 #define POSITIVE_Z 4 #define NEGATIVE_Z 5 float4 UnpackPositionForFace(GsIn data, int face) { float4 res = data.position[face/2]; [flatten] if (face%2) { res.w = -res.w; res.z = data.extraZ[face/2]; [flatten] if (face==NEGATIVE_Y) res.y = -res.y; else res.x = -res.x; } return res; } [maxvertexcount(18)] void main(triangle GsIn In[3], inout TriangleStream<PsIn> Stream) { PsIn Out; // Loop over cube faces [unroll] for (int i = 0; i < 6; i++) { float4 pos[3]; pos[0] = UnpackPositionForFace(In[0], i); pos[1] = UnpackPositionForFace(In[1], i); pos[2] = UnpackPositionForFace(In[2], i); // Use frustum culling to improve performance float4 t0 = saturate(pos[0].xyxy * float4(-1, -1, 1, 1) - pos[0].w); float4 t1 = saturate(pos[1].xyxy * float4(-1, -1, 1, 1) - pos[1].w); float4 t2 = saturate(pos[2].xyxy * float4(-1, -1, 1, 1) - pos[2].w); float4 t = t0 * t1 * t2;



[branch] if (!any(t)) { // Use backface culling to improve performance float2 d0 = pos[1].xy * pos[0].w - pos[0].xy * pos[1].w; float2 d1 = pos[2].xy * pos[0].w - pos[0].xy * pos[2].w; [branch] if (d1.x * d0.y > d0.x * d1.y || min(min(pos[0].w, pos[1].w), pos[2].w) < 0.0) { Out.face = i; [unroll] for (int k = 0; k < 3; k++) { Out.position = pos[k]; Stream.Append(Out); } Stream.RestartStrip(); } } } }

Optimizing the Geometry Shader for Intel® Microarchitecture Codename Sandy Bridge

Using Intel® GPA

Confetti used Intel GPA to tune code and establish hard metrics about optimization results

(Figure 5).



Figure 5. The baseline in Intel® GPA as the Confetti team identified areas of potential for

optimizing shader code. Note the GPU time in GS:1447 on the Shaders tab: On January 8, this was

1444.0 ms.

“We integrated [Intel] GPA very quickly and used it a lot in optimizing for [Intel microarchitecture

codename] Sandy Bridge,” said Engel (Figure 6).



Figure 6. GPA metrics quantifying improvements in shader performance. Note the GPU time in

GS:1357 on the Shaders tab: For the baseline, this was 1098.5 ms.

“I felt like it was the best system compared to other systems. I don’t know what else we can say.

It’s just cool.” Engel is right: There’s nothing more to add to that story. Except maybe graphics. The

next two figures show Confetti’s Dynamic Skydome System during a 24-hour day–night cycle.

Figure 7 shows Confetti’s depth-of-field and point light shadows technology.



Figure 7. Confetti’s depth-of-field and point light shadows technology are important components

of its Dynamic Skydome technology.

Figure 8 provides a detail of light in-scattering.



Figure 8. Detail of the use of in-scattering of light in a rendered scene

Summary

The Confetti team have a long history of aggressively implementing advanced technologies and

also have a broad cross-platform background. Based on that depth of experience, they approached

Intel microarchitecture codename Sandy Bridge-based platforms with expectations of finding

good graphics performance and real opportunities to expand their audience. They got more than

that, however: dramatically improved rendering performance; increased memory bandwidth and

storage; an architecture that allowed them to implement MLAA in a fashion entirely compatible

with moderately priced systems; and best-of-breed optimization tools, so they could know to a

certainty they were delivering beautiful, immersive game experiences for the typical user. For

more information on Confetti, go to http://www.conffx.com or become a friend of Confetti Special

Effects on Facebook.



About the Author

Nancy Nicolaisen is the author of numerous books on software engineering techniques. She

specializes in the design and development of solutions for small mobile and embedded systems.

Her involvement with the game industry dates back to 1981, when she worked at gaming pioneer

Imagic, developer of Demon Attack and other classics.

Documents

It’s a Fast, Smooth, Well-lit World Out There: Confetti ... · Intel® Microarchitecture Codename Sandy Bridge Remodel Unreality By Nancy Nicolaisen As documented by extensive neurobiologic