Upload
doanliem
View
213
Download
0
Embed Size (px)
Citation preview
It’s a Fast, Smooth, Well-lit World Out There: Confetti, MLAA, and
Intel® Microarchitecture Codename Sandy Bridge Remodel
Unreality By Nancy Nicolaisen
As documented by extensive neurobiologic research, about 80 percent of the entire bandwidth of
the human brain is devoted to vision. Another 12 percent or so is devoted to fun. This second bit is
conjecture but represents a reasonable guess based on casual observation of the electronic
entertainment industry. All this goes a long way toward explaining why, of all the dynamic, rapidly
evolving fields of technology, computer-generated visuals are among the most vibrant in terms of
innovation. One thing that makes graphics software development such fertile ground for
innovators is that as rendering algorithms have become increasingly sophisticated, cutting-edge
graphics hardware has advanced dramatically in performance and fallen in price. This trend is
making systems formerly accessible to the fortunate few commonplace in home and personal
gaming systems. High-end graphics developers Wolfgang Engel and Peter Santoki were in the best
imaginable position to see the opportunity this trajectory posed when they decided to strike out
on their own and leave Rockstar Games, publisher of some of the most successful games of all
time.
Engel and Santoki enjoyed the sort of jobs that most in the game industry would kill for: Engel was
lead graphics programmer and a key game engine developer, while Santoki was an acknowledged
wizard in the creation and application of visual effects. However, the two friends and colleagues
hungered for that elusive “something more.” Driven, as Engel puts it, “to follow their dream,” the
two founded Confetti Special Effects Inc. to pursue research in leading-edge graphics
programming technology and to become a game middleware powerhouse (Figure 1). Dreams, it
seems, can come true. Confetti has become one of the earliest providers of advanced graphics
middleware tools based on Morphological Anti-aliasing (MLAA) technology.
2 Confetti, MLAA, and Intel® Sandy Bridge
v4a DRAFT 14 February 2011
Figure 1. Launching a rocket using depth of field, lit soft particles, and point light shadows
Smooth Those Edges and Make It Snappy
MLAA is a relatively new technique but has the potential to dramatically improve game
experiences. Initially showcased on high-end graphical processing units (GPU), the technology
enables developers to boost the image quality of their games efficiently and in real time. To
understand how much improvement MLAA represents, you need to look back. Historically,
dynamically rendering computer graphics dictated a simple trade-off: You could have it look good,
happen fast, or run well on inexpensive platforms, but it was a “pick any two” situation. In part,
this limitation stems from the physiology of human eyesight. It turns out that, given a random
sample of the population, the number of color-sensitive cones on one person's retina may differ by
up to 40 times relative to that of another person, but there will be no measurable difference in
their ability to see. This is because we see mostly with our brains, not our eyes. More important
still, we don't “see” everything in our field of view equally.
Human vision is fine-tuned to detect alignment, which is why edges that appear to have “stair
steps” are much more noticeable and visually displeasing in video graphics than minor variations
in color, to which we are significantly less sensitive. (The phenomenon of blocky, stair-stepped
edges is known by computer scientists as spatial aliasing.) For these reasons, it has long been
standard practice in graphics programming to “soften away” spatial aliases by applying gradients
of color change at edge boundaries, commonly know as anti-aliasing (AA). Detecting edge
Confetti, MLAA, and Intel® Sandy Bridge 3
v4a DRAFT 14 February 2011
boundaries in complex shapes—particularly shapes with holes in them, like grates or fences—is a
demanding problem that can be computationally costly to solve, particularly in the case of
repetitively used graphic objects such as texture maps. Although many programming techniques
exist to smooth edges and give textures a pleasing or realistic look and feel, they all suffer from
one or both of these problems: They are resource intensive, requiring extraordinary amounts of
computing power for reasonable performance, or they omit remediation of some amount of
aliasing, causing images to appear to shimmer or have obvious remaining “jaggies.” One of
Confetti’s noteworthy early successes was to create cutting-edge AA middleware tools for use by
game developers. Those AA tools finally achieve the graphic programmer’s trifecta of happy
outcomes: They create smooth, sinuous edges; they’re fast; and they run well on the most common
configurations used by PC gamers (Figure 2).
Figure 2. Launching a rocket using depth particles, fire, point lights and shadows, and sparks
Confetti’s Road to MLAA
Why did Confetti settle on a plan to implement MLAA in its product offering? Confetti CTO Engel, a
widely published authority on computer-generated visuals, reflects here on his previous
experience with AA techniques:
“I wrote a significant part of the game engine used in Rockstar’s products, so optimizing image quality is
something I’ve thought about a lot. If you use older software anti aliasing strategies like Multi Sampling Anti-
4 Confetti, MLAA, and Intel® Sandy Bridge
v4a DRAFT 14 February 2011
aliasing (MSAA), you get excellent image quality, but it can quickly become expensive in terms of performance
impacts and sacrifices gamers with low-end platforms. Similarly, if you code to hardware anti-aliasing, then you
limit your audience size to a specific installed hardware base. MLAA seemed to me to have the potential to
overcome a lot of barriers for us at once, because it works in a compartment image space, making it efficient and
portable.”
Engel explains the conceptual way in which Confetti implements MLAA in its deferred lighting
engine like this:
1. Confetti’s engine detects edges, discovering verticals and horizontals.
2. Edges are assigned values.
3. Based on the edge values, the image space is blurred in horizontal and vertical directions.
“It sounds simple,” says Engel, “but there are challenges. How do you detect edges? And how do
you apply a chroma filter in vertical and horizontal directions?”
One thing with which the Confetti team was most impressed with in their early MLAA
implementation experience was the technology’s performance. Their MLAA-reliant lighting engine
ran as fast as if there were no AA being performed at all. Image quality was another standout
benefit.
“The quality is really close to MSAA, but doing anti-aliasing in image space is so much simpler and easier. It gives
you all kinds of flexibility, like the ability to switch anti-aliasing on and off without rebooting the whole game. For
example, you can do things like toggling anti-aliasing on or off depending on the workload or the camera style.”
The Intel® Microarchitecture Codename Sandy Bridge Piece of the Solution
It is not surprising that MLAA image enhancement works beautifully with the 2nd generation
Intel® Core™ Processor family—featuring Intel microarchitecture codename Sandy Bridge—
,because in a real sense, the two were made for each other (see Figure 3 and Figure 4). The
efficiency of this pairing wasn't lost on Engel and his team. When asked to describe their typical
workflow and design process, he made these observations:
“We look at the hardware first and see what it can do. From there, we try to come up with assumptions about
how we can get a certain effect from a given hardware configuration. Of course, at a very early point in our
evaluations, we also look at potential market federation for a given platform. We closely monitor the monthly
Steam hardware survey to identify trends.
“In December 2010, I believe the last survey showed something like 74 percent Intel® architecture usage. Given
these statistics, one thing we realized when we first looked at Sandy Bridge was that it would certainly have a
large market presence. In our early evaluation group, we thought ‘Oh, that’s so cool, because suddenly we have
a very powerful platform that will have lots of users.’ The next thought was, ‘How can we use this?’”
(Steam conducts a voluntary survey to collect data about what kinds of computer hardware and
software people are using. Find it at http://store.steampowered.com/hwsurvey.)
Confetti, MLAA, and Intel® Sandy Bridge 5
v4a DRAFT 14 February 2011
Figure 3. MLAA before and after. The image on the left is rendered without MLAA; the image on
the right has MLAA applied.
Figure 4. Detail: Notice the spatial aliasing in the shoulder line before applying MLAA and the
smoother profile of the right image after the MLAA processing.
Intel microarchitecture codename Sandy Bridge provided many advantages to Confetti. It offered
increased memory, much better rendering performance, and flexibility in designing new solutions.
“I’ve worked a great deal with previous generations of graphics hardware, where it was very difficult to do
deferred lighting because the memory bandwidth was just not there. We did a couple of test runs on [Intel
microarchitecture codename] Sandy Bridge, and it really surprised us with the fact that there was suddenly so
much more memory bandwidth. That was actually the first time we could run on processor graphics.”
(Processor graphics are a defining feature of the 2nd Generation Intel® Core™ i7/i5/i3
processors, wherein all the graphics capabilities are built right into the CPU chip.) Engel and his
team were most pleased about being able to run with their new shadow system on a piece of
hardware that will be common in the market. As Engel notes:
“What really surprised us was that when we started to do performance measurements, we found that the cost
per light and the cost per shadow map were really good. We all still have our gaming console hardware and lots
6 Confetti, MLAA, and Intel® Sandy Bridge
v4a DRAFT 14 February 2011
of specialized entertainment hardware, so coming from that background, we were very surprised that we can use
more lights and we can render more shadows on the [Intel] Sandy Bridge platforms. We are quite excited about
this, because it means that whatever we do now will run on the majority of PCs.“
Doing the Metrics
Confetti did a lot of analysis before implementing its deferred lighting engine, because it was a
“no-going-back” kind of step. In earlier versions of the company’s middleware and tools, the team
left this feature out. Seductive as it was, it was simply too big a market risk. Says Engel:
“Two years ago, we were thinking about doing deferred lighting and decided against including it, because the
low-end consumer platforms just couldn’t handle it. They really couldn’t do it at all. And because deferred
lighting influences the look and feel of your game, we couldn’t just say, ‘Okay let’s have a fall-back method.’ If
you have deferred lighting in your engine, you can’t dynamically drop back to a low-end approach.”
The Confetti team had to be sure they were creating a product that could be used by as wide a
variety of gamers as possible, and they needed empirical performance metrics to back up their
decision, so they used the Intel® Graphics Performance Analyzers (Intel® GPA) to give solid
foundations to their estimations of rendering performance and characteristics.
Engel and his peers liked Intel GPA—a lot:
“Let me just first say GPA is great. It’s awesome. For someone like me, coming from a game console background,
the standard of comparison is high. Video game consoles have really great profilers, so we were kind of spoiled.
Targeting consoles, we had been able to go down to the nitty gritty details. Intel GPA is the first PC-based tool
where we can say ‘Okay, this is comparable to game console tools.’ That pretty much says it all. We get a very
detailed view. We can also reuse our custom tagging system. This is key, because we were already productive
with that tool, and we were comfortable with it. We tag parts of our code and can see, for example, how we
render lights and how we render shadows. We get millisecond orders of feedback on performance, down to
whatever level of granularity we want. It just worked. And that whole system was very reliable. One of the
optimizations we figured out with [Intel] GPA was an improvement in rendering cube shadow maps. With
[Microsoft*] DirectX* 10, you can render in all six faces of a cube shadow map with one draw call. The geometry
shader will then replicate—if necessary—triangles into the six faces. It also does frustum and triangle culling in
the geometry shader, so the geometry shader is pretty busy.”
The code in this inner loop might look like Listing 1.
Listing 1. Example Inner-loop Code
// Loop over cube faces [unroll] for (int i = 0; i < 6; i++) { // Translate the view projection matrix to the position of the light float4x4 pViewProjArray = viewProjArray[i]; // // translate //
Confetti, MLAA, and Intel® Sandy Bridge 7
v4a DRAFT 14 February 2011
// access the row HLSL[row][column] pViewProjArray[0].w += dot(pViewProjArray[0].xyz, -In[0].lightpos.xyz); pViewProjArray[1].w += dot(pViewProjArray[1].xyz, -In[0].lightpos.xyz); pViewProjArray[2].w += dot(pViewProjArray[2].xyz, -In[0].lightpos.xyz); pViewProjArray[3].w += dot(pViewProjArray[3].xyz, -In[0].lightpos.xyz); float4 pos[3]; pos[0] = mul(pViewProjArray, float4(In[0].position.xyz, 1.0)); pos[1] = mul(pViewProjArray, float4(In[1].position.xyz, 1.0)); pos[2] = mul(pViewProjArray, float4(In[2].position.xyz, 1.0)); // Use frustum culling to improve performance float4 t0 = saturate(pos[0].xyxy * float4(-1, -1, 1, 1) - pos[0].w); float4 t1 = saturate(pos[1].xyxy * float4(-1, -1, 1, 1) - pos[1].w); float4 t2 = saturate(pos[2].xyxy * float4(-1, -1, 1, 1) - pos[2].w); float4 t = t0 * t1 * t2; [branch] if (!any(t)) { // Use backface culling to improve performance float2 d0 = pos[1].xy * pos[0].w - pos[0].xy * pos[1].w; float2 d1 = pos[2].xy * pos[0].w - pos[0].xy * pos[2].w; [branch] if (d1.x * d0.y > d0.x * d1.y || min(min(pos[0].w, pos[1].w), pos[2].w) < 0.0) { Out.face = i; [unroll] for (int k = 0; k < 3; k++) { Out.position = pos[k]; Stream.Append(Out); } Stream.RestartStrip(); } } }
To relieve the workload of the geometry shader, Engel’s team moved the offset and transformation
code into the vertex shader. This was a performance gain of more than 25 percent. Listing 2 shows
the source code.
Listing 2. Vertex Shader
float4x4 viewProjArray[6]; float3 LightPos; GsIn main(VsIn In) { GsIn Out; float3 position = In.position - LightPos; [unroll]
8 Confetti, MLAA, and Intel® Sandy Bridge
v4a DRAFT 14 February 2011
for (int i=0; i<3; ++i) { Out.position[i] = mul(viewProjArray[i*2], float4(position.xyz, 1.0)); Out.extraZ[i] = mul(viewProjArray[i*2+1], float4(position.xyz, 1.0)).z; } return Out; } //------------------------------------------------------------------------------ [Geometry shader] #define POSITIVE_X 0 #define NEGATIVE_X 1 #define POSITIVE_Y 2 #define NEGATIVE_Y 3 #define POSITIVE_Z 4 #define NEGATIVE_Z 5 float4 UnpackPositionForFace(GsIn data, int face) { float4 res = data.position[face/2]; [flatten] if (face%2) { res.w = -res.w; res.z = data.extraZ[face/2]; [flatten] if (face==NEGATIVE_Y) res.y = -res.y; else res.x = -res.x; } return res; } [maxvertexcount(18)] void main(triangle GsIn In[3], inout TriangleStream<PsIn> Stream) { PsIn Out; // Loop over cube faces [unroll] for (int i = 0; i < 6; i++) { float4 pos[3]; pos[0] = UnpackPositionForFace(In[0], i); pos[1] = UnpackPositionForFace(In[1], i); pos[2] = UnpackPositionForFace(In[2], i); // Use frustum culling to improve performance float4 t0 = saturate(pos[0].xyxy * float4(-1, -1, 1, 1) - pos[0].w); float4 t1 = saturate(pos[1].xyxy * float4(-1, -1, 1, 1) - pos[1].w); float4 t2 = saturate(pos[2].xyxy * float4(-1, -1, 1, 1) - pos[2].w); float4 t = t0 * t1 * t2;
Confetti, MLAA, and Intel® Sandy Bridge 9
v4a DRAFT 14 February 2011
[branch] if (!any(t)) { // Use backface culling to improve performance float2 d0 = pos[1].xy * pos[0].w - pos[0].xy * pos[1].w; float2 d1 = pos[2].xy * pos[0].w - pos[0].xy * pos[2].w; [branch] if (d1.x * d0.y > d0.x * d1.y || min(min(pos[0].w, pos[1].w), pos[2].w) < 0.0) { Out.face = i; [unroll] for (int k = 0; k < 3; k++) { Out.position = pos[k]; Stream.Append(Out); } Stream.RestartStrip(); } } } }
Optimizing the Geometry Shader for Intel® Microarchitecture Codename Sandy Bridge
Using Intel® GPA
Confetti used Intel GPA to tune code and establish hard metrics about optimization results
(Figure 5).
10 Confetti, MLAA, and Intel® Sandy Bridge
v4a DRAFT 14 February 2011
Figure 5. The baseline in Intel® GPA as the Confetti team identified areas of potential for
optimizing shader code. Note the GPU time in GS:1447 on the Shaders tab: On January 8, this was
1444.0 ms.
“We integrated [Intel] GPA very quickly and used it a lot in optimizing for [Intel microarchitecture
codename] Sandy Bridge,” said Engel (Figure 6).
Confetti, MLAA, and Intel® Sandy Bridge 11
v4a DRAFT 14 February 2011
Figure 6. GPA metrics quantifying improvements in shader performance. Note the GPU time in
GS:1357 on the Shaders tab: For the baseline, this was 1098.5 ms.
“I felt like it was the best system compared to other systems. I don’t know what else we can say.
It’s just cool.” Engel is right: There’s nothing more to add to that story. Except maybe graphics. The
next two figures show Confetti’s Dynamic Skydome System during a 24-hour day–night cycle.
Figure 7 shows Confetti’s depth-of-field and point light shadows technology.
12 Confetti, MLAA, and Intel® Sandy Bridge
v4a DRAFT 14 February 2011
Figure 7. Confetti’s depth-of-field and point light shadows technology are important components
of its Dynamic Skydome technology.
Figure 8 provides a detail of light in-scattering.
Confetti, MLAA, and Intel® Sandy Bridge 13
v4a DRAFT 14 February 2011
Figure 8. Detail of the use of in-scattering of light in a rendered scene
Summary
The Confetti team have a long history of aggressively implementing advanced technologies and
also have a broad cross-platform background. Based on that depth of experience, they approached
Intel microarchitecture codename Sandy Bridge-based platforms with expectations of finding
good graphics performance and real opportunities to expand their audience. They got more than
that, however: dramatically improved rendering performance; increased memory bandwidth and
storage; an architecture that allowed them to implement MLAA in a fashion entirely compatible
with moderately priced systems; and best-of-breed optimization tools, so they could know to a
certainty they were delivering beautiful, immersive game experiences for the typical user. For
more information on Confetti, go to http://www.conffx.com or become a friend of Confetti Special
Effects on Facebook.
14 Confetti, MLAA, and Intel® Sandy Bridge
v4a DRAFT 14 February 2011
About the Author
Nancy Nicolaisen is the author of numerous books on software engineering techniques. She
specializes in the design and development of solutions for small mobile and embedded systems.
Her involvement with the game industry dates back to 1981, when she worked at gaming pioneer
Imagic, developer of Demon Attack and other classics.