Mobile Performance Tools and GPU Performance Tuning Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

Embed Size (px)

Text of Mobile Performance Tools and GPU Performance Tuning Lars M. Bishop, NVIDIA Handheld DevTech Jason...

  • Mobile Performance Tools andGPU Performance TuningLars M. Bishop, NVIDIA Handheld DevTechJason Allen, NVIDIA Handheld DevTools

    Copyright NVIDIA Corporation 2004

    NVIDIA GoForce5500 OverviewWorld-class 3DHW Geometry pipeline16/32bpp textures and color buffersProgrammable pixel shadingQCIF, QVGA, VGA, XGA screen sizes!

    Integrated multimedia featuresHW video decode (video textures!)HW video encode (videoconferencing)HW camera support (live camera into a texture!)HW audio support

    Copyright NVIDIA Corporation 2004

    NVIDIA GoForce5500 3DGeometry PipelineHW TransformsVertex Buffer Object SupportHigh-performance Texturing1024x1024 texturesMipmapping w/ trilerpCompressed texturesPowerful pixel shading programsUp to 5 textures (and 12 texture samples!) per passComplex shader instructionsAccess to additional per-pixel components

    Copyright NVIDIA Corporation 2004

    Performance ConsiderationsWarning: Many of these items may look familiar3D on Handhelds is not wildly different from desktop or mobile 3DBut the specifics and balances are differentWell focus on these a bit

    Copyright NVIDIA Corporation 2004

    Holistic Performance ConsiderationsThe GPU doesnt exist in isolationBalance the three major system components:CPUSystem busGPUAny one of them can kill performance

    But on HW-accelerated handhelds, the GPU is the least likely candidate as the initial bottleneck today

    Copyright NVIDIA Corporation 2004

    CPUsCPUs on todays handhelds (especially low-power devices) are limited compared to PCs even from years ago

    ARM9s with no FPUs are very common

    ARM11s are gaining in popularityBut are still a small subset, and the FPU is optional

    Caches are smaller

    Copyright NVIDIA Corporation 2004

    Minimizing CPU WorkKnow your CPUAvoid floating point on an ARM9!Be careful with it even on ARM11+VFPBe cache-friendly (avoid many passes over vertex arrays)

    Avoid redundant render state changesNeedless driver work can costMany drivers dont fast path redundant calls

    Optimize triangles per callBatch trianglesUse multitexture/shaders to avoid multipass rendering

    Copyright NVIDIA Corporation 2004

    System BusNarrower and slower than the PC

    16-bit still common, indirect buses still common

    Result is lower bus bandwidthEspecially when data is sent in small bursts

    Copyright NVIDIA Corporation 2004

    Minimizing System Bus TrafficUse VBOs wherever possibleMark them as GL_STATIC_DRAW when possibleUse VBOs for index buffers, too! (almost always static)

    Avoid texture loads per frameUse render-to-texture for dynamic textures

    Dont read back the framebufferUnless you are taking screen shots

    Copyright NVIDIA Corporation 2004

    GPUsMoving ahead very quickly (perf and features)

    So the key is to feed them well

    But it is still possible to choke a good GPU well below its peak rate

    In order to minimize power consumption, various rendering features are not freeI.e. you dont pay for them when you dont need them

    Copyright NVIDIA Corporation 2004

    Maximizing GPU PerformanceMaximize texture throughputFormatDimensionsAccess coherence

    Maximize triangles-per-callSingle-pass effectsBatching

    Copyright NVIDIA Corporation 2004

    Texture FormatsUse compressed texturesGoForce supports DXT1/3/5 natively at full performance!

    Save 16- and 32-bpp textures for when you need themAnd prove that you do!

    Use single-channel (8bpp) textures when you canOften useful in shadersGood precision without large size

    Copyright NVIDIA Corporation 2004

    Optimizing Texture SizesDont just blindly turn off mipmapping!Dropping just the finest mip-level saves ~3x the pyramidSee if you are even using the finest mip-level!

    Create large virtual texturesUse single-pass multitexture and shadersCompose smaller ones at different scalesE.g. (detail * base * darkmap * 2)3 256x256 textures can create the effect of 1024x1204 in 3/16 the space

    Copyright NVIDIA Corporation 2004

    MipmappingUse mipmapping to increase performance

    But dont waste mipmap pyramidsSkip them for 2D UI elements

    Remember that embedded memory on handheld is optimized for power, not just speed

    But use trilinear filtering only when needed

    Copyright NVIDIA Corporation 2004

    Isolating Performance BottlenecksAll of these recommendations are great, but how do I figure out which ones will help my app?

    Old-school Modify application to add:Performance counters, timers, wrappers around your OpenGL calls, ability to turn off functionality

    The new way NVIDIA PerfHUD ESThe PC developers good friend is now available for GoForce mobile GPUs!Makes it easy to do initial performance analysis of OpenGL ES applications

    Copyright NVIDIA Corporation 2004

    PerfHUD ES: What is itPerf HUD ES is the OpenGL ES analogue to the popular and powerful PC PerfHUD toolProvides an instrumented driver and a client UI

    SupportsLive performance monitoring of running appDirected tests to help isolate bottlenecks

    Unlike the PC PerfHUD, the ES version displays the results on a host PC, not the handheld screenWould you want to see stats on a 3 VGA/QVGA?

    Copyright NVIDIA Corporation 2004

    PerfHUD ES: How it worksGoForce HW devkit:Runs target app and the instrumented driverInstrumented GL ES DriverGL ES 3D ApplicationPerfHUD ESHost PC:Runs PerfHUD ES UI Client and serves devkit file systemNetwork connection

    Copyright NVIDIA Corporation 2004

    PerfHUD ES Client

    Copyright NVIDIA Corporation 2004

    Performance StatisticsCurrent (running mean) frame rate

    Timelines with per frame reporting of lots of data

    Histogram of batch sizes (tris per call)

    Indicator lights that flash on expensive operations

    Copyright NVIDIA Corporation 2004

    TimelinesTotal frame timeFrame time spent in the driver/GL ESNumber of draw calls in frameVideo memory usedSystem memory used

    Copyright NVIDIA Corporation 2004

    Batch HistogramA graph of the number of triangles per draw callMost experienced developers are pretty good these days at getting their batch sizes upBut Look out for particles and text systems, which are sometimes the causes of poor batchingEspecially when they are used more heavily than they were designed

    Copyright NVIDIA Corporation 2004

    Event Indicator LightsLoading new texel data to a texture

    Loading new data to a VBO

    Creation of a new pixel shader

    Each of these should probably be investigated if they are blinking every frame (or frequently)

    Copyright NVIDIA Corporation 2004

    Using the Statistics as TriageDevTech can use the passive stats to quickly analyze a newly-received app. We frequently look at:

    The lightsAre textures or VBOs getting created/updated frequently?Textures are of particular interestThe batch-size histogramIs the mean number of tris per batch low?The total-time and driver-time timelinesWhat is the overall ratio of app time (total minus driver) and driver time? Focus where the time is being spent

    Copyright NVIDIA Corporation 2004

    The Directed TestsOptional modes that intercept rendering and state calls to change the rendering without having to modify the appReplace all textures with a 2x2-texel textureIgnore glDraw* callsIgnore gl* callsDisable VSYNCDisable pixel (AKA Alpha) blendingDisable GL_LIGHTINGNULL viewport

    Copyright NVIDIA Corporation 2004

    Directed Test Use-case ExamplesThink you might be fill-rate bound?Enable NULL viewportIf the frame rate shoots up, you may be fill-boundThink you might be texture-bandwidth bound?Enable 2x2 texturesIf the f.r. shoots up, you may be texture-boundThink you might be app-bound?Enable Ignore all GL callsIf the f.r. does not shoot up, you are likely app-bound

    And so on All without touching your source code

    Copyright NVIDIA Corporation 2004

    Speed ControlAdvanced feature requires a simple modification to the app (timer extension)Allows the user to slow or even stop time in the appMakes it possible to replay the same frame over and over while changing directed tests, etcUseful for isolating bottlenecks in a particular scene

    Copyright NVIDIA Corporation 2004

    Coming AttractionsThis is just the first version of PerfHUD ESMuch more to come!Some features will be similar to those in PC PerfHUDOthers will continue to be very handheld/embedded specific

    Copyright NVIDIA Corporation 2004

    Questions??

    handset-dev@nvidia.com

    Copyright NVIDIA Corporation 2004