Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Next Generation Visual Computing
(Making GPU Computing a Reality with Mali™)
Taipei, 18 June 2013
Roberto Mijat
ARM
2
Addressing Computational Challenges
Trends
Growing display sizes and resolutions
Increasing computational power and novel applications
Persistent users’ expectation of improved experience
Limitations
Limited and restricted energy and thermal budgets
In mobile, processing power greatly outgrowing battery capacity
Traditional scaling solutions not sustainable
Necessities
Increase computational efficiency of processing platforms
Make use of heterogeneous and parallel computing
Leverage new technologies such as GPU Compute
3
Complementary Compute Architectures
Note: characteristics of generic CPUs and GPUs
4
Heterogeneous Computing
Cost effective,
efficient, great
floating point
performance
2D/3D graphics
Advanced Image Processing
Accelerate/Complement ISP functionality
Offload video codec blocks
Accelerate physics computation
Operating System
Most application processing
Programmable
through C-like
languages and APIs
GPU used as
computational
accelerators or
companion processor
CPU GPU
Control
RAM
Caches
ALU ALU
ALU ALU
5
Benefits of GPU Computing
Performance
Faster computation
Offload and acceleration of non-graphical applications
Energy Efficiency
Free-up CPU resource by offloading to GPU
Better load-balance across system resources
Increased system efficiency using the best processor for the job
Cost Reduction
Reduced cost through h/w consolidation and software flexibility
Simpler interface to parallel programming through modern APIs
Improved user experience
Remove computational barriers
Enable new use cases and applications
6
Adoption of Mobile GPU Compute
2012 2013 2014 2015+
OpenCL™ Full Profile Khronos conformant GPUs in mobile SoCs
GPU Compute capable devices start shipping
OEMs and SiPs evaluating leading GPU Compute solutions
Gradual roll-out of GPU Compute APIs in mobile/embedded platforms
Android™ RenderScript computation first enabled on GPU
7
Adoption of Mobile GPU Compute
First public demonstrations of GPU Compute Mobile benchmarks
ISVs and OEMs start porting/optimizing libraries and key use-case
functionality using GPU Compute
Computational Photography and Advanced Imaging GPU acceleration
Codec vendors develop GPU Compute enabled HEVC decoders
Exploration by mainstream developers
2012 2013 2014 2015+
8
Adoption of Mobile GPU Compute
Mainstream support for GPU Computing in Mobile and Embedded
GPU Compute widely available and utilized by developers/libraries
Introduction of GPUs implementing HSA™ features, full system coherency
Hardware consolidation and software cost reduction through migration of
selected ISP/DSP functionality to GPU
New use cases, innovation
2012 2013 2014 2015+
9
OPENCL
10
OpenCL Overview
OpenCL enables easier, better programming of heterogeneous
parallel compute systems, and unleashes the general purpose
computational power of GPUs needed by emerging workloads
OpenCL is
A framework to enable general purpose
parallel computing
A computing language portable across
heterogeneous processing platforms
An API to define and control the platforms
A royalty-free open standard, interoperable
with existing APIs
OpenCL and the OpenCL logo are trademarks of Apple Inc.
11
OpenCL Programming Model
Application Program
Runtime Compiler
Kernel object
Kernel
- OpenCL kernel
- Native kernel
Index space (NDRange)
Execute command
Can use static compilation
Binaries are cached
Can be built to target
any supported device
Optimize performance
critical code
The kernel is executed over each
element of the N-dimensional
index space
Work-item: instance of a
kernel executing on a
point in the index space
Work-group: collection
of work-items
12
The ARM OpenCL Implementation
Implements the latest version of the standard
Implements Full Profile, supports 64-bit
Optimized for interoperability with existing Mali software stack
Optimized for interoperability between CPU and GPU
Architected for Cache Coherent Interconnect support
Extensible design
13
With Full Profile you know what you get
Full Profile defines the baseline set of features for OpenCL
Embedded Profile defines a subset of the specification
Designed to enable OpenCL on less capable devices
Making optional a large set of features, restricting developers
Reducing precision of floating point maths
Key Feature Embedded Full
FP32 precision Relaxed IEEE-754
Built-in atomic operations Optional Supported
64-bit integer Optional Supported
Online compiler Optional Supported
3D image writes Optional Supported
Linear interpolation for floating point images Optional Supported
Size of buffers and memory Limited Supported
Image data type requirements Reduced Supported
14
RENDERSCRIPT
15
Introduction to RenderScript
Compute framework and API for Android
Officially introduced in Honeycomb
Cross-platform control-slave architecture, with runtime compilation
A graphics engine component has been deprecated since Jelly Bean
Complements existing APIs by adding:
A compute API for parallel processing similar to OpenCL
A scripting language based on C99 supporting vector data types
Designed for portability, performance, usability
On-device JIT compilation and dynamic thread launch
Native code optimization to maximize performance critical algorithms
Mali-T604™ is the first GPU to support RenderScript
16
How RenderScript works
Java App
RenderScript Script
Portable Bitcode
Machine Code
libRS
Reflected Layer
llvm-rs-cc
libbcc
Dalvik JIT
Executable
ARM Compute System
(Cortex™ CPU + Mali GPU + AMBA™ 4)
Online compilation
On
lin
e c
om
pil
ati
on
17
DESIGNED FOR GPU COMPUTE
18
Mali-T600™ : Designed for GPU Compute
Comprehensive support for general purpose data types
8/16/32/64-bit signed/unsigned integer
FP16, FP32, FP64
2,3,4,8,16 wide vectors
2D/3D images
Floating Point precision & performance
Full IEEE 754-2008 compliance
100s of GFLOPs performance for non graphical workloads
Sustainable and proven performance for real life workloads
19
Mali-T600: Designed for GPU Compute
Hardware acceleration
Most common mathematical functions implemented in h/w
>70% coverage within newest industry APIs
Most operations compute in one cycle
Optimal memory throughput and latency
Optimized for stream and generic load/store operations
Tight integration with system using latest AMBA interfaces
Leverage on new Cache Coherent Interconnect technologies
Task management implemented in hardware
Optimal automatic distribution of compute workloads
Optimal dynamic power management
Efficient use of processing resources
20
GPU Compute on Mali: here today!
Passed Khronos™ Conformance
Only OpenCL 1.1 Full Profile on Linux and Android outside
of console and desktop space.
Proven in Silicon
Samsung® Exynos™ 5 Dual, implements Full Profile
OpenCL and RenderScript DDKs available now
Mali-T600 shipping in real products
Google® Chromebook™
Google Nexus™ 10
InSignal ® Arndale™ Community Board
API exposed for developers
RenderScript on Android for Nexus 10
21
USE CASES
Example of the benefits of GPU Compute from the real world
22
Example use cases for GPU Computing
Mobile
• Computational Photography
• Physics in games
• Moving and still image real-time stabilization
• Information extraction: object detection, classification and tracking
• Imaging: correction, improvement, consolidation
• Content and context understanding
• HDR
• Augmented Reality
DTV/STB
• 2D to 3D conversion
• Super resolution
• Pre and post processing
• Camera based UI
• Trans-coding
• Information extraction and superimposition
Automotive
• Lane Detection
• Smart Head-Light
• Road Sign Recognition
• Night Vision
• Object Classification
• Pedestrian, Vehicle and Collision Detection
• Vehicle Detection
• Dynamic cruise control
100s GFLOPs of efficient processing power: improve existing use-cases, enable next generation use-cases
23
Advanced Image Processing
RenderScript is the official Heterogeneous Compute Android API
Since Android ICS 4.2 it has been enabled to target the GPU
Complex image filters can be greatly accelerated by GPU Compute
Filter Speed-up [1]
MotionBlur 3.5x
Cloud 4.2x
Labyrinth 3.8x
TitleReflection 7.3x
WhirlPinch 3.6x
Wave 7.0x
Bicubic 15.4x
[1] Acceleration compares RenderScript compiled on device (LLVM) on dual-core Cortex™-A15 and Mali™-T604 on a stock Google Nexus™ 10
Image size: 2560x1920
24
Video Processing APK
Proprietary Transcoding/Processing Pipeline
Image filters implemented using RenderScript
Optimized for ARM + Mali-T600 GPU Compute
Filter FPS
(GPU+CPU vs CPU only)
Speed-up
Deshake (720p) 28 / 8 3.5x
Upscaling (720p to 1080p) 20 / 3 6.7x
25
GPU Compute accelerated superscaling
Accelerated using RenderScript
On Google Nexus 10 (Mali-T604)
26
Next Generation Multimedia Codecs
High Efficiency Video Coding (HEVC)
Latest video compression standard ratified by ITU in Jan 2013
Improved video quality and double data compression from H.264
Can support up to 8k UHD
ARM is collaborating with multiple codec vendors
Ensuring widest availability of HEVC across multiple ARM platforms
Enabling HEVC early, in software, through NEON and GPU Compute
Flexibility of software solutions critical as HEVC rolls out
27
Why GPU Compute for HEVC
High resolution HEVC decoding maximises CPU load
GPUs are traditionally idle during video playback
GPU architecture suites acceleration of parallel codec blocks
Offloading computation to the GPU frees up the CPU to
perform other (system) tasks
Combining CPU (NEON) and GPU Compute enable most
efficient HEVC decode
“Mali GPUs are well suited for
Video Acceleration
with significant
power/performance benefits”
– Ittiam Systems
28
Physics (Cloth Simulation)
29
ISP Pipeline Offload to GPU (OpenCL)
Entire ISP pipeline offloaded to the GPU using
OpenCL
More flexibility
Sensor and camera module vendors can invest in
optimized portable software libraries instead of
hardware ISP
SoC implementers can reduce BoM by offloading
ISP blocks to the GPU
Mali-T604 demo was previewed at MWC13
Noise reduction
HDR reconstruction
Tone mapping Colour
conversion
Gamma correction
De-noising Rendering
Raw Data form
HDR Sensor
OpenGL ES
OpenCL
30
Gesture User Interfaces
eyeSightTM’s gesture recognition technology using GPU
Compute on ARM’s Mali-T600 offers unique capabilities
Reduction of overall power consumption
Reduction of load from the CPU
Robust recognition in challenging lighting conditions
Enhanced user experience
Higher FPS for more gesture capabilities and features
31
Computer Vision Based Applications
Computer Vision entails the acquisition, processing, analysis
and understanding of sensor data (images), in order to derive
information to enable decisions to be made
En
erg
y u
se
d fo
r u
nit o
f w
ork
(lo
we
r is
be
tte
r)
Face detection study on Mali-T604 based silicon
In this example:
Consistent 6x speed up
~5x more energy efficiency
32
Conclusions
Improve energy efficiency through heterogeneous computing
Use the best processor for the task
Balance workload across system resources
Offload heavy parallel computation to the GPU
Bring the benefits of GPU Compute to key use cases
Computational Photography and Advanced Imaging
Next generation of multimedia codecs
Computer Vision applications
The Mali Ecosystem is making GPU Compute a reality