Upload
lamnga
View
215
Download
2
Embed Size (px)
Citation preview
Calin Cascaval, [email protected] Sept. 10, 2013
Parallel Programming for Mobile Devices
© 2013 Qualcomm Technologies, Inc.
Power
Qualcomm Research!
Outline
• Mobile compu-ng market -‐-‐ what do users care about?
• Mobile compu-ng landscape -‐-‐ how does the plaHorm look like?
• Applica-ons and developers -‐-‐ who is wri-ng code for mobile?
• Browser wars revisited -‐-‐ why does it maLer?
• Parallel heterogeneous programming made easier using MARE
2
Qualcomm Research!
Mobile Market Size
0
100
200
300
400
500
600
700
800
900
2010 2012 2017 (forecast)
Billion
s of U
S $
Mobile Desktop Servers HPC
Source: IDC. HPC forecast Intersect360
4
Size of mobile market roughly doubling every two years
> 20x
Qualcomm Research!
The Internet of Everything
7
Connec-vity Context Control An estimated 25B devices connected by 2020, talking to each other, making choices, and taking decisions
Use context awareness to filter the data that reaches every person: time, location, taste
Allow users to control and customize experience through new, mobile centric interfaces
Enabling whole new classes of applica-ons, where power is even more cri-cal
Qualcomm Research!
Anatomy of a Mobile SoC: Qualcomm Snapdragon 800
9
Source: hLp://www.qualcomm.com/snapdragon/processors/800
Qualcomm Research!
Mobile Applica-ons
11
Connec-vity Gaming Social networking, sharing: Media rich, data heavy
On device and multiplayer games with realistic graphics
Browsing Imaging Information gateway Computational photography, search
and processing of images and video
Qualcomm Research!
Mobile Sodware Development
12
Developers
• Think in terms of services and “features” • Portable apps, performance not the main concern • Use JavaScript frameworks
DSP GPU
AXI
T SE / RAS
VBIF
CP
CPU
VSC
HLSQ
VPC
RBBM
MARB
GMEM GMEM GMEM GMEM
OXILI SS
memory
PM4 PacketBuffers
VS Buffer
Index Buffers
Vertex Objects
Texture Objects
Frame Buffer
PC
VFD
SP#3
SP#2
SP#1
SP#0
UCHE
arb
OXILI TOP AHB Slave I/F
SS (shader system)
FF (fix-function)
External - other
arb
arb
arb
RB RB RB RB
MARB
MARB
MARB
2xTPL1
2xTPL1
2xTPL1
2xTPL1
ARM926EJ-S
16k I$ 8k D$
AHB
DDR AXI
VBIF (DDR)
ahb2axi ahb2crif
ARM Peripheral
VBIF (OCMEM)
OCMEM AXI
SP
SG
AP
VS
P-V
PP
I/F
(FIF
O)
Sha
red
RA
M B
anks
4K
B x
8
DM
A
VPE MEC
L1 $ / L2 $ Ctrl
TREPPEOCMEM
Data Mover & Controller
FPB (AHB)
VPP
VSP
Shared RAM pool(FIFO / Cache / Tile Buffer / Internal RAM)
Hardware Accel. CPU(s)
CPU CPU
VeNum
CPU
L2
VeNum VeNum
VeNum
CPU
High%Power%Efficiency
Low%Power%Efficiency
Qualcomm Research!
How can we easily write parallel applications
that use all available cores in a battery-powered
device?
13
Qualcomm Research!
Our Vision of a Mobile Sodware Stack
14
Na-ve Apps
JavaScript frameworks
Browser Engine
Web Apps
MARE
Domain Specific Parallel Libraries
Programmability: MARE -‐ Parallel, heterogeneous programming made easier -‐ Power and performance op-miza-ons
Performance: Domain Specific libraries -‐ Exploit domain knowledge to provide composable libraries for all programmers -‐ Hide hardware complexity
Portability: Zoomm Web App Engine -‐ Use of concurrency to op-mize execu-on of Web Apps -‐ Hardware exploita-on through the browser
Qualcomm Research!
Web Browser Usage
Information access Web Applications
16
Source: Illuminas, 2013
89% Email 85% Browsing
Qualcomm Research!
Browser Execu-on Time Breakdown
17
19%
31%
21%
20%
5% 4%
ARM Cortex A9 ExecuDon Time
rendering javascript layout css Others parsing
Average over the top 30 Alexa sites (May 2010), WebKit browser on a 400MHz Cortex A9, Linux
Insight: There is no one dominant component – Paralleliza-on must address the en-re browser structure!
Qualcomm Research!
Zoomm: Pervasive Concurrency
18
Per-‐Page DOM Engine Per-‐Page DOM Engine
JavaScript Engine
Compila-on Execu-on
JavaScript Engine
Compila-on Execu-on
User Interface
Res. Manager
Image Decoding
CSS Parsing
Prefetching
Per-‐Page DOM Engine
Events
Timers
HTML Parsing
Styling
CSS Parsing
Rendering Engine
Render Layout
Per-‐Page JavaScript Engine
Compila-on Execu-on
URL Events
HTML code
JS code Layout Tree
Cascaval et al.: Zoomm: a parallel web browser engine for mul-core mobile devices, PPoPP 2013
Qualcomm Research!
Zoomm: Page Load Performance
0 10 20 30 40 50 60
Zoomm
WebKit
Seconds
CNN BBC Yahoo Guardian
NYT Facebook Engadget QQ
HTC Jetstream, MSM8660, 2-‐core, 1.5GHz, Aug 2012
19
Qualcomm Research!
The Smartphone of 2010
• All applica-ons run on a single-‐core Qualcomm Snapdragon processor at 1GHz
Core
App 1
App 2
App 3
App 4
21
Qualcomm Research!
The Smartphone of 2013
• Mul-ple applica-ons can simultaneously run on a quad-‐core Snapdragon at 2+GHz
Core 1
Core 2
Core 3
Core 4
App 1
App 2
App 3
App 4
App 3
22
Qualcomm Research!
What is Qualcomm MARE?
• MARE is a programming model and a run-me system that provides simple yet powerful abstrac-ons for parallel, power-‐efficient sodware – Simple C++ API allows developers to express concurrency – User-‐level library that runs on any Android device, and on Linux, Mac
OS X, and Windows plaHorms
• The goal of MARE is to reduce the effort required to write apps that fully u-lize heterogeneous SoCs
23
Qualcomm Research!
MARE Workflow
• Understand algorithms • Par--on algorithms into independent units of work • Setup task dependencies • Link with MARE Run-me
Focus on your application and not on the hardware
MARE Runtime
MARE Application Algorithms
Qualcomm Research!
Qualcomm MARE API Concepts
• Tasks are units of work that can be asynchronously executed • Groups are sets of tasks that can be canceled or waited on
25
Qualcomm Research!
Advantages of using Qualcomm MARE
Simple Produc-ve Efficient Tasks are a natural way to express parallelism
Focus on application logic, not on thread management
Task dependences allow the MARE runtime to perform more intelligent scheduling decisions
26
Qualcomm Research!
Hello World!
#include <stdio.h> #include <mare/mare.h> int main() { mare::runtime::init(); // Initialize MARE auto hello = mare::create_task([] {printf("Hello ”);}); // Create task auto world = mare::create_task([] {printf(“World!”);}); // Create task hello >> world; // Set dependency mare::launch(hello); // Launch hello task mare::launch(world); // Launch world task mare::wait_for(world); // Wait to complete runtime::shutdown(); // Shutdown MARE return 0; }
27
Qualcomm Research!
Parallel Programming PaLerns in MARE
• MARE offers several commonly used parallel paLerns: – mare::pfor_each processes the elements of a collec-on – mare::pscan performs an in-‐place parallel prefix opera-on for the
elements of a collec-on. – mare::transform performs a map opera-on on the elements of a
collec-on
• Programmers can also create their own paLerns using the MARE API
• Parallel data structures: concurrent queues, hash tables, etc.
28
Qualcomm Research!
Bullet Physics Paralleliza-on using MARE
• Three major components: about 75% of total execu-on -me
• Constraint Solver – Finds non-‐interac-ng groups of objects and solves as independent tasks – Coarse-‐grained task-‐paralleliza-on
• Rendering – Very coarse-‐grained task-‐paralleliza-on: offload to a separate thread to allow for
con-nuous simula-on
• Narrow Phase Collision Detec-on – Fine-‐grained data-‐paralleliza-on
• About 2x FPS improvement on typical mobile devices
– Drama-cally improves the smoothness of rendering and overall user experience
29
Qualcomm Research!
Bullet Physics Paralleliza-on
0
10
20
30
40
50
60
70
80
0 200 400 600 800 1000 1200
Fram
es Per Secon
d
Frame Number
MARE Serial
30
Qualcomm Research!
How can You try MARE?
• Stay tuned! • Planning preview availability on Qualcomm Developer Network
– Fall 2013 Beta release for the mul-core version • We Want Your Feedback!
– What scenarios will you use it for? – Usability and features of the library
31
Qualcomm Research!
Acknowledgments
• Manticore team – Pablo Montesinos Ortego, Michael Weber, Wayne Piekarski, Behnam
Robatmili, Dario Suarez-Gracia, Vrajesh Bhavsar, Jimi Xenidis, Han Zhao, Kishore Puskuri
• Interns – Madhukar Kedlaya, Christian Delozier, Freark Van Der Berg,
Christoph Kershbaumer • Former members
– Mehrdad Reshadi, Seth Fowler, Alex Shye, Dilma DaSilva • Qualcomm
– Nayeem Islam, Mark Bapst, Charles Bergan, Matt Grob, Ben Gaster • MulticoreWare
– Wen-Mei Hwu, Mark Allender, Kevin Wu
32
Qualcomm Research!
Challenges
33
Power and performance Programmability
Custom hardware for power efficiency. Heterogeneity is the game. Expertly designed and optimized frameworks that hide hardware complexity
Tools and languages to enable programmers to understand and express concurrency and application semantics Composable building blocks
Mobile has exci-ng applica-ons and a huge impact
Qualcomm Research!
Legal Disclaimers
All opinions expressed in this presentation are mine and may not necessarily reflect those of my employer. Qualcomm and Snapdragon are trademarks of Qualcomm Incorporated, registered in the United States and other countries. All Qualcomm Incorporated trademarks are used with permission. Other products and brand names may be trademarks or registered trademarks of their respective owners.
34