35
Calin Cascaval, [email protected] Sept. 10, 2013 Parallel Programming for Mobile Devices © 2013 Qualcomm Technologies, Inc. Power

Power Parallel&Programming&for&Mobile&Devices&conferences.inf.ed.ac.uk/pact2013/slides/PACT2013Keynote.pdfQualcomm Research! The&Internetof&Everything& 7 Connec-vity& Context Control&

  • Upload
    lamnga

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Calin  Cascaval,  [email protected]  Sept.  10,  2013  

Parallel  Programming  for  Mobile  Devices  

© 2013 Qualcomm Technologies, Inc.

Power  

Qualcomm Research!

Outline  

•  Mobile  compu-ng  market  -­‐-­‐  what  do  users  care  about?  

•  Mobile  compu-ng  landscape  -­‐-­‐  how  does  the  plaHorm  look  like?  

•  Applica-ons  and  developers  -­‐-­‐  who  is  wri-ng  code  for  mobile?  

•  Browser  wars  revisited  -­‐-­‐  why  does  it  maLer?  

•  Parallel  heterogeneous  programming  made  easier  using  MARE  

2  

Qualcomm Research!

Mobile  Market  

3  

Qualcomm Research!

Mobile  Market  Size  

0  

100  

200  

300  

400  

500  

600  

700  

800  

900  

2010   2012   2017  (forecast)  

Billion

s  of  U

S  $  

Mobile   Desktop   Servers   HPC  

Source:  IDC.  HPC  forecast  Intersect360  

4  

Size  of  mobile  market  roughly  doubling  every  two  years    

>  20x  

Qualcomm Research!

What  do  users  care  about?  

McKinsey’s  Consumer  and  Shopping  Insights    

5  

Qualcomm Research!

What  about  next  morning?  

McKinsey’s  Consumer  and  Shopping  Insights    

6  

Qualcomm Research!

The  Internet  of  Everything  

7  

Connec-vity   Context   Control  An estimated 25B devices connected by 2020, talking to each other, making choices, and taking decisions

Use context awareness to filter the data that reaches every person: time, location, taste

Allow users to control and customize experience through new, mobile centric interfaces

Enabling  whole  new  classes  of  applica-ons,    where  power  is  even  more  cri-cal  

Qualcomm Research!Qualcomm Research!

What’s  in  a  smartphone?  

Qualcomm Research!

Anatomy  of  a  Mobile  SoC:  Qualcomm  Snapdragon  800  

9  

Source:  hLp://www.qualcomm.com/snapdragon/processors/800  

Qualcomm Research!

Mobile  Constrains  

10  

AA  

Energy  

Thermal  

Qualcomm Research!

Mobile  Applica-ons  

11  

Connec-vity   Gaming  Social networking, sharing: Media rich, data heavy

On device and multiplayer games with realistic graphics

Browsing   Imaging  Information gateway Computational photography, search

and processing of images and video

Qualcomm Research!

Mobile  Sodware  Development  

12  

Developers    

•  Think  in  terms  of  services  and  “features”  •  Portable  apps,  performance  not  the  main  concern  •  Use  JavaScript  frameworks  

DSP   GPU  

AXI

T SE / RAS

VBIF

CP

CPU

VSC

HLSQ

VPC

RBBM

MARB

GMEM GMEM GMEM GMEM

OXILI SS

memory

PM4 PacketBuffers

VS Buffer

Index Buffers

Vertex Objects

Texture Objects

Frame Buffer

PC

VFD

SP#3

SP#2

SP#1

SP#0

UCHE

arb

OXILI TOP AHB Slave I/F

SS (shader system)

FF (fix-function)

External - other

arb

arb

arb

RB RB RB RB

MARB

MARB

MARB

2xTPL1

2xTPL1

2xTPL1

2xTPL1

ARM926EJ-S

16k I$ 8k D$

AHB

DDR AXI

VBIF (DDR)

ahb2axi ahb2crif

ARM Peripheral

VBIF (OCMEM)

OCMEM AXI

SP

SG

AP

VS

P-V

PP

I/F

(FIF

O)

Sha

red

RA

M B

anks

4K

B x

8

DM

A

VPE MEC

L1 $ / L2 $ Ctrl

TREPPEOCMEM

Data Mover & Controller

FPB (AHB)

VPP

VSP

Shared RAM pool(FIFO / Cache / Tile Buffer / Internal RAM)

Hardware  Accel.   CPU(s)  

CPU     CPU    

 VeNum  

CPU    

L2  

 VeNum    VeNum  

 VeNum  

CPU    

High%Power%Efficiency

Low%Power%Efficiency

Qualcomm Research!

How can we easily write parallel applications

that use all available cores in a battery-powered

device?

13  

Qualcomm Research!

Our  Vision  of  a  Mobile  Sodware  Stack  

14  

Na-ve  Apps  

JavaScript  frameworks  

Browser  Engine  

Web    Apps  

MARE  

Domain  Specific  Parallel  Libraries  

Programmability:  MARE    -­‐  Parallel,  heterogeneous  programming  made  easier  -­‐  Power  and  performance  op-miza-ons  

Performance:  Domain  Specific  libraries    -­‐  Exploit  domain  knowledge  to  provide  composable  libraries  for  all  programmers    -­‐  Hide  hardware  complexity  

Portability:  Zoomm  Web  App  Engine    -­‐  Use  of  concurrency  to  op-mize  execu-on  of  Web  Apps    -­‐  Hardware  exploita-on  through  the  browser  

Qualcomm Research!Qualcomm Research!

Browsers:  why  do  we  care?  

Qualcomm Research!

Web  Browser  Usage  

Information access Web Applications

16  

Source:  Illuminas,  2013  

89%  Email  85%  Browsing  

Qualcomm Research!

Browser  Execu-on  Time  Breakdown  

17  

19%  

31%  

21%  

20%  

5%  4%  

ARM  Cortex  A9  ExecuDon  Time  

rendering   javascript   layout   css   Others   parsing  

Average  over  the  top  30  Alexa  sites  (May  2010),  WebKit  browser  on  a  400MHz  Cortex  A9,  Linux  

Insight:  There  is  no  one  dominant  component  –  Paralleliza-on  must  address  the  en-re  browser  structure!  

Qualcomm Research!

Zoomm:  Pervasive  Concurrency  

18  

Per-­‐Page  DOM  Engine  Per-­‐Page  DOM  Engine  

JavaScript  Engine  

Compila-on  Execu-on  

JavaScript  Engine  

Compila-on  Execu-on  

User  Interface    

Res.  Manager  

Image  Decoding  

CSS  Parsing  

Prefetching  

Per-­‐Page  DOM  Engine  

Events  

Timers  

HTML  Parsing  

Styling  

CSS  Parsing  

Rendering  Engine  

Render  Layout  

Per-­‐Page  JavaScript  Engine  

Compila-on  Execu-on  

URL   Events  

HTML  code  

JS    code   Layout  Tree  

Cascaval  et  al.:  Zoomm:  a  parallel  web  browser  engine  for  mul-core  mobile  devices,  PPoPP  2013  

Qualcomm Research!

Zoomm:  Page  Load  Performance  

0   10   20   30   40   50   60  

Zoomm  

WebKit  

Seconds  

CNN   BBC   Yahoo   Guardian  

NYT   Facebook   Engadget   QQ  

HTC  Jetstream,  MSM8660,  2-­‐core,  1.5GHz,  Aug  2012  

19  

Qualcomm Research!Qualcomm Research!

Qualcomm  MARE  

Mul-core  Asynchronous  Run-me  Environment  

Qualcomm Research!

The  Smartphone  of  2010  

•  All  applica-ons  run  on  a  single-­‐core  Qualcomm  Snapdragon  processor  at  1GHz  

Core  

App  1  

App  2  

App  3  

App  4  

21  

Qualcomm Research!

The  Smartphone  of  2013  

•  Mul-ple  applica-ons  can  simultaneously  run  on  a  quad-­‐core  Snapdragon  at  2+GHz  

Core  1  

Core  2  

Core  3  

Core  4  

App  1  

App  2  

App  3  

App  4  

App  3  

22  

Qualcomm Research!

What  is  Qualcomm  MARE?  

•  MARE  is  a  programming  model  and  a  run-me  system  that  provides  simple  yet  powerful  abstrac-ons  for  parallel,  power-­‐efficient  sodware  –  Simple  C++  API  allows  developers  to  express  concurrency  –  User-­‐level  library  that  runs  on  any  Android  device,  and  on  Linux,  Mac  

OS  X,  and  Windows  plaHorms  

•  The  goal  of  MARE  is  to  reduce  the  effort  required  to  write  apps  that  fully  u-lize  heterogeneous  SoCs  

23  

Qualcomm Research!

MARE  Workflow  

•  Understand  algorithms  •  Par--on  algorithms  into  independent  units  of  work  •  Setup  task  dependencies  •  Link  with  MARE  Run-me  

Focus on your application and not on the hardware

MARE Runtime

MARE Application Algorithms

Qualcomm Research!

Qualcomm  MARE  API  Concepts  

•  Tasks  are  units  of  work  that  can  be  asynchronously  executed  •  Groups  are  sets  of  tasks  that  can  be  canceled  or  waited  on  

25  

Qualcomm Research!

Advantages  of  using  Qualcomm  MARE  

Simple   Produc-ve   Efficient  Tasks are a natural way to express parallelism

Focus on application logic, not on thread management

Task dependences allow the MARE runtime to perform more intelligent scheduling decisions

26  

Qualcomm Research!

Hello  World!    

#include  <stdio.h>  #include  <mare/mare.h>    int  main()  {        mare::runtime::init();                                                                      //  Initialize  MARE        auto  hello  =  mare::create_task([]  {printf("Hello  ”);});    //  Create  task              auto  world  =  mare::create_task([]  {printf(“World!”);});    //  Create  task        hello  >>  world;                                                                                    //  Set  dependency          mare::launch(hello);                                                                          //  Launch  hello  task      mare::launch(world);                                                                          //  Launch  world  task        mare::wait_for(world);                                                                      //  Wait  to  complete        runtime::shutdown();                                                                          //  Shutdown  MARE      return  0;  }  

27  

Qualcomm Research!

Parallel  Programming  PaLerns  in  MARE  

•  MARE  offers  several  commonly  used  parallel  paLerns:    –  mare::pfor_each    processes  the  elements  of  a  collec-on  –  mare::pscan    performs  an  in-­‐place  parallel  prefix  opera-on  for  the  

elements  of  a  collec-on.    –  mare::transform    performs  a  map  opera-on  on  the  elements  of  a  

collec-on    

•  Programmers  can  also  create  their  own  paLerns  using  the  MARE  API  

•  Parallel  data  structures:  concurrent  queues,  hash  tables,  etc.  

28  

Qualcomm Research!

Bullet  Physics  Paralleliza-on  using  MARE  

•  Three  major  components:  about  75%  of  total  execu-on  -me  

•  Constraint  Solver  –  Finds  non-­‐interac-ng  groups  of  objects  and  solves  as  independent  tasks  –  Coarse-­‐grained  task-­‐paralleliza-on  

•  Rendering  –  Very  coarse-­‐grained  task-­‐paralleliza-on:  offload  to  a  separate  thread  to  allow  for  

con-nuous  simula-on  

•  Narrow  Phase  Collision  Detec-on  –  Fine-­‐grained  data-­‐paralleliza-on  

 •  About  2x  FPS  improvement  on  typical  mobile  devices  

–  Drama-cally  improves  the  smoothness  of  rendering  and  overall  user  experience  

29  

Qualcomm Research!

Bullet  Physics  Paralleliza-on  

0  

10  

20  

30  

40  

50  

60  

70  

80  

0   200   400   600   800   1000   1200  

Fram

es  Per  Secon

d  

Frame  Number  

MARE   Serial  

30  

Qualcomm Research!

How  can  You  try  MARE?  

•  Stay  tuned!  •  Planning  preview  availability  on  Qualcomm  Developer  Network  

–  Fall  2013  Beta  release  for  the  mul-core  version  •  We  Want  Your  Feedback!    

–  What  scenarios  will  you  use  it  for?  –  Usability  and  features  of  the  library  

31  

Qualcomm Research!

Acknowledgments  

•  Manticore team –  Pablo Montesinos Ortego, Michael Weber, Wayne Piekarski, Behnam

Robatmili, Dario Suarez-Gracia, Vrajesh Bhavsar, Jimi Xenidis, Han Zhao, Kishore Puskuri

•  Interns –  Madhukar Kedlaya, Christian Delozier, Freark Van Der Berg,

Christoph Kershbaumer •  Former members

–  Mehrdad Reshadi, Seth Fowler, Alex Shye, Dilma DaSilva •  Qualcomm

–  Nayeem Islam, Mark Bapst, Charles Bergan, Matt Grob, Ben Gaster •  MulticoreWare

–  Wen-Mei Hwu, Mark Allender, Kevin Wu

32  

Qualcomm Research!

Challenges  

33  

Power  and  performance   Programmability  

Custom hardware for power efficiency. Heterogeneity is the game. Expertly designed and optimized frameworks that hide hardware complexity

Tools and languages to enable programmers to understand and express concurrency and application semantics Composable building blocks

Mobile  has  exci-ng  applica-ons  and  a  huge  impact  

Qualcomm Research!

Legal  Disclaimers  

All opinions expressed in this presentation are mine and may not necessarily reflect those of my employer. Qualcomm and Snapdragon are trademarks of Qualcomm Incorporated, registered in the United States and other countries. All Qualcomm Incorporated trademarks are used with permission. Other products and brand names may be trademarks or registered trademarks of their respective owners.

34  

Qualcomm Research!Qualcomm Research!35