Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Cut Power Consumption by 5x
Without Losing Performance
A big.LITTLE Software Strategy
Klaas van Gend
FAE, Trainer & Consultant
2 | October 10, 2014 LINUXCON EUROPE 2014
The mandatory Klaas-in-a-Plane picture
3 | October 10, 2014 LINUXCON EUROPE 2014
Quad Core vs. Dual Core –Why isn’t it Twice as Fast?
VS
4 | October 10, 2014 LINUXCON EUROPE 2014
The GHz race
5 | October 10, 2014 LINUXCON EUROPE 2014
Why GHz++ cost power^2
ARM big.LITTLE
“OK, heavy work costs power.
Let’s not waste power on light work…”
7 | October 10, 2014 LINUXCON EUROPE 2014
ARM playing it cool: big.LITTLE
Source: http://community.arm.com/groups/processors/blog/2013/06/18/ten-things-to-know-about-biglittle/
8 | October 10, 2014 LINUXCON EUROPE 2014
A7 vs A15
Cortex A7:
• Less silicon area
• Less optimal cycles
• Less cycles/second
• More power efficient
9 | October 10, 2014 LINUXCON EUROPE 2014
How to use big.LITTLE today
Source: http://community.arm.com/groups/processors/blog/2013/06/18/ten-things-to-know-about-biglittle/
10 | October 10, 2014 LINUXCON EUROPE 2014
Some available big.LITTLE hardware
• AllWinner A80
• Renesas automotive silicon
• Samsung Galaxy S4 for South-Korean market
• Samsung Galaxy S5 for South-Korean market
• Hardkernel ODROID-XU boardExynos5
Built-in Power
Measurement
Use Case:
Chromium
12 | October 10, 2014 LINUXCON EUROPE 2014
Chrome / Chromium / ChromeShell
• Chromium: open source browser
based on KHTML Webkit Blink
• Google Chrome: closed-source browser
based on Chromium
• ChromeShell: open source Chromium “browser” for Android
• Chrome for Android:closed-source browser for Android
13 | October 10, 2014 LINUXCON EUROPE 2014
Chromium workloadVisualized
Loading
ParsingLayouting/Rendering
PaintingJavaScript
Canvas
14 | October 10, 2014 LINUXCON EUROPE 2014
HTML5 Canvas“graphics device for JavaScript”
15 | October 10, 2014 LINUXCON EUROPE 2014
HTML5 Canvas“graphics device for JavaScript”
16 | October 10, 2014 LINUXCON EUROPE 2014
Parallelizing Canvas“not as easy as it looks”
17 | October 10, 2014 LINUXCON EUROPE 2014
Canvas Parallelization -Performance Results
Benchmark
on quad-core
Standard
Blink
Parallelized
Blink
Performance
improvement
Flashcanvas perf 1,69 score 2,44 score 44%
Fc perf w/ alpha 1,04 score 1,52 score 50%
Guimark2 Vector 9,5 fps 13,3 fps 40%
Canvasmark ‘13 3475 score 4116 score 53%
Average improvement 47%
With parallelism you can improve performance of
even the most complex applications!
18 | October 10, 2014 LINUXCON EUROPE 2014
Google Chrome
19 | October 10, 2014 LINUXCON EUROPE 2014
Google Chrome on Odroid-XU+E
Using Google’s Chome (version 33 for Android)
• 2 cores active: 54% and 84%
• Power use A15+A7 cores: 2.374 Watts
• Test average: 9.44 fps
20 | October 10, 2014 LINUXCON EUROPE 2014
Our ChromeShell on Odroid-XU+E
Using our optimized ChromeShell:
• 3 A15 cores active: 59%, 63% and 38%
• Power use A15+A7 cores: 3.116 Watts
• Test average: around 14 fps
21 | October 10, 2014 LINUXCON EUROPE 2014
Canvas Parallelization -works even on ‘normal’ silicon like Qualcomm Snapdragon 800
• LG’s NEXUS 5 phone
• Quad core Qualcomm Snapdragon 800
• Phone heating up similarly in both cases
Default Chrome
Average: 7.12 fps
“Our” ChromeShell:
Average: 14.48 fps
22 | October 10, 2014 LINUXCON EUROPE 2014
Canvas Parallelization -Power Consumption on “Flashcanvas perf”
Benchmark Standard Blink
on A15+GPU
Parallelized Blink
on quad-A7
Difference
No optimization 29 fps 17 fps -40%
Performance 29 fps 26 fps -10%
Power
consumption
2,2W 0,4W 550%
Performance /
Watt
1,3 65 490%
With parallelism and right chip choices
• you can get 5x power savings
• without losing performance!
23 | October 10, 2014 LINUXCON EUROPE 2014
Comparing performance / watt:
Using Google’s Chome (version 33 Android)
• 2 cores active: 54% and 84%
• Power use A15+A7 cores: 2.374 Watts
• Test average: 9.44 fps
Using our optimized ChromeShell:
• 3 A7 cores active: 73%, 80% and 44%
• Power use A15+A7 cores: 0.472 Watts
• Test average: 10.04 fps
24 | October 10, 2014 LINUXCON EUROPE 2014
1x A15 or 4x A7?
25 | October 10, 2014 LINUXCON EUROPE 2014
1x A15 < 4x A7 !
20000
MIPS
More than twice the performance
Less W
Back to big.LITTLE
Making these results work outside a lab
27 | October 10, 2014 LINUXCON EUROPE 2014
State of big.LITTLE in Linux - 1What’s in the kernel today?
28 | October 10, 2014 LINUXCON EUROPE 2014
State of big.LITTLE in Linux - 2What else is relevant?
• IKS (In-kernel-Switcher)– Firstly available in Linaro kernel trees
– Merged in 3.11 kernel
• Qualcomm / LG / etc powerdaemons– Throttle performance if cores overheat
– Usually “secret”
• Not-in-mainline Schedulers:– Linaro’s GTS (Global Task Scheduler),
– a.k.a. HMP (Heterogeneous Multi-Processing)
• Kernel Summit 2014 “Energy-Aware Scheduling Workship”
29 | October 10, 2014 LINUXCON EUROPE 2014
Feedback loop
• We know when we want to have 4xA7 or 1xA15
• If we can tell the kernel, it can anticipate– instead of noticing an increase in workload
– and by accident turning on the A15s
Setpoint
30 | October 10, 2014 LINUXCON EUROPE 2014
Where to go?
• Qualcomm MARE– Research project
– Framework to aid parallelization
– Should assist kernel in scheduling/cpufreq
• Deadline scheduler– Merged in Linux 3.14
– Application sets SCHED_DEADLINE
– Application sets scheduling attributes
– Task repetition in microseconds
– Task start within repetition
– Task completion deadline within repetition
“Feedback loop”
(in user space)
“Feedback loop”
(in kernel space)
31 | October 10, 2014 LINUXCON EUROPE 2014
Is parallelism going to stay?Actually, is big.LITTLE going to stay???
• The GHz race has come to an end– Now also for ARM
• The speed of light limits “clock domain size”
• Thus many clock islands on a die– Multicore is just an “easy” way to improve performance
– At the cost of the programmer
– Who needs extra training
• ARM big.LITTLE– Is a mechanism to skip heavy power consumption
– At the cost of more mm2 silicon
– Is it worth it???
32 | October 10, 2014 LINUXCON EUROPE 2014
My ideal ARM-based design:
big: 1x A57
LITTLE: 4x A53
Why is no-one designing this chip?
Conclusions
34 | October 10, 2014 LINUXCON EUROPE 2014
Conclusion
• big.LITTLE works
• IFF– Short bursts can be handled by one ‘big’ core
– Heavier workloads are parallelizable
and run on clusters of LITTLEs
– APIs become available:
Programs must indicate what the workload will be
BTW: Chromium is parallelizable – we did it.
35 | October 10, 2014 LINUXCON EUROPE 2014
Conclusion
• big.LITTLE works
• IFF– Short bursts can be handled by one ‘big’ core
– Heavier workloads are parallelizable
and run on clusters of LITTLEs
– APIs become available:
Programs must indicate what the workload will be
BTW: Chromium is parallelizable – we did it.
36 | October 10, 2014 LINUXCON EUROPE 2014
Vector Fabrics – the Company
• Founded February 2007
• Founding team– Strong in SoC design and multi-core software
– Currently 15 FTE: 6 PhD, 7 MSc
• Protected technology– 3 patents filed in US & Europe
• Recognition– “Hot Startup” in EE Times Silicon 60, since 2011
– Selected by Gartner as “Cool vendor in Embedded Systems & Software” 2013
– Global Semiconductors Alliance award, March 2013
37 | October 10, 2014 LINUXCON EUROPE 2014
Contact Information
• Web:
www.vectorfabrics.com
• Email:
• Tel:
+31 40 8200960
• Address:
Vector Fabrics B.V.
Vonderweg 22
5616RM Eindhoven
The Netherlands
Thank You!
(drop your business card if you want the slides and the to-be-released whitepaper)
Klaas van Gend
FAE, Trainer & Consultant