Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Ultra Android: High Energy Efficiency Parallel Java Objects Processing via Object Request Broker on Heterogeneous Multi-core Processor
Takeshi Ohkawa* **, Yukoh Matsumoto**, Kenji Toda*
* National Institute of Advanced Industrial Science and Technology (AIST)
** TOPS Systems Corp.
2009/4/15-17 @ Yokohama2009/4/16 1TOPS&AIST
Acknowledgement
Joint Research ProjectTOPS and AIST
Portions of this study were supported by Industrial Technology Research Grant Program in 2007 from New Energy and Industrial Technology Development Organization (NEDO) of Japan.
Object-Oriented Embedded Software Platformfor Heterogeneous Multi-core Processor
Low-power Object Request Broker (ORB) engine for embedded systems
2009/4/16 2TOPS&AIST
CORBA on FPGA
Proposal
ANDOIRD is:◦ A software platform for Mobile phone◦ Proposed by Open Handset Alliance in 2007◦ Target platform: ARM and x86 (in 2009)
Ultra-ANDROID (Our Proposal) is:◦ A technology to reduce power-consumption (1/10) or enhance performance (x10) by employing Heterogeneous Multi-core Processor and Distributed-Object technology◦ Runs ANDROID application as it is
Portions of this page are reproduced from or modifications based on work created and shared by the Android Open Source Project and used according to terms described in the Creative Commons 2.5 Attribution License.
Ultra-ANDROID
2009/4/16 3TOPS&AIST
Android Software Architecture
Portions of this page are reproduced from work created and shared by the Android Open Source Project and used according to terms described in the Creative Commons 2.5 Attribution License.
LINUX
DalvikJavaVM
Java API
Java App
2009/4/16 4TOPS&AIST
Why choose ANDROID?
Target for Object-Oriented Software Platform for Heterogeneous Multi-core Processor◦ To ease the DIFFICULT multi-core programming◦ Programming Model = Object-Oriented (not C)ANDROID Application is written in Java◦ There are many Java developers in the world
What happens if the Java code runs very fast on hetero multi-core without modifying the code?
◦ Independent from Instruction Set ArchitectureNovel microprocessor architecture/Instruction set would be accepted -> BIG Impact
2009/4/16 5TOPS&AIST
Object distribution on cores– Communication via ORB Engine
Heterogeneous MulticoreSingle Core
Function level distribution
WebPage
Image
Image
WebPage
Each core has different Instruction Set Architecture and data structure to optimize processingUse common representation of data to communicate between cores
Letter
Display DCT
Display
DCT Letter
ORBEngine
6
CONCEPT Proposal
JavaObject
2009/4/16 TOPS&AIST
Key = ORB Engine + TOPSTREAM
“ORB Engine” is:◦ A light-weight CORBA implementation in C-lang. (NEDO)
Minimum functionality, small (12KB) and Std. alone◦ CORBA is:
Common Object Request Broker ArchitectureRemote method call between Objects via Message◦ In the context of Object Oriented SoftwareCan connect Java/C/C++/Python/..anything
“TOPSTREAM” is:◦ A Heterogeneous Multi-core Processor
Which operates at very low frequency (typ. 50MHz) for high energy-efficiencyRich inter-core communication resource◦ Concurrent Communication/Processing by multi-bank register
2009/4/16 7TOPS&AIST
Proposal of OLP“Object Level Parallelism”
Another buzzwordOLP = DLP+TLP◦ DLP: Data Level Parallelism◦ TLP: Thread Level Parallelism
Because, Object = Data + Action (Thread=Execution)
92009/4/16 TOPS&AIST
Measurement summary of various method of Inter Object Call on Android platform (ARM11 400MHz, Linux 2.6.25)
Call method LatencyMin.
Latency1K data Note
Localcall viainterface
0.02ms 0.2ms For reference
AIDL 2ms 12ms Android IDL
CORBAUDP/IP with PC
0.3ms -C lang.originalORB Engine
CORBAprotocol only
0.05ms -C lang.original ORB Engine
2009/4/16 11TOPS&AIST
Simple Scenario
Core 1
Core 2
Core 3
Core 4
Core 1
Core 2
Core 3
Core 4
Core 1
Core 2<<Specialized for B>>
Core 3<<Specialized for B>>
Core 4<<Specialized for C>>
Core 5
Core 6
Core 7
Core 8
SwitchBox
SwitchBox
SwitchBox
Core 1
Single Homo4 Homo8 Hetero4
2009/4/16 12TOPS&AIST
Example Pseudo Java Code
Data processing(Data first) {Data second = taskA.setup(first);Data third = taskB.heavyCalc(second);return taskC.summarize (third); };
A B C
2009/4/16 13TOPS&AIST
Inter-core object call modelData Parallel–Homogeneous multicore
Core 1 Core 2 Core 3
A
B2
B3B
1
C
Core 4
B4
TimeLatency=Throughput
A
B1
B2 B
3
B4
C
Observation: parallelization works fine sometimes
2009/4/16 14TOPS&AIST
Inter-core object call modelData Parallel–Homogeneous multicore
Core 1 #2 #3
A
C
#4Time
Latency=Throughput
#5 #6 #7 #8
B2
B3B
1
B4
B5
B6
B7
B8
Observation: Many-core causes communication bottleneck
2009/4/16 15TOPS&AIST
Heterogeneous Multi-core Setting
Task Cycles ParallelismA 100 sequentialB 2000 (87%) parallelC 200 mixedTotal 2300
Core# Specializedfor
Operation Per Cycle(Speedup)
1 <generic> 12 Task B 10x only for task B3 Task B 10x only for task B4 Task C 2x only for task C
Configuration of theparallel object tasks used in the estimationof parallel efficiency
Configuration of the Heterogeneous four cores used in the estimation of parallel efficiency
2009/4/16 16TOPS&AIST
Inter-core object call modelTask Parallel - Heterogeneous
Core 1 Core 2 Core 3
A
B1
B2
C
Core 4Time
Latency=Throughput
<<for B>> <<for B>> <<for C>>Observation: Heterogeneous reduces computation cycles
2009/4/16 17TOPS&AIST
Inter-core object call modelwith Object MigrationHeterogeneous
Core 1 Core 2 Core 3
A
B1
B2
C
Core 4Time
Latency=Throughput
<<for B>> <<for B>> <<for C>>Observation: Object Migration reduces communication
2009/4/16 18TOPS&AIST
Inter-core object call modelwith Object MigrationHeterogeneous + Pipelining
Core 1 Core 2 Core 3
A
B1
B2
C
Core 4Time
Latency
<<for B>> <<for B>> <<for C>>
A
B1
B2
C
Observation: OM enables pipelining without modifying code
2009/4/16 19TOPS&AIST
Throughput
Homogeneous vs. Heterogeneous Multi-core Processors
Latency
Observation: Homo many core cause communication bottleneck
Speedup
Total computation cycles = 2300Task size = 100, 2000(500, 250), 200
Communication =1/10 Computation
Communication =1/10 Computation
2009/4/16 20TOPS&AIST
Pipeline Processingby Object Migration
Effective Latency Speedup
Total computation cycles = 2300Task size = 100, 2000(500), 200
Communication =1/10 Computation Communication =
1/10 Computation
Observation: Pipeline max performance without modifying code
2009/4/16 21TOPS&AIST
Conclusion
1. The method call latency dominates the results. Increasing the number of cores causes an increase in communication delay
2. IPC has greater impact than the number of cores◦ IPC: Instruction Per Cycle◦ Heterogeneous multi-core: Specialized Instruction Set
3. Pipeline parallelization is effective and object migration enables pipelining
Ultra-ANDROID technology enables 10x energy efficiency than ANDROID, by Heterogeneous Multi-core Processor and Distributed Java Object
2009/4/16 22TOPS&AIST