Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Copyright 2013, Toshiba Corporation.
An Evaluation of
an Energy Efficient Many-Core SoC
with Parallelized Face Detection
Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and
Takashi Miyamori
Toshiba Corporation, Kawasaki, Japan
2
Executive Summary
• Future architecture will have many cores
• A key challenge : How to efficiently use them?
• We evaluated techniques to accelerate one
type of important application (face detection)
• Performance scales up to 64 cores
• Energy efficiency is 20x better than desktop
CPU
3
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
4
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
5
Two Key Trends in Embedded Systems
• Trend 1 : New applications (e.g. image recognition)
need more computing power while keeping low power
• Trend 2 : New architecture can enable much more
parallelism than before
6
Two Key Trends in Embedded Systems
• Trend 1 : New applications (e.g. image recognition)
need more computing power while keeping low power
• Trend 2 : New architecture can enable much more
parallelism than before Now : 500GOPS
Heterogeneous Multi-Core
Accelerator
ViscontiTM2
[ISSCC’12]
Accelerator
CPU CPU
CPU CPU
7
Two Key Trends in Embedded Systems
• Trend 1 : New applications (e.g. image recognition)
need more computing power while keeping low power
• Trend 2 : New architecture can enable much more
parallelism than before Now : 500GOPS
Heterogeneous Multi-Core
Accelerator
ViscontiTM2
[ISSCC’12]
Accelerator
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
Accelerator
Accelerator
Accelerator
Accelerator
Heterogeneous Many-Core Future : More than 1TOPS
Toshiba Many-Core
[VLSI Sympo. ’12]
8
Two Key Trends in Embedded Systems
• Trend 1 : New applications (e.g. image recognition)
need more computing power while keeping low power
• Trend 2 : New architecture can enable much more
parallelism than before Now : 500GOPS
Heterogeneous Multi-Core
Accelerator
ViscontiTM2
[ISSCC’12]
Accelerator
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
Accelerator
Accelerator
Accelerator
Accelerator
Heterogeneous Many-Core Future : More than 1TOPS
Toshiba Many-Core
[VLSI Sympo. ’12]
Result : A need for efficient and scalable application
performance on many-core
9
Power and Performance Target of our Many-Core P
ow
er
Co
nsu
mptio
n [W
]
0.1
1
10
100
1000
10 100 1000
Performance [GOPS]
10
Power and Performance Target of our Many-Core P
ow
er
Co
nsu
mptio
n [W
]
0.1
1
10
100
1000
10 100 1000
Performance [GOPS]
GHz cores
High Performance Multi & Many-Cores for HPC
Tilera ® Tile64
Intel® 80-Tile
Cell Broadband EngineTM
Intel ® SCC
11
Power and Performance Target of our Many-Core P
ow
er
Co
nsu
mptio
n [W
]
0.1
1
10
100
1000
10 100 1000
Performance [GOPS]
GHz cores
High Performance Multi & Many-Cores for HPC
Tilera ® Tile64
Intel® 80-Tile
Cell Broadband EngineTM
Intel ® SCC
Less than 3W is needed
for embedded applications
3
12
Power and Performance Target of our Many-Core P
ow
er
Co
nsu
mptio
n [W
]
0.1
1
10
100
1000
10 100 1000
Performance [GOPS]
GHz cores
High Performance Multi & Many-Cores for HPC
Tilera ® Tile64
Intel® 80-Tile
Cell Broadband EngineTM
Intel ® SCC
Less than 3W is needed
for embedded applications
3
Energy Efficient Embedded Multi-Cores
Toshiba Multi-Core (ViscontiTM2)
ARM ® CortexTM-A5
Renesas RP-X
13
Power and Performance Target of our Many-Core P
ow
er
Co
nsu
mptio
n [W
]
0.1
1
10
100
1000
10 100 1000
Performance [GOPS]
GHz cores
High Performance Multi & Many-Cores for HPC
Tilera ® Tile64
Intel® 80-Tile
Cell Broadband EngineTM
Intel ® SCC
Less than 3W is needed
for embedded applications
3
Energy Efficient Embedded Multi-Cores
Toshiba Multi-Core (ViscontiTM2)
ARM ® CortexTM-A5
Renesas RP-X
Our Target Area
14
Power and Performance Target of our Many-Core P
ow
er
Co
nsu
mptio
n [W
]
0.1
1
10
100
1000
10 100 1000
Performance [GOPS]
GHz cores
High Performance Multi & Many-Cores for HPC
Tilera ® Tile64
Intel® 80-Tile
Cell Broadband EngineTM
Intel ® SCC
Less than 3W is needed
for embedded applications
3
Energy Efficient Embedded Multi-Cores
Toshiba Multi-Core (ViscontiTM2)
ARM ® CortexTM-A5
Renesas RP-X
Our Target Area
Toshiba Energy Efficient
Many-Core (64Core)
15
Many-Core Scalability
The Number of Cores
Perf
orm
an
ce
16
Many-Core Scalability
The Number of Cores
Perf
orm
an
ce
Ideal
17
Many-Core Scalability
The Number of Cores
Perf
orm
an
ce
Ideal
Actual?
18
Many-Core Scalability
The Number of Cores
Perf
orm
an
ce
Ideal
Actual?
Can we achieve good performance scaling-up
on face detection?
19
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
20
Face Detection
21
Face Detection
1st
ROI
Check if a face
exists or not
25 pixels
25 pixels
ROI : Region of Interest
22
Face Detection
2nd
ROI
Check if a face
exists or not
25 pixels
25 pixels
2 pixels
ROI : Region of Interest
23
Face Detection
3rd
ROI
Check if a face
exists or not
25 pixels
25 pixels
4 pixels
ROI : Region of Interest
24
Face Detection
Nth
ROI
Check if a face
exists or not
ROI : Region of Interest
25
Face Detection
Check if a face
exists or not
25 pixels
25 pixels
(N+1)th
ROI
2 pixels ROI : Region of Interest
26
Face Detection
Target
ROI
ROI : Region of Interest
27
ROI
(Region Of Interest)
Joint Haar-Like Features [ICCV ‘05]
Compared to each threshold
If greater than : 1, otherwise : 0
1 1 0 Joint Haar-Like
Feature
• Extension to widely-used Viola and John’s Method
[CVPR ‘01] (using Haar-like features)
Haar-like feature : Difference of image intensities between
blue and red rectangles.
28
ROI
(Region Of Interest)
Joint Haar-Like Features [ICCV ‘05]
Compared to each threshold
If greater than : 1, otherwise : 0
1 1 0 Joint Haar-Like
Feature
• Extension to widely-used Viola and John’s Method
[CVPR ‘01] (using Haar-like features)
Haar-like feature : Difference of image intensities between
blue and red rectangles.
Eye is darker
than cheek
29
Classifier using Joint Haar-Like Features
Table
Table
Face?
Face?
Accumulate
Face or Not Face
Positions of features and tables are learned in advance
and stored in the dictionary
Possibility of face or not face
Weight of the feature Joint Haar-Like Features
30
Characteristics of Face Detection
• Face detection for each
ROI can be executed in
parallel
• There are a lot of ROIs
in an image
– 3M ROIs when image size
is 4000x3200
• A lot of coarse grain thread parallelism based
on ROIs
– Overhead of thread scheduling can be minimized
Many-core is good for face detection !
31
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
32
Chip Micrograph and Features
Technology 40nm LP Process
Interconnect 8 metal (Cu)
Chip Size 15.0mm x 14.0mm
Cluster Size 7.4mm x 5.7mm
Transistors 87.5Million
Cluster
Frequency
333MHz, 1.1V
Package 1369-pin FCBGA
Cluster 0
DD
R3 I/F
D
DR
3 I/F
Cluster 1
15.0mm
14.0
mm
L2Cache Bank0
5.7
mm
7.4mm
L2Cache Bank1
L2 Cache Bank2
L2 Cache Bank3
Core
Reconfigurable
Engines
2MB L2 Cache
2MB L2 Cache
33
Chip Micrograph and Features
Technology 40nm LP Process
Interconnect 8 metal (Cu)
Chip Size 15.0mm x 14.0mm
Cluster Size 7.4mm x 5.7mm
Transistors 87.5Million
Cluster
Frequency
333MHz, 1.1V
Package 1369-pin FCBGA
Cluster 0
DD
R3 I/F
D
DR
3 I/F
Cluster 1
15.0mm
14.0
mm
L2Cache Bank0
5.7
mm
7.4mm
L2Cache Bank1
L2 Cache Bank2
L2 Cache Bank3
Core
Reconfigurable
Engines
2MB L2 Cache
2MB L2 Cache
34
Structure of Many-Core Cluster
• Tree-based NoC – Leaf nodes: Core
– Root nodes: L2 cache
banks
L2$(Root)
Core(Leaf)
Core(Leaf)
Router L2 Cache
512 KB
Bank0
L2 Cache
512 KB
Bank1
L2 Cache
512 KB
Bank2
L2 Cache
512 KB
Bank3
Core Core
Core Core
1
Core Core
Core Core
1
2
2
Core Core
Core Core
1
Core Core
Core Core
1
2
2
- Interrupt Controller
- Hardware Semaphores
Cluster Control Module
3 3 3 3
Core Core
Core Core
1
Core Core
Core Core
1
2
2
Core Core
Core Core
1
Core Core
Core Core
1
2
2
Four L2 Cache
Banks
35
Many-Core SoC Architecture
Video
In x 4 Peripherals
L2 Cache
2MB
Core x 32
Many-Core Cluster 0
Video
Out
SRAM 512KB
Reconfigurable
Engine x 2
External
Interface PCIe
X 1
Image
Recognition &
Processing
Accelerators
SRAM 512KB x 4
ARM
Cortex-A9
X 2
PCIe
X 1
10.7GB/s
Reconfigurable
Engine x 2
ARM
Cortex-A9
X 2
DDR3
Controller
X 2
L2 Cache
2MB
Core x 32
Many-Core Cluster 1
36
Core : Media Processing Block (MPB)
• 3-Way VLIW Processor
• L1 Instruction Cache: 32KB
• L1 Data Cache: 16KB
• 333 MHz
3-Way VLIW Processor (MPB)
Dec./RF
ALU
D$
Dec.
Cop. RF
ALU ALU
MUL
Acc. Acc.
I$
RF
32b RISC Core
2-way SIMD Coprocessor
RISC Core
SIMD Co-Processor
L1 Inst
Cache
32KB
L1 Data
Cache
16KB
Debug
Module
NoC I/F
Address Protect
Unit
Address Translate
Unit
Exploits multi-grain parallelism • Thread level by many cores
• Instruction level by VLIW architecture
• Data level by SIMD instructions
64b 64b
37
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
38
Issues in implementing parallelized face detection
• High coarse-grain parallelism: Good for Parallelization
– There are enough ROIs to exploit by many cores
• Imbalanced workload: Bad for Processor Utilization
– The workload of an ROI where a face exists is higher than that of an
ROI without a face
Implementation of parallelized face-detection
– Minimize the number of threads in order to reduce synchronization cost
• Allocate one thread to one core
– Find a good thread partitioning with balancing workload of threads
– Reduce data bandwidth (L1$-L2$ and L2$-DDR3)
39
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
40
Implementation on the Single Cluster
• We implemented the face detection with two
methods to allocate image to cores
– Allocating Cyclically
– Splitting Equally
41
(1) Allocating Cyclically
ROI Core0
Core1
Core31
Core2
Image
This way allocates lines to each core cyclically
Effective in balancing workload
42
(2) Splitting Equally
Core0
Core1
Core31
Core2
Image
Height
Height/32
This way divides the image evenly
Effective to reduce data size read by each core
43
Images for Evaluation • High Resolution Images (5.76-12.7Mp) including many faces
No. Resolution Number of Faces
0 4000x1440 30
1 3000x4082 37
2 4083x3062 78
3 4094x3107 148
4 3568x2568 9
5 3568x2568 10
Face Detection Result of Image 4
44
Evaluation Board
Many-Core SoC
(Fan-less Cooling)
I/O and switches for
evaluation
45
0
5
10
15
20
25
30
35
1 2 4 8 16 32 1 2 4 8 16 32
Rela
tive
Pe
rfo
rma
nce
Number of Cores
img.0
img.1
img.2
img.3
img.4
img.5
ideal
Relative Performance on Single Cluster
15.5x
30x
11x
21x
Allocating
Cyclically
Splitting
Equally
46
0
5
10
15
20
25
30
35
0 10 20 30 40
Ave
rag
e R
ela
tive
P
erf
orm
an
ce
The Number of Cores
AllocatingCyclically
SplittingEqually
ideal
Average Relative Performance on Single Cluster
11x
21x 15.5x
30x
With Allocating Cyclically,
performance scales up to 32 cores
Ideal
47
0
5
10
15
20
25
30
35
40
45
0 1 2 3 4 5 0 1 2 3 4 5
Tim
e (
se
c)
Image Number
FastestCore
SlowestCore
Execution Time of the Fastest and Slowest Cores
11x
1.1x
Allocating
Cyclically
Splitting
Equally
48
Processor Utilization
• Allocating Cyclically : 90 ~ 95%
• Splitting Equally : 55 ~ 75%
Low processor utilization deteriorates
the performance of Splitting Equally
0
20
40
60
80
100
0 1 2 3 4 5
Pro
ce
ss
or
Uti
liza
tio
n(%
)
Image Number
AllocatingCyclically
SplittingEqually
49
Bandwidth of L2 Cache and DDR3
• L1-L2 bandwidth is nearly the same
– L1 cache is not enough to store ROI line
• About L2-DDR3, Allocating Cyclically is better
– All cores access the small area at the same
0
0.2
0.4
0.6
0.8
1
1.2
L1-L2 L2-DDR3
Rela
tive A
mo
un
t o
f Tra
ns
ferr
ed
Data
AllocatingCyclically
SplittingEqually
50
Outline
• Introduction
• Face Detection using Joint Haar-Like Features
• Architecture of Energy Efficient Many-Core SoC
• Issues in Implementing Parallelized Face Detection
• Implementation and Evaluation of Parallelized Face
Detection
– On the Single Cluster
– On the Dual Cluster
• Conclusion
51
Implementation on Dual Cluster
• Each cluster has its own L2 cache and shares DDR3
– Because bandwidth is narrower than L1 and L2 cache, reducing
bandwidth between L2 cache and DDR3 is important
• We implemented the two ways
– Allocating Cyclically
– Bisection
MPB
L2 cache
L1Cache MPBx32 …
DDR3
MPB L1Cache
MPB
L2 cache
L1Cache MPBx32 …
MPB L1Cache
52
(1) Allocating Cyclically
Image
Core0
Core1
Core31
Core2
Cluster0
Core0
Core1
Core31
Core2
Cluster1
ROI
This way is the same as that of a single cluster
Effective in balancing workload
53
(2) Bisection
Image
Cluster0 Cluster1
Height/2
This way divides the image into two blocks
Each cluster processes each block
54
(2) Bisection
Image
Core0
Core1
Core31
Core2
Cluster0
Core0
Core1
Core31
Core2
Cluster1
ROI
ROI
This way divides the image into two blocks
Effective to reduce data size read by each cluster
In each block, Allocating Cyclically is used
55
0
10
20
30
40
50
60
70
0 1 2 3 4 5
Rela
tive
Pe
rfo
rma
nce
Image Number
AllocatingCyclically
Bisection
Performance of Dual Cluster (64 Cores)
Ideal (64x)
61x
42x
56
0
10
20
30
40
50
60
70
0 1 2 3 4 5
Rela
tive
Pe
rfo
rma
nce
Image Number
AllocatingCyclically
Bisection
Performance of Dual Cluster (64 Cores)
Ideal (64x)
61x
42x
By Allocating Cyclically,
performance scales up to 64 cores
57
0
2
4
6
8
10
12
14
16
18
0 1 2 3 4 5 0 1 2 3 4 5
Tim
e(s
ec)
Image Number
FastestCore
SlowestCore
Allocating
Cyclically Bisection
1.3x
2.8x
Execution Time of the Fastest and Slowest Cores
58
Processor Utilization
• Allocating Cyclically : 87 ~ 95%
• Bisection : 66 ~ 91%
Low processor utilization deteriorates
the performance of Bisection
0
20
40
60
80
100
0 1 2 3 4 5
Pro
ce
ss
or
Uti
liza
tio
n
(%)
Image Number
AllocatingCyclically
Bisection
59
DDR3 Bandwidth
Utilized bandwidth is 750MB/s ( only 7% of maximum (10.7GB/s ) )
Memory bandwidth is not bottleneck even when two clusters operate.
0
100
200
300
400
500
600
700
800
1 2 3 4 5 6
DD
R3 B
an
dw
idth
(M
B/s
)
Image Number
Bandwidth in Allocating Cyclically
60
0
500
1000
1500
2000
2500
1 2 4 8 16 32 64
Po
we
r (m
W)
The Number of Cores
Clusters
Bus
DDR3
IO
Other
Power Consumption
Typical Process, Room Temperature, using Allocating Cyclically
2Clusers
1.18W
SoC : 2.21W
Our many-core SoC achieves less than 3W
61
16.1
443.5
0
50
100
150
200
250
300
350
400
450
500
Many-core Core™-i7-3820
En
erg
y (
J)
Comparison with Desk-Top CPU
• Compared with Desk-Top CPU
(Core™-i7-3820: 3.6GHz, 4 Cores, 8 Threads)
TDP of Core™-i7-3820 (130W) is used for calculating energy
20x
better
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Many-core Core™-i7-3820
Rela
tive P
erf
orm
an
ce
Performance Energy
62
Conclusion
• Future architecture will have many cores
– A key challenge : How to efficiently use them?
• We evaluated the many-core SoC with parallelized face
detection
– Many-core is suited for the face detection because it
exploits ROI based coarse-grained parallelism efficiently
• Scale up by 30x (32 cores) to 60x (64 cores)
• Balancing workload is important
• Power consumption is only 2.21W under actual
workload : enables fan-less cooling
– Our many-core SoC is remarkably energy efficient in
image recognition applications
• 20x better than the desk-top CPU
63
Thank you!