An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core

Copyright 2013, Toshiba Corporation.

An Evaluation of

an Energy Efficient Many-Core SoC

with Parallelized Face Detection

Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and

Takashi Miyamori

Toshiba Corporation, Kawasaki, Japan

2

Executive Summary

• Future architecture will have many cores

• A key challenge : How to efficiently use them?

• We evaluated techniques to accelerate one

type of important application (face detection)

• Performance scales up to 64 cores

• Energy efficiency is 20x better than desktop

CPU

3

Outline

• Introduction

• Face Detection using Joint Haar-Like Features

• Architecture of Energy Efficient Many-Core SoC

• Issues in Implementing Parallelized Face Detection

• Implementation and Evaluation of Parallelized Face

Detection

– On the Single Cluster

– On the Dual Cluster

• Conclusion

4

Outline

• Introduction





Detection



• Conclusion

5

Two Key Trends in Embedded Systems

• Trend 1 : New applications (e.g. image recognition)

need more computing power while keeping low power

• Trend 2 : New architecture can enable much more

parallelism than before

6





parallelism than before Now : 500GOPS

Heterogeneous Multi-Core

Accelerator

ViscontiTM2

[ISSCC’12]

Accelerator

CPU CPU

CPU CPU

7







Accelerator

ViscontiTM2

[ISSCC’12]

Accelerator

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

Accelerator

Accelerator

Accelerator

Accelerator

Heterogeneous Many-Core Future : More than 1TOPS

Toshiba Many-Core

[VLSI Sympo. ’12]

8







Accelerator

ViscontiTM2

[ISSCC’12]

Accelerator

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

Accelerator

Accelerator

Accelerator

Accelerator

Heterogeneous Many-Core Future : More than 1TOPS

Toshiba Many-Core

[VLSI Sympo. ’12]

Result : A need for efficient and scalable application

performance on many-core

9

Power and Performance Target of our Many-Core P

ow

er

Co

nsu

mptio

n [W

]

0.1

1

10

100

1000

10 100 1000

Performance [GOPS]

10


ow

er

Co

nsu

mptio

n [W

]

0.1

1

10

100

1000

10 100 1000

Performance [GOPS]

GHz cores

High Performance Multi & Many-Cores for HPC

Tilera ® Tile64

Intel® 80-Tile

Cell Broadband EngineTM

Intel ® SCC

11


ow

er

Co

nsu

mptio

n [W

]

0.1

1

10

100

1000

10 100 1000

Performance [GOPS]

GHz cores


Tilera ® Tile64

Intel® 80-Tile


Intel ® SCC

Less than 3W is needed

for embedded applications

3

12


ow

er

Co

nsu

mptio

n [W

]

0.1

1

10

100

1000

10 100 1000

Performance [GOPS]

GHz cores


Tilera ® Tile64

Intel® 80-Tile


Intel ® SCC



3

Energy Efficient Embedded Multi-Cores

Toshiba Multi-Core (ViscontiTM2)

ARM ® CortexTM-A5

Renesas RP-X

13


ow

er

Co

nsu

mptio

n [W

]

0.1

1

10

100

1000

10 100 1000

Performance [GOPS]

GHz cores


Tilera ® Tile64

Intel® 80-Tile


Intel ® SCC



3



ARM ® CortexTM-A5

Renesas RP-X

Our Target Area

14


ow

er

Co

nsu

mptio

n [W

]

0.1

1

10

100

1000

10 100 1000

Performance [GOPS]

GHz cores


Tilera ® Tile64

Intel® 80-Tile


Intel ® SCC



3



ARM ® CortexTM-A5

Renesas RP-X

Our Target Area

Toshiba Energy Efficient

Many-Core (64Core)

15

Many-Core Scalability

The Number of Cores

Perf

orm

an

ce

16


The Number of Cores

Perf

orm

an

ce

Ideal

17


The Number of Cores

Perf

orm

an

ce

Ideal

Actual?

18


The Number of Cores

Perf

orm

an

ce

Ideal

Actual?

Can we achieve good performance scaling-up

on face detection?

19

Outline

• Introduction





Detection



• Conclusion

20

Face Detection

21

Face Detection

1st

ROI

Check if a face

exists or not

25 pixels

25 pixels

ROI : Region of Interest

22

Face Detection

2nd

ROI

Check if a face

exists or not

25 pixels

25 pixels

2 pixels


23

Face Detection

3rd

ROI

Check if a face

exists or not

25 pixels

25 pixels

4 pixels


24

Face Detection

Nth

ROI

Check if a face

exists or not


25

Face Detection

Check if a face

exists or not

25 pixels

25 pixels

(N+1)th

ROI

2 pixels ROI : Region of Interest

26

Face Detection

Target

ROI


27

ROI

(Region Of Interest)

Joint Haar-Like Features [ICCV ‘05]

Compared to each threshold

If greater than : 1, otherwise : 0

1 1 0 Joint Haar-Like

Feature

• Extension to widely-used Viola and John’s Method

[CVPR ‘01] (using Haar-like features)

Haar-like feature : Difference of image intensities between

blue and red rectangles.

28

ROI

(Region Of Interest)

Joint Haar-Like Features [ICCV ‘05]

Compared to each threshold

If greater than : 1, otherwise : 0

1 1 0 Joint Haar-Like

Feature

• Extension to widely-used Viola and John’s Method

[CVPR ‘01] (using Haar-like features)

Haar-like feature : Difference of image intensities between

blue and red rectangles.

Eye is darker

than cheek

29

Classifier using Joint Haar-Like Features

Table

Table

Face?

Face?

Accumulate

Face or Not Face

Positions of features and tables are learned in advance

and stored in the dictionary

Possibility of face or not face

Weight of the feature Joint Haar-Like Features

30

Characteristics of Face Detection

• Face detection for each

ROI can be executed in

parallel

• There are a lot of ROIs

in an image

– 3M ROIs when image size

is 4000x3200

• A lot of coarse grain thread parallelism based

on ROIs

– Overhead of thread scheduling can be minimized

Many-core is good for face detection !

31

Outline

• Introduction





Detection



• Conclusion

32

Chip Micrograph and Features

Technology 40nm LP Process

Interconnect 8 metal (Cu)

Chip Size 15.0mm x 14.0mm

Cluster Size 7.4mm x 5.7mm

Transistors 87.5Million

Cluster

Frequency

333MHz, 1.1V

Package 1369-pin FCBGA

Cluster 0

DD

R3 I/F

D

DR

3 I/F

Cluster 1

15.0mm

14.0

mm

L2Cache Bank0

5.7

mm

7.4mm

L2Cache Bank1

L2 Cache Bank2

L2 Cache Bank3

Core

Reconfigurable

Engines

2MB L2 Cache

2MB L2 Cache

33

Chip Micrograph and Features

Technology 40nm LP Process

Interconnect 8 metal (Cu)

Chip Size 15.0mm x 14.0mm

Cluster Size 7.4mm x 5.7mm

Transistors 87.5Million

Cluster

Frequency

333MHz, 1.1V

Package 1369-pin FCBGA

Cluster 0

DD

R3 I/F

D

DR

3 I/F

Cluster 1

15.0mm

14.0

mm

L2Cache Bank0

5.7

mm

7.4mm

L2Cache Bank1

L2 Cache Bank2

L2 Cache Bank3

Core

Reconfigurable

Engines

2MB L2 Cache

2MB L2 Cache

34

Structure of Many-Core Cluster

• Tree-based NoC – Leaf nodes: Core

– Root nodes: L2 cache

banks

L2$（Root）

Core（Leaf）

Core（Leaf）

Router L2 Cache

512 KB

Bank0

L2 Cache

512 KB

Bank1

L2 Cache

512 KB

Bank2

L2 Cache

512 KB

Bank3

Core Core

Core Core

1

Core Core

Core Core

1

2

2

Core Core

Core Core

1

Core Core

Core Core

1

2

2

- Interrupt Controller

- Hardware Semaphores

Cluster Control Module

3 3 3 3

Core Core

Core Core

1

Core Core

Core Core

1

2

2

Core Core

Core Core

1

Core Core

Core Core

1

2

2

Four L2 Cache

Banks

35

Many-Core SoC Architecture

Video

In x 4 Peripherals

L2 Cache

2MB

Core x 32

Many-Core Cluster 0

Video

Out

SRAM 512KB

Reconfigurable

Engine x 2

External

Interface PCIe

X 1

Image

Recognition &

Processing

Accelerators

SRAM 512KB x 4

ARM

Cortex-A9

X 2

PCIe

X 1

10.7GB/s

Reconfigurable

Engine x 2

ARM

Cortex-A9

X 2

DDR3

Controller

X 2

L2 Cache

2MB

Core x 32

Many-Core Cluster 1

36

Core : Media Processing Block (MPB)

• 3-Way VLIW Processor

• L1 Instruction Cache: 32KB

• L1 Data Cache: 16KB

• 333 MHz

3-Way VLIW Processor (MPB)

Dec./RF

ALU

D$

Dec.

Cop. RF

ALU ALU

MUL

Acc. Acc.

I$

RF

32b RISC Core

2-way SIMD Coprocessor

RISC Core

SIMD Co-Processor

L1 Inst

Cache

32KB

L1 Data

Cache

16KB

Debug

Module

NoC I/F

Address Protect

Unit

Address Translate

Unit

Exploits multi-grain parallelism • Thread level by many cores

• Instruction level by VLIW architecture

• Data level by SIMD instructions

64b 64b

37

Outline

• Introduction





Detection



• Conclusion

38

Issues in implementing parallelized face detection

• High coarse-grain parallelism: Good for Parallelization

– There are enough ROIs to exploit by many cores

• Imbalanced workload: Bad for Processor Utilization

– The workload of an ROI where a face exists is higher than that of an

ROI without a face

Implementation of parallelized face-detection

– Minimize the number of threads in order to reduce synchronization cost

• Allocate one thread to one core

– Find a good thread partitioning with balancing workload of threads

– Reduce data bandwidth (L1$-L2$ and L2$-DDR3)

39

Outline

• Introduction





Detection



• Conclusion

40

Implementation on the Single Cluster

• We implemented the face detection with two

methods to allocate image to cores

– Allocating Cyclically

– Splitting Equally

41

(1) Allocating Cyclically

ROI Core0

Core1

Core31

Core2

Image

This way allocates lines to each core cyclically

Effective in balancing workload

42

(2) Splitting Equally

Core0

Core1

Core31

Core2

Image

Height

Height/32

This way divides the image evenly

Effective to reduce data size read by each core

43

Images for Evaluation • High Resolution Images (5.76-12.7Mp) including many faces

No. Resolution Number of Faces

0 4000x1440 30

1 3000x4082 37

2 4083x3062 78

3 4094x3107 148

4 3568x2568 9

5 3568x2568 10

Face Detection Result of Image 4

44

Evaluation Board

Many-Core SoC

(Fan-less Cooling)

I/O and switches for

evaluation

45

0

5

10

15

20

25

30

35

1 2 4 8 16 32 1 2 4 8 16 32

Rela

tive

Pe

rfo

rma

nce

Number of Cores

img.0

img.1

img.2

img.3

img.4

img.5

ideal

Relative Performance on Single Cluster

15.5x

30x

11x

21x

Allocating

Cyclically

Splitting

Equally

46

0

5

10

15

20

25

30

35

0 10 20 30 40

Ave

rag

e R

ela

tive

P

erf

orm

an

ce

The Number of Cores

AllocatingCyclically

SplittingEqually

ideal

Average Relative Performance on Single Cluster

11x

21x 15.5x

30x

With Allocating Cyclically,

performance scales up to 32 cores

Ideal

47

0

5

10

15

20

25

30

35

40

45

0 1 2 3 4 5 0 1 2 3 4 5

Tim

e (

se

c)

Image Number

FastestCore

SlowestCore

Execution Time of the Fastest and Slowest Cores

11x

1.1x

Allocating

Cyclically

Splitting

Equally

48

Processor Utilization

• Allocating Cyclically : 90 ~ 95%

• Splitting Equally : 55 ~ 75%

Low processor utilization deteriorates

the performance of Splitting Equally

0

20

40

60

80

100

0 1 2 3 4 5

Pro

ce

ss

or

Uti

liza

tio

n(%

)

Image Number


SplittingEqually

49

Bandwidth of L2 Cache and DDR3

• L1-L2 bandwidth is nearly the same

– L1 cache is not enough to store ROI line

• About L2-DDR3, Allocating Cyclically is better

– All cores access the small area at the same

0

0.2

0.4

0.6

0.8

1

1.2

L1-L2 L2-DDR3

Rela

tive A

mo

un

t o

f Tra

ns

ferr

ed

Data


SplittingEqually

50

Outline

• Introduction





Detection



• Conclusion

51

Implementation on Dual Cluster

• Each cluster has its own L2 cache and shares DDR3

– Because bandwidth is narrower than L1 and L2 cache, reducing

bandwidth between L2 cache and DDR3 is important

• We implemented the two ways

– Allocating Cyclically

– Bisection

MPB

L2 cache

L1Cache MPBx32 …

DDR3

MPB L1Cache

MPB

L2 cache

L1Cache MPBx32 …

MPB L1Cache

52

(1) Allocating Cyclically

Image

Core0

Core1

Core31

Core2

Cluster0

Core0

Core1

Core31

Core2

Cluster1

ROI

This way is the same as that of a single cluster

Effective in balancing workload

53

(2) Bisection

Image

Cluster0 Cluster1

Height/2

This way divides the image into two blocks

Each cluster processes each block

54

(2) Bisection

Image

Core0

Core1

Core31

Core2

Cluster0

Core0

Core1

Core31

Core2

Cluster1

ROI

ROI

This way divides the image into two blocks

Effective to reduce data size read by each cluster

In each block, Allocating Cyclically is used

55

0

10

20

30

40

50

60

70

0 1 2 3 4 5

Rela

tive

Pe

rfo

rma

nce

Image Number


Bisection

Performance of Dual Cluster (64 Cores)

Ideal (64x)

61x

42x

56

0

10

20

30

40

50

60

70

0 1 2 3 4 5

Rela

tive

Pe

rfo

rma

nce

Image Number


Bisection

Performance of Dual Cluster (64 Cores)

Ideal (64x)

61x

42x

By Allocating Cyclically,

performance scales up to 64 cores

57

0

2

4

6

8

10

12

14

16

18

0 1 2 3 4 5 0 1 2 3 4 5

Tim

e(s

ec)

Image Number

FastestCore

SlowestCore

Allocating

Cyclically Bisection

1.3x

2.8x

Execution Time of the Fastest and Slowest Cores

58

Processor Utilization

• Allocating Cyclically : 87 ~ 95%

• Bisection : 66 ~ 91%

Low processor utilization deteriorates

the performance of Bisection

0

20

40

60

80

100

0 1 2 3 4 5

Pro

ce

ss

or

Uti

liza

tio

n

(%)

Image Number


Bisection

59

DDR3 Bandwidth

Utilized bandwidth is 750MB/s ( only 7% of maximum (10.7GB/s ) )

Memory bandwidth is not bottleneck even when two clusters operate.

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6

DD

R3 B

an

dw

idth

(M

B/s

)

Image Number

Bandwidth in Allocating Cyclically

60

0

500

1000

1500

2000

2500

1 2 4 8 16 32 64

Po

we

r (m

W)

The Number of Cores

Clusters

Bus

DDR3

IO

Other

Power Consumption

Typical Process, Room Temperature, using Allocating Cyclically

2Clusers

1.18W

SoC : 2.21W

Our many-core SoC achieves less than 3W

61

16.1

443.5

0

50

100

150

200

250

300

350

400

450

500

Many-core Core™-i7-3820

En

erg

y (

J)

Comparison with Desk-Top CPU

• Compared with Desk-Top CPU

(Core™-i7-3820: 3.6GHz, 4 Cores, 8 Threads)

TDP of Core™-i7-3820 (130W) is used for calculating energy

20x

better

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Many-core Core™-i7-3820

Rela

tive P

erf

orm

an

ce

Performance Energy

62

Conclusion

• Future architecture will have many cores

– A key challenge : How to efficiently use them?

• We evaluated the many-core SoC with parallelized face

detection

– Many-core is suited for the face detection because it

exploits ROI based coarse-grained parallelism efficiently

• Scale up by 30x (32 cores) to 60x (64 cores)

• Balancing workload is important

• Power consumption is only 2.21W under actual

workload : enables fan-less cooling

– Our many-core SoC is remarkably energy efficient in

image recognition applications

• 20x better than the desk-top CPU

63

Thank you!

Documents

An Evaluation of an Energy Efficient Many-Core SoC with ... · 4 Outline •Introduction •Face Detection using Joint Haar-Like Features •Architecture of Energy Efficient Many-Core