13
CONFIDENTIAL 1 Cortex-A72: Current State-of-the-Art Processor Aniket M. Saha Senior Product Manager, ARM

Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 2: Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

CONFIDENTIAL 2

Compelling single-threaded performance Large performance increase across all workloads including

integer, memory-intensive, crypto, floating point, etc.

Baseline microarchitecture similar to Cortex-A57

Significant advancements in power efficiency Re-optimized every logical block from Cortex-A57

Power reduction enables sustained operation at Fmax

Area reduction lowers costs and static power

Feature support for enterprise and mobile SoCs

Cortex-A72: State of the Art Processor

Page 3: Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

CONFIDENTIAL 3

1.9

2.6

Cortex-A72: ARM’s Highest Performance Processor

2016

Premium

2014

2015

x

x

Increase in sustained performance within

smartphone power budget 3.5x

Cortex-A15

28nm

1.6 GHz

Cortex-A57

20nm

2.0 GHz

Cortex-A57

14/16nm

2.3 GHz

Cortex-A72

14/16nm

2.5 GHz

Page 4: Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

CONFIDENTIAL 4

Cortex-A72: Next-Generation Performance

1

1

1

1

1

1.16

1.26

1.50

1.38

1.16

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

Integer Compute

Floating Point

Memory

Crypto

Analytics Cortex-A72 Cortex-A57

Performance per cycle

(Relative )

Workloads include: SPECint06, SPECfp06, Stream, LMbench, Geekbench, Antutu, Minebench, AES/SHA/CRC kernels, and other targeted kernels

Page 5: Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

CONFIDENTIAL 5

Next generation solutions using ARM Cortex-A72

Page 6: Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

CONFIDENTIAL 6

Enabling Scalable Portfolio of Solutions

Cortex-A7

Cortex-A53

CCI-400

CCI-500

CCN-502

Cost-Efficient Power-Optimized

CCI-500

CCN-502

CCN-504

Cortex-A53

Cortex-A57

Mid-range Performance

CCN-508

CCN-512

Cortex-A53

Cortex-A57

Cortex-A72

High Performance Networking and Server

ARM Architecture

Page 7: Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

CONFIDENTIAL 7

DSPDSP

ACE

Network Interconnect

NIC-400

Flash

NIC-400

USB

Memory

Controller

DMC-520

x72

DDR4-3200

AHB

Snoop Filter1-32MB L3 cache

PCIe

10-40

GbE

DPI Crypto

CoreLink™ CCN-512 Cache Coherent Network

DSP SATA

Memory

Controller

DMC-520

x72

DDR4-3200

Cortex-A72

Memory

Controller

DMC-520

x72

DDR4-3200

Memory

Controller

DMC-520

x72

DDR4-3200

PCIe

DPI

I/O Virtualisation CoreLink MMU-500

SRAM

Network Interconnect

NIC-400

GPIO PCIe

GIC-500

Cortex CPU

or CHI

master

Cortex-A53

Cortex-A72

Cortex-A53

Cortex-A72

Cortex-A53

Cortex-A72

Cortex-A53

Cortex CPU

or CHI

master

Cortex CPU

or CHI

master

Cortex CPU

or CHI

master

®

Extensible Architecture for Heterogeneous Multi-core Solutions

Up to 4

cores per

cluster

Up to 12

coherent

clusters

Integrated

L3 cache

Up to 24 I/O

coherent

interfaces for

accelerators

and I/O

Peripheral address space

Heterogeneous processors – CPU, GPU, DSP and

accelerators Virtualized Interrupts

Up to Quad

channel

DDR3/4 x72

Page 8: Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

CONFIDENTIAL 8

Efficient Interconnect for Compelling Scalable Solutions

CCN-508

Syst

em

Perf

orm

ance

16 48+ 8 2 32 4

High-end Mid-range Cost-efficient

Approximate core count

CCN-504

CCN- 502

CCI-500

CCI-400

CCN-512

Level-3 Cache Size 0MB 32MB

DDR Bandwidth 20 GB/s 100 GB/s

Coherent ports 2 24

On-chip bandwidth 0.2 Tb/s 1.8 Tb/s

AMBA 5 CHI

AMBA 4 ACE

Page 9: Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

CONFIDENTIAL 9

Enterprise Compute Requirements

Specialised Processing

L1, Content Delivery, Security

Diverse requirements

Trend: Advanced modulation schemes

Need: DSPs, Accelerators

Data Plane Processing

Throughput driven, IO intensive

Deterministic performance

Trend: Higher packet rates

Need: Small Cores at Maximum Efficiency

Control Plane Processing

Fast Event Processing

Complex signalling

Trend: Evolving Software

Need: Efficient, High Compute Performance

MAC Scheduling

Real Time, Latency Driven

Multiple core processing

Trend: More Complexity (LTE-A, 5G)

Need: High Compute, Low Latency Performance

High Bandwidth, Low Latency Interconnect

Wide Range of Implementations from Few to Many Coherent Devices

Page 10: Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

CONFIDENTIAL 10

Cortex-A72: Compelling performance and throughput

0

0.2

0.4

0.6

0.8

1

1.2

Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3

20 Thread Workload 2.3

GH

z

2.7

GH

z

2.6

G

Hz

Rela

tive

perf

orm

ance

(Sp

ec2

K6 r

ate)

ARM Cortex-A57, Cortex-A72 deliver:

Competitive performance per thread

Similar overall performance throughput

At Much Lower Power

2.5

GH

z

Comparison for equivalent number of threads Platforms used:

Xeon-E5 2660 10C20T platform (measured) Xeon-E5 2650 10C20T platform (measured) Gcc compiler v4.9 with –o3 flag

Estimated result on example 20C ARM Cortex platforms with CCN-508, 28MB total L2+L3 cache

per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled to 20T based on modelled and empirical results Power estimated in 16nm based on ARM internal implementations for entire CPU+interconnect complex

(10 cores 20 threads) (20 cores 20 threads) (20 cores 20 threads) (10 cores 20 threads)

Page 11: Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

CONFIDENTIAL 11

Maximizing Throughput Density: per mm2, per Watt

0

0.2

0.4

0.6

0.8

1

1.2

Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3

20 Thread Workload 2.3

GH

z

2.7

GH

z

2.6

GH

z

Rela

tive

perf

orm

ance

(Sp

ec2

K6 r

ate)

Comparison for equivalent number of threads Platforms used:

Xeon-E5 2660 10C20T platform (measured) Xeon-E5 2650 10C20T platform (measured) Gcc compiler v4.9 with –o3 flag

Estimated result on example 20C ARM Cortex platforms with CCN-508, 28MB total L2+L3 cache

per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled to 20T based on modelled and empirical results Power estimated in 16nm based on ARM internal implementations for entire CPU+ interconnect

2.5

GH

z

105W

105W <30W

<30W

ARM Solution Benefits:

Less than 1/3rd the power for equivalent

performance

Allows more specialized computing or

significantly greater thread density in

the same power budget

(10 cores 20 threads) (20 cores 20 threads) (20 cores 20 threads) (10 cores 20 threads)

Page 12: Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

CONFIDENTIAL 12

Cortex-A72: Ideal for dense compute environments

Cortex-A72 is <20 % size

Single Broadwell CPU + 256K1 L2

~8mm2

Cortex-A72 MP4 + 2MB L23

~8mm2

Single Cortex-A72 core 2

~1.15mm2

A quad core Cortex-A72 with 8x L2 cache RAM is

the same size

1Source: Estimated from die-shot image provided by Intel at IDF 2014. 2/3Source: ARM trial implementations on TSMC 16FF+, using ARM Artisan libraries

Core

Page 13: Cortex-A72: Current State-of-the-Art Processor...with CCN-508, 28MB total L2+L3 cache per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled

CONFIDENTIAL 13

Compelling performance and efficiency

Enterprise class scalable solutions

Enterprise ready feature set and ecosystem

Cortex-A72: Highest Performance ARM Cortex Processor