25
Filippo Spiga, Arm Research 15/09/2019

Filippo Spiga, Arm Research 15/09/2019

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Filippo Spiga, Arm Research

15/09/2019

Arm Neoverse

3 © 2019 Arm Limited 3 Arm Limited © 2019

An exciting road ahead

26 years 4 years!

20171991 2021

100 billionchips shipped

100 billionchips shipped

4 © 2019 Arm Limited 4 Arm Limited © 2019

The Next 100 Billion in 4 Years

Another 100 billion chips are projected to ship from 2017 – 2021.

Forecast mix of Arm chips (Classic, Cortex-A, R, M)

Embedded and Automotive: 40%

Infrastructure: 15% Mobile and Consumer Electronics:45%

5 © 2019 Arm Limited 5 Arm Limited © 2019

Distributing Intelligence from Edge to Cloud

On-device learning for enhanced user privacy

Compute performance to deliver a hi-fidelity world

Real-time inference for autonomous systems

Security and privacy for your data

4k, HDR and 5G for more human-like interfaces

5 Arm Limited © 2019

6 © 2019 Arm Limited 6 Arm Limited © 2019

Today’s compute model

Central Data Centers Billions of people

Media Content

Consume

4G Network

Create & Distribute

7 © 2019 Arm Limited 7 Arm Limited © 2019

Data consumption is driving future designs

Analyze & Store

Critical DataEdge

Edge

Edge

Edge

Edge

Edge

Filter & React

Massive Amounts of

Data5G

Local Decisions

Trillionsof Devices

Cloud Data Centers

8 © 2019 Arm Limited 8 Arm Limited © 2019

Transforming infrastructure

Cloud Data Center

Edge Cloud

Gateways, uCPE

5G

Edge Access

5G network

Value added uServices

Data privacy

Real time decisions made at the source

Edge cloud

5G network

Content cache

Cloud application deployment to meet

latency

Network analytics and management

Core cloud

Analytics/storage of critical data

9 © 2019 Arm Limited 9 Arm Limited © 2019

Motivating the Edge

Latency

BandwidthConstraints

Security Privacy

10 © 2019 Arm Limited 10 Arm Limited © 2019

High Performance, Secure IP and Architectures

Diverse Solutions and Ecosystem

Scalable from Hyperscale to the Edge

The Cloud to Edge Infrastructure Foundationfor a World of 1T Intelligent Devices

11 © 2019 Arm Limited 11 Arm Limited © 2019

Each generation, faster performance

~30% per Gen Faster Performance & New Features

16nm

(A72, A75)

CosmosPlatform

Today

7nm

AresPlatform

7+nm

ZeusPlatform

PoseidonPlatform

5nm

(N1, E1)

12 © 2019 Arm Limited 12 Arm Limited © 2019

Scalable system solutions from cloud to edge

8ch DDRPCIe 100GbE CCIX

Arm CPUs128 big

256 data plane

Bandwidth1 TB/s

System cache128MB

HBM8

DDR channels8

4

20 GB/s

0 MB

0

1

1ch DDR

10G

Radio

Edge

Edge

Edge

Edge

Edge

Edge

5G

Cloud Data Centers

SVE and beyond

14 © 2019 Arm Limited

Addressing Moore’s Law

• Can we use the additional transistors to unlock more CPU performance?

• By processing more instructions or more data in parallel per cycle.

• New microarchitecture can extract more Instruction-Level Parallelism (ILP).

• But there are limits to this hardware magic.

• Could new architecture allow us to express greater parallelism in our code?

• Tackling the unbending nature of Amdahl’s Law requires far more parallelisation.

• But without needing to rewrite the world’s software.

15 © 2019 Arm Limited

Scalable Vector Extension (SVE), recapA vector extension to the ARMv8-A architecture with some major new features

Gather-load and scatter-storeLoads a single register from several non-contiguous memory locations.

Per-lane predicationOperations work on individual lanes under control of a predicate register.

Predicate-driven loop control and managementEliminate scalar loop heads and tails by processing partial vectors.

Vector partitioning and software-managed speculationFirst Faulting Load instructions allow memory accesses to cross into invalid pages.

Extended floating-point horizontal reductionsIn-order and tree-based reductions trade-off performance and repeatability.

1 2 3 45 5 5 51 0 1 0

6 2 8 4

+

=

pred

1 2 0 01 1 0 0

+pred

1 2

1 + 2 + 3 + 4

1 + 2

+

3 + 4

3 7= =

=

=

n-2

1 01 0CMPLT n

n-1 n n+1INDEX i

for (i = 0; i < n; ++i)

16 © 2019 Arm Limited

• Start porting and tuning for future architectures early• Reduce time to market • Save development and debug time with Arm support

• Run 64-bit user-space Linux code that uses new hardware features on current Arm hardware• SVE support available now.• Tested with Arm Architecture Verification Suite (AVS)

• Near native speed with commercial support• Integrates with DynamoRIO allowing arbitrary instrumentation extension• Emulates only unsupported instructions• Integrated with other commercial Arm tools including compiler and profiler• Maintained and supported by Arm for a wide range of Arm-based SoCs

Commercially Supportedby ARM

Runs at close to native speed

Develop software for tomorrow’s hardware today

Develop your user-space applications for future hardware today

Arm Instruction Emulator

18 © 2019 Arm Limited

Applying ArmIE methodology to workloads

• Compile application with SVE-capable compiler and run it through ArmIE:

$ armie -msve-vector-bits=512 -i libinscount_emulated.so -- ./sve_app

• (Optional) Use Region-of-Interest (RoI) in the code to delimit various region of interests

• Select between several SVE-ready instrumentation clients

• Successfully applied to various popular mini-apps and benchmarks (e.g. SVE)

Arm Community Blog “Emulating SVE on existing Armv8-A hardware using DynamoRIO and ArmIE” Miguel Tairum -- http://bit.ly/2wN4P6M

Arm Community Blog “Parallelizing HPCG's main kernels”Daniel Ruiz -- http://bit.ly/2ZtstSb

19 © 2019 Arm Limited

Fugaku (RIKEN RCCS, Japan)

20 © 2019 Arm Limited

Ookami (Stony Brook, USA)

"Three perspectives on message passing" - Robert Harrison, Director of the Institute of Advanced Computational Science (IACS) and Brookhaven Computational Science Center (CSC) -- https://www.youtube.com/watch?v=WkepRUw0ri0

21 © 2019 Arm Limited

Scalable Vector Extension v2 (SVE2)Scalable Data-Level Parallelism (DLP) for more applications

Built on the SVE foundation.• Scalable vectors with hardware choice from 128 to 2048 bits.• Vector-length agnostic programming for “write once, run anywhere”.• Predication and gather/scatter allows more code to be vectorized.• Tackles some obstacles to compiler auto-vectorisation.

Scaling single-thread performance to exploit long vectors.• SVE2 adds NEON™-style fixed-point DSP/multimedia plus other new features.• Performance parity and beyond with classic NEON DSP/media SIMD.• Tackles further obstacles to compiler auto-vectorization.

Enables vectorization of a wider range of applications than SVE.• Multiple use cases in Client, Edge, Server and HPC.

– DSP, Codecs/filters, Computer vision, Photography, Game physics, AR/VR, Networking, Baseband, Database, Cryptography, Genomics, Web serving.

• Improves competitiveness of Arm-based CPU vs proprietary solutions.• Reduces s/w development time and effort.

Built on SVE

Improved scalability

Vectorization ofmore workloads

Announced by Nigel Stephens, Arm Fellow, at Linaro Connect BKK, April 2019

22 © 2019 Arm Limited

SVE2 enhancements

▪ NEON-style “DSP” instructions

• Trad NEON fixed-p, widen, narrow & pairwise ops

• Fixed-point complex dot product, etc. (LTE)

• Interleaved add w/ carry (wide multiply, BigNums)

• Multi-register table lookup (LTE, CV, shuffle)

• Enhanced vector extract (FIR, FFT)

▪ Cross-lane match detect / count

• In-memory histograms (CV, HPC, sorting)

• In-register histograms (CV, G/S pointer de-alias)

• Multi-character search (parsers, packet inspection)

▪ Non-temporal Gather / Scatter

• Explicit cache segregation (CV, HPC, sorting)

▪ Bitwise operations

• PMULL32→64, EORBT, EORTB (CRC, ECC, etc.)

• BCAX, BSL, EOR3, XAR (ternary logic + rotate)

▪ Bit shuffle

• BDEP, BEXT, BGRP (LTE, compression, genomics)

▪ Cryptography

• AES, SM4, SHA3, PMULL64→128

▪ Miscellaneous vectorisation

• WHILEGE/GT/HI/HS (down-counting loops)

• WHILEWR/RW (contiguous pointer de-alias)

• FLOGB (other vector trig)

▪ ID register changes only for SVE Linux kernel

23 © 2019 Arm Limited

Transactional Memory Extension (TME)Scalable Thread-Level Parallelism (TLP) for multi-threaded applications

Hardware Transactional Memory (HTM) for the Arm architecture.• Improved competitiveness with other architectures that support HTM.• Strong isolation between threads.• Failure atomicity.

Scaling multi-thread performance to exploit many-core designs.• Database.• Network dataplane.• Dynamic web serving.

Simplifies software design for massively multi-threaded code.• Supports Transactional Lock Elision (TLE) for existing locking code.• Low-level concurrent access to shared data is easier to write and debug.

Improved scalability

Hardware Transactional Memory

Simpler software design

Announced by Nigel Stephens, Arm Fellow, at Linaro Connect BKK, April 2019

24 © 2019 Arm Limited

Seriously committed toward High Performance

• Arm committed to enable innovation across all compute continuum (including HPC!).• New ways to scale performance and exploit additional transistors with Moore’s Law slowing.

• Extracting more parallelism from existing software.

• SVE2: improved auto-vectorization with support for DSP/Media hand-coded SIMD.• Scalable vectorization for increased fine-grain Data Level Parallelism (DLP).

• More work done per instruction.

• TME: easier lock-free programming for lightly-contended shared data structures.• Scalable concurrency to increase coarse-grain Thread Level Parallelism (TLP).

• More done work per thread.

• SVE2 and TME may be able to combine for even greater performance scaling.• Tackling Amdahl’s law on multiple fronts with a mix of DLP and TLP in multi-threaded applications.

These new technologies are not yet part of any announced product roadmap.

25 © 2019 Arm Limited

What we are up to in Research / SLSS

• Collaborative HPC activities with newly established Centre of Excellence (Filippo Spiga)• Workload evaluation continues with gem5/armie waiting real SVE HW • Exploring other computational areas (e.g. genomics)

• Applied Analytics and Irregular Applications (Doug Joseph)• Funded projects around unsupervised and semi-supervised• Graph Analytics

• High Performance Networking and Direct IO Compute (Pavel Shamis)• Co-design to tighten coupling between off-chip interconnect and compute elements

• Reliability for large-scale Arm-based systems (Reiley Jeyapaul)• Involvement in MontBlanc2020

• Edge Computing, Edge-to-Cloud and Smarter Cities (Eric Van Hensbergen)

• Enable scalable Deep Learning training on Arm using SVE (Filippo Spiga)• Enablement across the full stack, from library to automatic generation of optimized operators • Is DL/AI a killer app for SVE?

Thank YouDankeMerci谢谢

ありがとうGracias

Kiitos감사합니다

धन्यवादشكرًاתודה

© 2019 Arm Limited