PipeLayer: A Pipelined ReRAM-Based Accelerator for … · PipeLayer: A Pipelined ReRAM-Based...

PipeLayer:A Pipelined ReRAM-Based

Accelerator for Deep Learning

Presented by Nils Weller

Hardware Acceleration for Data ProcessingSeminar, Fall 2017

PipeLayer:A Pipelined ReRAM-Based

Accelerator for Deep LearningPurpose:

- Processing-in-Memory (PIM) architecture to accelerate Convolutional Neural Networks (CNNs)

- Based on novel resistive memory (ReRAM) technology

- Incremental improvement on prior works

Background: CNNs

Goal: Classify image contents

Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Not shown:Nonlinear activation function after convolution

Background: CNNs

Main layer type: Convolution

Convolution operation

Image: Burger, W. (2016): Digital Image Processing. An Algorithmic Introduction Using Java.

Input image

Output feature map

Filter matrix

Dot product

Input image

Output feature map

Filter matrix

Dot product

Traditional: Fixed - e.g. vertical Sobel:

Input image

Output feature map

Filter matrix

Dot product

Traditional: Fixed - e.g. vertical Sobel:

CNNs: Learnedweights for kernel:

Background: CNNs

Two phases:

1. Training2. Testing (= first half of training)

Background: CNNs

Phase 1: Training

Label: boat

Process image

Background: CNNs

Phase 1: Training

Label: boat

True value(label): dog (0) cat (0) boat (1) bird (0)

E(output)

Process image

Background: CNNs

Phase 1: Training

Label: boat

True value(label): dog (0) cat (0) boat (1) bird (0)

E(output)

Process image

Backpropagate error, gradient descentmethod- Calculate error contribution for layers- Update weights to reduce error

Background: CNNs

Phase 1: Training

Background: CNNs

Summary:

- Large amounts of data- Acceleration desirable- Particularly for training

- Simple core operations (matrix/dot product)- Opportunities for parallelization (single- or multi-image)- Non-trivial training process

- Error computations- Dependencies on intermediate results

Background: Resistive RAM (ReRAM)

1971: Theory of “Fourth Fundamental Circuit Element” (Leon Chua)

ResistorCapacitorIndctorMemristor = Memory + Resistance:

- Passive element- Resistance depends on charge passed through it- Enabling inherent computational capabilities

→ No separate processing unitsElectrical network theoryImage: Wikipedia

→ No separate processing units

2008: Strukov et al. (HP Labs): The missing memristor found. In: Nature

Discovery in molecular electronics:- Memristor-like behavior through metal-oxide structures- Enabled through flow of oxygen atoms

Electrical network theoryImage: Wikipedia

→ No separate processing units

2008: Strukov et al. (HP Labs): The missing memristor found. In: Nature

Discovery in molecular electronics:- Memristor-like behavior through metal-oxide structures- Enabled through flow of oxygen atoms

Since then:- Resistive memory designs and prototypes- Research in Processing-in-Memory with resistive memories

Electrical network theoryImage: Wikipedia

Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing:Programming 1T1M Crossbar to Accelerate Matrix-VectorMultiplication

- Accumulation of vol- tages (Kirchoff’s Law)- Resistance of mem- ristors acts as weight - Parallel processing!

Feedback resistanceConductance matrix

- Assumes linear memristor conductance- Ignores circuit pararistics

→ More things to consider, but the basicidea is sound

ReRAM-based PIM architecture

ReRAM-based PIM architectureBuilding a complete ReRAM system from building blocks:

- HW structures for real CNN processing- Programmable for different CNNs- Process real benchmarks

No training support

- doesn’t do CNNs- claim: pipeline design not suitable for training due to stalls - claim: ADC/DAC overhead could be improved

No training support

- doesn’t do CNNs- claim: pipeline design not suitable for training due to stalls - claim: ADC/DAC overhead could be improved

Side noteFull CNN processing introduces further practical issues:

1. Computations are analog – errors will occur2. Some CNN layers cannot be computed with ReRAM

AlexNet, 2012:

Side note

AlexNet, 2012:

Full CNN processing introduces further practical issues:

1. Computations are analog – errors will occur2. Some CNN layers cannot be computed with ReRAM

2015: CNNs without LCNshown to work just as well

Empirical results: NNs areresilient to errors

PipeLayer: Architecture

Main considerations:

1. Training support2. Intra-Layer Parallelism3. Inter-Layer Parallelism

PipeLayer: Architecture1. Training support

Figure 3: PipeLayer configured for training

Intermediate memory(memory subarray)

Computationand weight storage(morphable subarray)

Traininglabel

Partial derivative for weight(averaged) Figure 3: PipeLayer configured for training

Intermediate memory(memory subarray)

Computationand weight storage(morphable subarray)

Traininglabel

Partial derivative for weight(averaged)

Concept of batching:- Process batch of images with fixed weights- Update weights after batch

→ Reduce update overhead

Process image 1 of 2-sized batch(ignoring parallelism)

Batch complete - Weight update

Image unclear:- Weight update path not shown- Text references nonexistent “b” derivatives

PipeLayer: Architecture2. Intra-layer parallelism

Basic crossbar array matrix-vectorcomputation scheme

Added complexity:- Process batch of images in one go- Use multiple kernels

Without parallelism:

PipeLayer: Architecture2. Intra-layer parallelism

- Duplicate processing structure for parallelism- Break up computation arrays due to HW size constraints

Without parallelism:With parallelism:

PipeLayer: Architecture3. Inter-layer parallelism

Conceptually:

img1img2

Conceptually:

img2 img1img3

Conceptually:

img3 img2 img1img4

Conceptually:

img3 img2 img1

Implications: - Need to buffer multiple intermediate results for later use

Conceptually:

img3 img2 img1

Implications: - Need to buffer multiple intermediate results for later use - Weight update requires pipeline flush (does it really?)

Last image before update(gap of 2L+1 cycles)

Paper seems to agree on flush/stall:

Update looks larger,but is only 1 cycle

Last image before update(gap of 2L+1 cycles)

Paper seems to agree on flush/stall:

Update looks larger,but is only 1 cycle

… but:

How is this pipeline designsuperior to ISAAC’s?

PipeLayer: Implementation

Activationfunctioncomponent

Typical division intomemory-only + memory/computation areas

Spike coding driver (for energy/area reduction):Input to weighted spikes conversion

Spike coding: analog input to“digital” spike sequence withoutADC. Output spike count =accumulated input*weight

… details like error propagation notvisualized

PipeLayer: Discussion

- Limited ReRAM precision- Previous works showed NNs to take errors well

PipeLayer: Evaluation

- Large improvements vs. reference GPU- Architecture is simulated (could results be impaired?)

SummaryThe work:

- Successful design of ReRAM-based memory architecture for PIM- Good improvements in test setup- Support for training is new (but not a groundbreaking idea)

The paper:- Sensibly structured- Appropriate drawings- Many implicit assumptions; reasoning for claims often missing- Many grammatical errors

Take-aways

1971: Memristor 2008: Molecular electronics

2012: AlexNet CNN 2015: Good CNNswithout contrastnormalizationlayer

1990s: Initial PIM concepts

1. The work is made possible by progress in an interesting combination of fields

ReRAM-based CNNaccelerators

2. Various optimization techniques mentioned in this seminar are used- Hardware acceleration / PIM- Various layers of parallelism- Precision-speed trade-offs

Thanks for your time!

Questions?

PipeLayer: A Pipelined ReRAM-Based Accelerator for … · PipeLayer: A Pipelined ReRAM-Based...

Documents

MIPS PIPELINED CPU

Specalog for 583T Pipelayer, AEHQ5645-02design of the Caterpillar 583T Pipelayer offers several operator visibility improvements for more precise maneuvering and placement of pipe

Resistive RAM ( Resistive RAM (ReRAM) Technology ) Technology

Pipelined Computations

Resistive switching behaviors of ReRAM having … Introduction of Resistive RAM (ReRAM) Non-volatility Low consumption High speed Simple MIM structure Compatibility with CMOS process

Volvo Brochure Pipelayer PL3005E PL4809E Spanish

583R Pipelayer - Kelly Tractor · The elevated sprocket moves the final ... uction stand up to the most demanding applications. ... 583R Pipelayer specifications

IEEE Paper for ReRAM

Specalog for 583T Pipelayer, AEHQ5645-02crosscountryis.com/pdf/CAT583TPipelayer.pdf · 2 583T Pipelayer The 583T Pipelayer ... design of the Caterpillar 583T Pipelayer ... The ADEM

Engineering ReRAM for high-density applicationsisiarticles.com/bundles/Article/pre/pdf/42437.pdfEngineering ReRAM for high-density applications ... Resistive random access memory

Pipelined ADC

Pipelined Architecture

Pipelayer Ship

Design and Optimization of TMO- ReRAM Based Synaptic Devices

Specalog for PL72 Pipelayer AEHQ7513-02

Pipelined instruction processing

RIERE#Re5O%IEEE#EE±EREA±RERAM±el±MMEta#ME#

Pipelined Design

Pipelined protocols

Pipelined Datapath