PipeLayer: A Pipelined ReRAM-Based Accelerator for … · PipeLayer: A Pipelined ReRAM-Based...

Preview:

Citation preview

PipeLayer:A Pipelined ReRAM-Based

Accelerator for Deep Learning

Presented by Nils Weller

Hardware Acceleration for Data ProcessingSeminar, Fall 2017

PipeLayer:A Pipelined ReRAM-Based

Accelerator for Deep LearningPurpose:

- Processing-in-Memory (PIM) architecture to accelerate Convolutional Neural Networks (CNNs)

- Based on novel resistive memory (ReRAM) technology

- Incremental improvement on prior works

Background: CNNs

Background: CNNs

Goal: Classify image contents

Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Not shown:Nonlinear activation function after convolution

Background: CNNs

Goal: Classify image contents

Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Main layer type: Convolution

Convolution operation

Image: Burger, W. (2016): Digital Image Processing. An Algorithmic Introduction Using Java.

Input image

Output feature map

Filter matrix

Dot product

Convolution operation

Image: Burger, W. (2016): Digital Image Processing. An Algorithmic Introduction Using Java.

Input image

Output feature map

Filter matrix

Dot product

Traditional: Fixed - e.g. vertical Sobel:

Convolution operation

Image: Burger, W. (2016): Digital Image Processing. An Algorithmic Introduction Using Java.

Input image

Output feature map

Filter matrix

Dot product

Traditional: Fixed - e.g. vertical Sobel:

CNNs: Learnedweights for kernel:

Background: CNNs

Goal: Classify image contents

Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Background: CNNs

Goal: Classify image contents

Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Two phases:

1. Training2. Testing (= first half of training)

Background: CNNs

Phase 1: Training

Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Label: boat

Process image

Background: CNNs

Phase 1: Training

Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Label: boat

True value(label): dog (0) cat (0) boat (1) bird (0)

E(output)

Process image

Background: CNNs

Phase 1: Training

Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Label: boat

True value(label): dog (0) cat (0) boat (1) bird (0)

E(output)

Process image

Backpropagate error, gradient descentmethod- Calculate error contribution for layers- Update weights to reduce error

Background: CNNs

Phase 1: Training

Image: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

...

Background: CNNs

Summary:

- Large amounts of data- Acceleration desirable- Particularly for training

- Simple core operations (matrix/dot product)- Opportunities for parallelization (single- or multi-image)- Non-trivial training process

- Error computations- Dependencies on intermediate results

Background: Resistive RAM (ReRAM)

Background: Resistive RAM (ReRAM)

1971: Theory of “Fourth Fundamental Circuit Element” (Leon Chua)

ResistorCapacitorIndctorMemristor = Memory + Resistance:

- Passive element- Resistance depends on charge passed through it- Enabling inherent computational capabilities

→ No separate processing unitsElectrical network theoryImage: Wikipedia

Background: Resistive RAM (ReRAM)

1971: Theory of “Fourth Fundamental Circuit Element” (Leon Chua)

ResistorCapacitorIndctorMemristor = Memory + Resistance:

- Passive element- Resistance depends on charge passed through it- Enabling inherent computational capabilities

→ No separate processing units

2008: Strukov et al. (HP Labs): The missing memristor found. In: Nature

Discovery in molecular electronics:- Memristor-like behavior through metal-oxide structures- Enabled through flow of oxygen atoms

Electrical network theoryImage: Wikipedia

Background: Resistive RAM (ReRAM)

1971: Theory of “Fourth Fundamental Circuit Element” (Leon Chua)

ResistorCapacitorIndctorMemristor = Memory + Resistance:

- Passive element- Resistance depends on charge passed through it- Enabling inherent computational capabilities

→ No separate processing units

2008: Strukov et al. (HP Labs): The missing memristor found. In: Nature

Discovery in molecular electronics:- Memristor-like behavior through metal-oxide structures- Enabled through flow of oxygen atoms

Since then:- Resistive memory designs and prototypes- Research in Processing-in-Memory with resistive memories

Electrical network theoryImage: Wikipedia

Background: Resistive RAM (ReRAM)

Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing:Programming 1T1M Crossbar to Accelerate Matrix-VectorMultiplication

Background: Resistive RAM (ReRAM)

Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing:Programming 1T1M Crossbar to Accelerate Matrix-VectorMultiplication

- Accumulation of vol- tages (Kirchoff’s Law)- Resistance of mem- ristors acts as weight - Parallel processing!

Feedback resistanceConductance matrix

Background: Resistive RAM (ReRAM)

Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing:Programming 1T1M Crossbar to Accelerate Matrix-VectorMultiplication

Naive

Background: Resistive RAM (ReRAM)

Hu et al. (2016): Dot-Product Engine for Neuromorphic Computing:Programming 1T1M Crossbar to Accelerate Matrix-VectorMultiplication

Naive

- Assumes linear memristor conductance- Ignores circuit pararistics

→ More things to consider, but the basicidea is sound

ReRAM-based PIM architecture

ReRAM-based PIM architectureBuilding a complete ReRAM system from building blocks:

- HW structures for real CNN processing- Programmable for different CNNs- Process real benchmarks

ReRAM-based PIM architectureBuilding a complete ReRAM system from building blocks:

- HW structures for real CNN processing- Programmable for different CNNs- Process real benchmarks

ReRAM-based PIM architectureBuilding a complete ReRAM system from building blocks:

- HW structures for real CNN processing- Programmable for different CNNs- Process real benchmarks

No training support

- doesn’t do CNNs- claim: pipeline design not suitable for training due to stalls - claim: ADC/DAC overhead could be improved

ReRAM-based PIM architectureBuilding a complete ReRAM system from building blocks:

- HW structures for real CNN processing- Programmable for different CNNs- Process real benchmarks

No training support

- doesn’t do CNNs- claim: pipeline design not suitable for training due to stalls - claim: ADC/DAC overhead could be improved

Side noteFull CNN processing introduces further practical issues:

1. Computations are analog – errors will occur2. Some CNN layers cannot be computed with ReRAM

AlexNet, 2012:

Side note

AlexNet, 2012:

Full CNN processing introduces further practical issues:

1. Computations are analog – errors will occur2. Some CNN layers cannot be computed with ReRAM

2015: CNNs without LCNshown to work just as well

Empirical results: NNs areresilient to errors

PipeLayer: Architecture

Main considerations:

1. Training support2. Intra-Layer Parallelism3. Inter-Layer Parallelism

PipeLayer: Architecture1. Training support

Figure 3: PipeLayer configured for training

PipeLayer: Architecture1. Training support

Intermediate memory(memory subarray)

Computationand weight storage(morphable subarray)

Traininglabel

Partial derivative for weight(averaged) Figure 3: PipeLayer configured for training

PipeLayer: Architecture1. Training support

Intermediate memory(memory subarray)

Computationand weight storage(morphable subarray)

Traininglabel

Partial derivative for weight(averaged)

Concept of batching:- Process batch of images with fixed weights- Update weights after batch

→ Reduce update overhead

Figure 3: PipeLayer configured for training

PipeLayer: Architecture1. Training support

Process image 1 of 2-sized batch(ignoring parallelism)

Figure 3: PipeLayer configured for training

PipeLayer: Architecture1. Training support

Process image 1 of 2-sized batch(ignoring parallelism)

Figure 3: PipeLayer configured for training

PipeLayer: Architecture1. Training support

Process image 1 of 2-sized batch(ignoring parallelism)

Figure 3: PipeLayer configured for training

PipeLayer: Architecture1. Training support

Process image 2 of 2-sized batch(ignoring parallelism)

Figure 3: PipeLayer configured for training

PipeLayer: Architecture1. Training support

Process image 2 of 2-sized batch(ignoring parallelism)

Figure 3: PipeLayer configured for training

PipeLayer: Architecture1. Training support

Batch complete - Weight update

Figure 3: PipeLayer configured for training

PipeLayer: Architecture1. Training support

Batch complete - Weight update

Image unclear:- Weight update path not shown- Text references nonexistent “b” derivatives

Figure 3: PipeLayer configured for training

PipeLayer: Architecture2. Intra-layer parallelism

PipeLayer: Architecture2. Intra-layer parallelism

Basic crossbar array matrix-vectorcomputation scheme

Added complexity:- Process batch of images in one go- Use multiple kernels

Without parallelism:

PipeLayer: Architecture2. Intra-layer parallelism

- Duplicate processing structure for parallelism- Break up computation arrays due to HW size constraints

Without parallelism:With parallelism:

PipeLayer: Architecture3. Inter-layer parallelism

PipeLayer: Architecture3. Inter-layer parallelism

Conceptually:

img1img2

PipeLayer: Architecture3. Inter-layer parallelism

Conceptually:

img2 img1img3

PipeLayer: Architecture3. Inter-layer parallelism

Conceptually:

img3 img2 img1img4

PipeLayer: Architecture3. Inter-layer parallelism

Conceptually:

img3 img2 img1

Implications: - Need to buffer multiple intermediate results for later use

img4

PipeLayer: Architecture3. Inter-layer parallelism

Conceptually:

img3 img2 img1

Implications: - Need to buffer multiple intermediate results for later use - Weight update requires pipeline flush (does it really?)

img4

PipeLayer: Architecture3. Inter-layer parallelism

Last image before update(gap of 2L+1 cycles)

Paper seems to agree on flush/stall:

Update looks larger,but is only 1 cycle

PipeLayer: Architecture3. Inter-layer parallelism

Last image before update(gap of 2L+1 cycles)

Paper seems to agree on flush/stall:

Update looks larger,but is only 1 cycle

… but:

How is this pipeline designsuperior to ISAAC’s?

PipeLayer: Implementation

PipeLayer: Implementation

Activationfunctioncomponent

Typical division intomemory-only + memory/computation areas

Spike coding driver (for energy/area reduction):Input to weighted spikes conversion

Spike coding: analog input to“digital” spike sequence withoutADC. Output spike count =accumulated input*weight

… details like error propagation notvisualized

PipeLayer: Discussion

- Limited ReRAM precision- Previous works showed NNs to take errors well

PipeLayer: Evaluation

- Large improvements vs. reference GPU- Architecture is simulated (could results be impaired?)

SummaryThe work:

- Successful design of ReRAM-based memory architecture for PIM- Good improvements in test setup- Support for training is new (but not a groundbreaking idea)

The paper:- Sensibly structured- Appropriate drawings- Many implicit assumptions; reasoning for claims often missing- Many grammatical errors

Take-aways

1971: Memristor 2008: Molecular electronics

2012: AlexNet CNN 2015: Good CNNswithout contrastnormalizationlayer

1990s: Initial PIM concepts

1. The work is made possible by progress in an interesting combination of fields

ReRAM-based CNNaccelerators

2. Various optimization techniques mentioned in this seminar are used- Hardware acceleration / PIM- Various layers of parallelism- Precision-speed trade-offs

Thanks for your time!

Questions?

Recommended