Lecture 9: Memory Optimization - University of Washington · Lecture 9: Memory Optimization...

Lecture 9: Memory Optimization

CSE599W: Spring 2018

Where are we

Gradient Calculation (Differentiation API)

Computational Graph Optimization and Execution

Runtime Parallel Scheduling

GPU Kernels, Optimizing Device Code

Programming API

Accelerators and Hardwares

User API

System Components

Architecture

High level Packages

Where are we

Gradient Calculation (Differentiation API)

Computational Graph Optimization and Execution

Runtime Parallel Scheduling

Programming API

GPU Kernels, Optimizing Device Code

Accelerators and Hardwares

Recap: Computation Graph

matmult softmax log

mul mean

log-gradsoftmax-grad mul 1 / batch_sizematmult-transpose

W_gradmul

learning_rate

assign y cross_entropy

Recap: Automatic Differentiation

matmult softmax log

mul mean

y cross_entropy

matmult softmax log mul meancross_entropy

Backprop in Graph

Autodiff by Extending the Graph: assignment 1

Questions for this Lecture

matmult softmax log

mul mean

y cross_entropy

Why do we need automatic differentiation that extends the graph instead of backprop in graph?

Memory Problem of Deep Nets

Inception

Deep nets are becoming deeper

State-of-Art Models can be Resource Bound

● Examples of recent state of art neural nets ○ Convnet: ResNet-1k on CIFAR-10, ResNet-200 on ImageNet○ Recurrent models: LSTM on long sequences like speech

● The maximum size of the model we can try is bounded by total RAM available of a Titan X card (12G)

We need to be frugal

Q: How to build an Executor for a Given Graph

mul add-const

Computational Graph for exp(a * b + 3)

Build an Executor for a Given Graph

1. Allocate temp memory for intermediate computation

Same color represent same piece of memory

mul add-const

32 35 exp(32)

2. Traverse and execute the graph by topo order.

mul add-const

32 35 exp(32)

2. Traverse and execute the graph by topo order.

Temporary space linear to number of ops

Dynamic Memory Allocation

mul add-const

Memory Pool

1. Allocate when needed

2. Recycle when a memory is not needed.

3. Useful for both declarative and imperative executions

Dynamic Memory Allocation

mul add-const

Memory Pool

1. Allocate when needed

Dynamic Memory Allocation1. Allocate when needed

mul add-const

exp(35)354

Memory Pool

Static Memory Planning1. Plan for reuse ahead of time

2. Analog: register allocation algorithm in compiler

mul add-const

Same color represent same piece of memory

Common Patterns of Memory Planning

● Inplace store the result in the input

● Normal Sharing reuse memory that are no longer needed.

Inplace Optimization

mul add-const exp

● Store the result in the input● Works if we only care about the final result

● Question: what operation cannot be done inplace ?

Inplace Pitfalls

mul add-const

We can only do inplace if result op is the only consumer of the current value

Normal Memory Sharing

mul add-const

Recycle memory that is no longer needed.

Memory Planning Algorithm

B = sigmoid(A)

C = sigmoid(B)

E = Pooling(C)

F = Pooling(B)

G = E * F

step 1: Allocate tag for B

step 2: Allocate tag for C, cannot do inplace because B is still alive

step 3: Allocate tag for F, release space of B

step 4: Reuse the tag in the box for E

step 5: Re-use tag of E,This is an inplace optimization

internal arrays, same color indicates shared memory.

data dependency, operation completed

Initial state of allocation algorithm

Final Memory PlanTag used to indicate memory sharing on allocation Algorithm.

count ref counter on dependent operations that yet to be full-filled

data dependency, operation not completed

Box of free tags in allocation algorithm.

Concurrency vs Memory Optimization

A[2] = conv(A[1])

A[3]=pool(A[2])

A[4]=conv(A[3])

A[5] = pool(A[1])

A[6]=conv(A[5])

A[7]=pool(A[6])

A[8] = concat(A[4], A[7])

A[2] = conv(A[1])

A[3]=pool(A[2])

A[4]=conv(A[3])

A[5] = pool(A[1])

A[6]=conv(A[5])

A[7]=pool(A[6])

A[8] = concat(A[4], A[7])

internal arrays data dependency

Memory allocation for result, same color indicates shared memory.

implicit dependency introduced due to allocation

Cannot Run in Parallel Enables two Parallel Path

Concurrency aware Heuristics

First the Longest Path

Reset the Reward of visited Node to 0. Find the next longest Path

The final node Color

Restrict memory reuse in the same colored path

Memory Allocation and Scheduling

Introduces implicit control flow dependencies between ops

Solutions:● Explicitly add the control flow dependencies

○ Needed in TensorFlow● Enable mutation in the scheduler, no extra job needed

○ Both operation “mutate” the same memory○ Supported in MXNet

A[2] = conv(A[1])

A[3] = pool(A[2])

A[4] = conv(A[3])

A[5] = pool(A[1])

A[6] = conv(A[5])

A[7] = pool(A[6])

A[8] = concat(A[4], A[7])

Memory Plan with Gradient Calculation

mul add-const

mulout_grad

mula_grad

Back to the Question: Why do we need automatic differentiation that extends the graph instead of backprop in graph?

Memory Plan with Gradient Calculation

mul add-const

mulout_grad

mula_grad

Back to the Question: Why do we need automatic differentiation that extends the graph instead of backprop in graph?

Memory Optimization on a Two Layer MLP

Impact of Memory Optimization in MXNet

We are still Starved

● For training, cost is still linear to the number of layers● Need to book-keep results for the gradient calculation

Trade Computation with Memory

Forward

Backward1

Backward2

Data to be checkpointed for backprop

Data to be dropped

● Only store a few of the intermediate result● Recompute the value needed during gradient calculation

Computation Graph View of the Algorithm

Sublinear Memory Complexity● If we check point every K steps on a N layer network

● The memory cost = O(K) + O(N/K)

● We can get sqrt(N) memory cost plan

● With one additional forward pass(25% overhead)

Cost per segment Cost to store results

Alternative View: Recursion

More memory can be saved by multiple level of recursion

Comparison of Allocation Algorithm on ResNet

Chen et.al 2016

Comparison of Allocation Algorithm on LSTM

Chen et.al 2016

Execution Overhead

Take-aways

● Computation graph is a useful tool for tracking dependencies

● Memory allocation affects concurrency

● We can trade computation for memory to get sublinear memory plan

Assignment 2

● Assignment 1 implements computation graph and autodiff● Assignment 2 implements the rest of DL system stack (Graph Executor) to

run on hardware○ Shape Inference○ Memory management○ TVM-based operator implementation

● Deadline in two weeks: 5/8/2018 ● Post questions to #dlsys slack channel so course staff can help

Lecture 9: Memory Optimization - University of Washington · Lecture 9: Memory Optimization...

Documents

Performance Optimization of Tizen WebKitdownload.tizen.org/misc/media/conference2013/... · Foreground Web App Memory Optimization • Memory Optimization Parameters: • Tile cover

NVIDIA GPU Computing Webinars CUDA Memory Optimization

L6: Memory Hierarchy Optimization III, Bandwidth Optimization CS6963

Randomized Sketching for Low-Memory Dynamic Optimization

Parallel Performance Optimization ASD Shared Memory HPC

L8: Memory Hierarchy Optimization, Bandwidth CS6963

Performance analysis and optimization of the Java memory

Long Short Term Memory Hyperparameter Optimization for a ... paper.pdfLong Short Term Memory Hyperparameter Optimization for a Neural Network Based Emotion Recognition Framework

Memory-constrained Block Processing for DSP Software Optimization

Memory Partitioning and Scheduling Co-Optimization in

Лекция 3. Оптимизация доступа к памяти (Memory access optimization, cache optimization)

L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies CS6963

Memory Augmented Policy Optimization for …papers.nips.cc/paper/8204-memory-augmented-policy...Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing Chen

Eﬃcient Optimization of Memory Accesses in Parallel Programsvs3/PDF/rajbarik_thesis.pdf · Eﬃcient Optimization of Memory Accesses in Parallel Programs by Rajkishore Barik The

Lecture 9: Memory Optimization - University of Washingtondlsys.cs.washington.edu/pdf/lecture9.pdf · Lecture 9: Memory Optimization CSE599W: Spring 2018. Where are we Gradient Calculation

Code Optimization 1. Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.1 ~ 5.6 2

Memory Checking and Single Processor Optimization with ...€¦ · 5b. — Memory Checking and Single Processor Optimization with Valgrind — 5b. 5b-6 Rainer Keller Höchstleistungsrechenzentrum

Cache (Memory) Performance Optimization

Write-optimization in external memory data structures

Advanced Memory Optimization Techniques for