TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps...

Preview:

Citation preview

TensorBoardand

ApplicationsJun 12, 2019

ddebarr@uw.edu

http://cross-entropy.net/ML410/Deep_Learning_9.pdf

Agenda

• TensorBoard

• Speech Recognition Homework: Spatial to Frequency Representation

• Single Shot MultiBox Detector (SSD)

• Bi-directional Encoder Representations from Transformers (BERT)

• GANs

• Datasets Review

The Loop of Progress

TensorBoard

Text-Classification Modelto Use with TensorBoard

TensorBoard

Training the Model with a TensorBoardCallback

$ tensorboard --logdir=my_log_dir

TensorBoard

TensorBoard: Metrics Monitoring

TensorBoard

TensorBoard: Activation Histograms

TensorBoard

TensorBoard: Interactive 3DWord-Embedding Visualization

TensorBoard

TensorBoard: TensorFlow Graph Visualization

TensorBoard

keras.utils.plot_model()

TensorBoard

Audio: Spatial to Frequency Representation

https://www.ritchievink.com/blog/2017/04/23/understanding-the-fourier-transform-by-example/

Audio Features

Discrete Fourier Transform

https://www.ritchievink.com/blog/2017/04/23/understanding-the-fourier-transform-by-example/

Audio Features

Discrete Fourier Transform Implementation

https://www.ritchievink.com/blog/2017/04/23/understanding-the-fourier-transform-by-example/

[ (k * n) produces an NxN matrix ]

Audio Features

Single Shot MultiBox Detector (SSD)

• Feature Maps

• SSD Architecture

• Matching Ground Truth Boxes to Prediction

• SSD Loss Function

• VOC 2007 Performance

• Traffic Prediction

• Inference Example

SSD

Feature Maps

https://www.cs.unc.edu/~wliu/papers/ssd.pdf

SSD

Single Shot Multi-Box Detector (SSD) Architecture

8,732 boxes:

38 x 38 x 4 + 19 x 19 x 6 + 10 x 10 x 6 + 5 x 5 x 6 + 3 x 3 x 4 + 1 x 1 x 4

aspect ratios include: { 1, 1, 1/2, 2, 1/3, 3 }

SSD

Matching Ground Truth Boxes to Prediction BoxesMatch:

Intersection over Union (IoU) > 0.5

Hard Negative Mining:

Keep TP:FP ratio fixed (1:3), use worst misclassified FPs

SSD

Loss Function

• N: the number of default boxes matched

• alpha: set to 1 [by cross validation]

• conf: class confidence loss

• loc: localization loss

SSD

Localization Loss

SSD

Smooth L1 Loss

SSD

Confidence Loss

SSD

PASCAL Visual Object Classes (VOC)

• PASCAL: Pattern Analysis, Statistical modeling, and Computational Learning

• 20 classes• Person: person

• Animal: bird, cat, cow, dog, horse, sheep

• Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train

• Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

http://host.robots.ox.ac.uk/pascal/VOC/

SSD

Improvement Via Augmentation

SSD300: an SSD model using 300x300 pixel input images

SSD

Comparing Results on the VOC 2007 Data

SSD

Traffic Prediction

SSD

Inference Example

pip install opencv-python

git clone https://github.com/pierluigiferrari/ssd_keras.git

jupyter-notebook

# open ssd300_inference.ipynb (interactive python notebook)

# https://drive.google.com/open?id=121-kCXaOHOkJE_Kf5lKcJvC_5q1fYb_q

weights_path = 'C:/Temp/VGG_VOC0712_SSD_300x300_iter_120000.h5’

SSD

Question Answering using Bidirectional Encoder Representations from Transformers (BERT)

• SQuAD• Stats

• Measuring Syntactic Divergence

• Logistic Regression versus Human Performance

• BERT• Base versus Large

• Transformer Execution

• Fine-Tuning BERT for SQuAD

• References

BERT

Stanford Question Answering Dataset (SQuAD)• Extracted paragraphs from a sample of “top” Wikipedia articles

• Questions asked and answered using Amazon’s Mechanical Turk

BERT

Wikipedia Article for Precipitation

First 2 sentences and last 2 sentences of first paragraph, without references:

https://en.wikipedia.org/wiki/Precipitation

Paragraph: 3 Questions with 1 Answer for Each:

BERT

Counts for SQuAD

BERT

Survey of Question Answering Datasets

• Small: difficult for statistical models

• Cloze datasets• Children’s Book Test (CBT) and

CNN/DailyMail

• Cloze data: predict the missing word/entity [“performance is almost saturated”]

• SQuAD: answers often include non-entities and can be much longer

The word cloze is derived from closure in Gestalt theory

BERT

SQuAD Answer Types

BERT

SQuAD Question Answering Categories[based on manual review of 192 examples]

BERT

Measuring Syntactic Divergence: ExampleEdit distance between dependency tree paths for the question and answer sentences; e.g. delete “amod” (adjectival modifier), substitute “xcomp” for “nmod”, and insert “det” (determiner)

https://nlp.stanford.edu/software/dependencies_manual.pdf

BERT

Measuring Syntactic Divergence: Distribution

BERT

Features for Logistic Regression Baseline:Part 1 of 3 [180 million features]

AllenNLP.org: A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads.

BERT

Features for Logistic Regression Baseline:Part 2 of 3 [180 million features]

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

AllenNLP.org: A constituency parse tree breaks a text into sub-phrases, or constituents. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence.

BERT

Features for Logistic Regression Baseline:Part 3 of 3 [180 million features]

BERT

Text Normalization for Scoring

def normalize_answer(s):

"""Lower text and remove punctuation, articles and extra whitespace."""

def remove_articles(text):

return re.sub(r'\b(a|an|the)\b', ' ', text)

def white_space_fix(text):

return ' '.join(text.split())

def remove_punc(text):

exclude = set(string.punctuation)

return ''.join(ch for ch in text if ch not in exclude)

def lower(text):

return text.lower()

return white_space_fix(remove_articles(remove_punc(lower(s))))

BERT

Evaluation Metrics: Exact Match (EM)

def exact_match_score(prediction, ground_truth):

return (normalize_answer(prediction) == normalize_answer(ground_truth))

mean(exact_match_score)

BERT

Evaluation Metrics: F1[the harmonic mean of precision and recall]

def f1_score(prediction, ground_truth):

prediction_tokens = normalize_answer(prediction).split()

ground_truth_tokens = normalize_answer(ground_truth).split()

common = Counter(prediction_tokens) & Counter(ground_truth_tokens)

num_same = sum(common.values())

if num_same == 0:

return 0

precision = 1.0 * num_same / len(prediction_tokens)

recall = 1.0 * num_same / len(ground_truth_tokens)

f1 = (2 * precision * recall) / (precision + recall)

return f1

mean(metric_max_over_ground_truths(f1_score, prediction, ground_truths))

BERT

Human Perf

To evaluate human performance, they treat the second answer to each question as the human prediction, and keep the other answers as ground truth answers

BERT

Performance of Logistic Regression Baseline

Non-human candidates restricted to spans which are constituents in the constituency parse generated by Stanford CoreNLPBaselines included the sliding window and distance-based algorithms by Matt Richardson et al

BERT

Ablation Study for Logistic Regression Perf

BERT

Performance Stratified by Answer Types

BERT

Performance Stratified by Syntactic Divergence

BERT

BERT

• Bidirectional Encoder Representations from Transformers

• Forward• Elements that precede the current element are useful for understanding the current

element

• Backward• Elements that succeed the current element are useful for understanding the current

element

• “Self-Supervised” Pre-Training Tasks• Corpora: Books Corpus and … wait for it … English Wikipedia [3.3 billion words]

• Masked Language Model [MLM]: predict the missing words

• Next Sentence: predict whether sentence B follows sentence A

BERT

BERT Model Architecture

Model consists of …• Embeddings block: embeddings for “token”, position, and segment (added),

with H=768 (or 1024) floats per embedding

• Encoder stack: L=12 (or 24) transformer blocks, each consisting of 6 weight matrices:• Multi-head self-attention (number of heads is H / 64)

• Query: H x H

• Key: H x H [softmax( (Input * Query) * (Input * Key)’ / sqrt(64) ) to average values]

• Value: H x H

• Projection: H x H

• Intermediate (H x 4H)

• Output (4H x H)

• Pooling layer: H x H

Pre-Trained BERT Models

Base: 109,482,240 parameters

• Embeddings: 768 floats

• 12 Transformer Layers• Self-Attention

• 12 heads x 64 dimensions [768]• Projection (Convolution)• Layer Normalization

• 2 Convolution Layers (expand/contract)

• Layer Normalization

• Pooler (Convolution)

Large: 335,141,888 parameters

• Embeddings: 1,024 floats

• 24 Transformer Layers• Self-Attention

• 16 heads x 64 dimensions [1024]• Projection (Convolution)• Layer Normalization

• 2 Convolution Layers (expand/contract)

• Layer Normalization

• Pooler (Convolution)

All Convolution based on filter width of one (word piece) token30,522 tokens in the vocabulary for both models

BERT

Parameter Details: Embeddings

Replace 768 with 1024 for BERT Large

BERT

Parameter Details: Transformer Layer

Replace 768 with 1024 and 3072 with 4096 for BERT Large

BERT

Parameter Details: Pooling and Classifier Layers

Replace 768 with 1024 for BERT Large

BERT

BERT Base Transformer Layer Execution

1. Compute Self-Attention Output Matrix [next 3 slides]

2. Convolve the Self-Attention Output with the Intermediate Filters (length 1 filters): (512 x 768) x (768 x 3072) = (512 x 3072)

3. Convolve the Intermediate Filter Output with the Output Filters (length 1 filters): (512 x 3072) x (3072 x 768) = (512 x 768)

4. Normalize: multiply feature values by feature-specific gamma (scale parameter) and add feature-specific beta (location parameter)

BERT

Note the residual connections to the Norm

BERT Base Self-Attention: Part 1 of 3

• Sequence length: s = 512• Number of heads: h = 12• Number of dimensions for Key and Value: d = 64• Number of dimensions for Model: m = h * d = 12 * 64 = 768

• Input Matrix: (s x m)• Query Weight Matrix: (m x d) # one for each head• Key Weight Matrix: (m x d) # one for each head• Value Weight Matrix: (m x d) # one for each head• Projection Matrix: (m x m) # only one

BERT

BERT Base Self-Attention: Part 2 of 3

1. For each head …a. Multiply Input Matrix by Query Weight Matrix to produce Query Matrix:

(512x768) x (768x64) = (512x64) ["broadcast" (add) bias vector to rows]b. Multiply Input Matrix by Key Weight Matrix to produce Key Matrix:

(512x768) x (768x64) = (512x64) ["broadcast" (add) bias vector to rows]c. Multiply Input Matrix by Value Weight Matrix to produce Value Matrix:

(512x768) x (768x64) = (512x64) ["broadcast" (add) bias vector to rows]d. Multiply Query Matrix by transposed Key Matrix to produce the Attention

Matrix: (512x64) x (64x512) = (512x512) [also normalize (divide) by square root of number of dimensions for key]

e. SoftMax across the rows of the Attention Matrix: still (512x512)f. Multiply SoftMax Matrix by the Value Matrix to produce the Attended

Output; (512x512) x (512x64) = (512x64)cell i,j contains a weighted sum of Value[,j] for all elements in the sequence,based on weights specific to element i

BERT

BERT Base Self-Attention: Part 3 of 3

2. Concatenate the Attended Output matrices, then multiply by the Projection Matrix to produce the Projected Output Matrix: (512x768) x (768x768) = (512x768)

3. Normalize: multiply feature values by feature-specific gamma (scale parameter) and add feature-specific beta (location parameter)

http://jalammar.github.io/illustrated-transformer/

BERT

Fine-Tuning BERT for SQuAD:Fetch Stuff

git clone https://github.com/google-research/bert.git

cd bert

wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zipunzip uncased_L-12_H-768_A-12.zip

unzip uncased_L-12_H-768_A-12.zip

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

wget https://raw.githubusercontent.com/allenai/bi-att-flow/master/squad/evaluate-v1.1.py

BERT

Fine-Tuning BERT for SQuAD:Train and Predictpython run_squad.py \

> --vocab_file=${HOME}/bert/uncased_L-12_H-768_A-12/vocab.txt \

> --bert_config_file=${HOME}/bert/uncased_L-12_H-768_A-12/bert_config.json \

> --init_checkpoint=${HOME}/bert/uncased_L-12_H-768_A-12/bert_model.ckpt \

> --do_train=True \

> --train_file=${HOME}/bert/train-v1.1.json \

> --do_predict=True \

> --predict_file=${HOME}/bert/dev-v1.1.json \

> --train_batch_size=16 \

> --learning_rate=3e-5 \

> --num_train_epochs=2.0 \

> --max_seq_length=384 \

> --doc_stride=128 \

> --output_dir=/tmp/squad/ \

> --save_checkpoints_steps 20000

BERT

Nvidia System Management Interface OutputPower Usage is 263 Watts [fluctuates over time] 62° Celsius is 144° Fahrenheit (gets warmer after a while)

BERT

Fine-Tuning BERT for SQuAD:Evaluate (dev)• python evaluate-v1.1.py \

dev-v1.1.json \

/tmp/squad_large/predictions.json

• Platform: Amazon Web Services (AWS) Elastic Compute Cloud (EC2) p3.2xlarge instance [Nvidia GV100 (Volta) GPU with 16 GB memory]

• Base• Batch size: 16

• 97 minutes (1.6 hours) to train and make predictions for the dev set

• {"exact_match": 80.96499526963103, "f1": 88.3375015101166}

• Console outputhttps://www.cross-entropy.net/ML410/bert-console.txt

BERT

References

• Stanford Question Answering Dataset (SQuAD): 100,000+ Questions for Machine Comprehension of Text• SQuAD 1: https://arxiv.org/abs/1606.05250

• Bidirectional Encoder Representations from Transformers• https://arxiv.org/pdf/1810.04805.pdf

• Know What You Don’t Know: Unanswerable Questions for SQuAD• SQuAD 2: https://arxiv.org/abs/1806.03822

• Added questions without answers

BERT

GANs

• Motivation

• Example of Deep Convolutional Generative Adversarial Network

• Evaluation Metrics

• Training Nvidia’s Progressive Growing of GANs

• Evaluating Google’s BigGAN

• Trouble with Using Discriminator to Evaluate Quality

GANs

Inspiration from News Coverage: https://thispersondoesnotexist.com/

https://www.theverge.com/2019/3/3/18244984/ai-generated-fake-which-face-is-real-test-stylegan

GANs

Generative Adversarial Network (GAN):Training Two Networks• Discriminator: a binary classifier that predicts whether an image is “real”

(e.g. a photo) or “fake” (an image produced by transforming random noise)

• Generator: a model that takes random numbers as input and produces an “image” as output

• Updates are alternatively applied to train the two models• Update the discriminator: use a batch consisting of “real” images from a data set as

well as “fake” images produced by the “generator”; weight updates are based on the derivative of cross entropy (real/fake) with respect to the weight [back propagated across operations of the discriminator]

• Update the generator: use a batch consisting of only “fake” images produced by the generator; weight updates are still based on the derivative of cross entropy (real/fake) with respect to the weight [back propagated across operations of *both* the discriminator and the generator]

GANs

ExampleGenerator(4 conv ops)

Hands-On Unsupervised Learning Using Python

GANs

ExampleDiscriminator(4 conv ops)

Hands-On Unsupervised Learning Using Python

GANs

How to Evaluate a GAN?

• Inception Score• Use the generator to create images• Use an inception network to predict class probabilities for each image• Measure the Kullback-Leibler divergence between the predicted class

probabilities and the mean predicted class probabilities for the set of images• Larger values are better (predicted class probabilities differ from expectations)

• Frechet Inception Distance• Use the generator to create images• Use the predictions of the last pooling layer of an inception network to

compare the distribution of predictions for “real” images to the distribution of predictions for “fake” images

• Smaller values are better (squared distances between the means of the distributions and the covariance matrices of the distributions are smaller)

How Good is My GAN?

GANs

Training Last Year’s State of the Art:Nvidia’s Progressive Growing of GANs• Canadian Institute For Advanced Research’s 10 class data [CIFAR10]

• Generator: 20,719,628 parameters

• Discriminator: 20,726,785 parameters

• Total training time: 29 hours, 54 minutes [using Nvidia Titan V GPU]

• Reported Inception Score: max = 8.80; avg = 8.56 [previous state of the art for “unsupervised” (class not specified as input) GAN is 7.90]

• Inception Score for our newly trained model: 8.08

• Inception Score for real images: 11.22

• Frechet Inception Distance for our newly trained model: 15.66

• Frechet Inception Distance for real images: 0.00 [by definition]

https://github.com/tkarras/progressive_growing_of_gans: change tf.squeeze(pool3) to tf.squeeze(pool3, [1, 2]) for scoring

GANs

Initial versus Final Generator Outputs

GANs

Evaluation Metrics While Training

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

0.0 5.0 10.0 15.0 20.0 25.0 30.0

Ince

pti

on

Sco

re

Hours

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

0.0 5.0 10.0 15.0 20.0 25.0 30.0

Frec

het

Ince

pti

on

Dis

tan

ce

Hours

GANs

Evaluating This Year’s State of the Art:Google’s BigGAN• “We demonstrate that GANs benefit dramatically from scaling, and

train models with two to four times as many parameters and eight times the batch size compared to prior art”• Class-conditional image synthesis (desired class provided as input)

• We train on a Google Tensor Processing Unit (TPU) v3 Pod, with the number of cores proportional to the resolution• 128 for 128×128

• 256 for 256×256

• 512 for 512×512

• Training takes between 24 and 48 hours for most models

https://github.com/huggingface/pytorch-pretrained-BigGAN

GANs

BigGAN Evaluation Metrics

• CIFAR-10• 50,000 color images for training; 32 x 32

• 10 classes (5,000 images for each class)

• Inception Score: 9.22

• Frechet Inception Distance: 14.73

• Image-net Large Scale Visual Recognition Challenge (ILSVRC) 2012• 1,281,167 color images for training; various sizes

• 1,000 WordNet classes (min = 732 images; max = 1,300 images)

• Inception Score for 256x256 model: 232.5

• Frechet Inception Distance for 256x256 model: 8.1

GANs

Example BigGAN Output

Noise truncated to 0.1 (controls variance)

GANs

BigGAN Examples 1 of 2GANs

BigGAN Examples 2 of 2GANs

Generating BigGAN Output Examples

# adjust for your version of nvidia’s common unified device architecture (cuda) toolkit:

# https://pytorch.org/get-started/locally/

wget https://repo.anaconda.com/archive/Anaconda3-2018.12-Linux-x86_64.sh

bash Anaconda3-2018.12-Linux-x86_64.sh

conda install pytorch torchvision cudatoolkit=9.0 -c pytorch

git clone https://github.com/huggingface/pytorch-pretrained-BigGAN.git

cd pytorch-pretrained-BigGAN

pip install -r full_requirements.txt

pip install -r requirements.txt

python

import nltk

nltk.download("wordnet")

exit()

https://github.com/huggingface/pytorch-pretrained-BigGAN

# search for "import torch"

# copy all of the code in the box to a file named sample.py, except the "display_in_terminal(output)" line

python sample.py

Potential Issues Encountered for GANs

• Network “collapse” is a frequently lamented problem in this space: the discriminator essentially memorizes the “real” images of the training data and starts rejecting all other images as “fake”

Validation data must be used to detect this condition

• BigGAN definitely seems better at generating some of the classes compared to others [BigGAN’s output for “Band Aid” still haunts me]

• Existing evaluation measures are not particularly helpful for identifying whether an output is “good enough” to fool a human

GANs

Datasets Review

• Images (PIL)• MNIST Digit Classification• Fashion Accessory Classification• CIFAR10 Image Classification• Tiny ImageNet Classification (MAP)

• Text (spaCy, NLTK)• Newsgroups Classification• Reuters MultiLabel Classification (Macro Averaged ROC AUC)• Penn TreeBank Language Modeling (Perplexity)• IMDB Review Sentiment Classification

• Speech (libROSA)• Google Commands

Review

Recommended