78
TensorBoard and Applications Jun 12, 2019 [email protected] http://cross-entropy.net/ML410/Deep_Learning_9.pdf

TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

TensorBoardand

ApplicationsJun 12, 2019

[email protected]

http://cross-entropy.net/ML410/Deep_Learning_9.pdf

Page 2: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Agenda

• TensorBoard

• Speech Recognition Homework: Spatial to Frequency Representation

• Single Shot MultiBox Detector (SSD)

• Bi-directional Encoder Representations from Transformers (BERT)

• GANs

• Datasets Review

Page 3: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

The Loop of Progress

TensorBoard

Page 4: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Text-Classification Modelto Use with TensorBoard

TensorBoard

Page 5: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Training the Model with a TensorBoardCallback

$ tensorboard --logdir=my_log_dir

TensorBoard

Page 6: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

TensorBoard: Metrics Monitoring

TensorBoard

Page 7: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

TensorBoard: Activation Histograms

TensorBoard

Page 8: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

TensorBoard: Interactive 3DWord-Embedding Visualization

TensorBoard

Page 9: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

TensorBoard: TensorFlow Graph Visualization

TensorBoard

Page 10: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

keras.utils.plot_model()

TensorBoard

Page 11: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Audio: Spatial to Frequency Representation

https://www.ritchievink.com/blog/2017/04/23/understanding-the-fourier-transform-by-example/

Audio Features

Page 12: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Discrete Fourier Transform

https://www.ritchievink.com/blog/2017/04/23/understanding-the-fourier-transform-by-example/

Audio Features

Page 13: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Discrete Fourier Transform Implementation

https://www.ritchievink.com/blog/2017/04/23/understanding-the-fourier-transform-by-example/

[ (k * n) produces an NxN matrix ]

Audio Features

Page 14: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Single Shot MultiBox Detector (SSD)

• Feature Maps

• SSD Architecture

• Matching Ground Truth Boxes to Prediction

• SSD Loss Function

• VOC 2007 Performance

• Traffic Prediction

• Inference Example

SSD

Page 15: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Feature Maps

https://www.cs.unc.edu/~wliu/papers/ssd.pdf

SSD

Page 16: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Single Shot Multi-Box Detector (SSD) Architecture

8,732 boxes:

38 x 38 x 4 + 19 x 19 x 6 + 10 x 10 x 6 + 5 x 5 x 6 + 3 x 3 x 4 + 1 x 1 x 4

aspect ratios include: { 1, 1, 1/2, 2, 1/3, 3 }

SSD

Page 17: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Matching Ground Truth Boxes to Prediction BoxesMatch:

Intersection over Union (IoU) > 0.5

Hard Negative Mining:

Keep TP:FP ratio fixed (1:3), use worst misclassified FPs

SSD

Page 18: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Loss Function

• N: the number of default boxes matched

• alpha: set to 1 [by cross validation]

• conf: class confidence loss

• loc: localization loss

SSD

Page 19: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Localization Loss

SSD

Page 20: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Smooth L1 Loss

SSD

Page 21: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Confidence Loss

SSD

Page 22: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

PASCAL Visual Object Classes (VOC)

• PASCAL: Pattern Analysis, Statistical modeling, and Computational Learning

• 20 classes• Person: person

• Animal: bird, cat, cow, dog, horse, sheep

• Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train

• Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

http://host.robots.ox.ac.uk/pascal/VOC/

SSD

Page 23: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Improvement Via Augmentation

SSD300: an SSD model using 300x300 pixel input images

SSD

Page 24: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Comparing Results on the VOC 2007 Data

SSD

Page 25: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Traffic Prediction

SSD

Page 26: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Inference Example

pip install opencv-python

git clone https://github.com/pierluigiferrari/ssd_keras.git

jupyter-notebook

# open ssd300_inference.ipynb (interactive python notebook)

# https://drive.google.com/open?id=121-kCXaOHOkJE_Kf5lKcJvC_5q1fYb_q

weights_path = 'C:/Temp/VGG_VOC0712_SSD_300x300_iter_120000.h5’

SSD

Page 27: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Question Answering using Bidirectional Encoder Representations from Transformers (BERT)

• SQuAD• Stats

• Measuring Syntactic Divergence

• Logistic Regression versus Human Performance

• BERT• Base versus Large

• Transformer Execution

• Fine-Tuning BERT for SQuAD

• References

BERT

Page 28: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Stanford Question Answering Dataset (SQuAD)• Extracted paragraphs from a sample of “top” Wikipedia articles

• Questions asked and answered using Amazon’s Mechanical Turk

BERT

Page 29: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Wikipedia Article for Precipitation

First 2 sentences and last 2 sentences of first paragraph, without references:

https://en.wikipedia.org/wiki/Precipitation

Paragraph: 3 Questions with 1 Answer for Each:

BERT

Page 30: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Counts for SQuAD

BERT

Page 31: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Survey of Question Answering Datasets

• Small: difficult for statistical models

• Cloze datasets• Children’s Book Test (CBT) and

CNN/DailyMail

• Cloze data: predict the missing word/entity [“performance is almost saturated”]

• SQuAD: answers often include non-entities and can be much longer

The word cloze is derived from closure in Gestalt theory

BERT

Page 32: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

SQuAD Answer Types

BERT

Page 33: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

SQuAD Question Answering Categories[based on manual review of 192 examples]

BERT

Page 34: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Measuring Syntactic Divergence: ExampleEdit distance between dependency tree paths for the question and answer sentences; e.g. delete “amod” (adjectival modifier), substitute “xcomp” for “nmod”, and insert “det” (determiner)

https://nlp.stanford.edu/software/dependencies_manual.pdf

BERT

Page 35: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Measuring Syntactic Divergence: Distribution

BERT

Page 36: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Features for Logistic Regression Baseline:Part 1 of 3 [180 million features]

AllenNLP.org: A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads.

BERT

Page 37: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Features for Logistic Regression Baseline:Part 2 of 3 [180 million features]

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

AllenNLP.org: A constituency parse tree breaks a text into sub-phrases, or constituents. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence.

BERT

Page 38: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Features for Logistic Regression Baseline:Part 3 of 3 [180 million features]

BERT

Page 39: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Text Normalization for Scoring

def normalize_answer(s):

"""Lower text and remove punctuation, articles and extra whitespace."""

def remove_articles(text):

return re.sub(r'\b(a|an|the)\b', ' ', text)

def white_space_fix(text):

return ' '.join(text.split())

def remove_punc(text):

exclude = set(string.punctuation)

return ''.join(ch for ch in text if ch not in exclude)

def lower(text):

return text.lower()

return white_space_fix(remove_articles(remove_punc(lower(s))))

BERT

Page 40: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Evaluation Metrics: Exact Match (EM)

def exact_match_score(prediction, ground_truth):

return (normalize_answer(prediction) == normalize_answer(ground_truth))

mean(exact_match_score)

BERT

Page 41: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Evaluation Metrics: F1[the harmonic mean of precision and recall]

def f1_score(prediction, ground_truth):

prediction_tokens = normalize_answer(prediction).split()

ground_truth_tokens = normalize_answer(ground_truth).split()

common = Counter(prediction_tokens) & Counter(ground_truth_tokens)

num_same = sum(common.values())

if num_same == 0:

return 0

precision = 1.0 * num_same / len(prediction_tokens)

recall = 1.0 * num_same / len(ground_truth_tokens)

f1 = (2 * precision * recall) / (precision + recall)

return f1

mean(metric_max_over_ground_truths(f1_score, prediction, ground_truths))

BERT

Page 42: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Human Perf

To evaluate human performance, they treat the second answer to each question as the human prediction, and keep the other answers as ground truth answers

BERT

Page 43: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Performance of Logistic Regression Baseline

Non-human candidates restricted to spans which are constituents in the constituency parse generated by Stanford CoreNLPBaselines included the sliding window and distance-based algorithms by Matt Richardson et al

BERT

Page 44: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Ablation Study for Logistic Regression Perf

BERT

Page 45: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Performance Stratified by Answer Types

BERT

Page 46: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Performance Stratified by Syntactic Divergence

BERT

Page 47: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

BERT

• Bidirectional Encoder Representations from Transformers

• Forward• Elements that precede the current element are useful for understanding the current

element

• Backward• Elements that succeed the current element are useful for understanding the current

element

• “Self-Supervised” Pre-Training Tasks• Corpora: Books Corpus and … wait for it … English Wikipedia [3.3 billion words]

• Masked Language Model [MLM]: predict the missing words

• Next Sentence: predict whether sentence B follows sentence A

BERT

Page 48: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

BERT Model Architecture

Model consists of …• Embeddings block: embeddings for “token”, position, and segment (added),

with H=768 (or 1024) floats per embedding

• Encoder stack: L=12 (or 24) transformer blocks, each consisting of 6 weight matrices:• Multi-head self-attention (number of heads is H / 64)

• Query: H x H

• Key: H x H [softmax( (Input * Query) * (Input * Key)’ / sqrt(64) ) to average values]

• Value: H x H

• Projection: H x H

• Intermediate (H x 4H)

• Output (4H x H)

• Pooling layer: H x H

Page 49: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Pre-Trained BERT Models

Base: 109,482,240 parameters

• Embeddings: 768 floats

• 12 Transformer Layers• Self-Attention

• 12 heads x 64 dimensions [768]• Projection (Convolution)• Layer Normalization

• 2 Convolution Layers (expand/contract)

• Layer Normalization

• Pooler (Convolution)

Large: 335,141,888 parameters

• Embeddings: 1,024 floats

• 24 Transformer Layers• Self-Attention

• 16 heads x 64 dimensions [1024]• Projection (Convolution)• Layer Normalization

• 2 Convolution Layers (expand/contract)

• Layer Normalization

• Pooler (Convolution)

All Convolution based on filter width of one (word piece) token30,522 tokens in the vocabulary for both models

BERT

Page 50: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Parameter Details: Embeddings

Replace 768 with 1024 for BERT Large

BERT

Page 51: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Parameter Details: Transformer Layer

Replace 768 with 1024 and 3072 with 4096 for BERT Large

BERT

Page 52: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Parameter Details: Pooling and Classifier Layers

Replace 768 with 1024 for BERT Large

BERT

Page 53: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

BERT Base Transformer Layer Execution

1. Compute Self-Attention Output Matrix [next 3 slides]

2. Convolve the Self-Attention Output with the Intermediate Filters (length 1 filters): (512 x 768) x (768 x 3072) = (512 x 3072)

3. Convolve the Intermediate Filter Output with the Output Filters (length 1 filters): (512 x 3072) x (3072 x 768) = (512 x 768)

4. Normalize: multiply feature values by feature-specific gamma (scale parameter) and add feature-specific beta (location parameter)

BERT

Note the residual connections to the Norm

Page 54: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

BERT Base Self-Attention: Part 1 of 3

• Sequence length: s = 512• Number of heads: h = 12• Number of dimensions for Key and Value: d = 64• Number of dimensions for Model: m = h * d = 12 * 64 = 768

• Input Matrix: (s x m)• Query Weight Matrix: (m x d) # one for each head• Key Weight Matrix: (m x d) # one for each head• Value Weight Matrix: (m x d) # one for each head• Projection Matrix: (m x m) # only one

BERT

Page 55: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

BERT Base Self-Attention: Part 2 of 3

1. For each head …a. Multiply Input Matrix by Query Weight Matrix to produce Query Matrix:

(512x768) x (768x64) = (512x64) ["broadcast" (add) bias vector to rows]b. Multiply Input Matrix by Key Weight Matrix to produce Key Matrix:

(512x768) x (768x64) = (512x64) ["broadcast" (add) bias vector to rows]c. Multiply Input Matrix by Value Weight Matrix to produce Value Matrix:

(512x768) x (768x64) = (512x64) ["broadcast" (add) bias vector to rows]d. Multiply Query Matrix by transposed Key Matrix to produce the Attention

Matrix: (512x64) x (64x512) = (512x512) [also normalize (divide) by square root of number of dimensions for key]

e. SoftMax across the rows of the Attention Matrix: still (512x512)f. Multiply SoftMax Matrix by the Value Matrix to produce the Attended

Output; (512x512) x (512x64) = (512x64)cell i,j contains a weighted sum of Value[,j] for all elements in the sequence,based on weights specific to element i

BERT

Page 56: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

BERT Base Self-Attention: Part 3 of 3

2. Concatenate the Attended Output matrices, then multiply by the Projection Matrix to produce the Projected Output Matrix: (512x768) x (768x768) = (512x768)

3. Normalize: multiply feature values by feature-specific gamma (scale parameter) and add feature-specific beta (location parameter)

http://jalammar.github.io/illustrated-transformer/

BERT

Page 57: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Fine-Tuning BERT for SQuAD:Fetch Stuff

git clone https://github.com/google-research/bert.git

cd bert

wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zipunzip uncased_L-12_H-768_A-12.zip

unzip uncased_L-12_H-768_A-12.zip

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

wget https://raw.githubusercontent.com/allenai/bi-att-flow/master/squad/evaluate-v1.1.py

BERT

Page 58: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Fine-Tuning BERT for SQuAD:Train and Predictpython run_squad.py \

> --vocab_file=${HOME}/bert/uncased_L-12_H-768_A-12/vocab.txt \

> --bert_config_file=${HOME}/bert/uncased_L-12_H-768_A-12/bert_config.json \

> --init_checkpoint=${HOME}/bert/uncased_L-12_H-768_A-12/bert_model.ckpt \

> --do_train=True \

> --train_file=${HOME}/bert/train-v1.1.json \

> --do_predict=True \

> --predict_file=${HOME}/bert/dev-v1.1.json \

> --train_batch_size=16 \

> --learning_rate=3e-5 \

> --num_train_epochs=2.0 \

> --max_seq_length=384 \

> --doc_stride=128 \

> --output_dir=/tmp/squad/ \

> --save_checkpoints_steps 20000

BERT

Page 59: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Nvidia System Management Interface OutputPower Usage is 263 Watts [fluctuates over time] 62° Celsius is 144° Fahrenheit (gets warmer after a while)

BERT

Page 60: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Fine-Tuning BERT for SQuAD:Evaluate (dev)• python evaluate-v1.1.py \

dev-v1.1.json \

/tmp/squad_large/predictions.json

• Platform: Amazon Web Services (AWS) Elastic Compute Cloud (EC2) p3.2xlarge instance [Nvidia GV100 (Volta) GPU with 16 GB memory]

• Base• Batch size: 16

• 97 minutes (1.6 hours) to train and make predictions for the dev set

• {"exact_match": 80.96499526963103, "f1": 88.3375015101166}

• Console outputhttps://www.cross-entropy.net/ML410/bert-console.txt

BERT

Page 61: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

References

• Stanford Question Answering Dataset (SQuAD): 100,000+ Questions for Machine Comprehension of Text• SQuAD 1: https://arxiv.org/abs/1606.05250

• Bidirectional Encoder Representations from Transformers• https://arxiv.org/pdf/1810.04805.pdf

• Know What You Don’t Know: Unanswerable Questions for SQuAD• SQuAD 2: https://arxiv.org/abs/1806.03822

• Added questions without answers

BERT

Page 62: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

GANs

• Motivation

• Example of Deep Convolutional Generative Adversarial Network

• Evaluation Metrics

• Training Nvidia’s Progressive Growing of GANs

• Evaluating Google’s BigGAN

• Trouble with Using Discriminator to Evaluate Quality

GANs

Page 63: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Inspiration from News Coverage: https://thispersondoesnotexist.com/

https://www.theverge.com/2019/3/3/18244984/ai-generated-fake-which-face-is-real-test-stylegan

GANs

Page 64: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Generative Adversarial Network (GAN):Training Two Networks• Discriminator: a binary classifier that predicts whether an image is “real”

(e.g. a photo) or “fake” (an image produced by transforming random noise)

• Generator: a model that takes random numbers as input and produces an “image” as output

• Updates are alternatively applied to train the two models• Update the discriminator: use a batch consisting of “real” images from a data set as

well as “fake” images produced by the “generator”; weight updates are based on the derivative of cross entropy (real/fake) with respect to the weight [back propagated across operations of the discriminator]

• Update the generator: use a batch consisting of only “fake” images produced by the generator; weight updates are still based on the derivative of cross entropy (real/fake) with respect to the weight [back propagated across operations of *both* the discriminator and the generator]

GANs

Page 65: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

ExampleGenerator(4 conv ops)

Hands-On Unsupervised Learning Using Python

GANs

Page 66: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

ExampleDiscriminator(4 conv ops)

Hands-On Unsupervised Learning Using Python

GANs

Page 67: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

How to Evaluate a GAN?

• Inception Score• Use the generator to create images• Use an inception network to predict class probabilities for each image• Measure the Kullback-Leibler divergence between the predicted class

probabilities and the mean predicted class probabilities for the set of images• Larger values are better (predicted class probabilities differ from expectations)

• Frechet Inception Distance• Use the generator to create images• Use the predictions of the last pooling layer of an inception network to

compare the distribution of predictions for “real” images to the distribution of predictions for “fake” images

• Smaller values are better (squared distances between the means of the distributions and the covariance matrices of the distributions are smaller)

How Good is My GAN?

GANs

Page 68: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Training Last Year’s State of the Art:Nvidia’s Progressive Growing of GANs• Canadian Institute For Advanced Research’s 10 class data [CIFAR10]

• Generator: 20,719,628 parameters

• Discriminator: 20,726,785 parameters

• Total training time: 29 hours, 54 minutes [using Nvidia Titan V GPU]

• Reported Inception Score: max = 8.80; avg = 8.56 [previous state of the art for “unsupervised” (class not specified as input) GAN is 7.90]

• Inception Score for our newly trained model: 8.08

• Inception Score for real images: 11.22

• Frechet Inception Distance for our newly trained model: 15.66

• Frechet Inception Distance for real images: 0.00 [by definition]

https://github.com/tkarras/progressive_growing_of_gans: change tf.squeeze(pool3) to tf.squeeze(pool3, [1, 2]) for scoring

GANs

Page 69: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Initial versus Final Generator Outputs

GANs

Page 70: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Evaluation Metrics While Training

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

0.0 5.0 10.0 15.0 20.0 25.0 30.0

Ince

pti

on

Sco

re

Hours

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

0.0 5.0 10.0 15.0 20.0 25.0 30.0

Frec

het

Ince

pti

on

Dis

tan

ce

Hours

GANs

Page 71: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Evaluating This Year’s State of the Art:Google’s BigGAN• “We demonstrate that GANs benefit dramatically from scaling, and

train models with two to four times as many parameters and eight times the batch size compared to prior art”• Class-conditional image synthesis (desired class provided as input)

• We train on a Google Tensor Processing Unit (TPU) v3 Pod, with the number of cores proportional to the resolution• 128 for 128×128

• 256 for 256×256

• 512 for 512×512

• Training takes between 24 and 48 hours for most models

https://github.com/huggingface/pytorch-pretrained-BigGAN

GANs

Page 72: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

BigGAN Evaluation Metrics

• CIFAR-10• 50,000 color images for training; 32 x 32

• 10 classes (5,000 images for each class)

• Inception Score: 9.22

• Frechet Inception Distance: 14.73

• Image-net Large Scale Visual Recognition Challenge (ILSVRC) 2012• 1,281,167 color images for training; various sizes

• 1,000 WordNet classes (min = 732 images; max = 1,300 images)

• Inception Score for 256x256 model: 232.5

• Frechet Inception Distance for 256x256 model: 8.1

GANs

Page 73: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Example BigGAN Output

Noise truncated to 0.1 (controls variance)

GANs

Page 74: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

BigGAN Examples 1 of 2GANs

Page 75: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

BigGAN Examples 2 of 2GANs

Page 76: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Generating BigGAN Output Examples

# adjust for your version of nvidia’s common unified device architecture (cuda) toolkit:

# https://pytorch.org/get-started/locally/

wget https://repo.anaconda.com/archive/Anaconda3-2018.12-Linux-x86_64.sh

bash Anaconda3-2018.12-Linux-x86_64.sh

conda install pytorch torchvision cudatoolkit=9.0 -c pytorch

git clone https://github.com/huggingface/pytorch-pretrained-BigGAN.git

cd pytorch-pretrained-BigGAN

pip install -r full_requirements.txt

pip install -r requirements.txt

python

import nltk

nltk.download("wordnet")

exit()

https://github.com/huggingface/pytorch-pretrained-BigGAN

# search for "import torch"

# copy all of the code in the box to a file named sample.py, except the "display_in_terminal(output)" line

python sample.py

Page 77: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Potential Issues Encountered for GANs

• Network “collapse” is a frequently lamented problem in this space: the discriminator essentially memorizes the “real” images of the training data and starts rejecting all other images as “fake”

Validation data must be used to detect this condition

• BigGAN definitely seems better at generating some of the classes compared to others [BigGAN’s output for “Band Aid” still haunts me]

• Existing evaluation measures are not particularly helpful for identifying whether an output is “good enough” to fool a human

GANs

Page 78: TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps •SSD Architecture •Matching Ground Truth Boxes to Prediction •SSD Loss Function

Datasets Review

• Images (PIL)• MNIST Digit Classification• Fashion Accessory Classification• CIFAR10 Image Classification• Tiny ImageNet Classification (MAP)

• Text (spaCy, NLTK)• Newsgroups Classification• Reuters MultiLabel Classification (Macro Averaged ROC AUC)• Penn TreeBank Language Modeling (Perplexity)• IMDB Review Sentiment Classification

• Speech (libROSA)• Google Commands

Review