TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps...

TensorBoardand

ApplicationsJun 12, 2019

ddebarr@uw.edu

http://cross-entropy.net/ML410/Deep_Learning_9.pdf

Agenda

• TensorBoard

• Speech Recognition Homework: Spatial to Frequency Representation

• Single Shot MultiBox Detector (SSD)

• Bi-directional Encoder Representations from Transformers (BERT)

• GANs

• Datasets Review

The Loop of Progress

TensorBoard

Text-Classification Modelto Use with TensorBoard

TensorBoard

Training the Model with a TensorBoardCallback

$ tensorboard --logdir=my_log_dir

TensorBoard

TensorBoard: Metrics Monitoring

TensorBoard

TensorBoard: Activation Histograms

TensorBoard

TensorBoard: Interactive 3DWord-Embedding Visualization

TensorBoard

TensorBoard: TensorFlow Graph Visualization

TensorBoard

keras.utils.plot_model()

TensorBoard

Audio: Spatial to Frequency Representation

https://www.ritchievink.com/blog/2017/04/23/understanding-the-fourier-transform-by-example/

Audio Features

Discrete Fourier Transform

Audio Features

Discrete Fourier Transform Implementation

[ (k * n) produces an NxN matrix ]

Audio Features

Single Shot MultiBox Detector (SSD)

• Feature Maps

• SSD Architecture

• Matching Ground Truth Boxes to Prediction

• SSD Loss Function

• VOC 2007 Performance

• Traffic Prediction

• Inference Example

Feature Maps

https://www.cs.unc.edu/~wliu/papers/ssd.pdf

Single Shot Multi-Box Detector (SSD) Architecture

8,732 boxes:

38 x 38 x 4 + 19 x 19 x 6 + 10 x 10 x 6 + 5 x 5 x 6 + 3 x 3 x 4 + 1 x 1 x 4

aspect ratios include: { 1, 1, 1/2, 2, 1/3, 3 }

Matching Ground Truth Boxes to Prediction BoxesMatch:

Intersection over Union (IoU) > 0.5

Hard Negative Mining:

Keep TP:FP ratio fixed (1:3), use worst misclassified FPs

Loss Function

• N: the number of default boxes matched

• alpha: set to 1 [by cross validation]

• conf: class confidence loss

• loc: localization loss

Localization Loss

Smooth L1 Loss

Confidence Loss

PASCAL Visual Object Classes (VOC)

• PASCAL: Pattern Analysis, Statistical modeling, and Computational Learning

• 20 classes• Person: person

• Animal: bird, cat, cow, dog, horse, sheep

• Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train

• Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

http://host.robots.ox.ac.uk/pascal/VOC/

Improvement Via Augmentation

SSD300: an SSD model using 300x300 pixel input images

Comparing Results on the VOC 2007 Data

Traffic Prediction

Inference Example

pip install opencv-python

git clone https://github.com/pierluigiferrari/ssd_keras.git

jupyter-notebook

# open ssd300_inference.ipynb (interactive python notebook)

# https://drive.google.com/open?id=121-kCXaOHOkJE_Kf5lKcJvC_5q1fYb_q

weights_path = 'C:/Temp/VGG_VOC0712_SSD_300x300_iter_120000.h5’

Question Answering using Bidirectional Encoder Representations from Transformers (BERT)

• SQuAD• Stats

• Measuring Syntactic Divergence

• Logistic Regression versus Human Performance

• BERT• Base versus Large

• Transformer Execution

• Fine-Tuning BERT for SQuAD

• References

Stanford Question Answering Dataset (SQuAD)• Extracted paragraphs from a sample of “top” Wikipedia articles

• Questions asked and answered using Amazon’s Mechanical Turk

Wikipedia Article for Precipitation

First 2 sentences and last 2 sentences of first paragraph, without references:

https://en.wikipedia.org/wiki/Precipitation

Paragraph: 3 Questions with 1 Answer for Each:

Counts for SQuAD

Survey of Question Answering Datasets

• Small: difficult for statistical models

• Cloze datasets• Children’s Book Test (CBT) and

CNN/DailyMail

• Cloze data: predict the missing word/entity [“performance is almost saturated”]

• SQuAD: answers often include non-entities and can be much longer

The word cloze is derived from closure in Gestalt theory

SQuAD Answer Types

SQuAD Question Answering Categories[based on manual review of 192 examples]

Measuring Syntactic Divergence: ExampleEdit distance between dependency tree paths for the question and answer sentences; e.g. delete “amod” (adjectival modifier), substitute “xcomp” for “nmod”, and insert “det” (determiner)

https://nlp.stanford.edu/software/dependencies_manual.pdf

Measuring Syntactic Divergence: Distribution

Features for Logistic Regression Baseline:Part 1 of 3 [180 million features]

AllenNLP.org: A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads.

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

AllenNLP.org: A constituency parse tree breaks a text into sub-phrases, or constituents. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence.

Text Normalization for Scoring

def normalize_answer(s):

"""Lower text and remove punctuation, articles and extra whitespace."""

def remove_articles(text):

return re.sub(r'\b(a|an|the)\b', ' ', text)

def white_space_fix(text):

return ' '.join(text.split())

def remove_punc(text):

exclude = set(string.punctuation)

return ''.join(ch for ch in text if ch not in exclude)

def lower(text):

return text.lower()

return white_space_fix(remove_articles(remove_punc(lower(s))))

Evaluation Metrics: Exact Match (EM)

def exact_match_score(prediction, ground_truth):

return (normalize_answer(prediction) == normalize_answer(ground_truth))

mean(exact_match_score)

Evaluation Metrics: F1[the harmonic mean of precision and recall]

def f1_score(prediction, ground_truth):

prediction_tokens = normalize_answer(prediction).split()

ground_truth_tokens = normalize_answer(ground_truth).split()

common = Counter(prediction_tokens) & Counter(ground_truth_tokens)

num_same = sum(common.values())

if num_same == 0:

return 0

precision = 1.0 * num_same / len(prediction_tokens)

recall = 1.0 * num_same / len(ground_truth_tokens)

f1 = (2 * precision * recall) / (precision + recall)

return f1

mean(metric_max_over_ground_truths(f1_score, prediction, ground_truths))

Human Perf

To evaluate human performance, they treat the second answer to each question as the human prediction, and keep the other answers as ground truth answers

Performance of Logistic Regression Baseline

Non-human candidates restricted to spans which are constituents in the constituency parse generated by Stanford CoreNLPBaselines included the sliding window and distance-based algorithms by Matt Richardson et al

Ablation Study for Logistic Regression Perf

Performance Stratified by Answer Types

Performance Stratified by Syntactic Divergence

• Bidirectional Encoder Representations from Transformers

• Forward• Elements that precede the current element are useful for understanding the current

element

• Backward• Elements that succeed the current element are useful for understanding the current

element

• “Self-Supervised” Pre-Training Tasks• Corpora: Books Corpus and … wait for it … English Wikipedia [3.3 billion words]

• Masked Language Model [MLM]: predict the missing words

• Next Sentence: predict whether sentence B follows sentence A

BERT Model Architecture

Model consists of …• Embeddings block: embeddings for “token”, position, and segment (added),

with H=768 (or 1024) floats per embedding

• Encoder stack: L=12 (or 24) transformer blocks, each consisting of 6 weight matrices:• Multi-head self-attention (number of heads is H / 64)

• Query: H x H

• Key: H x H [softmax( (Input * Query) * (Input * Key)’ / sqrt(64) ) to average values]

• Value: H x H

• Projection: H x H

• Intermediate (H x 4H)

• Output (4H x H)

• Pooling layer: H x H

Pre-Trained BERT Models

Base: 109,482,240 parameters

• Embeddings: 768 floats

• 12 Transformer Layers• Self-Attention

• 12 heads x 64 dimensions [768]• Projection (Convolution)• Layer Normalization

• 2 Convolution Layers (expand/contract)

• Layer Normalization

• Pooler (Convolution)

Large: 335,141,888 parameters

• Embeddings: 1,024 floats

• 24 Transformer Layers• Self-Attention

• 16 heads x 64 dimensions [1024]• Projection (Convolution)• Layer Normalization

• 2 Convolution Layers (expand/contract)

• Layer Normalization

• Pooler (Convolution)

All Convolution based on filter width of one (word piece) token30,522 tokens in the vocabulary for both models

Parameter Details: Embeddings

Replace 768 with 1024 for BERT Large

Parameter Details: Transformer Layer

Replace 768 with 1024 and 3072 with 4096 for BERT Large

Parameter Details: Pooling and Classifier Layers

Replace 768 with 1024 for BERT Large

BERT Base Transformer Layer Execution

1. Compute Self-Attention Output Matrix [next 3 slides]

2. Convolve the Self-Attention Output with the Intermediate Filters (length 1 filters): (512 x 768) x (768 x 3072) = (512 x 3072)

3. Convolve the Intermediate Filter Output with the Output Filters (length 1 filters): (512 x 3072) x (3072 x 768) = (512 x 768)

4. Normalize: multiply feature values by feature-specific gamma (scale parameter) and add feature-specific beta (location parameter)

Note the residual connections to the Norm

BERT Base Self-Attention: Part 1 of 3

• Sequence length: s = 512• Number of heads: h = 12• Number of dimensions for Key and Value: d = 64• Number of dimensions for Model: m = h * d = 12 * 64 = 768

• Input Matrix: (s x m)• Query Weight Matrix: (m x d) # one for each head• Key Weight Matrix: (m x d) # one for each head• Value Weight Matrix: (m x d) # one for each head• Projection Matrix: (m x m) # only one

1. For each head …a. Multiply Input Matrix by Query Weight Matrix to produce Query Matrix:

(512x768) x (768x64) = (512x64) ["broadcast" (add) bias vector to rows]b. Multiply Input Matrix by Key Weight Matrix to produce Key Matrix:

(512x768) x (768x64) = (512x64) ["broadcast" (add) bias vector to rows]c. Multiply Input Matrix by Value Weight Matrix to produce Value Matrix:

(512x768) x (768x64) = (512x64) ["broadcast" (add) bias vector to rows]d. Multiply Query Matrix by transposed Key Matrix to produce the Attention

Matrix: (512x64) x (64x512) = (512x512) [also normalize (divide) by square root of number of dimensions for key]

e. SoftMax across the rows of the Attention Matrix: still (512x512)f. Multiply SoftMax Matrix by the Value Matrix to produce the Attended

Output; (512x512) x (512x64) = (512x64)cell i,j contains a weighted sum of Value[,j] for all elements in the sequence,based on weights specific to element i

2. Concatenate the Attended Output matrices, then multiply by the Projection Matrix to produce the Projected Output Matrix: (512x768) x (768x768) = (512x768)

3. Normalize: multiply feature values by feature-specific gamma (scale parameter) and add feature-specific beta (location parameter)

http://jalammar.github.io/illustrated-transformer/

Fine-Tuning BERT for SQuAD:Fetch Stuff

git clone https://github.com/google-research/bert.git

cd bert

wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zipunzip uncased_L-12_H-768_A-12.zip

unzip uncased_L-12_H-768_A-12.zip

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

wget https://raw.githubusercontent.com/allenai/bi-att-flow/master/squad/evaluate-v1.1.py

Fine-Tuning BERT for SQuAD:Train and Predictpython run_squad.py \

> --vocab_file=${HOME}/bert/uncased_L-12_H-768_A-12/vocab.txt \

> --bert_config_file=${HOME}/bert/uncased_L-12_H-768_A-12/bert_config.json \

> --init_checkpoint=${HOME}/bert/uncased_L-12_H-768_A-12/bert_model.ckpt \

> --do_train=True \

> --train_file=${HOME}/bert/train-v1.1.json \

> --do_predict=True \

> --predict_file=${HOME}/bert/dev-v1.1.json \

> --train_batch_size=16 \

> --learning_rate=3e-5 \

> --num_train_epochs=2.0 \

> --max_seq_length=384 \

> --doc_stride=128 \

> --output_dir=/tmp/squad/ \

> --save_checkpoints_steps 20000

Nvidia System Management Interface OutputPower Usage is 263 Watts [fluctuates over time] 62° Celsius is 144° Fahrenheit (gets warmer after a while)

Fine-Tuning BERT for SQuAD:Evaluate (dev)• python evaluate-v1.1.py \

dev-v1.1.json \

/tmp/squad_large/predictions.json

• Platform: Amazon Web Services (AWS) Elastic Compute Cloud (EC2) p3.2xlarge instance [Nvidia GV100 (Volta) GPU with 16 GB memory]

• Base• Batch size: 16

• 97 minutes (1.6 hours) to train and make predictions for the dev set

• {"exact_match": 80.96499526963103, "f1": 88.3375015101166}

• Console outputhttps://www.cross-entropy.net/ML410/bert-console.txt

References

• Stanford Question Answering Dataset (SQuAD): 100,000+ Questions for Machine Comprehension of Text• SQuAD 1: https://arxiv.org/abs/1606.05250

• Bidirectional Encoder Representations from Transformers• https://arxiv.org/pdf/1810.04805.pdf

• Know What You Don’t Know: Unanswerable Questions for SQuAD• SQuAD 2: https://arxiv.org/abs/1806.03822

• Added questions without answers

• Motivation

• Example of Deep Convolutional Generative Adversarial Network

• Evaluation Metrics

• Training Nvidia’s Progressive Growing of GANs

• Evaluating Google’s BigGAN

• Trouble with Using Discriminator to Evaluate Quality

Inspiration from News Coverage: https://thispersondoesnotexist.com/

https://www.theverge.com/2019/3/3/18244984/ai-generated-fake-which-face-is-real-test-stylegan

Generative Adversarial Network (GAN):Training Two Networks• Discriminator: a binary classifier that predicts whether an image is “real”

(e.g. a photo) or “fake” (an image produced by transforming random noise)

• Generator: a model that takes random numbers as input and produces an “image” as output

• Updates are alternatively applied to train the two models• Update the discriminator: use a batch consisting of “real” images from a data set as

well as “fake” images produced by the “generator”; weight updates are based on the derivative of cross entropy (real/fake) with respect to the weight [back propagated across operations of the discriminator]

• Update the generator: use a batch consisting of only “fake” images produced by the generator; weight updates are still based on the derivative of cross entropy (real/fake) with respect to the weight [back propagated across operations of *both* the discriminator and the generator]

ExampleGenerator(4 conv ops)

Hands-On Unsupervised Learning Using Python

ExampleDiscriminator(4 conv ops)

Hands-On Unsupervised Learning Using Python

How to Evaluate a GAN?

• Inception Score• Use the generator to create images• Use an inception network to predict class probabilities for each image• Measure the Kullback-Leibler divergence between the predicted class

probabilities and the mean predicted class probabilities for the set of images• Larger values are better (predicted class probabilities differ from expectations)

• Frechet Inception Distance• Use the generator to create images• Use the predictions of the last pooling layer of an inception network to

compare the distribution of predictions for “real” images to the distribution of predictions for “fake” images

• Smaller values are better (squared distances between the means of the distributions and the covariance matrices of the distributions are smaller)

How Good is My GAN?

Training Last Year’s State of the Art:Nvidia’s Progressive Growing of GANs• Canadian Institute For Advanced Research’s 10 class data [CIFAR10]

• Generator: 20,719,628 parameters

• Discriminator: 20,726,785 parameters

• Total training time: 29 hours, 54 minutes [using Nvidia Titan V GPU]

• Reported Inception Score: max = 8.80; avg = 8.56 [previous state of the art for “unsupervised” (class not specified as input) GAN is 7.90]

• Inception Score for our newly trained model: 8.08

• Inception Score for real images: 11.22

• Frechet Inception Distance for our newly trained model: 15.66

• Frechet Inception Distance for real images: 0.00 [by definition]

https://github.com/tkarras/progressive_growing_of_gans: change tf.squeeze(pool3) to tf.squeeze(pool3, [1, 2]) for scoring

Initial versus Final Generator Outputs

Evaluation Metrics While Training

0.0 5.0 10.0 15.0 20.0 25.0 30.0

Evaluating This Year’s State of the Art:Google’s BigGAN• “We demonstrate that GANs benefit dramatically from scaling, and

train models with two to four times as many parameters and eight times the batch size compared to prior art”• Class-conditional image synthesis (desired class provided as input)

• We train on a Google Tensor Processing Unit (TPU) v3 Pod, with the number of cores proportional to the resolution• 128 for 128×128

• 256 for 256×256

• 512 for 512×512

• Training takes between 24 and 48 hours for most models

https://github.com/huggingface/pytorch-pretrained-BigGAN

BigGAN Evaluation Metrics

• CIFAR-10• 50,000 color images for training; 32 x 32

• 10 classes (5,000 images for each class)

• Inception Score: 9.22

• Frechet Inception Distance: 14.73

• Image-net Large Scale Visual Recognition Challenge (ILSVRC) 2012• 1,281,167 color images for training; various sizes

• 1,000 WordNet classes (min = 732 images; max = 1,300 images)

• Inception Score for 256x256 model: 232.5

• Frechet Inception Distance for 256x256 model: 8.1

Example BigGAN Output

Noise truncated to 0.1 (controls variance)

BigGAN Examples 1 of 2GANs

BigGAN Examples 2 of 2GANs

Generating BigGAN Output Examples

# adjust for your version of nvidia’s common unified device architecture (cuda) toolkit:

# https://pytorch.org/get-started/locally/

wget https://repo.anaconda.com/archive/Anaconda3-2018.12-Linux-x86_64.sh

bash Anaconda3-2018.12-Linux-x86_64.sh

conda install pytorch torchvision cudatoolkit=9.0 -c pytorch

git clone https://github.com/huggingface/pytorch-pretrained-BigGAN.git

cd pytorch-pretrained-BigGAN

pip install -r full_requirements.txt

pip install -r requirements.txt

python

import nltk

nltk.download("wordnet")

exit()

https://github.com/huggingface/pytorch-pretrained-BigGAN

# search for "import torch"

# copy all of the code in the box to a file named sample.py, except the "display_in_terminal(output)" line

python sample.py

Potential Issues Encountered for GANs

• Network “collapse” is a frequently lamented problem in this space: the discriminator essentially memorizes the “real” images of the training data and starts rejecting all other images as “fake”

Validation data must be used to detect this condition

• BigGAN definitely seems better at generating some of the classes compared to others [BigGAN’s output for “Band Aid” still haunts me]

• Existing evaluation measures are not particularly helpful for identifying whether an output is “good enough” to fool a human

Datasets Review

• Images (PIL)• MNIST Digit Classification• Fashion Accessory Classification• CIFAR10 Image Classification• Tiny ImageNet Classification (MAP)

• Text (spaCy, NLTK)• Newsgroups Classification• Reuters MultiLabel Classification (Macro Averaged ROC AUC)• Penn TreeBank Language Modeling (Perplexity)• IMDB Review Sentiment Classification

• Speech (libROSA)• Google Commands

Review

TensorBoard and Applications - Cross Entropy · Single Shot MultiBox Detector (SSD) •Feature Maps...

Documents

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: …repository.essex.ac.uk/21723/1/08307427.pdf · the diversity of targets and their mobility, single shot multibox detector (SSD)

Real-time pedestrian detection via hierarchical convolutional feature VERSION.pdf · objects detector SSD (Single Shot MultiBox Detector), we proposed a hierarchical convo-lution

CONTENEDORES PLÁSTICOS MULTIBOX 15

SSD: Single Shot MultiBox Detectorkosecka/cs747/presentation-ssd.pdfSSD: Single Shot MultiBox Detector Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang

SSD: Single Shot MultiBox Detector - Computer Sciencewliu/papers/ssd.pdf · SSD: Single Shot MultiBox Detector Wei Liu1, Dragomir Anguelov2, Dumitru Erhan3, Christian Szegedy3, Scott

SSD: Single Shot MultiBox Detector - Computer Sciencewliu/papers/ssd_eccv2016_slide.pdf · ant of the single shot detection (SSD) network from [10] slower) detector followed by a

Detect-SLAM: Making Object Detection and SLAM Mutually ... · Single Shot Multibox Object Detector (SSD) [14] is the ﬁrst DNN-based real-time object detector that achieves above

Object Detection - Virginia Tech · Object Detection Jia-Bin Huang Virginia Tech ECE 6554 Advanced Computer Vision. Today’s class ... SSD: Single Shot MultiBox Detector, ECCV 2016

An algorithm for highway vehicle detection based on ... · Faster R-CNN and Single Shot MultiBox Detector (SSD) using aspect ratios are [0.5, 1, 2], but the aspect ratio range of

IMI Multibox

Object Detection Introduction - AiFrenz20190320] Intor_Object... · SSD: Single Shot multibox Detector 18 Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference

Multibox Eclipse - biprost.by · Multibox Eclipse RTL: Ограничение максимальной темпера-туры в обратном трубопроводе, Автоматическое

MULTIBOX 25 - DICOM INGENIEROS SRL · 2016. 6. 19. · MULTIBOX 25 KD121-10e MULTIBOX 25 Page 1/1 }} Made of hot-pressed glassfibre reinforced polyester(GRP) Two-part construction,

Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras

Multibox - RSK Databasen€¦ · Multibox C/E används för separat rumstemperaturreglering, t ex vid golvvärme i kombination med uppvärmningssystem av lågtemperaturtyp. Temperaturen

MultiBox System - IPS Corp · 2017. 11. 8. · 82903 MultiBox WMOB TM CPVC HA CP MBS1201HACP 012181-829035 2.30 10 16 82904 MultiBox WMOB TM F1807 HA CP MBS1202HACP 012181-829042

TRAINING WITH MIXED PRECISION - GTC On-Demand …on-demand.gputechconf.com/gtc/2017/presentation/s7218-training... · Multibox SSD with VGG-D backbone – Was not learning, even with

TensorBoard ハンズオン

SSD: Single Shot MultiBox Detector - Welcome to the UNC ...wliu/papers/ssd.pdf · SSD: Single Shot MultiBox Detector Wei Liu1, Dragomir Anguelov2, Dumitru Erhan3, Christian Szegedy3,

MULTIBOX - Mounting Brackets, Standpipe Instrument Pillar ... · MULTIBOX - Mounting Brackets, Standpipe Instrument Pillar, Wall Mounting and Mounting Plate . It is generally recommended