43
“Attention” and “Transformer” Architectures James Hays

“Attention” and “Transformer” Architectures

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: “Attention” and “Transformer” Architectures

“Attention” and “Transformer” Architectures

James Hays

Page 2: “Attention” and “Transformer” Architectures

Recap – Semantic Segmentation

Page 3: “Attention” and “Transformer” Architectures

Outline

• Context and Receptive Field

• Going Beyond Convolutions in…• Text

• Point Clouds

• Images

Page 4: “Attention” and “Transformer” Architectures
Page 5: “Attention” and “Transformer” Architectures
Page 6: “Attention” and “Transformer” Architectures
Page 7: “Attention” and “Transformer” Architectures
Page 8: “Attention” and “Transformer” Architectures
Page 9: “Attention” and “Transformer” Architectures

Language understanding

… serve …

Page 10: “Attention” and “Transformer” Architectures

Language understanding

… great serve from Djokovic …

Page 11: “Attention” and “Transformer” Architectures

Language understanding

… be right back after I serve these salads …

Page 12: “Attention” and “Transformer” Architectures
Page 13: “Attention” and “Transformer” Architectures
Page 14: “Attention” and “Transformer” Architectures

So how do we fix these problems?

Page 15: “Attention” and “Transformer” Architectures

Slide Credit: Frank Dellaert https://dellaert.github.io/19F-4476/resources/receptiveField.pdf

Page 16: “Attention” and “Transformer” Architectures

Dilated Convolution

Figure source: https://github.com/vdumoulin/conv_arithmetic

Page 17: “Attention” and “Transformer” Architectures

Sequence 2 Sequence models in language

Source: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Page 18: “Attention” and “Transformer” Architectures
Page 19: “Attention” and “Transformer” Architectures

Outline

• Context and Receptive Field

• Going Beyond Convolutions in…• Text

• Point Clouds

• Images

Page 20: “Attention” and “Transformer” Architectures
Page 21: “Attention” and “Transformer” Architectures
Page 22: “Attention” and “Transformer” Architectures

From https://medium.com/lsc-psd/introduction-of-self-attention-layer-in-transformer-fc7bff63f3bc

Page 23: “Attention” and “Transformer” Architectures

From https://medium.com/lsc-psd/introduction-of-self-attention-layer-in-transformer-fc7bff63f3bc

Page 24: “Attention” and “Transformer” Architectures
Page 25: “Attention” and “Transformer” Architectures
Page 26: “Attention” and “Transformer” Architectures
Page 27: “Attention” and “Transformer” Architectures

Transformer Architecture

Page 28: “Attention” and “Transformer” Architectures

Outline

• Context and Receptive Field

• Going Beyond Convolutions in…• Text

• Point Clouds

• Images

Page 29: “Attention” and “Transformer” Architectures

Point Transformer. Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, Vladlen Koltun

Page 30: “Attention” and “Transformer” Architectures
Page 31: “Attention” and “Transformer” Architectures
Page 32: “Attention” and “Transformer” Architectures

Outline

• Context and Receptive Field

• Going Beyond Convolutions in…• Text

• Point Clouds

• Images

Page 33: “Attention” and “Transformer” Architectures
Page 34: “Attention” and “Transformer” Architectures
Page 35: “Attention” and “Transformer” Architectures
Page 36: “Attention” and “Transformer” Architectures
Page 37: “Attention” and “Transformer” Architectures
Page 38: “Attention” and “Transformer” Architectures

When trained on mid-sized datasets such as ImageNet, such models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome maybe expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.

However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias.

Dosovitskiy et al.

Page 39: “Attention” and “Transformer” Architectures
Page 40: “Attention” and “Transformer” Architectures
Page 41: “Attention” and “Transformer” Architectures

Summary

• “Attention” models outperform recurrent models and convolutional models for sequence processing and point processing. They allow long range interactions.

• Surprisingly, they seem to outperform convolutional networks for image processing tasks. Again, long range interactions might be more important than we realized.

• Naïve attention mechanisms have quadratic complexity with the number of input tokens, but there are often workarounds for this.

Page 42: “Attention” and “Transformer” Architectures

Reminder

• This is the final lecture. We won’t use the reading period or the final exam slot.

• Project 5 is out and due Friday

• Project 6 is optional. It is due May 5th.

• The problem set will go out this week.

Page 43: “Attention” and “Transformer” Architectures

Thank you for making this semester work!