78
Multimodality Learning from Text, Speech, and Vision CMU 11-4/611 Natural Language Processing Lecture 28 April 14, 2020 Shruti Palaskar

Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

MultimodalityLearning from Text, Speech, and Vision

CMU 11-4/611 Natural Language Processing

Lecture 28April 14, 2020 Shruti Palaskar

Page 2: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Outline

I. What is multimodality?

II. Types of modalities

III. Commonly used Models

IV. Multimodal Fusion and Representation Learning

V. Multimodal Tasks: Use Cases

2

Page 3: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

I. What is Multimodality?

3

Page 4: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Human Interaction is Inherently Multimodal

4

Page 5: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

How We Perceive

5

Page 6: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

How We Perceive

6

Page 7: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

The Dream: Sci-Fi Movies

7

JARVIS The Matrix

Page 8: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Reality?

8

Page 9: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Give a caption.

9

Page 10: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Give a caption.

10

Human: A Small Dogs Ears Stick Up As It Runs In The Grass.

Model: A Black And White Dog Is Running On Grass With A Frisbee In Its Mouth

Page 11: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Single sentence image description -> Captioning

11

Page 12: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Give a caption.

12

Page 13: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Give a caption.

13

Human: A Young Girl In A White Dress Standing In Front Of A Fence And Fountain.

Model: Two Men Are Standing In Front Of A Fountain

Page 14: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Reality?

14

Page 15: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Watch the video and answer questions.

15

Q. is there only one person ?

Q. does she walk in with a towel around her neck ?

Q. does she interact with the dog ?

Q. does she drop the towel on the floor ?

QUESTIONS

Page 16: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Watch the video and answer questions.

16

Q. is there only one person ?A. there is only one person and a dog .

Q. does she walk in with a towel around her neck ?A. she walks in from outside with the towel around her

neck .

Q. does she interact with the dog ?A. she does not interact with the dog

Q. does she drop the towel on the floor ?A. she dropped the towel on the floor at the end of the

video .

QUESTIONS

Page 17: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Simple questions, simple answers -> Video Question Answering

17

Page 18: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Reality? Baby Steps. Still a long way to go.

18

Page 19: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

...Challenges

Common challenges based on the tasks we just saw

- Training Dataset bias- Very complicated tasks- Lack of common sense reasoning within models- No world knowledge available like humans do

- Physics, Nature, Memory, Experience

How do we teach machines to perceive?

19

Page 20: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Outline

I. What is multimodality?

II. Types of modalities

III. Commonly used Models

IV. Multimodal Fusion and Representation Learning

V. Multimodal Tasks: Use Cases

20

Page 21: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

II. Types of modalities

21

Page 22: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Types of Modalities

22

IMAGE/VIDEO

SPEECH/AUDIO

TEXT

EMOTION/AFFECT/SENTIMENT

Page 23: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Example Dataset: ImageNet

● Object Recognition● Image Tagging/Categorization● ~14M images● Knowledge Ontology● Hierarchical Tags

○ Mammal -> Placental -> Carnivore -> Canine -> Dog -> Working Dog -> Husky

Deng et al. 2009 23

Page 24: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Example Dataset: How2 Dataset

● Speech● Video● English Transcript

● Portuguese Transcript● Summary

Sanabria et al. 2018 24

Page 25: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Example Dataset: Open Pose

● Action Recognition● Pose Estimation● Human Dynamic● Body Dynamics

Wei et al. 2016 25

Page 26: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

III. Commonly Used Models

26

Page 27: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Multilayer Perceptrons

27

Single Perceptron

Page 28: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Multilayer Perceptrons

28Single Perceptron

Page 29: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Multilayer Perceptrons: Uses in Multimedia

29

Page 30: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Multilayer Perceptrons: Limitations

30

Limitation #1

Very large amount of input data samples (xi), which requires a gigantic amount of model parameters.

Page 31: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Convolutional Neural Networks (CNNs)

31

Translation invariance: we can use same parameters to capture a specific “feature” in any area of the image. We can use different sets of parameters to capture different features.

These operations are equivalent to perform convolutions with different filters.

Page 32: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Convolutional Neural Networks (CNNs)

32LeCun et al. 1998

Page 33: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Convolutional Neural Networks (CNNs) for Image Encoding

33Krizhevsky et al. 2012

Page 34: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Multilayer Perceptrons: Limitations

34

Limitation #1

Very large amount of input data samples (xi), which requires a gigantic amount of model parameters.

Limitation #2

Does not naturally handle input data of variable dimension

(eg. audio/video/word sequences)

Page 35: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Recurrent Neural Networks

35

Build specific connections capturing the temporal evolution

→ Shared weights in time

Page 36: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Recurrent Neural Networks

36

Page 37: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Recurrent Neural Networks for Video Encoding

37

Combination is commonly implemented as a small NN on top of a pooling operation (e.g. max, sum, average).

Recurrent Neural Networks are well suited for processing sequences.

Donahue et al. 2015

Page 38: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Attention Mechanism

38Olah and Cate 2016

Page 39: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Loss Function: Softmax

39Slide by LP Morency

Page 40: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

IV. Multimodal Fusion & Representation Learning

40

Page 41: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Fusion: Model Agnostic

41Slide by LP Morency

Page 42: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Fusion: Model Based

42Slide by LP Morency

Page 43: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Representation Learning: Encoder-Decoder

43

Page 44: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Representation Learning

44

Word2Vec

Mikolov et al. 2013

Page 45: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Representation Learning: RNNs

45Cho et al. 2014

Page 46: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Representation Learning: Self-Supervised

46

Page 47: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Representation Learning: Transfer Learning

47

Page 48: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Representation Learning: Joint Learning

48

Page 49: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Representation Learning: Joint Learning (Similarity)

49

Page 50: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

V. Common Tasks, Use Cases

50

Page 51: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

V. Common Tasks

1. Vision and Language2. Speech, Vision and Language3. Multimedia4. Emotion and Affect

51

● Image/Video Captioning● Visual Question Answering● Visual Dialog ● Video Summarization● Lip Reading● Audio Visual Speech Recognition● Visual Speech Synthesis● …

Page 52: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

1. Vision and Language Common Tasks

52

Page 53: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Image Captioning

53Vinyals et al. 2015

Page 54: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Image Captioning

54

Karpathy et al. 2015

Slides by Marc Bolaños

Page 55: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Image Captioning: Show, Attend and Tell

55Xu et al. 2015

Page 56: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Image Captioning and Detection

56Johnson et al. 2016

Page 57: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Video Captioning

57Donahue et al. 2015

Page 58: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Video Captioning

58

Pan et al. 2016

Slides by Marc Bolaños

Page 59: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Visual Question Answering

59

Page 60: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Visual Question Answering

60

Page 61: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Visual Question Answering

61

Page 62: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Visual Question Answering

62

Page 63: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Visual Question Answering

63

Page 64: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Video Summarization

on behalf of expert village my name is lizbeth muller and today we are going to show you how to make spanish omelet . i 'm going to dice a little bit of peppers here . i 'm not going to use a lot , i 'm going to use very very little . a little bit more then this maybe . you can use red peppers if you like to get a little bit color in your omelet . some people do and some people do n't . but i find that some of the people that are mexicans who are friends of mine that have a mexican she like to put red peppers and green peppers and yellow peppers in hers and with a lot of onions . that is the way they make there spanish omelets that is what she says . i loved it , it actually tasted really good . you are going to take the onion also and dice it really small . you do n't want big chunks of onion in there cause it is just pops out of the omelet . so we are going to dice the up also very very small . so we have small pieces of onions and peppers ready to go .

how to cut peppers to make a spanish omelette ; get expert tips and advice on making cuban breakfast recipes in this free cooking video .

Transcript (290 words on avg)

“Teaser” (33 words on avg)

~1.5 minutes of audio and video

Page 65: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Video Summarization: Hierarchical Model

Palaskar et al. 2019

Page 66: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Action Recognition

66

Page 67: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

2. Speech, Vision and Language Common Tasks

67

Page 68: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Audio Visual Speech Recognition: Lip Reading

68

Assael et al. 2016

Page 69: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Lip Reading: Watch, Listen, Attend and Spell

69

Chung et al. 2017

Page 70: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

3. Multimedia Common Tasks

70

Page 71: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Multimedia Retrieval

71

Page 72: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Multimedia Retrieval

72

Page 73: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Multimedia Retrieval: Shared Multimodal Representation

73

Page 74: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Multimedia Retrieval

74

Page 75: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

4. Emotion and Affect

75

Page 76: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Affect Recognition: Emotion, Sentiment, Persuasion, Personality

76

Page 77: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Outline

I. What is multimodality?

II. Types of modalities

III. Commonly used Models

IV. Multimodal Fusion and Representation Learning

V. Multimodal Tasks: Use Cases

77

Page 78: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language

Takeaways

● Lots of multimodal data generated everyday● Need automatic ways to understand it

○ Privacy○ Security○ Regulation○ Storage

● Different models used for different downstream tasks○ Highly open-ended research!

● Try it out for fun on Kaggle!

Thank [email protected] 78