34
University of Amsterdam / Efstratios Gavves 1 THE MACHINE LEARNING OF TIME IN VIDEOS EFSTRATIOS GAVVES UNIVERSITY OF AMSTERDAM

T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves 1

THE MACHINE LEARNING OF TIME

IN VIDEOS

EFSTRATIOS GAVVES

UNIVERSITY OF AMSTERDAM

Page 2: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

THE PROBLEM

Page 3: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

THE ROLE OF TIME IN SPATIOTEMPORAL SEQUENCES ?

What is difference between time (video) and space (image)?

Image models work well for two reasons

− Convolutions → Images are spatially stationary signals

− Backprop → Full supervision → millions of labels

𝑃𝑥+𝛿 = 𝑃𝑥

Bounded

Bo

un

ded

Bo

un

ded

Bounded

Page 4: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

THE ROLE OF TIME IN SPATIOTEMPORAL SEQUENCES ?

What is difference between time (video) and space (image)?

Image models work well for two reasons

− Convolutions → Images are spatially stationary signals

− Backprop → Full supervision → easy to get millions of labels

Temporal stationarity hard to guarantee in long & complex videos

(without unlimited data)

− (Short) Convolutions are not enough

− Modelling temporal transitions directly instead

− Quite hard for technical and conceptual reasons

4

𝑃𝑥+𝛿𝑥 = 𝑃𝑥𝑃𝑥+𝛿𝑥 = 𝑃𝑥

4

𝑃𝑥+𝛿𝑥 = 𝑃𝑥

𝑃𝑡+δδδδδ … δ ≠ 𝑃𝑡

𝑃𝑥+𝛿𝑥 = 𝑃𝑥

𝑃𝑡+δ ≈ 𝑃𝑡

𝑃𝑥+𝛿 = 𝑃𝑥

Bounded

Bo

un

ded

Bo

un

ded

Bounded

Unbounded

Page 5: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

VIDEOGRAPH … OR TIME AS A GRAPH

ONGOING

A. SmeuldersE. GavvesN. Hussein

Page 6: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

COMPLEX ACTIONS

6

Preparing Breakfast Stirring Food

Long & Complex Action One-action

Page 7: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

COMPLEX ACTIONS

7

get

cook

put

Cooking a Meal

wash

• •

•2. EXTENT

3. DEPENDENCY

Complex actions of Charades are30 sec, compared to 5 sec of Kinetics.

The one-actions differ in their temporal extents.

Temporal dependency , albeit weak, between the one-actions.

1. LONG-RANGE

Page 8: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

A YEAR AGO: TIMECEPTION

8

Reduce the complexity of 3D conv by factorizingIt into depth-wise separable 1D conv

2. TEMPORAL-ONLY CONV• • •

𝐶

𝑇

𝑘𝑡

3. MULTI-SCALE KERNELSUsing different kernel sizes, k, or dilation rates to account for verities in temporal extents of one-actions.

𝑘 = {1, 3, 5, 7}𝑑 = {1, 2, 3}

1. EFFICIENT MODULAR LAYERGrouped conv and concat+shuffle to reduce the computational cost of typical 3d conv.

Grouped Conv

Channel Shuffle

Page 9: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

TIMECEPTION ON CHARADES

9

10

TC

R2D

3025 35 40

3

Performance mAP %

Tim

este

ps

2

1

10

1020M 30M 40M

TC TC

I3DR3D

NL

Results: Processing 1K frames → better accuracies

Insight: longer range temporal relations are important

Question: how to go longer range?

10x longer videos possible

Page 10: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

STRETCHING TIME: TIME AS A GRAPH

Nodes: represent key one-actions in the activity

Edges: encode the temporal relationship between one-actions

Sublinear temporal representation

− Nodes are temporal concept-clusters

− Similar/irrelevant frames discarded

− Scales up to hours long videos (hopefully)

Increased interpretability

− Graphs are discrete

− Straightforward to reason

10

Graph-based Representation

Page 11: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

VIDEOGRAPH

Video

𝑣𝑠1

𝑠2

𝑠𝑛

Tim

e

3D CNN

𝑥1𝑠1

Page 12: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

VIDEOGRAPH

Video

𝑣𝑠1

𝑠2

𝑠𝑛

Tim

e

3D CNN

𝑥1𝑠1

‧‧‧

Concepts

𝑥1

𝐶

𝛼NodeAttention

Page 13: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

VIDEOGRAPH

Video

𝑣𝑠1

𝑠2

𝑠𝑛

Tim

e

3D CNN

𝑥1𝑠1

‧‧‧

Concepts

𝑥1

𝐶

𝛼0.4 0.70.10.00.0

Transition

NodeAttention

Page 14: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

VIDEOGRAPH

14

Video

𝑣𝑠1

𝑠2

𝑠𝑛

Tim

e

3D CNN

𝑥1𝑠1

‧‧‧

Concepts

𝑥1

𝐶

𝛼0.4 0.70.10.00.0

MLP

‧‧‧

Predictions

GraphEmbedding

GraphEmbedding

NodeAttention

Page 15: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

VIDEOGRAPH: GRAPH-EMBEDDING

15

0.4 0.70.10.00.0

GraphEmbedding

GraphEmbedding

𝒀

TimeConv1D

N × T × HW× C

NodeConv1D

ChannelConv3D

MaxPool

N

4×T

4× HW × C

Learn set of latent concepts representation Ƹ𝑐 from randomly initialized 𝑐

Dot product ⊙ for similarity between a feature 𝑥 and all latent concepts Ƹ𝑐, followed by activation 𝜎.

Reweight all learned concepts 𝑐 using the activation values 𝛼

Page 16: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

EXPERIMENTS

16

BreakfastCharades

Page 17: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

TIME-ALIGNED NETS… OR NEURAL LAYERS AS TIME STEPS

VIDEO TIME: ENCODERS PROPERTIES EVALUATION, BMVC 2018

C. SnoekE. GavvesA. Ghodrati

Page 18: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

WHAT ARE (SOME) PROPERTIES OF TIME?

18

Temporal

Asymmetry

Temporal

Continuity

Temporal

Causality

Temporal

Redundancy

Page 19: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

DOMINANT APPROACHES FOR MODELLING THESE PROPERTIES?

19

LSTMs learn transitions between subsequent states 3D convolutions learn spatiotemporal correlations

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997Ji et al. 3d convolutional neural networks for human action recognition. PAMI, 2013

Tran et al., Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015

Carreira and Zisserman, Quo vadis action recognition a new model and the kinetics

dataset? CVPR, 2017

Page 20: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

LSTM AND C3D: ARROW OF TIME?

20

LSTM

C3D

Page 21: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

REVISITING RECURRENT NEURAL NETWORKS

Recurrent Nets are highly sensitive dynamical systems (Pascanu and Bengio, 2013)

− Even with highly discriminative symbolic (one-hot vector) inputs

− Gradients very sensitive to initialization → Poor learning! → No generalization

Visual features over time -even the best ones- are:

− much noisier

− much less discriminative

− much more redundant

Learning LSTM on videos is orders of magnitude harder

− Chaotic regime → no useful gradients → no learning

− Forward and Backward (or even shuffling) LSTM score the same accuracy on arrow of time

21

Basically, with high-dim noisy inputs LSTMs do not do sequence modelling but some weird entangled pooling

Page 22: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

PROPOSAL: TIME-ALIGNED NEURAL NETS

ConvNets are much better with vanishing and exploding gradients, noisy and redundant inputs

No parameter sharing → no iterated maps → no chaotic regime

Moreover, the premise of LSTM parameter sharing is infinite Markov chains

In practice, however, we chop it off at T steps → like a ConvNet with T layers

Idea: Why not flip the ConvNet to align the layers with time steps?

22

ConvNets can handle vanishing/exploding/noisy/redundant because they do not share parameters.

Hypothesis

Page 23: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

PROPOSAL: TIME-ALIGNED NEURAL NETS

Idea: Why not flip the ConvNet to align the layers with time steps?

No vanishing/exploding gradients, no problems with noisy and redundant inputs

23

Page 24: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

RECHECKING ARROW OF TIME

Time-Aligned DenseNet gives much cleaner temporal clusters

24

Conclusion: Poor temporal modelling is likely due to hard –and thus unsuccessful- optimization

Page 25: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

SPIKING NEURAL NETS … OR EVENT-DRIVEN TIME

ICLR 2017, 2018, 2019

M. WellingE. GavvesP. O’Connor

Page 26: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

MOTIVATION

Most sensory data is temporally redundant

For today’s models temporal redundancy is a bug

For spiking models temporal redundancy is a feature

Great promise of human brain-like efficiency

Unfortunately spiking networks are still in the pre-alpha release stage :P

− How to backprop through spikes is not clear

− Also, in theory one must backprop to infinity which clearly creates lots of problems

26

Page 27: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

AND IF YOU DON’T BELIEVE IT: YOU ONLY NEED MNIST EXPERIMENTS TO PUBLISH

27

Page 28: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

WHAT IS MISSING & CHALLENGES?

Page 29: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

WHAT IS MISSING & CHALLENGES?

Too-much image (space-only) focused

29

Image problem space“coordinate system”

Images

Page 30: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

WHAT IS MISSING & CHALLENGES?

Too-much image (space-only) focused

30

Images

Videos

Curse of problem dimensionality

Image problem space“coordinate system”

Page 31: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

WHAT IS MISSING & CHALLENGES?

Too-much image (space-only) focused models

#1. Transition to models focused on the spatiotemporal

This requires a deeper understanding of: The Role of Time in Spatiotemporal Sequences ?

Role of Time == Long spatiotemporal sequences == temporal non-stationarity/no-labels

Maybe time integrated in neural network structure, not neural network input/output

31

Images

VideosSpatiotemporal problem space

“coordinate system”

Deep Learning

Machine Learning

Computer Vision

Temporal Learning& Dynamics

Page 32: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

WHAT IS MISSING & CHALLENGES?

Too-much image (space-only) focused models

#1. Transition to models focused on the spatiotemporal

This requires a deeper understanding of: The Role of Time in Spatiotemporal Sequences ?

Role of Time == Long spatiotemporal sequences == temporal non-stationarity/no-labels

Maybe time integrated in neural network structure, not neural network input/output

#2. Tighter community and better communication channels

This initiative is a great step forward

32

Images

VideosSpatiotemporal problem space

“coordinate system”

Deep Learning

Machine Learning

Computer Vision

Temporal Learning& Dynamics

Page 33: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves

WHAT IS MISSING & CHALLENGES?

Too-much image (space-only) focused models

#1. Transition to models focused on the spatiotemporal

This requires a deeper understanding of: The Role of Time in Spatiotemporal Sequences ?

Role of Time == Long spatiotemporal sequences == temporal non-stationarity/no-labels

Maybe time integrated in neural network structure, not neural network input/output

#2. Tighter community and better communication channels

This initiative is a great step forward

#3. Balanced datasets? Video is not only Action. Trade-off between

(i) larger datasets → better performance

(ii) larger datasets → hard to run and compare equally

Maybe having a “temporal property”spatiotemporal decathlon?

33

Images

VideosSpatiotemporal problem space

“coordinate system”

Deep Learning

Machine Learning

Computer Vision

Temporal Learning& Dynamics

Page 34: T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves 34

THANK [email protected]

UNIVERSITY OF AMSTERDAM