T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding

University of Amsterdam / Efstratios Gavves 1

THE MACHINE LEARNING OF TIME

IN VIDEOS

EFSTRATIOS GAVVES

UNIVERSITY OF AMSTERDAM

THE PROBLEM

University of Amsterdam / Efstratios Gavves

THE ROLE OF TIME IN SPATIOTEMPORAL SEQUENCES ?

What is difference between time (video) and space (image)?

Image models work well for two reasons

− Convolutions → Images are spatially stationary signals

− Backprop → Full supervision → millions of labels

𝑃𝑥+𝛿 = 𝑃𝑥

Bounded

Bo

un

ded

Bo

un

ded

Bounded


THE ROLE OF TIME IN SPATIOTEMPORAL SEQUENCES ?

What is difference between time (video) and space (image)?

Image models work well for two reasons

− Convolutions → Images are spatially stationary signals

− Backprop → Full supervision → easy to get millions of labels

Temporal stationarity hard to guarantee in long & complex videos

(without unlimited data)

− (Short) Convolutions are not enough

− Modelling temporal transitions directly instead

− Quite hard for technical and conceptual reasons

4

𝑃𝑥+𝛿𝑥 = 𝑃𝑥𝑃𝑥+𝛿𝑥 = 𝑃𝑥

4

𝑃𝑥+𝛿𝑥 = 𝑃𝑥

𝑃𝑡+δδδδδ … δ ≠ 𝑃𝑡

𝑃𝑥+𝛿𝑥 = 𝑃𝑥

𝑃𝑡+δ ≈ 𝑃𝑡

𝑃𝑥+𝛿 = 𝑃𝑥

Bounded

Bo

un

ded

Bo

un

ded

Bounded

Unbounded

VIDEOGRAPH … OR TIME AS A GRAPH

ONGOING

A. SmeuldersE. GavvesN. Hussein


COMPLEX ACTIONS

6

Preparing Breakfast Stirring Food

Long & Complex Action One-action


COMPLEX ACTIONS

7

get

cook

put

Cooking a Meal

wash

• •

•2. EXTENT

3. DEPENDENCY

Complex actions of Charades are30 sec, compared to 5 sec of Kinetics.

The one-actions differ in their temporal extents.

Temporal dependency , albeit weak, between the one-actions.

1. LONG-RANGE


A YEAR AGO: TIMECEPTION

8

Reduce the complexity of 3D conv by factorizingIt into depth-wise separable 1D conv

2. TEMPORAL-ONLY CONV• • •

𝐶

𝑇

𝑘𝑡

3. MULTI-SCALE KERNELSUsing different kernel sizes, k, or dilation rates to account for verities in temporal extents of one-actions.

𝑘 = {1, 3, 5, 7}𝑑 = {1, 2, 3}

1. EFFICIENT MODULAR LAYERGrouped conv and concat+shuffle to reduce the computational cost of typical 3d conv.

Grouped Conv

Channel Shuffle


TIMECEPTION ON CHARADES

9

10

TC

R2D

3025 35 40

3

Performance mAP %

Tim

este

ps

2

1

10

1020M 30M 40M

TC TC

I3DR3D

NL

Results: Processing 1K frames → better accuracies

Insight: longer range temporal relations are important

Question: how to go longer range?

10x longer videos possible


STRETCHING TIME: TIME AS A GRAPH

Nodes: represent key one-actions in the activity

Edges: encode the temporal relationship between one-actions

Sublinear temporal representation

− Nodes are temporal concept-clusters

− Similar/irrelevant frames discarded

− Scales up to hours long videos (hopefully)

Increased interpretability

− Graphs are discrete

− Straightforward to reason

10

Graph-based Representation


VIDEOGRAPH

Video

𝑣𝑠1

𝑠2

𝑠𝑛

…

Tim

e

3D CNN

𝑥1𝑠1


VIDEOGRAPH

Video

𝑣𝑠1

𝑠2

𝑠𝑛

…

Tim

e

3D CNN

𝑥1𝑠1

‧‧‧

Concepts

𝑥1

𝐶

𝛼NodeAttention


VIDEOGRAPH

Video

𝑣𝑠1

𝑠2

𝑠𝑛

…

Tim

e

3D CNN

𝑥1𝑠1

‧‧‧

Concepts

𝑥1

𝐶

𝛼0.4 0.70.10.00.0

…

Transition

NodeAttention


VIDEOGRAPH

14

Video

𝑣𝑠1

𝑠2

𝑠𝑛

…

Tim

e

3D CNN

𝑥1𝑠1

‧‧‧

Concepts

𝑥1

𝐶

𝛼0.4 0.70.10.00.0

…

MLP

‧‧‧

Predictions

GraphEmbedding

GraphEmbedding

NodeAttention


VIDEOGRAPH: GRAPH-EMBEDDING

15

0.4 0.70.10.00.0

…

GraphEmbedding

GraphEmbedding

𝒀

TimeConv1D

N × T × HW× C

NodeConv1D

ChannelConv3D

MaxPool

N

4×T

4× HW × C

Learn set of latent concepts representation Ƹ𝑐 from randomly initialized 𝑐

Dot product ⊙ for similarity between a feature 𝑥 and all latent concepts Ƹ𝑐, followed by activation 𝜎.

Reweight all learned concepts 𝑐 using the activation values 𝛼


EXPERIMENTS

16

BreakfastCharades

TIME-ALIGNED NETS… OR NEURAL LAYERS AS TIME STEPS

VIDEO TIME: ENCODERS PROPERTIES EVALUATION, BMVC 2018

C. SnoekE. GavvesA. Ghodrati


WHAT ARE (SOME) PROPERTIES OF TIME?

18

Temporal

Asymmetry

Temporal

Continuity

Temporal

Causality

Temporal

Redundancy


DOMINANT APPROACHES FOR MODELLING THESE PROPERTIES?

19

LSTMs learn transitions between subsequent states 3D convolutions learn spatiotemporal correlations

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997Ji et al. 3d convolutional neural networks for human action recognition. PAMI, 2013

Tran et al., Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015

Carreira and Zisserman, Quo vadis action recognition a new model and the kinetics

dataset? CVPR, 2017


LSTM AND C3D: ARROW OF TIME?

20

LSTM

C3D


REVISITING RECURRENT NEURAL NETWORKS

Recurrent Nets are highly sensitive dynamical systems (Pascanu and Bengio, 2013)

− Even with highly discriminative symbolic (one-hot vector) inputs

− Gradients very sensitive to initialization → Poor learning! → No generalization

Visual features over time -even the best ones- are:

− much noisier

− much less discriminative

− much more redundant

Learning LSTM on videos is orders of magnitude harder

− Chaotic regime → no useful gradients → no learning

− Forward and Backward (or even shuffling) LSTM score the same accuracy on arrow of time

21

Basically, with high-dim noisy inputs LSTMs do not do sequence modelling but some weird entangled pooling


PROPOSAL: TIME-ALIGNED NEURAL NETS

ConvNets are much better with vanishing and exploding gradients, noisy and redundant inputs

No parameter sharing → no iterated maps → no chaotic regime

Moreover, the premise of LSTM parameter sharing is infinite Markov chains

In practice, however, we chop it off at T steps → like a ConvNet with T layers

Idea: Why not flip the ConvNet to align the layers with time steps?

22

ConvNets can handle vanishing/exploding/noisy/redundant because they do not share parameters.

Hypothesis


PROPOSAL: TIME-ALIGNED NEURAL NETS

Idea: Why not flip the ConvNet to align the layers with time steps?

No vanishing/exploding gradients, no problems with noisy and redundant inputs

23


RECHECKING ARROW OF TIME

Time-Aligned DenseNet gives much cleaner temporal clusters

24

Conclusion: Poor temporal modelling is likely due to hard –and thus unsuccessful- optimization

SPIKING NEURAL NETS … OR EVENT-DRIVEN TIME

ICLR 2017, 2018, 2019

M. WellingE. GavvesP. O’Connor


MOTIVATION

Most sensory data is temporally redundant

For today’s models temporal redundancy is a bug

For spiking models temporal redundancy is a feature

Great promise of human brain-like efficiency

Unfortunately spiking networks are still in the pre-alpha release stage :P

− How to backprop through spikes is not clear

− Also, in theory one must backprop to infinity which clearly creates lots of problems

26


AND IF YOU DON’T BELIEVE IT: YOU ONLY NEED MNIST EXPERIMENTS TO PUBLISH

27

WHAT IS MISSING & CHALLENGES?



Too-much image (space-only) focused

29

Image problem space“coordinate system”

Images



Too-much image (space-only) focused

30

Images

Videos

Curse of problem dimensionality

Image problem space“coordinate system”



Too-much image (space-only) focused models

#1. Transition to models focused on the spatiotemporal

This requires a deeper understanding of: The Role of Time in Spatiotemporal Sequences ?

Role of Time == Long spatiotemporal sequences == temporal non-stationarity/no-labels

Maybe time integrated in neural network structure, not neural network input/output

31

Images

VideosSpatiotemporal problem space

“coordinate system”

Deep Learning

Machine Learning

Computer Vision

Temporal Learning& Dynamics








#2. Tighter community and better communication channels

This initiative is a great step forward

32

Images



Deep Learning

Machine Learning

Computer Vision









#2. Tighter community and better communication channels

This initiative is a great step forward

#3. Balanced datasets? Video is not only Action. Trade-off between

(i) larger datasets → better performance

(ii) larger datasets → hard to run and compare equally

Maybe having a “temporal property”spatiotemporal decathlon?

33

Images



Deep Learning

Machine Learning

Computer Vision


University of Amsterdam / Efstratios Gavves 34

THANK [email protected]

UNIVERSITY OF AMSTERDAM

Documents

T MACHINE LEARNING OF T VIDEOS - Efstratios Gavves · University of Amsterdam / Efstratios Gavves VIDEOGRAPH: GRAPH-EMBEDDING 15 r.0 r. v r.0 r.1 r.7 … Graph Embedding Graph Embedding