59
Large-Scale Video Understanding: YouTube and Beyond Rahul Sukthankar Machine Perception, Google Research https://research.google.com/teams/perception/ AI Frontiers Conference - Nov. 3, 2017

Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Embed Size (px)

Citation preview

Page 1: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Large-Scale Video Understanding:YouTube and BeyondRahul SukthankarMachine Perception, Google Researchhttps://research.google.com/teams/perception/

AI Frontiers Conference - Nov. 3, 2017

Page 2: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Machine PerceptionReally Works!

(better than I expected)

Page 3: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sample of Perception tech in products

Signals for Image Search ranking, related images, search-by-image, etc.

Page 4: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sample of Perception tech in products

Cloud Video API Cloud Vision API

Page 5: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sample of Perception tech in products

(Seth LaForge, Nexus 5X)

HDR+ in Android Camera Mobile Vision API

Page 6: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sample of Perception tech in products

Organizing Photos image & video collections and making them searchable by content

Microvideo tech in Photos & Motion Stills

De-reflection & tracking in Photo Scanner

Page 7: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sample of Perception tech in products

Personalized sticker packs in Allo

On-device handwritinginput & recognition

OCR for lots of languages

Page 8: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sample of Perception tech in products

Visual & auditory annotation & signals on YouTube

Thumbnail/preview selection & optimization for YouTube

Non-speech sound captions on YouTube

Page 9: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sample of Perception tech in products

Region tracking for custom blurring tool on YouTube

Mobile creative effects on YouTube

Page 10: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

watch, listen, understandcapture a moment improve & manipulate

Useful Applications for Video Technology

Help users create, enhance, organize, and discover videos.

Page 11: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Privacy Region Tracking & Blurring for YouTube

Page 12: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Fun Effects from Tracking (on Mobile) for YouTube

Page 13: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Large-Scale Video Annotation for YouTube

Page 14: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Large-Scale Video Annotation for YouTube

extract features

quantize & aggregate

train model(e.g., AdaBoost)

training data

Video understanding pipeline as of ~5 years ago

frame features

video features

“Roller-blading”

hand-designed descriptors

codebook histogram

pixels & sound samples

Page 15: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Large-Scale Video Annotation for YouTube

extract features

training data

Modern video understanding pipeline

“Roller-blading”

pixels & sound samples

Magic box containing many convolutional, deep, end-to-

end buzzwords :-)

Page 16: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Deep-learned visual features

Inception model trained on noisy data (images)

Bottleneck embedding

layer (1000-d)

Videos with noisy labels

Frame-level Video-level

- Max pooling- Avg pooling- VLAD pooling

Page 17: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

+80%mean avg. precision

40x more compact features

Deep learned visual features, VLAD coding: 1024-d, 0.272 MAP

Handcrafted audio-visual features: ~40K-d, 0.153 MAPM

ean

Ave

rage

Pre

cisi

on

Dimensionality

0.40

0.30

0.20

0.10

0

Deep-learned vs. handcrafted features

Page 18: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Personal video search in Google Photos

Lots of videosAlmost no metadata

Page 19: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

“Dancing” on the web

Page 20: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

“Dancing” in home videos

Page 21: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Domain adaptation: Finding home videos on YouTubeBy capture device

vs

By video frame rate

By video orientation

vs

Page 22: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

The technology behind personal video searchVideo

Trained on web images

Image / photo annotation model

1

Page 23: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

The technology behind personal video searchVideo

Trained on web images

Image / photo annotation model

YouTube frame annotation model

Trained on video thumbnails

Domain-adapted frame-level vision model

1

2

Page 24: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

YouTube video annotation model

Trained on YouTube videos

The technology behind personal video searchVideo

Trained on web images

Image / photo annotation model

YouTube frame annotation model

Trained on video thumbnails

Domain-adapted frame-levelvision model

Domain-adapted video-levelvision model

1

2

3

Page 25: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

YouTube video annotation model

Trained on YouTube videos

The technology behind personal video searchVideo

Audio

Trained on web images

Image / photo annotation model

Trained on YouTube videos

YouTube audio annotation model

YouTube frame annotation model

Trained on video thumbnails

Domain-adapted frame-level vision model

Domain-adapted video-levelvision model

Domain-adapted audio model

1

2

3

4

Page 26: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

YouTube video annotation model

Trained on YouTube videos

toddlerdancing

The technology behind personal video searchVideo

Audio

Trained on web images

Image / photo annotation model

Trained on YouTube videos

YouTube audio annotation model

YouTube frame annotation model

Trained on video thumbnails

Domain-adapted frame-level vision model

Domain-adapted video-levelvision model

Domain-adapted audio model

1

2

3

4

Fusion & calibration

5

Trained on home videos

Domain-adapted personal videomodel

Page 27: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Evolution of personal video annotation models1234

Page 28: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Evolution of personal video annotation models1234

Photo annotation model applied on video frames

Page 29: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Evolution of personal video annotation models

Domain adaptation + fusion across frames

1234

Photo annotation model applied on video frames

Page 30: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Evolution of personal video annotation models

Fusion across multiple vision models

Domain adaptation + fusion across frames

1234

Photo annotation model applied on video frames

Page 31: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Evolution of personal video annotation models

Fusion across multiple audio-visual models

Fusion across multiple vision models

Photo annotation model applied on video frames

Domain adaptation + fusion across frames

1234

Page 32: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Evolution of personal video annotation models1234

> 2x recall gain

Page 33: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Learning aesthetics: YouTube Thumbnails

Page 34: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Learning aesthetics: YouTube Thumbnails

YouTube thumbnail quality model

Page 35: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Learning aesthetics: YouTube Thumbnails

Page 36: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Learning aesthetics: YouTube Thumbnails

Improving YouTube video thumbnails with deep neural nets, Google Research Blog, Oct. 2015

Page 37: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Video retargeting (spatial)

Original video. Reframed for a banner aspect ratio.

Page 38: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Video retargeting (temporal)

Video preview:

(duration: 6 secs)

Page 39: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Motion Stabilization

Page 40: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Motion Stills app

Stream One-Up

Page 41: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Motion Still examples: cinemagraphs

Page 42: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Motion Stills examples: gifs / memes

Page 43: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Motion Stills examples: timelapse

Page 44: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Promising Directions for Future Research:

Learning from Video

Page 45: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017

Self-Supervised ImitationPierre Sermanet* Corey Lynch* Yevgen Chebotar*

Jasmine Hsu Eric Jang Stefan Schaal Sergey LevineGoogle Brain + University of Southern California

* equal contribution

Page 46: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017

Multi-view capture

This image cannot currently be displayed.

Page 47: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017

Time-Contrastive Networks (TCN)

(source: [Rippel et al 2015])

arxiv.org/abs/1704.06888v2sermanet.github.io/imitate

Page 48: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017

Approach (pouring, real)

* RL used: Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning,Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., Levine, S. [ICML 17]

Page 49: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017

Resulting policies

Page 50: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017

Pose imitation (real robot)

Page 51: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Useful Datasets for Video Understanding

● Large-scale video annotation○ Sports-1M > 1M videos from ~500 classes [with

Stanford]○ YouTube-8M ~8M videos from ~4800 classes

● Action recognition in video○ THUMOS Temporal localization in untrimmed videos [with UCF, INRIA]○ Kinetics 400+ short clips for 400 actions [with

DeepMind]○ AVA Spatially localized atomic actions

[with Berkeley, INRIA]

● Object recognition○ YouTube-BB Spatially localized objects in video (80 classes)○ Open Images Spatially localized objects in images (600 classes)

Page 52: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Sports-1M: 1.1M videos from 487 sports classes (video classification)

Page 53: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

YouTube-8M Video Research Dataset

research.google.com/youtube8m/

Page 54: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

THUMOS Challenge Series: Temporal Localization in Untrimmed Videos

Page 55: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

YouTube Bounding Boxes: Spatial localization of one object through time

Page 56: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

AVA: Spatial localization of an actor performing atomic actions

Atomic action: “Paint”

Page 57: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Open Images v3 - detailed spatial annotations in images

Example validation images

Page 58: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Open Images v3 - detailed spatial annotations in images

Example validation images

Page 59: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

● Significant progress in large-scale video annotation for YouTube● Video understanding has many applications beyond YouTube● We encourage others to work on video through public datasets● Many exciting research problems ahead, particularly in learning from video

(I think there’s a lot more progress to be made in video understanding)

Conclusion