Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

Large-Scale Video Understanding:YouTube and BeyondRahul SukthankarMachine Perception, Google Researchhttps://research.google.com/teams/perception/

AI Frontiers Conference - Nov. 3, 2017

Machine PerceptionReally Works!

(better than I expected)

Sample of Perception tech in products

Signals for Image Search ranking, related images, search-by-image, etc.


Cloud Video API Cloud Vision API


(Seth LaForge, Nexus 5X)

HDR+ in Android Camera Mobile Vision API


Organizing Photos image & video collections and making them searchable by content

Microvideo tech in Photos & Motion Stills

De-reflection & tracking in Photo Scanner


Personalized sticker packs in Allo

On-device handwritinginput & recognition

OCR for lots of languages


Visual & auditory annotation & signals on YouTube

Thumbnail/preview selection & optimization for YouTube

Non-speech sound captions on YouTube


Region tracking for custom blurring tool on YouTube

Mobile creative effects on YouTube

watch, listen, understandcapture a moment improve & manipulate

Useful Applications for Video Technology

Help users create, enhance, organize, and discover videos.

Privacy Region Tracking & Blurring for YouTube

Fun Effects from Tracking (on Mobile) for YouTube

Large-Scale Video Annotation for YouTube


extract features

quantize & aggregate

train model(e.g., AdaBoost)

training data

Video understanding pipeline as of ~5 years ago

frame features

video features

“Roller-blading”

hand-designed descriptors

codebook histogram

pixels & sound samples


extract features

training data

Modern video understanding pipeline

“Roller-blading”

pixels & sound samples

Magic box containing many convolutional, deep, end-to-

end buzzwords :-)

Deep-learned visual features

Inception model trained on noisy data (images)

Bottleneck embedding

layer (1000-d)

Videos with noisy labels

Frame-level Video-level

- Max pooling- Avg pooling- VLAD pooling

+80%mean avg. precision

40x more compact features

Deep learned visual features, VLAD coding: 1024-d, 0.272 MAP

Handcrafted audio-visual features: ~40K-d, 0.153 MAPM

ean

Ave

rage

Pre

cisi

on

Dimensionality

0.40

0.30

0.20

0.10

0

Deep-learned vs. handcrafted features

Personal video search in Google Photos

Lots of videosAlmost no metadata

“Dancing” on the web

“Dancing” in home videos

Domain adaptation: Finding home videos on YouTubeBy capture device

vs

By video frame rate

By video orientation

vs

The technology behind personal video searchVideo

Trained on web images

Image / photo annotation model

1




YouTube frame annotation model

Trained on video thumbnails

Domain-adapted frame-level vision model

1

2

YouTube video annotation model

Trained on YouTube videos






Domain-adapted frame-levelvision model

Domain-adapted video-levelvision model

1

2

3




Audio




YouTube audio annotation model





Domain-adapted audio model

1

2

3

4



toddlerdancing


Audio




YouTube audio annotation model





Domain-adapted audio model

1

2

3

4

Fusion & calibration

5

Trained on home videos

Domain-adapted personal videomodel

Evolution of personal video annotation models1234


Photo annotation model applied on video frames

Evolution of personal video annotation models

Domain adaptation + fusion across frames

1234



Fusion across multiple vision models


1234



Fusion across multiple audio-visual models

Fusion across multiple vision models



1234


> 2x recall gain

Learning aesthetics: YouTube Thumbnails


YouTube thumbnail quality model



Improving YouTube video thumbnails with deep neural nets, Google Research Blog, Oct. 2015

Video retargeting (spatial)

Original video. Reframed for a banner aspect ratio.

Video retargeting (temporal)

Video preview:

(duration: 6 secs)

Motion Stabilization

Motion Stills app

Stream One-Up

Motion Still examples: cinemagraphs

Motion Stills examples: gifs / memes

Motion Stills examples: timelapse

Promising Directions for Future Research:

Learning from Video

Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017

Self-Supervised ImitationPierre Sermanet* Corey Lynch* Yevgen Chebotar*

Jasmine Hsu Eric Jang Stefan Schaal Sergey LevineGoogle Brain + University of Southern California

* equal contribution


Multi-view capture

This image cannot currently be displayed.


Time-Contrastive Networks (TCN)

(source: [Rippel et al 2015])

arxiv.org/abs/1704.06888v2sermanet.github.io/imitate


Approach (pouring, real)

* RL used: Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning,Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., Levine, S. [ICML 17]


Resulting policies


Pose imitation (real robot)

Useful Datasets for Video Understanding

● Large-scale video annotation○ Sports-1M > 1M videos from ~500 classes [with

Stanford]○ YouTube-8M ~8M videos from ~4800 classes

● Action recognition in video○ THUMOS Temporal localization in untrimmed videos [with UCF, INRIA]○ Kinetics 400+ short clips for 400 actions [with

DeepMind]○ AVA Spatially localized atomic actions

[with Berkeley, INRIA]

● Object recognition○ YouTube-BB Spatially localized objects in video (80 classes)○ Open Images Spatially localized objects in images (600 classes)

Sports-1M: 1.1M videos from 487 sports classes (video classification)

YouTube-8M Video Research Dataset

research.google.com/youtube8m/

THUMOS Challenge Series: Temporal Localization in Untrimmed Videos

YouTube Bounding Boxes: Spatial localization of one object through time

AVA: Spatial localization of an actor performing atomic actions

Atomic action: “Paint”

Open Images v3 - detailed spatial annotations in images

Example validation images

Open Images v3 - detailed spatial annotations in images

Example validation images

● Significant progress in large-scale video annotation for YouTube● Video understanding has many applications beyond YouTube● We encourage others to work on video through public datasets● Many exciting research problems ahead, particularly in learning from video

(I think there’s a lot more progress to be made in video understanding)

Conclusion

Data & Analytics

Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond