Upload
ai-frontiers
View
344
Download
0
Embed Size (px)
Citation preview
Large-Scale Video Understanding:YouTube and BeyondRahul SukthankarMachine Perception, Google Researchhttps://research.google.com/teams/perception/
AI Frontiers Conference - Nov. 3, 2017
Machine PerceptionReally Works!
(better than I expected)
Sample of Perception tech in products
Signals for Image Search ranking, related images, search-by-image, etc.
Sample of Perception tech in products
Cloud Video API Cloud Vision API
Sample of Perception tech in products
(Seth LaForge, Nexus 5X)
HDR+ in Android Camera Mobile Vision API
Sample of Perception tech in products
Organizing Photos image & video collections and making them searchable by content
Microvideo tech in Photos & Motion Stills
De-reflection & tracking in Photo Scanner
Sample of Perception tech in products
Personalized sticker packs in Allo
On-device handwritinginput & recognition
OCR for lots of languages
Sample of Perception tech in products
Visual & auditory annotation & signals on YouTube
Thumbnail/preview selection & optimization for YouTube
Non-speech sound captions on YouTube
Sample of Perception tech in products
Region tracking for custom blurring tool on YouTube
Mobile creative effects on YouTube
watch, listen, understandcapture a moment improve & manipulate
Useful Applications for Video Technology
Help users create, enhance, organize, and discover videos.
Privacy Region Tracking & Blurring for YouTube
Fun Effects from Tracking (on Mobile) for YouTube
Large-Scale Video Annotation for YouTube
Large-Scale Video Annotation for YouTube
extract features
quantize & aggregate
train model(e.g., AdaBoost)
training data
Video understanding pipeline as of ~5 years ago
frame features
video features
“Roller-blading”
hand-designed descriptors
codebook histogram
pixels & sound samples
Large-Scale Video Annotation for YouTube
extract features
training data
Modern video understanding pipeline
“Roller-blading”
pixels & sound samples
Magic box containing many convolutional, deep, end-to-
end buzzwords :-)
Deep-learned visual features
Inception model trained on noisy data (images)
Bottleneck embedding
layer (1000-d)
Videos with noisy labels
Frame-level Video-level
- Max pooling- Avg pooling- VLAD pooling
+80%mean avg. precision
40x more compact features
Deep learned visual features, VLAD coding: 1024-d, 0.272 MAP
Handcrafted audio-visual features: ~40K-d, 0.153 MAPM
ean
Ave
rage
Pre
cisi
on
Dimensionality
0.40
0.30
0.20
0.10
0
Deep-learned vs. handcrafted features
Personal video search in Google Photos
Lots of videosAlmost no metadata
“Dancing” on the web
“Dancing” in home videos
Domain adaptation: Finding home videos on YouTubeBy capture device
vs
By video frame rate
By video orientation
vs
The technology behind personal video searchVideo
Trained on web images
Image / photo annotation model
1
The technology behind personal video searchVideo
Trained on web images
Image / photo annotation model
YouTube frame annotation model
Trained on video thumbnails
Domain-adapted frame-level vision model
1
2
YouTube video annotation model
Trained on YouTube videos
The technology behind personal video searchVideo
Trained on web images
Image / photo annotation model
YouTube frame annotation model
Trained on video thumbnails
Domain-adapted frame-levelvision model
Domain-adapted video-levelvision model
1
2
3
YouTube video annotation model
Trained on YouTube videos
The technology behind personal video searchVideo
Audio
Trained on web images
Image / photo annotation model
Trained on YouTube videos
YouTube audio annotation model
YouTube frame annotation model
Trained on video thumbnails
Domain-adapted frame-level vision model
Domain-adapted video-levelvision model
Domain-adapted audio model
1
2
3
4
YouTube video annotation model
Trained on YouTube videos
toddlerdancing
The technology behind personal video searchVideo
Audio
Trained on web images
Image / photo annotation model
Trained on YouTube videos
YouTube audio annotation model
YouTube frame annotation model
Trained on video thumbnails
Domain-adapted frame-level vision model
Domain-adapted video-levelvision model
Domain-adapted audio model
1
2
3
4
Fusion & calibration
5
Trained on home videos
Domain-adapted personal videomodel
Evolution of personal video annotation models1234
Evolution of personal video annotation models1234
Photo annotation model applied on video frames
Evolution of personal video annotation models
Domain adaptation + fusion across frames
1234
Photo annotation model applied on video frames
Evolution of personal video annotation models
Fusion across multiple vision models
Domain adaptation + fusion across frames
1234
Photo annotation model applied on video frames
Evolution of personal video annotation models
Fusion across multiple audio-visual models
Fusion across multiple vision models
Photo annotation model applied on video frames
Domain adaptation + fusion across frames
1234
Evolution of personal video annotation models1234
> 2x recall gain
Learning aesthetics: YouTube Thumbnails
Learning aesthetics: YouTube Thumbnails
YouTube thumbnail quality model
Learning aesthetics: YouTube Thumbnails
Learning aesthetics: YouTube Thumbnails
Improving YouTube video thumbnails with deep neural nets, Google Research Blog, Oct. 2015
Video retargeting (spatial)
Original video. Reframed for a banner aspect ratio.
Video retargeting (temporal)
Video preview:
(duration: 6 secs)
Motion Stabilization
Motion Stills app
Stream One-Up
Motion Still examples: cinemagraphs
Motion Stills examples: gifs / memes
Motion Stills examples: timelapse
Promising Directions for Future Research:
Learning from Video
Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017
Self-Supervised ImitationPierre Sermanet* Corey Lynch* Yevgen Chebotar*
Jasmine Hsu Eric Jang Stefan Schaal Sergey LevineGoogle Brain + University of Southern California
* equal contribution
Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017
Multi-view capture
This image cannot currently be displayed.
Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017
Time-Contrastive Networks (TCN)
(source: [Rippel et al 2015])
arxiv.org/abs/1704.06888v2sermanet.github.io/imitate
Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017
Approach (pouring, real)
* RL used: Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning,Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., Levine, S. [ICML 17]
Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017
Resulting policies
Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017
Pose imitation (real robot)
Useful Datasets for Video Understanding
● Large-scale video annotation○ Sports-1M > 1M videos from ~500 classes [with
Stanford]○ YouTube-8M ~8M videos from ~4800 classes
● Action recognition in video○ THUMOS Temporal localization in untrimmed videos [with UCF, INRIA]○ Kinetics 400+ short clips for 400 actions [with
DeepMind]○ AVA Spatially localized atomic actions
[with Berkeley, INRIA]
● Object recognition○ YouTube-BB Spatially localized objects in video (80 classes)○ Open Images Spatially localized objects in images (600 classes)
Sports-1M: 1.1M videos from 487 sports classes (video classification)
YouTube-8M Video Research Dataset
research.google.com/youtube8m/
THUMOS Challenge Series: Temporal Localization in Untrimmed Videos
YouTube Bounding Boxes: Spatial localization of one object through time
AVA: Spatial localization of an actor performing atomic actions
Atomic action: “Paint”
Open Images v3 - detailed spatial annotations in images
Example validation images
Open Images v3 - detailed spatial annotations in images
Example validation images
● Significant progress in large-scale video annotation for YouTube● Video understanding has many applications beyond YouTube● We encourage others to work on video through public datasets● Many exciting research problems ahead, particularly in learning from video
(I think there’s a lot more progress to be made in video understanding)
Conclusion