Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Deep Learning for videoDeep Learning for video understandingg
Wanli OuyangWanli OuyangDepartment of Electronic Engineering, Th Chi U i i f H KThe Chinese University of Hong Kong
OutlineOutline
• Deep learning for image window
Tracking
OutlineOutline
• Deep learning for image window• Deep learning for multiple imagesDeep learning for multiple images
Action recognitionTrack cycling Heptathlon Longboarding
Deep learning for image window
Deep learning trackerDeep learning tracker
• Tracking by classificationForeground positiveg pBackground negative
Cl ifi ti ith d• Classification with deep model
Deep classifier
[Babenko et al TPAMI11]
Deep classifier
[Wang&Yeung NIPS13]
Deep learning trackerDeep learning tracker
b k d d• Pretrain by stacked auto encoder• Use 4 fully connected deep model y pfor learning the classifier from 32x32 input patch32x32 input patch.
classification
Deep classifierDeep classifier
Deep learning for multiple images
Deep learning for multiple imagesDeep learning for multiple images
C id K i 3K h l i th i t• Consider K images as 3K channels in the input data.
• Apply 3D CNN for extracting features
1 image, 3 channels K image, 3K channels
3D CNN for action recognition [Ji et al. TPAMI13]
• CNN channels can be hard wired. E.g. gray pixel values, gradient‐x/y,hard wired. E.g. gray pixel values, gradient x/y, optical flow‐x/y.Learned weights at other layersLearned weights at other layers
3D CNN for action recognition3D CNN for action recognition
• Encourage the output to be close to high‐level features (bag‐of‐words, motion edge history ( g g yimage).
Auxiliary feature yextractors
Auxiliary motion
3D CNN
motion features
Action class
Action recognition resultsAction recognition results
Cell to ear
Object put
Pointing
Large‐scale Video Classification with CNN [Karpath et al. CVPR 2014]
• Multi‐resolution
8989
89
89
89
Temporal FusionTemporal Fusion
Experimental resultsExperimental results• Randomly sample 20 clips of a video andRandomly sample 20 clips of a video and averaging the output of these clip predictions.
L b diTrack cycling Longboarding
Cycling Longboarding
Track cycling Aggressive inline skating
2 “How to”s2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)
• How to formulate a vision problem with deep learning?• How to formulate a vision problem with deep learning? Tune hyper‐parameters, e.g. number of hidden nodes,
number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Background clutter removal (SDN)Short and long range temporal relationship (Action recognition)
Conclusion 2 “How to”sConclusion ‐ 2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)
• How to formulate a vision problem with deep learning?• How to formulate a vision problem with deep learning? Tune hyper‐parameters, e.g. number of hidden nodes,
number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Background clutter removal (SDN)Short and long range temporal relationship (Action recognition)
ReferenceReference
• Babenko, Boris, Ming‐Hsuan Yang, and Serge Belongie. "Robust object tracking with online multiple instance learning " Pattern Analysis and Machine Intelligence IEEElearning. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33.8 (2011): 1619‐1632.
• Wang N & Yeung D Y “Learning a deep compact imageWang, N., & Yeung, D. Y., Learning a deep compact image representation for visual tracking” NIPS, 2013.
• Karpathy, Andrej, et al. "Large‐scale video classification with p y, j, gconvolutional neural networks“ CVPR, 2014.
• Ji, Shuiwang, et al. "3D convolutional neural networks for human action recognition." Pattern Analysis and Machine Intelligence, IEEE Transactions on35.1 (2013): 221‐231.
Thank you!Thank you!
mmlab.ie.cuhk.edu.hk/ www.ee.cuhk.edu.hk/~xgwang/ www.ee.cuhk.edu.hk/~wlouyang/