Deep Learning for videoxgwang/video.pdf · 2014. 7. 12. · Reference • Babenko, Boris, Ming‐Hsuan Yang, and Serge Belongie. "Robust object tracking with online multiple instance

Deep Learning for videoDeep Learning for video understandingg

Wanli OuyangWanli OuyangDepartment of Electronic Engineering, Th Chi U i i f H KThe Chinese University of Hong Kong

OutlineOutline

• Deep learning for image window

Tracking

OutlineOutline

• Deep learning for image window• Deep learning for multiple imagesDeep learning for multiple images

Action recognitionTrack cycling Heptathlon Longboarding

Deep learning for image window

Deep learning trackerDeep learning tracker

• Tracking by classificationForeground positiveg pBackground negative

Cl ifi ti ith d• Classification with deep model

Deep classifier

[Babenko et al TPAMI11]

Deep classifier

[Wang&Yeung NIPS13]

Deep learning trackerDeep learning tracker

b k d d• Pretrain by stacked auto encoder• Use 4 fully connected deep model y pfor learning the classifier from 32x32 input patch32x32 input patch.

classification

Deep classifierDeep classifier

Deep learning for multiple images

Deep learning for multiple imagesDeep learning for multiple images

C id K i 3K h l i th i t• Consider K images as 3K channels in the input data.

• Apply 3D CNN for extracting features

1 image, 3 channels K image, 3K channels

3D CNN for action recognition [Ji et al. TPAMI13]

• CNN channels can be hard wired. E.g. gray pixel values, gradient‐x/y,hard wired. E.g. gray pixel values, gradient x/y, optical flow‐x/y.Learned weights at other layersLearned weights at other layers

3D CNN for action recognition3D CNN for action recognition

• Encourage the output to be close to high‐level features (bag‐of‐words, motion edge history ( g g yimage).

Auxiliary feature yextractors

Auxiliary motion

3D CNN

motion features

Action class

Action recognition resultsAction recognition results

Cell to ear

Object put

Pointing

Large‐scale Video Classification with CNN [Karpath et al. CVPR 2014]

• Multi‐resolution

8989

89

89

89

Temporal FusionTemporal Fusion

Experimental resultsExperimental results• Randomly sample 20 clips of a video andRandomly sample 20 clips of a video and averaging the output of these clip predictions.

L b diTrack cycling Longboarding

Cycling Longboarding

Track cycling Aggressive inline skating

2 “How to”s2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)

• How to formulate a vision problem with deep learning?• How to formulate a vision problem with deep learning? Tune hyper‐parameters, e.g. number of hidden nodes,

number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Background clutter removal (SDN)Short and long range temporal relationship (Action recognition)

Conclusion 2 “How to”sConclusion ‐ 2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)

• How to formulate a vision problem with deep learning?• How to formulate a vision problem with deep learning? Tune hyper‐parameters, e.g. number of hidden nodes,

number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Background clutter removal (SDN)Short and long range temporal relationship (Action recognition)

ReferenceReference

• Babenko, Boris, Ming‐Hsuan Yang, and Serge Belongie. "Robust object tracking with online multiple instance learning " Pattern Analysis and Machine Intelligence IEEElearning. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33.8 (2011): 1619‐1632.

• Wang N & Yeung D Y “Learning a deep compact imageWang, N., & Yeung, D. Y., Learning a deep compact image representation for visual tracking” NIPS, 2013.

• Karpathy, Andrej, et al. "Large‐scale video classification with p y, j, gconvolutional neural networks“ CVPR, 2014.

• Ji, Shuiwang, et al. "3D convolutional neural networks for human action recognition." Pattern Analysis and Machine Intelligence, IEEE Transactions on35.1 (2013): 221‐231.

Thank you!Thank you!

mmlab.ie.cuhk.edu.hk/ www.ee.cuhk.edu.hk/~xgwang/ www.ee.cuhk.edu.hk/~wlouyang/

Documents

Deep Learning for videoxgwang/video.pdf · 2014. 7. 12. · Reference • Babenko, Boris, Ming‐Hsuan Yang, and Serge Belongie. "Robust object tracking with online multiple instance