Upload
nguyentuyen
View
231
Download
3
Embed Size (px)
Citation preview
Lecture8:ConvolutionalNeuralNetworksonVideos
Bohyung [email protected]
CSED703R:DeepLearningforVisualRecognition(2017F)
CNNsonVideos
• ChallengesinvideoprocessingusingCNNs
§ Alargenumberofframes:highcomputationalcomplexity
§ Variablelengths
§ Temporaldependencyofdata
• Relevantproblems
§ Actiondetectionandrecognition
§ Objectdetectionandrecognitioninvideos
§ Videosegmentation
§ Scenerecognitioninvideos
§ Visualtracking
§ …
2
ActionDetectionandRecognition
• Classifyingactions
§ Fromimagesorvideos
§ Usingdeeplearningand/orshallowlearning
• Problemdefinition
§ Actionrecognition:pervideoinferencetypicallyusingtrimmedvideos
§ Actiondetection:temporalorspatio-temporallocalization
• Deeplearningforactiondetectionandrecognition
§ Notyetmature
§ Basedoneither CNNsorRNNs
§ Algorithmsbasedondeeplearningstartedtooutperformthemethods
basedonhandcraftedfeatures.
3
Datasets
• UCF-101
§ 101classes:approximately13,320realisticvideos,collectedfromYouTube§ 5types:human-objectinteraction,bodymotiononly,human-human
interaction,playingmusicalinstruments,sports
§ Threetraining/testingsplits
4http://crcv.ucf.edu/data/UCF101.php
Datasets
• Sports-1M
§ 487classes,1000-3000videosperclass
§ 1MYouTubevideos
§ Theclassesarearrangedinamanually-curatedtaxonomy.
§ Noisylabelsduetoautomatedannotationprocess
5http://cs.stanford.edu/people/karpathy/deepvideo/
Datasets
• HMDB-51
§ Alargehumanmotiondatabase
§ 7000clipsfor51classes
§ Originalandstabilizedversionsareavailable.
§ STIP(SpaceTimeInterestPoint)featuresareavailable.
6http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/
Datasets
• AtomicVisualActions(AVA)dataset
§ 44classeswithatleast25testexamples
§ From192moviesand57.6Kvideosegments
§ 96Kboundingboxes
§ 210Kactionannotations
7
3secs
Left:Kneel,Talkto
Right:Stand,Listen,Shoot
https://research.google.com/ava/
Datasets
• Kinetics
§ Humanactionclassification(10sclips)
§ 400humanactionclasses
§ Morethan400videosperclass:300ktotal,allfromuniquevideos
§ Videosfromyoutube searches
8https://deepmind.com/research/open-source/open-source-datasets/kinetics/
PlayingtrumpetDunkingbasketball
BraidinghairShakinghands
Datasets
• ActivityNet 200
§ 200activityclasseswithtaxonomy
§ 10,024trainingvideos(15,410instances)
§ 4,926validationvideos(7,654instances)
§ 5,044testingvideos(labelswithheld)
§ Untrimmedvideos
9http://activity-net.org/download.html
ActionRecognitionResults
10
Title Venue AlgorithmAccuracy
Sports-1M
UCF101
ActionRecognition withImprovedTrajectories
ICCV 13 Improved dense
trajectory(iDT)
85.9
BagofVisualWordsandFusionMethods
forActionRecognition:Comprehensive
StudyandGoodPractice
arXiv 14 iDT withhigher
dimensionalencoding
87.9
Large-scaleVideo ClassificationwithCNNs CVPR 14 Spatial-temporal CNN 60.9 65.4
Two-Stream CNNsforActionRecognitioninVideos
NIPS14 Two-streamCNN
fusedby SVM
88.0
Long-TermRecurrentConvolutionalNetworks
forVisualRecognitionandDescription
CVPR15 CNN +LSTM 82.9
BeyondShortSnippets: DeepNetworksfor
VideoClassification
CVPR15 Two-stream+temporal
featurepooling
72.4 88.6
LearningSpatiotemporalFeatureswith3DConvolutionalNetworks
ICCV 15 Spatiotemporal 3D-
CNN+iDT
61.1 90.4
ActionRecognitionwithTrajectory-PooledDeep-ConvolutionalDescriptors
CVPR15 Two-streammodel+
iDT
91.5
Actions Transformations CVPR16 92.4
ImprovedDenseTrajectory(iDT)
• Mainidea
§ Globalmotioncompensation:
removingtrajectoriesfrom
cameramotion
§ Outlierrejection:removing
matchesfromhumanregions
• Featureencoding
§ Extractsfeaturessuchas
trajectory,HOG,HOF,andMBH
§ Encodesfeaturesusing
• Bagoftrajectoryfeatures
• Fishervector+PCA
11[Wang13]H.Wang,C.Schmid:ActionRecognitionwithImprovedTrajectories.ICCV2013
Itdemonstratesverygoodaccuracyevencomparedwithdeeplearningapproaches.
Large-ScaleVideoClassificationbyCNN
• Architectureswithpoolingvariations
§ Regardsvideosasacollectionofframes
§ AppliesCNNsforasingleormultipleframes
12
TwostreamswithTframegap 3Dconvolutions 3Dconvolutionswith
progressivefusion
2frames Multipleframes Multipleframes
[Kapathy14]A.Kapathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar,L.Fei-Fei:Large-scaleVideoClassificationwithConvolutionalNeuralNetworks.CVPR2014
Singleframe Latefusion Earlyfusion Slowfusion
Large-ScaleVideoClassificationbyCNN
• Multi-resolutionapproach
§ Foveastream:centercropofeachframe
§ Contextstream:entireframe
13
[Kapathy14]A.Kapathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar,L.Fei-Fei:Large-scaleVideoClassificationwithConvolutionalNeuralNetworks.CVPR2014
Large-ScaleVideoClassificationbyCNN
• Contributions
§ Constructionalarge-scaledataset,Sports-1M
§ Providingquantitativeaccuraciesofseveralbaselinealgorithms
• Cons
§ StillworsethaniDT algorithmwithlargemargin
§ Stillnosophisticatedideaofvideolevelprediction
14
[Kapathy14]A.Kapathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar,L.Fei-Fei:Large-scaleVideoClassificationwithConvolutionalNeuralNetworks.CVPR2014
TwoStreamCNNs
• Combinationoftwocomplementaryinformation
§ Spatialstream:CNNonasingleimage(2Dconvolution)
§ Temporalstream:CNNonmulti-frameopticalflow(2Dconvolution)
§ Fusion:SVMwithsoftmax scoresfromtwostreams
15
[Simonyan14]K.Simonyan,A.Zissermann:Two-StreamConvolutionalNetworksforActionRecognitioninVideos.NIPS2014
Pretrained onImageNet
Trainingfromscratch
TwoStreamCNNs
• Temporalstream
§ Flowdatageneration:!" ∈ ℝ%×'×()(* frameswith2channels)
§ Networkinput:224×224×2* volumesubsampledfrom!"• Multi-tasklearning
§ Totraintemporalstreamfromscratchusingasmallnumberofdata
§ TrainedonUCF-101(9.5Kvideos)andHMDB-51(3.5Kvideos)
• Featureaggregationovermultipleframes
§ Opticalflowstacking:stackflowchannels-./ and-.0of* consecutive
frames
§ Trajectorystacking:stackcumulativemotionin1 and2 direction
§ Bidirectionalopticalflow:stackingbothforwardandbackwardstreams
§ Meanflowsubtraction
16
[Simonyan14]K.Simonyan,A.Zissermann:Two-StreamConvolutionalNetworksforActionRecognitioninVideos.NIPS2014
TwoStreamCNNs
• Results
§ NotparticularlygoodcomparedtoIDT
17
Meanaccuracy(overthreesplits)onUCF-101andHMDB-51
[Simonyan14]K.Simonyan,A.Zissermann:Two-StreamConvolutionalNetworksforActionRecognitioninVideos.NIPS2014
iDT +CNN
• Mainidea
§ Sharesmeritsofhandcraftedfeaturesanddeeplylearnedfeatures
§ Shallowfeatures:iDT
§ Deepfeatures:aslightvariationoftwo-streamCNNs
18
[Wang15]L.Wang,Y.Qiao,X.Tang:ActionRecognitionwithTrajectory-PooledDeep-ConvolutionalDescriptors.CVPR2015
+
Extracting Trajectories Extracting Feature Maps
Trajectory-Pooled Deep-Convolutional Descriptors (TDDs)
input video tracking in a single scale trajectories spatial & temporal nets frame & flow pyramidfeature pyramid
…
…
convolutional feature map feature map normalization trajectory-constrained pooling
TD
D
X
Y
Z
spatiotemporal normalization channel normalization
…
…
+
+
+
iDT +CNN
• Trajectory-PooledDeep-ConvolutionalDescriptors(TDDs)
§ Localtrajectory-aligneddescriptorcomputedina3Dvolume
§ Sum-poolingofthenormalizedfeaturemaps
• ClassificationbyalinearSVMusingFishervectorencoding
19
3 45, 789: = <789: 1=5, 2=5, >=5
=@
45:trajectory,789: :normalizedfeaturemap, A ∈ B, C , D5 = 1=5, 2=5, >=5
[Wang15]L.Wang,Y.Qiao,X.Tang:ActionRecognitionwithTrajectory-PooledDeep-ConvolutionalDescriptors.CVPR2015
RGB Flow-x Flow-y S-conv4 S-conv5 T-conv3 T-conv4
ActionRecognitionby3DCNNs
• Generalizedversionof2D
convolution
§ Convolutioninspatio-temporal
domain
§ Moreparametersfor
convolutions
20
[Ji10]S.Ji,W.Xu,M.Yang,K.Yu:3DConvolutionalNeuralNetworksforHumanActionRecognition.ICML
2010
[Tran15]D.Tran,L.Bourdev,R.Fergus,L.Torresani,M.Paluri:LearningSpatiotemporalFeatureswith3DConvolutionalNetworks.ICCV2015
2Dconvolution3Dconvolution
3DConvolutions
• Basicarchitecture[JiICML10]
§ 1hardwiredconvolutionallayer
§ 3additionalconvolutionallayers
§ 2poolinglayers
§ 1fullyconnectedlayer
21
H1: 33@60x40 C2:
23*2@54x34
7x7x3 3D convolution
2x2 subsampling
S3: 23*2@27x17
7x6x3 3D convolution
C4: 13*6@21x12
3x3 subsampling
S5: 13*6@7x4
7x4 convolution
C6: 128@1x1
full connnection
hardwired
input: 7@60x40
[Ji10]S.Ji,W.Xu,M.Yang,K.Yu:3DConvolutionalNeuralNetworksforHumanActionRecognition.ICML
2010
3DConvolutions:C3D
• Architecture[TranICCV15]
§ 8convolutionslayers:3x3x3with
stride1x1x1
§ 5max-poolinglayers:2x2x2with
stride2x2x2(1x2x2withstride1x2x2
forpool1)
§ 2fullyconnectedlayers
§ Softmax layer
22
Conv
1a6
4
Conv
2a1
28
Pool
1
Conv
3a2
56
Pool
2
Conv
3b2
56
Pool
3
Conv
4a5
12
Conv
4b5
12
Pool
4
Conv
5a5
12
Conv
5b5
12
Pool
5
FC4
096
FC4
096
Softm
ax
[Tran15]D.Tran,L.Bourdev,R.Fergus,L.Torresani,M.Paluri:LearningSpatiotemporalFeatureswith3DConvolutionalNetworks.ICCV2015
DeCAFDeCAF
AccuracywithlinearSVM
3DConvolutions:C3D
• Featureembedding
23
DeCAF C3D
[Tran15]D.Tran,L.Bourdev,R.Fergus,L.Torresani,M.Paluri:LearningSpatiotemporalFeatureswith3DConvolutionalNetworks.ICCV2015
I3D
• TwoStream Inflated3DConvNets (I3D)
§ Inflating2DConvNets into3D
• Thepre-trainedweightsinCNNsfor
imagesprovideavaluableinitialization.
§ Two3Dstreams:RGBandflow
§ 64framesasaninput
§ 25Mparameters
24
[Carreira17]J.Carreira,A.Zisserman:QuoVadis,ActionRecognition?ANewModelandtheKineticsDataset.CVPR2017
ActionDetection
• Spatio-temporalactionlocalization
25[Gkioxari15]G.Gkioxari,J.Malik:FindingActionTubes.CVPR2015
Action-specificclassifier
Spatial-CNN
Motion-CNN
Objectproposals
Linkingactiondetections
ActionDetection
• Actionspecificclassifier
• Linkingactiondetections
26
BE F., F.GH = IEJK F. + IE
JK F.GH + M ⋅ IoU F., F.GH
FRE∗ = argmaxYR
14<BE F., F.GH
[\H
.]H
OptimizedbyViterbialgorithm
IEJ:learnedmodelfromlinearSVM
K F. :learnedfeaturefromCNN
[Gkioxari15]G.Gkioxari,J.Malik:FindingActionTubes.CVPR2015
ActionDetection
• ResultsonUCFsportsdataset
27[Gkioxari15]G.Gkioxari,J.Malik:FindingActionTubes.CVPR2015
ObjectDetectioninVideos
• ImageNet-VIDdataset
§ 30categories
• Taskdefinition
§ Recognizingandlocalizingobjectsinvideos
§ Aimingspatio-temporalcoherency
§ Sameevaluationmetricwithobjectdetectioninimages
28
CNNTublets
• Spatio-temporaltubelet proposalmodule
§ Combinesstill-imageobjectdetectionandgenericobjecttracking
§ Proposalextraction,proposalscoring,high-confidenceproposaltracking
• Tubelet classificationandrescoring
§ Tubelet box(spatial)perturbationandmax-pooling
§ Temporalconvolutionandrescoring
29
[Kang16]K.Kang,W.Ouyang,H.Li,X.Wang:ObjectDetectionfromVideoTubelets withConvolutionalNeuralNetworks.CVPR2016
ObjectSegmentationinVideos
• DAVIS(Densely-AnnotatedVIdeo Segmentation)dataset
§ http://davischallenge.org/index.html
30
31