Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Recall0 0.2 0.4 0.6 0.8 1
Prec
ision
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
own left | AP: 0.64 | AUC: 0.666own right | AP: 0.727 | AUC: 0.742other left | AP: 0.813 | AUC: 0.878other right | AP: 0.781 | AUC: 0.809
Lee et al. [1], all 4 hand types
LendingAHand:Detec/ngHandsandRecognizingAc/vi/esinComplexEgocentricInterac/onsSvenBambach,StefanLee,DavidCrandall,SchoolofInforma-csandCompu-ng,IndianaUniversity
ChenYu,DepartmentofPsychologicalandBrainSciences,IndianaUniversity
Acknowledgments:ThisworkwassupportedbytheNa3onalScienceFounda3on(CAREERIIS-1253549,CNS-0521433),theNa3onalIns3tutesofHealth(R01HD074601,R21EY017843),Google,andtheIndianaUniversityOfficeoftheVicePresidentforResearchthroughtheIndianaUniversityCollabora3veResearchGrant,andFacultyResearchSupportProgram.Itusedcomputefacili3esprovidedbyNVidia,theLillyEndowmentthroughitssupportoftheIUPervasiveTechnologyIns3tute,andtheIndianaMETACytIni3a3ve.SBwassupportedbyaPaulPurdomFellowship.
Introduc/on
EgoHands:ALarge-scaleHandDataset
HandTypeDetec/on
Hand-basedAc/vityRecogni/on
ConclusionandFutureWork
§ Handsappearveryo>eninegocentricvideoandcangiveimportantcuesaboutwhatpeoplearedoingandwhatthey
arepayingaAen-onto§ Weaimtorobustlydetectandsegmenthandsindynamic
first-personinterac-ons,dis/nguishhandsfromanother,
andusehandsascuesforac/vityrecogni/on
Ourcontribu-onsinclude:
1. Deepmodelsforhanddetec/on/classifica/oninfirstpersonvideo,includingfastregionproposals
2. Pixelwisehandsegmenta/onsfromdetec-ons
3. Aquan-ta-veanalysisofthepowerofhandloca-onandposeinrecognizingac/vi/es
4. Alargedatasetofegocentricinterac/ons,includingfine-grainedgroundtruthforhands
§ Werecordedfirst-personvideofromtwointerac-ng
subjects,usingtwoGoogleGlassdevices§ 4actorsx4ac-vi-esx3loca-ons=48uniquevideos§ Pixel-levelgroundtruthannota-onsfor15,053handsin4,800framesallowtrainingofdata-drivenmodels
§ Weshowedhowtoaccuratelydetectanddis-nguishhands
infirstpersonvideoandexploredthepoten-alof
segmentedhandposesascuesforac-vityrecogni-on
§ Weplantoextendourac-vityrecogni-onapproachto
finer-grainedac/onsandmorediversesocialinterac/ons
# Window Proposals500 1000 1500 2000 2500
Cove
rage
0.7
0.75
0.8
0.85
0.9
0.95
1
Direct SamplingDirect Sampling (only spatial)Sampling from Selective SearchSelective SearchSampling from ObjectnessObjectness
(a)Genera-ngRegionProposalsEfficiently
§ Regionsaresampledfrom
learneddistribu/onsoverregionsizeandposi-on,biased
byskincolorprobabili-es§ Yieldshighercoveragethanstandardmethodsatafrac-on
ofthecomputa-onalcostHand coverage versus number of proposals per frameforvariousproposalmethods.
§ ACNNistrainedtodis-nguishregionsbetweenhandtypesandbackground(definedbasedonIoUoverlapwithground
truth),thennon-maxsuppressionproducesdetec-ons
§ Lightweightsamplingapproachproposes
asetofimageregionsthatarelikelyto
containhands(a)§ CNNclassifieseachregionforthepresenceofaspecifichandtype(b)
(a)
(b)
(b)HandTypeClassifica-on
.807 .807
.772
.737
.640
.700.675
.639
.727
.782.768
.748
.813
.776
.699 .698
.781.756
.686 .678
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
MainSplit ActorCrossVal Ac/vityCrossVal Loca/onCrossVal
AveragePrecision
AllHands OwnLe> OwnRight OtherLe> OtherRight
Recall0 0.2 0.4 0.6 0.8 1
Prec
isio
n
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ours | AP: 0.807 | AUC: 0.842SS | AP: 0.763 | AUC: 0.794obj. | AP: 0.568 | AUC: 0.591
Performance on the main dataset split. Le1) Precision-Recall plot for hand detec-on using different proposal methods. Ourproposedapproachoutperformsthebaselines.Center)Precision-Recallforhanddetec-onanddisambigua-onwithcomparisonto
Leeetal.[1].“Ownhand”detec-onismoredifficultduetoextremeocclusions.Right)Somedetec-onexamplesat0.7recall.
§ Weseesimilarperformanceandtrendsforcross-valida/onexperimentswithrespecttoactors,ac-vi-es,andloca-ons
HandSegmenta/on§ Weuseourstrongdetec-onstoini-alizeGrabCut,modified
touselocalcolormodelsforhandsandbackground,yieldingstateoftheartresultsforhandsegmenta-on
(Ours:0.556averageIoU,Lietal.[2]:0.478averageIoU)
§ Wehypothesizethatfirst-personhandposecanrevealsignificantinforma-onabouttheinterac/on
§ Totestthis,webuildCNNstoclassifyac-vi-esbasedonframeswithnon-handsmaskedout:
§ Withgroundtruthhandsegmenta-ons,66.4%accuracy§ Withourhandsegmenta-ons,50.9%accuracy§ Aggrega-ngframesacross/mefutherincreasesaccuracy
Le1)ExamplesofmaskedhandframesthatareusedasinputtotheCNN.Right)Classifica3onaccuracyincreasesasmoreframesareconsidered.Here,wesample~1framepersecondsuchthat10framesspanroughly10secondsofvideo.
Le1)Somesampleframesfromthedatawithgroundtruthhandmaskssuperimposedindifferentcolors(indica-ngdifferenthandtypes).Eachcolumnshowsoneac-vity:Jenga,jigsawpuzzle,cards,andchess.Right)Randomsamplesofgroundtruthsegmentedhands.
TheEgoHandsdatasetispubliclyavailable:hdp://vision.indiana.edu/egohands/
Moreac-vityrecogni-onevalua-oncanbefoundinourACMICMI2015paper:hdp://vision.soic.indiana.edu/hand-interac/ons/
# Frames0 10 20 30 40 50
Accu
racy
0.5
0.6
0.7
0.8
0.9
1
Using our SegmentationsUsing Ground Truth Segmentations
[1]S.Lee,S.Bambach,D.Crandall,J.Franchak,C.Yu,“Thishandismyhand.Aprobabili-capproachtohanddisambigua-oninegocentricvideo”,CVPRWorkshops2014
[2]C.Li,K.Kitani,“Modelrecommenda-onwithvirtualprobesforegocentrichanddetec-on”,ICCV2013