Download pdf - Lending A Hand: Detec/ng Hands and Recognizing Ac/vi/es in …vision.soic.indiana.edu/pres/egohands2015iccv-poster.pdf · 2016-04-16 · 1. Deep models for hand detec/on/classiﬁca/on

Recall0 0.2 0.4 0.6 0.8 1

Prec

ision

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

own left | AP: 0.64 | AUC: 0.666own right | AP: 0.727 | AUC: 0.742other left | AP: 0.813 | AUC: 0.878other right | AP: 0.781 | AUC: 0.809

Lee et al. [1], all 4 hand types

LendingAHand:Detec/ngHandsandRecognizingAc/vi/esinComplexEgocentricInterac/onsSvenBambach,StefanLee,DavidCrandall,SchoolofInforma-csandCompu-ng,IndianaUniversity

ChenYu,DepartmentofPsychologicalandBrainSciences,IndianaUniversity

Acknowledgments:ThisworkwassupportedbytheNa3onalScienceFounda3on(CAREERIIS-1253549,CNS-0521433),theNa3onalIns3tutesofHealth(R01HD074601,R21EY017843),Google,andtheIndianaUniversityOfficeoftheVicePresidentforResearchthroughtheIndianaUniversityCollabora3veResearchGrant,andFacultyResearchSupportProgram.Itusedcomputefacili3esprovidedbyNVidia,theLillyEndowmentthroughitssupportoftheIUPervasiveTechnologyIns3tute,andtheIndianaMETACytIni3a3ve.SBwassupportedbyaPaulPurdomFellowship.

Introduc/on

EgoHands:ALarge-scaleHandDataset

HandTypeDetec/on

Hand-basedAc/vityRecogni/on

ConclusionandFutureWork

§  Handsappearveryo>eninegocentricvideoandcangiveimportantcuesaboutwhatpeoplearedoingandwhatthey

arepayingaAen-onto§  Weaimtorobustlydetectandsegmenthandsindynamic

first-personinterac-ons,dis/nguishhandsfromanother,

andusehandsascuesforac/vityrecogni/on

Ourcontribu-onsinclude:

1.  Deepmodelsforhanddetec/on/classifica/oninfirstpersonvideo,includingfastregionproposals

2.  Pixelwisehandsegmenta/onsfromdetec-ons

3.  Aquan-ta-veanalysisofthepowerofhandloca-onandposeinrecognizingac/vi/es

4.  Alargedatasetofegocentricinterac/ons,includingfine-grainedgroundtruthforhands

§  Werecordedfirst-personvideofromtwointerac-ng

subjects,usingtwoGoogleGlassdevices§  4actorsx4ac-vi-esx3loca-ons=48uniquevideos§  Pixel-levelgroundtruthannota-onsfor15,053handsin4,800framesallowtrainingofdata-drivenmodels

§  Weshowedhowtoaccuratelydetectanddis-nguishhands

infirstpersonvideoandexploredthepoten-alof

segmentedhandposesascuesforac-vityrecogni-on

§  Weplantoextendourac-vityrecogni-onapproachto

finer-grainedac/onsandmorediversesocialinterac/ons

# Window Proposals500 1000 1500 2000 2500

Cove

rage

0.7

0.75

0.8

0.85

0.9

0.95

1

Direct SamplingDirect Sampling (only spatial)Sampling from Selective SearchSelective SearchSampling from ObjectnessObjectness

(a)Genera-ngRegionProposalsEfficiently

§  Regionsaresampledfrom

learneddistribu/onsoverregionsizeandposi-on,biased

byskincolorprobabili-es§  Yieldshighercoveragethanstandardmethodsatafrac-on

ofthecomputa-onalcostHand coverage versus number of proposals per frameforvariousproposalmethods.

§  ACNNistrainedtodis-nguishregionsbetweenhandtypesandbackground(definedbasedonIoUoverlapwithground

truth),thennon-maxsuppressionproducesdetec-ons

§  Lightweightsamplingapproachproposes

asetofimageregionsthatarelikelyto

containhands(a)§  CNNclassifieseachregionforthepresenceofaspecifichandtype(b)

(a)

(b)

(b)HandTypeClassifica-on

.807 .807

.772

.737

.640

.700.675

.639

.727

.782.768

.748

.813

.776

.699 .698

.781.756

.686 .678

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

MainSplit ActorCrossVal Ac/vityCrossVal Loca/onCrossVal

AveragePrecision

AllHands OwnLe> OwnRight OtherLe> OtherRight

Recall0 0.2 0.4 0.6 0.8 1

Prec

isio

n

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ours | AP: 0.807 | AUC: 0.842SS | AP: 0.763 | AUC: 0.794obj. | AP: 0.568 | AUC: 0.591

Performance on the main dataset split. Le1) Precision-Recall plot for hand detec-on using different proposal methods. Ourproposedapproachoutperformsthebaselines.Center)Precision-Recallforhanddetec-onanddisambigua-onwithcomparisonto

Leeetal.[1].“Ownhand”detec-onismoredifficultduetoextremeocclusions.Right)Somedetec-onexamplesat0.7recall.

§  Weseesimilarperformanceandtrendsforcross-valida/onexperimentswithrespecttoactors,ac-vi-es,andloca-ons

HandSegmenta/on§  Weuseourstrongdetec-onstoini-alizeGrabCut,modified

touselocalcolormodelsforhandsandbackground,yieldingstateoftheartresultsforhandsegmenta-on

(Ours:0.556averageIoU,Lietal.[2]:0.478averageIoU)

§  Wehypothesizethatfirst-personhandposecanrevealsignificantinforma-onabouttheinterac/on

§  Totestthis,webuildCNNstoclassifyac-vi-esbasedonframeswithnon-handsmaskedout:

§  Withgroundtruthhandsegmenta-ons,66.4%accuracy§  Withourhandsegmenta-ons,50.9%accuracy§  Aggrega-ngframesacross/mefutherincreasesaccuracy

Le1)ExamplesofmaskedhandframesthatareusedasinputtotheCNN.Right)Classifica3onaccuracyincreasesasmoreframesareconsidered.Here,wesample~1framepersecondsuchthat10framesspanroughly10secondsofvideo.

Le1)Somesampleframesfromthedatawithgroundtruthhandmaskssuperimposedindifferentcolors(indica-ngdifferenthandtypes).Eachcolumnshowsoneac-vity:Jenga,jigsawpuzzle,cards,andchess.Right)Randomsamplesofgroundtruthsegmentedhands.

TheEgoHandsdatasetispubliclyavailable:hdp://vision.indiana.edu/egohands/

Moreac-vityrecogni-onevalua-oncanbefoundinourACMICMI2015paper:hdp://vision.soic.indiana.edu/hand-interac/ons/

# Frames0 10 20 30 40 50

Accu

racy

0.5

0.6

0.7

0.8

0.9

1

Using our SegmentationsUsing Ground Truth Segmentations

[1]S.Lee,S.Bambach,D.Crandall,J.Franchak,C.Yu,“Thishandismyhand.Aprobabili-capproachtohanddisambigua-oninegocentricvideo”,CVPRWorkshops2014

[2]C.Li,K.Kitani,“Modelrecommenda-onwithvirtualprobesforegocentrichanddetec-on”,ICCV2013