1
Recall 0 0.2 0.4 0.6 0.8 1 Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 own left | AP: 0.64 | AUC: 0.666 own right | AP: 0.727 | AUC: 0.742 other left | AP: 0.813 | AUC: 0.878 other right | AP: 0.781 | AUC: 0.809 Lee et al. [1], all 4 hand types Lending A Hand: Detec/ng Hands and Recognizing Ac/vi/es in Complex Egocentric Interac/ons Sven Bambach, Stefan Lee, David Crandall, School of Informa-cs and Compu-ng, Indiana University Chen Yu, Department of Psychological and Brain Sciences, Indiana University Acknowledgments: This work was supported by the Na3onal Science Founda3on (CAREER IIS-1253549, CNS-0521433), the Na3onal Ins3tutes of Health (R01 HD074601, R21 EY017843), Google, and the Indiana University Office of the Vice President for Research through the Indiana University Collabora3ve Research Grant, and Faculty Research Support Program. It used compute facili3es provided by NVidia, the Lilly Endowment through its support of the IU Pervasive Technology Ins3tute, and the Indiana METACyt Ini3a3ve. SB was supported by a Paul Purdom Fellowship. Introduc/on EgoHands: A Large-scale Hand Dataset Hand Type Detec/on Hand-based Ac/vity Recogni/on Conclusion and Future Work § Hands appear very o>en in egocentric video and can give important cues about what people are doing and what they are paying aAen-on to § We aim to robustly detect and segment hands in dynamic first-person interac-ons, dis/nguish hands from another, and use hands as cues for ac/vity recogni/on Our contribu-ons include: 1. Deep models for hand detec/on/classifica/on in first person video, including fast region proposals 2. Pixelwise hand segmenta/ons from detec-ons 3. A quan-ta-ve analysis of the power of hand loca-on and pose in recognizing ac/vi/es 4. A large dataset of egocentric interac/ons, including fine- grained ground truth for hands § We recorded first-person video from two interac-ng subjects, using two Google Glass devices § 4 actors x 4 ac-vi-es x 3 loca-ons = 48 unique videos § Pixel-level ground truth annota-ons for 15,053 hands in 4,800 frames allow training of data-driven models § We showed how to accurately detect and dis-nguish hands in first person video and explored the poten-al of segmented hand poses as cues for ac-vity recogni-on § We plan to extend our ac-vity recogni-on approach to finer-grained ac/ons and more diverse social interac/ons # Window Proposals 500 1000 1500 2000 2500 Coverage 0.7 0.75 0.8 0.85 0.9 0.95 1 Direct Sampling Direct Sampling (only spatial) Sampling from Selective Search Selective Search Sampling from Objectness Objectness (a) Genera-ng Region Proposals Efficiently § Regions are sampled from learned distribu/ons over region size and posi-on, biased by skin color probabili-es § Yields higher coverage than standard methods at a frac-on of the computa-onal cost Hand coverage versus number of proposals per frame for various proposal methods. § A CNN is trained to dis-nguish regions between hand types and background (defined based on IoU overlap with ground truth), then non-max suppression produces detec-ons § Lightweight sampling approach proposes a set of image regions that are likely to contain hands (a) § CNN classifies each region for the presence of a specific hand type (b) (a) (b) (b) Hand Type Classifica-on .807 .807 .772 .737 .640 .700 .675 .639 .727 .782 .768 .748 .813 .776 .699 .698 .781 .756 .686 .678 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 Main Split Actor CrossVal Ac/vity CrossVal Loca/on CrossVal Average Precision All Hands Own Le> Own Right Other Le> Other Right Recall 0 0.2 0.4 0.6 0.8 1 Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ours | AP: 0.807 | AUC: 0.842 SS | AP: 0.763 | AUC: 0.794 obj. | AP: 0.568 | AUC: 0.591 Performance on the main dataset split. Le1) Precision-Recall plot for hand detec-on using different proposal methods. Our proposed approach outperforms the baselines. Center) Precision-Recall for hand detec-on and disambigua-on with comparison to Lee et al. [1]. “Own hand” detec-on is more difficult due to extreme occlusions. Right) Some detec-on examples at 0.7 recall. § We see similar performance and trends for cross-valida/on experiments with respect to actors, ac-vi-es, and loca-ons Hand Segmenta/on § We use our strong detec-ons to ini-alize GrabCut, modified to use local color models for hands and background, yielding state of the art results for hand segmenta-on (Ours: 0.556 average IoU, Li et al. [2]: 0.478 average IoU) § We hypothesize that first-person hand pose can reveal significant informa-on about the interac/on § To test this, we build CNNs to classify ac-vi-es based on frames with non-hands masked out: § With ground truth hand segmenta-ons, 66.4% accuracy § With our hand segmenta-ons, 50.9% accuracy § Aggrega-ng frames across /me futher increases accuracy Le1) Examples of masked hand frames that are used as input to the CNN. Right) Classifica3on accuracy increases as more frames are considered. Here, we sample ~1 frame per second such that 10 frames span roughly 10 seconds of video. Le1) Some sample frames from the data with ground truth hand masks superimposed in different colors (indica-ng different hand types). Each column shows one ac-vity: Jenga, jigsaw puzzle, cards, and chess. Right) Random samples of ground truth segmented hands. The EgoHands dataset is publicly available: hdp://vision.indiana.edu/egohands/ More ac-vity recogni-on evalua-on can be found in our ACM ICMI 2015 paper: hdp://vision.soic.indiana.edu/hand-interac/ons/ # Frames 0 10 20 30 40 50 Accuracy 0.5 0.6 0.7 0.8 0.9 1 Using our Segmentations Using Ground Truth Segmentations [1] S. Lee, S. Bambach, D. Crandall, J. Franchak, C. Yu, “This hand is my hand. A probabili-c approach to hand disambigua-on in egocentric video”, CVPR Workshops 2014 [2] C. Li, K. Kitani, “Model recommenda-on with virtual probes for egocentric hand detec-on”, ICCV 2013

Lending A Hand: Detec/ng Hands and Recognizing Ac/vi/es in …vision.soic.indiana.edu/pres/egohands2015iccv-poster.pdf · 2016-04-16 · 1. Deep models for hand detec/on/classifica/on

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lending A Hand: Detec/ng Hands and Recognizing Ac/vi/es in …vision.soic.indiana.edu/pres/egohands2015iccv-poster.pdf · 2016-04-16 · 1. Deep models for hand detec/on/classifica/on

Recall0 0.2 0.4 0.6 0.8 1

Prec

ision

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

own left | AP: 0.64 | AUC: 0.666own right | AP: 0.727 | AUC: 0.742other left | AP: 0.813 | AUC: 0.878other right | AP: 0.781 | AUC: 0.809

Lee et al. [1], all 4 hand types

LendingAHand:Detec/ngHandsandRecognizingAc/vi/esinComplexEgocentricInterac/onsSvenBambach,StefanLee,DavidCrandall,SchoolofInforma-csandCompu-ng,IndianaUniversity

ChenYu,DepartmentofPsychologicalandBrainSciences,IndianaUniversity

Acknowledgments:ThisworkwassupportedbytheNa3onalScienceFounda3on(CAREERIIS-1253549,CNS-0521433),theNa3onalIns3tutesofHealth(R01HD074601,R21EY017843),Google,andtheIndianaUniversityOfficeoftheVicePresidentforResearchthroughtheIndianaUniversityCollabora3veResearchGrant,andFacultyResearchSupportProgram.Itusedcomputefacili3esprovidedbyNVidia,theLillyEndowmentthroughitssupportoftheIUPervasiveTechnologyIns3tute,andtheIndianaMETACytIni3a3ve.SBwassupportedbyaPaulPurdomFellowship.

Introduc/on

EgoHands:ALarge-scaleHandDataset

HandTypeDetec/on

Hand-basedAc/vityRecogni/on

ConclusionandFutureWork

§  Handsappearveryo>eninegocentricvideoandcangiveimportantcuesaboutwhatpeoplearedoingandwhatthey

arepayingaAen-onto§  Weaimtorobustlydetectandsegmenthandsindynamic

first-personinterac-ons,dis/nguishhandsfromanother,

andusehandsascuesforac/vityrecogni/on

Ourcontribu-onsinclude:

1.  Deepmodelsforhanddetec/on/classifica/oninfirstpersonvideo,includingfastregionproposals

2.  Pixelwisehandsegmenta/onsfromdetec-ons

3.  Aquan-ta-veanalysisofthepowerofhandloca-onandposeinrecognizingac/vi/es

4.  Alargedatasetofegocentricinterac/ons,includingfine-grainedgroundtruthforhands

§  Werecordedfirst-personvideofromtwointerac-ng

subjects,usingtwoGoogleGlassdevices§  4actorsx4ac-vi-esx3loca-ons=48uniquevideos§  Pixel-levelgroundtruthannota-onsfor15,053handsin4,800framesallowtrainingofdata-drivenmodels

§  Weshowedhowtoaccuratelydetectanddis-nguishhands

infirstpersonvideoandexploredthepoten-alof

segmentedhandposesascuesforac-vityrecogni-on

§  Weplantoextendourac-vityrecogni-onapproachto

finer-grainedac/onsandmorediversesocialinterac/ons

# Window Proposals500 1000 1500 2000 2500

Cove

rage

0.7

0.75

0.8

0.85

0.9

0.95

1

Direct SamplingDirect Sampling (only spatial)Sampling from Selective SearchSelective SearchSampling from ObjectnessObjectness

(a)Genera-ngRegionProposalsEfficiently

§  Regionsaresampledfrom

learneddistribu/onsoverregionsizeandposi-on,biased

byskincolorprobabili-es§  Yieldshighercoveragethanstandardmethodsatafrac-on

ofthecomputa-onalcostHand coverage versus number of proposals per frameforvariousproposalmethods.

§  ACNNistrainedtodis-nguishregionsbetweenhandtypesandbackground(definedbasedonIoUoverlapwithground

truth),thennon-maxsuppressionproducesdetec-ons

§  Lightweightsamplingapproachproposes

asetofimageregionsthatarelikelyto

containhands(a)§  CNNclassifieseachregionforthepresenceofaspecifichandtype(b)

(a)

(b)

(b)HandTypeClassifica-on

.807 .807

.772

.737

.640

.700.675

.639

.727

.782.768

.748

.813

.776

.699 .698

.781.756

.686 .678

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

MainSplit ActorCrossVal Ac/vityCrossVal Loca/onCrossVal

AveragePrecision

AllHands OwnLe> OwnRight OtherLe> OtherRight

Recall0 0.2 0.4 0.6 0.8 1

Prec

isio

n

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ours | AP: 0.807 | AUC: 0.842SS | AP: 0.763 | AUC: 0.794obj. | AP: 0.568 | AUC: 0.591

Performance on the main dataset split. Le1) Precision-Recall plot for hand detec-on using different proposal methods. Ourproposedapproachoutperformsthebaselines.Center)Precision-Recallforhanddetec-onanddisambigua-onwithcomparisonto

Leeetal.[1].“Ownhand”detec-onismoredifficultduetoextremeocclusions.Right)Somedetec-onexamplesat0.7recall.

§  Weseesimilarperformanceandtrendsforcross-valida/onexperimentswithrespecttoactors,ac-vi-es,andloca-ons

HandSegmenta/on§  Weuseourstrongdetec-onstoini-alizeGrabCut,modified

touselocalcolormodelsforhandsandbackground,yieldingstateoftheartresultsforhandsegmenta-on

(Ours:0.556averageIoU,Lietal.[2]:0.478averageIoU)

§  Wehypothesizethatfirst-personhandposecanrevealsignificantinforma-onabouttheinterac/on

§  Totestthis,webuildCNNstoclassifyac-vi-esbasedonframeswithnon-handsmaskedout:

§  Withgroundtruthhandsegmenta-ons,66.4%accuracy§  Withourhandsegmenta-ons,50.9%accuracy§  Aggrega-ngframesacross/mefutherincreasesaccuracy

Le1)ExamplesofmaskedhandframesthatareusedasinputtotheCNN.Right)Classifica3onaccuracyincreasesasmoreframesareconsidered.Here,wesample~1framepersecondsuchthat10framesspanroughly10secondsofvideo.

Le1)Somesampleframesfromthedatawithgroundtruthhandmaskssuperimposedindifferentcolors(indica-ngdifferenthandtypes).Eachcolumnshowsoneac-vity:Jenga,jigsawpuzzle,cards,andchess.Right)Randomsamplesofgroundtruthsegmentedhands.

TheEgoHandsdatasetispubliclyavailable:hdp://vision.indiana.edu/egohands/

Moreac-vityrecogni-onevalua-oncanbefoundinourACMICMI2015paper:hdp://vision.soic.indiana.edu/hand-interac/ons/

# Frames0 10 20 30 40 50

Accu

racy

0.5

0.6

0.7

0.8

0.9

1

Using our SegmentationsUsing Ground Truth Segmentations

[1]S.Lee,S.Bambach,D.Crandall,J.Franchak,C.Yu,“Thishandismyhand.Aprobabili-capproachtohanddisambigua-oninegocentricvideo”,CVPRWorkshops2014

[2]C.Li,K.Kitani,“Modelrecommenda-onwithvirtualprobesforegocentrichanddetec-on”,ICCV2013