Towards Visual SLAM in Dynamic Environments

7/27/2019 Towards Visual SLAM in Dynamic Environments

1/12


2/12

Contents

1 Introduction 32 Sensor Error Treatment and Data Association in SIFT SLAM 3

2.1 SIFT SLAM Strongly Trusts Sensor Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Predictions for SIFT SLAM in Dynamic Environments . . . . . . . . . . . . . . . . . . . . . . 6

3 Techniques for Improving Visual SLAM Performance 63.1 Tuning Thresholds to Minimize Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Identifying Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Identifying Motion through Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Optical Flow for Place Recognition 84.1 Common Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.2 Simple Demonstration of Keypoint Rejection . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.3 Use of Optical Flow in Place Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.4 Choice of Optical Flow Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Conclusions 12

2


3/12

1 Introduction

To build autonomous robots capable of operating wherever humans do, we must develop localization andmapping strategies that can handle changing environments. Visual strategies are especially desirable, astheyhavethepotentialtoproducesubstantiallymoreusefulmapsthanotherapproaches(e.g. withdenserfeatures at multiple height levels). Recent work by Se, Lowe and Little has shown that current machinevisiontechnologymakesvisualSLAMfeasible inapotentiallywiderangeofstaticenvironments. However,their system treats vision system errors in a manner which imposes a fundamental limit on its ability tohandle

environments

with

motion.

In this paper, I first identify limitations of Se, Lowe and Littles error handling approach common

to virtually all SLAM systems and, based on this analysis, propose two techniques for overcoming theselimitations. Ialsopresentathirdtechnique,basedonopticalflow,whichspecificallyaddressesthe issueofvisual SLAM in dynamic environments. Finally, I discussa proof-of-conceptsystem I built which explorestheusefulnessofthisthirdtechniqueinthecontextofplacerecognition.

ItisworthnotingthattheoriginalfocusofmycourseprojectwastheuseofopticalflowtoenablevisualSLAM in dynamic environments. While working with this idea, I made the observations about Se, Loweand Littles errortreatment and possible improvementsthat I outlined above; I believe these observationsconstitutethemeatofmyproject.

2 Sensor Error Treatment and Data Association in SIFT SLAM

Virtually all metric SLAM systems operate by predicting relative landmark positions based on relativelycoarse motion estimates, interpreting sensor data to identify measured landmark positions based on thoseestimates, and revising position accordingly. Predicted landmark positions are a key element of the dataassociationprocessinvirtuallyallmetricSLAMsystems,asfollows. IfSLAM landmarkscannotbedistinguished by a sensor system other than on the basis of their position (as with the flat region landmarkstypicallyproducedbysonaror laserrangeanalysis),theymustbe chosenverycarefullytoavoidconfusionwith other landmarks. In practice, this has resulted in the use of a validation gate, a set of thresholdsgoverninghowclose, inposition,a measured landmarkmustbetoanexpected landmarkto be considereda match.

InSIFTSLAM,thevalidationgatetakestheformofapixelwindow(10x10inSe,LoweandLittlespaper)centeredonpredictedpositioninwhichlandmarksmustbetogetmatched,aswellassimilarconstraintsondisparityandSIFTscale(derivedfromdepth). AlthoughSIFTSLAMincludesamatchingthresholdfortheSIFTorientationcomponentofalandmark,ameasureofitsactualimagecontent,thisthresholdisnotpartofthevalidationgate,asitsvaluehasnothingtodowiththepositionofthe landmark. NotethatasSIFTfeature matching is a visual process, the difference between matched and expected SIFT feature positioncanbeviewedasanestimatefortheopticalfloworimagemotionatthelocationoftheSIFTfeaturebeingmatched.1

1It might perhaps be better to term the difference as measured motion rather than optical flow, as sparse optical flow issomewhatofanoxymoron,andoverloadsthetechnicaldefinitionsofopticalflow- asopposedtomotionestimates- asapparent

3


4/12

TheparametersforthisgateimposehardlimitsontheperformanceoftheSLAMsysteminthefollowingways:

1. They imply a bound on the maximum odometry error that the system can handle, as if odometryreadingsare sufficiently far off, no landmarks previously known about by the sensor system will fallwithinthevalidationgateboundaries.

2. Perhapsmore intuitively, they imply a maximum sensor error in locating landmarks, as again if thesensorsreadingforagivenlandmarkissufficientlyfaroff,itwillnotbecounted.

3. If the sensor system cannotdistinguish between landmarks at all other than by their position - thatis, if a landmark is simply a signal that is stably extractable from the world at a given position the validation gate imposes a limit on the closeness of landmarks. This is because if two landmarkswerechosenwithoverlappingvalidationregions,theycouldnotbedistinguishedbythesystem. NotethatinSIFTSLAM,thevisionsystemcandistinguishbetweenlandmarksonabasisotherthantheirposition;Iwilldiscussthe implicationsofthis later.

4. They imply a limit on the motion resolution of the SLAM system. That is, if the SLAM systemattempted to notice landmark motion (either while moving, in which case the detected landmarkmotion would be first computed relative to the robots motion, or while stationary) it would onlybe able to do so for landmarksmoving faster per sensorprocessing cycle than the validation gate. Iattemptto illustratethisinthefigurebelow.

Inthisfigure,therobotisrepresentedbyMarioandaslowlymovingadversarybyWario. Aslongaswariosmotioniswithinthegreyregion,correspondingtothevalidationgate,Mariowillthinkhispositionestimateisoffandadjustitaccordingly. Whileotherlandmarksnotexhibitingthoseerrorsmight help to reduce the impact of Wariosmotion, its effect will be nonzero, andpossiblyquitesignificant.

motionof the brightnesspattern inan image.

4


5/12

2.1 SIFTSLAMStronglyTrustsSensorDataTheoverarchingconclusionofmanyofDavidLowesevaluationsofSIFTisthatitisremarkablyreliable;twoSIFTfeaturesarerelativelyrarelymatchedwhenthereal-worldobjectsonwhichthosefeaturesarelocateddonotlookthesametopeople. ThestabilityresultsinhispaperindicatethatthemeasuredparametersofSIFTfeatures(theirscaleandorientation)arerelativelystableatleasttosmalldisplacements.

SIFTSLAMexploitsthisfactinitspositionestimationschemebyessentiallycircumventingthebulkoftraditional EKF SLAM. It first treats the odometry reading as both the measurement and prediction forthe robot pose, as opposedto treating the odometry reading asthe prediction and the individual positionobservationsfor landmarksasthe measurements. TheHmatrix for theKalmanfilterSIFTSLAM usesissimplythe identity.

Itthenincorporateslandmarkpositionsbycomputingaleast-squaresbest-fitpositionminimizingerrorsbetweenpredicted landmarkpixelcoordinates in itsviewandtheobservedones. Eventhis processthrowsawayinformation,asthemeasureddepthsofthe landmarksareusedonlyindirectly. TheerrortermsbeingminimizedinSIFTSLAMarebelow,takenfromSe,LoweandLittlespaper,wherer

i

isthepredictedpixelrowpositionofmatched landmarki andrmi isitsdetectedpixelposition:

eri =ri rmi (1)eci =ci cmi (2)

Althoughthedepthofalandmarkisusedtopredictthesevalues,SIFTscaleerror- ameasurementanalogoustopixelpositionproducedbySIFT- isnotdirectlyminimized. Actually,SIFTSLAMcomputestwoleast-squaresfits,usingthefirstonetoidentifyoutliers(accordingtoastandardresidualmeasure,withathresholdof2pixels)andthesecondonetoproduceanestimateusingonlythenon-outliers.

SIFTSLAMthenapparentlycomputesameasureofthequalityoftheleast-squaresfitandtreatsitasakindofmeasurementuncertainty,addingittothepositionuncertaintypredictedfromitsmotionmodel. ItthenfeedsthistotaluncertaintyestimateintoaKalman-filter-likeprocedure,whoseactualequivalencetoanEKFIhavenotbeenabletoverify,toweighttheleast-squarespositionestimateandtheodometryestimate.TheoutputisapositionestimatealongwithapositionuncertaintywhichislaterfedintoindependentEKFsfortheeachtrackedlandmark.

This structure reflects a strong trust in sensor readings in two main ways. First, its definition of anoutlier in the least-squares procedure is quite stringent. This means all the offsets in measured positionsmustbequitecoherent,reflectingprimarilymotionandnotsensorerror. This isasomewhatsubtle point,which I have not had the opportunity to verify in simulation, but worth investigating further. Second, itdoesnotapplyafullEKFtotheSLAMproblem;instead,itdecouplesitintoposeestimationandlandmarkestimation,restrictinglandmarkuncertaintytobeindependent(alaFastSLAM).Thechoicetothrowawaythese sources of information reflects the confidence the SIFT SLAM designers had in the quality of SIFTmatching. Italsoraisesthe issuethattheuncertaintyestimatesproducedbySIFTSLAMdonotmeanthesamethingsastheuncertaintyestimatesproducedbymanyothermetricSLAMalgorithms.

It would be interesting to check if the position estimates produced by SIFT SLAM vary significantly

5


6/12

from the least squares estimates produced by its vision system; if so, SIFT SLAM would be trusting itssensorsalmostabsolutely. ItisalsoworthnotingthatthesensorerrormodelusedbySIFTSLAMintheper-landmarkEKFsuse anarbitrarilychosenvarianceof1pixel; this value isquite low,anotherreflectionof confidence. It is not as compelling, however,because SIFT SLAM does not appear to use its landmarkEKFsforformuch,givenitsdecouplingofthestateestimation.2.2 Predictions forSIFTSLAM inDynamicEnvironmentsThespecificchoicesmade inSIFTSLAMsuggestitwillactuallyperformreasonablywellindynamicenvironments, as long as motion is fast enoughthat it doesnt introduce the validation gate errorsI discussedearlier. This is because if a landmark is not matched (when expected) enough times, SIFT SLAM throwsit away; landmarks on a moving person, then, are likely to get thrown out if the person moves too much.If a person movesslowly for awhile, it could conceivablycorrupt the robots position estimate a bit, but combinedwiththepossibilityofrejectionasanoutlierintheleast-squaresestimate- itseemsunlikelythatthemajorityofmovingbodieswillconfuseSIFTSLAM.

AninterestingexercisewouldbetopredictthemotionthresholdsimpliedbySIFTSLAMserrorhandlingchoicestodetermine whatamountofmotion ina scene is likely to fool it, orwhether ornotsufficientlyslowmotionislikelytobeaprobleminanyrealSLAMscenarios.

3 Techniques for Improving Visual SLAM Performance

TheaboveanalysissuggestsseveralpossibleimprovementstoSIFTSLAM.SomeofthesemayevenbeusableinotherSLAMsystems.3.1 TuningThresholds toMinimizeUncertaintyThe default threshold values provided in SIFT are were apparently set by Se, Lowe and Little based ontheir experience with the SIFT system. It is not at all clear that they were optimally chosen. Given theuncertaintymeasurementsproducedbyvirtuallyallmetricSLAMalgorithms,however,itshouldbepossibletosearchforbetterthresholdvalues,perhapsonaperenvironmentbasis.

Anaivescheme fordoingsowould involverunningSIFT SLAM(orperhapsaverysimilarSIFT-basedSLAMsetup usingFastSLAMat its core)andmeasurepositionortotaluncertainty(e.g. covariancemagnitudeorparticleentropy)forvariouschoicesofvalidationgatevalues,matchingthresholds,andlandmarktrackingmiss countthresholds. The thresholds which producedthe best output atthe end ofa run couldthenbeusedinfutureSLAMscenariosinasimilarenvironment. Alternately,ifthecostofevaluatingmultiplethresholdsissmall,thethresholdscouldbesetadaptively(perhapseveryfewframes)overthecourseofagivenSLAMtask. Thismighthelp improveSLAMperformanceintaskswhereSIFTperformancevaries,suchastransitionfromregionswith identifyingmarksonwallstofairlyhomogeneousones.

Ofcourse,itisnotclearthatminimizinganuncertaintymeasurewillactuallyimproveSLAMaccuracy,astheworldmightconspiretoproduce sensorreadingswith loweruncertainty inan incorrect location. Infact,themotion/validationgateproblemIdiscussaboveisaninstanceofthis,whereminimizinguncertainty

6


7/12

via standard SLAM approachesnaively results in erroneousposition estimates. However, it is not clear ifthiswillbeaproblem inrealworldapplications,and intheworstcasetuningthresholdsduringtrainingSLAMrunswhereabsolutepositioncanbetrackedmightstillimproveSLAMperformance.3.2 IdentifyingOutliersThecentralideabehindmostmetricSLAMsystemsisthatmotionerrorsshouldbeconsistentlyreflectedindeviations from predicted and measured landmark positions (modulo sensor uncertainty). Therefore, eveninthepresenceofslowlymovingobjects,thevastmajorityofthelandmarksinasceneshoulddeviatefrompredictedvaluesinaconsistentmanner. Relativemotionofsomelandmarksshouldshowupasaclusteroflandmarkswithadifferent(but internallyconsistent)kindofdeviationfrommotionerror.

If these clusters could be programmatically detected, they could be used to rule out deviations fromprediction which are more likely to correspond to landmark motion than robot motion error. IdentifyingtheseclustersviaPCAorevensomekindofregressionmightbepossible. Oneinterestingquestioniswhetheror not the least-squares fit with rejection of outliers in SIFT SLAM actually produces this effect to somedegree.3.3 IdentifyingMotionthroughOpticalFlowA central problem for SLAM in dynamic environments is the rejection of landmarks which actually lie onmovingobjects. Therationale issimple: if landmarksseemto move,anySLAMschemewill interpretthismotionasanerrorinitspositionestimates,anderroneouslyrevisethem.

The problem, then, is to avoid selecting landmarks on things which can move. A reasonablefirst steptowards this goal, reachable given current machine vision technology, is to avoid selecting landmarks onobjectswhicharemoving. Asoutlinedabove,thisproblemiscurrentlyintractablewhentherobotperformingSLAMismoving,asitcannotdistinguishbetweensmallerrorsinitssensorapparatusormotionmodelandsmall motions. A workable solution, however, may be to only assign new landmarks when the robot isstationary,andsensoruncertaintyisthereforesmall,soallmeasuredmotionoflandmarkcandidatesislikelytocorrespondtoreallandmarkmotion.

The measurement of landmark motion could proceed in at least two ways. First note that he latestincarnation of SIFT includes keypoint descriptors, a better measure of feature content than the SIFTorientationsusedinSIFTSLAM.Ifkeypointdescriptorscanbematchedreliably,itwouldbeinterestingtoseehowSLAMperformanceisaffectedbyonlyusingmatchingthresholdsonSIFTproperties,notpositionalproperties. Ifthisdidnotresultinlargenumbersoffalsematches,theSIFTsystemcouldthenuselandmarkmatchingtoestimatelandmarkmotionwhentherobotisstationary,andrejectthoselandmarkswhichmovedmorethan some verysmall amount. This wouldessentially amount to setting the validation gateto someverylowvalueaslongastherobotisstationary,drivingthespeedneededtoconfusethesystemtoavaluelowenoughsoastobe irrelevant.

Aslongasthesystemdidnotacquirenewlandmarkswhilemoving(stoppingtoacquirenewlandmarkswhenever necessary), the system would be much less foolable by motion in the environment. Given thedensityof landmarksextractedviaSIFT(asdemonstrated inSe,LoweandLittlespaper),thisshouldnot

7


8/12


9/12

Thesystemexecutableiscalledsimple. Itacceptstwocommand-linearguments. Thefirstistheprefixforthefilenameofallthe frames,andthesecondisthenumberofframes. Framefilenamesmustbeoftheform.pgm;framenumbersmuststartat0. Forexample,torunthesystemonthefirst40framesofamoviewherethefirstframeisfoo000000.pgm,thecommandlineissimple foo 40.

Thematchingcriteriausedbythesystem3 areasfollows(withallcountsinitially0):1. Ifafeatureismatchedatan imagepointwithflowbelowthecurrentthreshold,set itscountsto0.2. Ifafeatureismatchedatanimagepointwithflowabovethe currentthreshold, incrementitsmoved

count. Ifthisexceedsthemovementthreshold,discardit.3. Ifafeatureisnotmatchedatall,incrementitsmissedcount. Ifthisexceedsthemissthreshold,discard

it.NotethatthesethresholdsareanalogoustothoseintheSIFTlandmarktrackingsystem. Theresultsof

runningthissystemonframes500-550ofthePoolSharkentrytotheIRTC- extractedfromtheincludedfilepoolshrk.mpg - forthefirst39framesisshownbelow:

Courtesy of Neil Alexander. Used with permission.

The image on the left shows all keypoints found in the first frame. The middle image shows thosekeypointswhichsurvivedgivenaflowthresholdof30(essentiallyarbitrarilylargeflow),amovedthresholdof5framesandamissthresholdof10frames. Theimageontherightshowsthosekeypointswhichsurvivedgivenaflowthresholdof0.1(essentiallynoflow). All landmarksonthemovingpersonhavebeenpruned.

Thisservesasausefulproofofconceptoftheusefulnessofanopticalflowschemeforrejectingfeaturesonmovingbodies.4.3 UseofOpticalFlow inPlaceRecognitionThe secondsystemI built tests the usefulness ofopticalflow for rejecting spurious features in the contextofanaiveplacerecognitionsystem. Itshouldbenotedthatthe usefulnessofflow in thecontextofSLAMisnotatallmeasuredbytheseresults;adetailedexplanationofmyrationaleis includedbelow.

Thenaiveplacerecognitionsystem,embodiedintheexecutableplacerec,operatesontwoinputmovies(specified as for the simple optical flowdemo above). The first is the training movie, containingan image

3Intraditionalbad-qualityresearchcodingstyle,theseareset inthecode;thecurrentbuildofsimpleissettoanextremelyforgivingoptical flowthreshold essentiallycorrespondingtonouse offlow. Cest lavie.

9


10/12

sequence fromwhich featureswill beextracteddefining theplace. The second isthe testmoviewhich willbecomparedagainstthetrainingsequenceonaframe-by-framebasis.

Theplacerecognitionsystemisextremelynaive. Itbuildsupaplacedefinitionbyperformingthesameprocessasthesimpledemoabove- that is, itdefinesaplacetobeacollectionofkeypointsextractedfromthefirstplaceframewhichwerestablymatched(andstationary)overtheentiretrainingsequence. Foreachtestframe,itappliesthefollowingprocedure:

1. Extractthekeypointsfromtheframe.2. Foreachkeypoint,determine if itlies inaregionwithflowbelowacertainthreshold. Ifso,keepit.3. Storethenumberofkeptkeypoints.4. For each kept keypoint, check if it matches a keypoint in the place definition. If so, increment the

numberofmatches.5. Ifthepercentofmatches- thatis,thepercentofkeptkeypointswhichmatchedakeypointintheplace

- exceedsathreshold,declaretheframetobefromthetrainingplace.The idea is straightforwardenough: determine which keypoints correspond to environmental rather

thantransientfeatures,andcheckhowmanyofthemmatch. Itisnaivechieflyinthatitdoesnotattemptto weight features according to how informative they are, and that it has a single simple threshold fordeclaringamatch.

Snapshotsofthefeaturesselectedbythesystemfromamovietakeninanapartment inSomervilleareshownbelow. Theimageonthetopleftshowsthekeypointsextractedfromthefirst30framesofthetrainingmovie(dh-still). Theimageonthetoprightshowsthefeaturesidentifiedinframe20withoutopticalflowpruning;notethefeaturesontheinterloper. Theimageonthebottomleftshowstheopticalflowgeneratedbytheinterlopersdistractingdancemoves,andtheimageonthebottomrightshowsthefeaturesidentifiedgivenanopticalflowthresholdof1.0.

10


11/12

Givena matchingthresholdof20%,opticalflowpruning improvedrecognitionperformancebyroughly13%(froma43%framerecognitionratetoa56%framerecognitionrateonasequenceof30frames). Theaveragepercentofmatchesincreaesdbyroughly2%.

Opticalflowonlyresultedinasmallincreaseinperformanceprimarilybecausethereal-worldtestmoviesdid not include enough motion distractors. A secondary factor was, again, the naive nature of the placerecognitionscheme. AnHMMapproachwithoutflowwouldeitherbeconfusedbyfeaturesonmovingobjects(notpresentinallfootageofagivenplace)orwouldassignthosefeatureslowprobability;opticalflowcouldpotentiallyprunethosefeaturesentirely.

ThesystemwasalsotestedonvideofootagefromMITsLobby7aswellasothersequencesfromthesameapartmentinSomerville. Forthemajorityofthesescenes,placerecognitionperformanceingeneralwaspoorenough(that is, few enoughkeypointsmatchedunambiguously)that the system couldntaffordto discardkeypointsbasedonopticalflow;thefewpointsdiscardedduetoerrorsintheopticalflowcomputationweretoocostly4.

Thispoorperformancewasprimarilyduetotwofactors:1. The resolution of the movies produced by the Quickcam was 160x128; the SIFT feature count was

fairlylowtobeginwith. with.2. Thecameraplacementcompoundedthe lowfeatureproblembyincludinglargehomogeneousregions.

These regions result in ambiguous matches (in the absence of position information as provided bystereopsis)whicharerejectedbythekeypointmatchingprocedureIoutlinedabove.

4.4 ChoiceofOpticalFlowSchemeItshouldbenotedthatitwouldhavealsobeenpossibletouseSIFTfeaturematchingtoapproximateimagemotion insteadofappealingtoHorn-Schunck. Thiswasnotdone initiallyasthe ideadid notoccurtome.However,it isnotclearthatSIFTmatchingwouldhaveperformedbetter.

4Detailed resultson these test movies was not tabulated, as these were not relevant to the mainthrust of the project, andwouldnotbetoo informative in anycasegiven thesimplicityofthe placerecognition approach.

11


12/12

5

ThestrengthinSIFTliesinitsfeaturedensity,permittingsubstantialmatchingerrorssuchasomissionsor mismatchesto be dwarfedby the number of correctmatches in recognitiontasks. The precise locationof individual SIFT features, however, is not terribly stable; changes in the overall image content such astheadditionofnewobjectsorsmallfluctuations inbrightnesscancauseindividual feaurestojumpseveralpixelsorvanishentirely. Theseerrorsareevident inthefeaturelocationmoviesIprovided.

InSIFTSLAM,theerrorsintroducedbythesejumpsareminimizedduetotheuseof3Dfeaturepositionin all matching tasks along with moderately stringent rejection thresholds. In a typical place recognitionscenario,however,thecamerapositionisnotknown;theonlyconstraintisthattheviewfromwhichaplaceistoberecognizedoverlapssubstantiallywithtrainingviews. Geometricinformationisthereforeunavailableandcannotbeusedtoaidinthematchingtask.

While I could have designed a special-purpose SIFT matching approach incorporating the particularconstraints of optical flow in the place-recognition context, this did not seem particularly relevant to myoverallgoals. Horn-Schuncksopticalflowscheme,ontheotherhand,isformulatedtoincorporatesubstantialgeometricinformation.

Conclusions

In this paper, I argued thatvalidation gates impose fundamental limits on SLAM accuracy, es-pecially in the presence of motion. I suggestedseveraltechniques for addressing this issue as well asothererrorhandling issues in SLAM, such astuning thresholds to minimize uncertainty,detectingoutlying displacements, and pruning landmarks based on optical flow. Finally, Ibuilt and de-scribed two systems using optical flow to prune landmarks andtestedtheminthecontextofanaiveapproachtoplacerecognition.

TheconjecturesinthispapercannotbetestedwithoutareimplmentationoftheSIFTSLAMsystem. Asuccessfulreimplementationwhichwascapableofnavigatinginnoisyenvironments,suchasLobby7duringlunchtime,wouldconstituteasignificantcontributioninrobotics. Ihaveconfidencethatsomecombinationoftheideasinthispaper(orperhapsevenanonlyslightlymodifiedversionofSIFTSLAMitself)wouldbeabletoachievethatgoal,andIhopetohavethechancetoinvestigatethisinthefuture.

12

Documents

Towards Visual SLAM in Dynamic Environments