Upload
emory-ward
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Learning to Navigate Learning to Navigate Through Crowded Through Crowded
EnvironmentsEnvironmentsPeter HenryPeter Henry11, Christian Vollmer, Christian Vollmer22, Brian Ferris, Brian Ferris11, Dieter Fox, Dieter Fox11
Tuesday, May 4, 2010Tuesday, May 4, 2010
11University of Washington, Seattle, USAUniversity of Washington, Seattle, USA22Ilmenau University of Technology, GermanyIlmenau University of Technology, Germany
The GoalThe Goal
Enable robot navigation within crowded Enable robot navigation within crowded environmentsenvironments
MotivationMotivation
Robots should move Robots should move naturally naturally and and predictably predictably within crowded environmentswithin crowded environments Move amongst people in a socially transparent Move amongst people in a socially transparent
way way More efficient and safer motionMore efficient and safer motion
Humans trade off various factorsHumans trade off various factors To move with the flowTo move with the flow To avoid high density areasTo avoid high density areas To walk on the left/right sideTo walk on the left/right side To reach the goalTo reach the goal
ChallengeChallenge
Humans naturally balance between various factorsHumans naturally balance between various factors Relatively easy to list factorsRelatively easy to list factors But they can’t specify how they are making the But they can’t specify how they are making the
tradeofftradeoff
Previous work typically uses heuristics and Previous work typically uses heuristics and parameters are hand-tunedparameters are hand-tuned Shortest path with collision avoidance [Shortest path with collision avoidance [Burgard, et Burgard, et
al., AI 1999]al., AI 1999] Track and follow a single person [Kirby, et al., HRI Track and follow a single person [Kirby, et al., HRI
2007]2007] Follow people moving in same direction [Mueller, et Follow people moving in same direction [Mueller, et
al., CogSys 2008]al., CogSys 2008]
ContributionContribution
Learn Learn how humans trade off various factorshow humans trade off various factors
A framework for learning to navigate as A framework for learning to navigate as humans do within crowded environmentshumans do within crowded environments
Extension of Extension of Maximum Entropy Inverse Maximum Entropy Inverse Reinforcement LearningReinforcement Learning [Ziebart, et al., [Ziebart, et al., AAAI 2008] to incorporate:AAAI 2008] to incorporate: Limited locally observable areaLimited locally observable area Dynamic crowd flow features Dynamic crowd flow features
Markov Decision Markov Decision ProcessesProcesses
StatesStates
ActionsActions
Rewards / CostsRewards / Costs
(Transition Probabilities)(Transition Probabilities)
(Discount Factor)(Discount Factor)
S0
S1
S2
S3
Goal
Navigating in a Crowd as Navigating in a Crowd as an MDPan MDP
States States ssii
In crowd scenario: Grid cell + orientationIn crowd scenario: Grid cell + orientation
Actions Actions aai,ji,j from from ssii to to ssjj
In crowd scenario: Move to adjacent cellIn crowd scenario: Move to adjacent cell
Cost = An unknown linear combination of Cost = An unknown linear combination of action featuresaction features Cost weights to be learned: Cost weights to be learned: θθ Path: Path: ττ Features: Features: ffττ
cost(τ |θ) =θ·fτ = θ·fai , jai , j ∈τ∑
Inverse Reinforcement Inverse Reinforcement LearningLearning
InverseInverse Reinforcement Learning (IRL): Reinforcement Learning (IRL): Given: The MDP structure and a set of example Given: The MDP structure and a set of example
pathspaths Find: The reward function resulting in the same Find: The reward function resulting in the same
behaviorbehavior (Also called “Inverse Optimal Control”)(Also called “Inverse Optimal Control”)
Has been previously applied with successHas been previously applied with success Lane changing [Abbeel ICML 2004]Lane changing [Abbeel ICML 2004] Parking lot navigation [Abbeel IROS 2008]Parking lot navigation [Abbeel IROS 2008] Driving route choice and prediction [Ziebart AAAI Driving route choice and prediction [Ziebart AAAI
2008]2008] Pedestrian route prediction [Ziebart IROS 2009]Pedestrian route prediction [Ziebart IROS 2009]
Exponential distribution over paths:Exponential distribution over paths:
Learning:Learning:
Gradient: Match Gradient: Match observed observed and and expected expected feature countsfeature counts
Maximum Entropy IRLMaximum Entropy IRL
P(τ |θ) =e−θ ·fτ
e−θ ·f ′τ
′τ∑θ* = argmax
θlog
τ ∈T∑ P(τ |θ)
∇F = %f − Dai , jfai , j
ai , j
∑
Locally Observable Locally Observable FeaturesFeatures
It is unrealistic to assume the agent has It is unrealistic to assume the agent has global knowledge of the crowdglobal knowledge of the crowd Contrast: Continuum Crowd Simulator explicitly Contrast: Continuum Crowd Simulator explicitly
finds a global solution for the entire crowdfinds a global solution for the entire crowd We We do do assume knowledge of the map itselfassume knowledge of the map itself
Training: Only provide flow features for small Training: Only provide flow features for small radius around current positionradius around current position Assumes that these are the features available Assumes that these are the features available
to the “expert”to the “expert” A single demonstration path becomes many A single demonstration path becomes many
small demonstrations of locally motivated pathssmall demonstrations of locally motivated paths
Locally Observable Locally Observable Dynamic Dynamic FeaturesFeatures
Crowd flow changes as the agent movesCrowd flow changes as the agent moves
Locally observable dynamic feature Locally observable dynamic feature training:training:1.1. Update flow features within local horizonUpdate flow features within local horizon
2.2. Compute feature gradient within gridCompute feature gradient within grid
3.3. Perform stochastic update of weightsPerform stochastic update of weights
4.4. Take the next step of the observed pathTake the next step of the observed path
P(τ |θ) =1
Z(θ)e
−θ ·fat+ht
0≤h<H∑t∑
Locally Observable Locally Observable Dynamic IRLDynamic IRL
The path probability decomposes into many The path probability decomposes into many short paths over the current features in the short paths over the current features in the locally observable horizonlocally observable horizon
Decompose over
timestepsLocal
Horizon
Features for actions
within horizon at
time t
Locally Observable Locally Observable Dynamic GradientDynamic Gradient
Uses current estimate of features at time Uses current estimate of features at time tt
Computes gradient only within local horizon Computes gradient only within local horizon HH
∇F t = %ft − Dtai , j
ftai , jai , j∈H∑
Observed features within H
Expected features for
actions within H
Map and FeaturesMap and Features
Each grid cell encompasses 8 oriented statesEach grid cell encompasses 8 oriented states Allows for flow features relative to orientationAllows for flow features relative to orientation
FeaturesFeatures DistanceDistance Crowd flow speed and directionCrowd flow speed and direction Crowd densityCrowd density (many others possible…)(many others possible…)
Chosen as being reasonable to obtain from Chosen as being reasonable to obtain from current sensorscurrent sensors
Crowd SimulatorCrowd Simulator[[Continuum Crowds,Continuum Crowds, Treuille et al., Treuille et al.,
SIGGRAPH 2006]SIGGRAPH 2006]
Experimental SetupExperimental Setup
We used ROS [Willow Garage] to integrate the crowd We used ROS [Willow Garage] to integrate the crowd simulator and IRL learning and planner simulator and IRL learning and planner
1.1. Extract individual crowd traces and observable Extract individual crowd traces and observable featuresfeatures
2.2. Learn feature weights with our IRL algorithmLearn feature weights with our IRL algorithm
3.3. Use weights for a simulated robot in test scenariosUse weights for a simulated robot in test scenarios Planning is A* searchPlanning is A* search Re-planning occurs every grid cell with updated featuresRe-planning occurs every grid cell with updated features The robot is represented to the crowd simulator as just The robot is represented to the crowd simulator as just
another person for realistic reactions from the crowdanother person for realistic reactions from the crowd
Quantitative ResultsQuantitative Results
Measure similarity to “human” pathMeasure similarity to “human” path Shortest Path (baseline): Ignores crowdShortest Path (baseline): Ignores crowd Learned Path: The path from our learned Learned Path: The path from our learned
plannerplanner
Mean / Maximum Difference: Over all path Mean / Maximum Difference: Over all path cells, difference to closest “human” path cells, difference to closest “human” path cellcell Shortest
PathLearned Path Improvement
Mean Difference 1.4 0.9 35%
Maximum Difference 3.3 2.3 30%(Difference is significant at p=0.05 level)
Future WorkFuture Work
Train on real crowd dataTrain on real crowd data Overhead video + tracking?Overhead video + tracking? Wearable sensors to mimic robot sensor input?Wearable sensors to mimic robot sensor input?
Implement on actual robotImplement on actual robot Is the method effective for raw sensor data?Is the method effective for raw sensor data? Which are the most useful features?Which are the most useful features?
Pedestrian predictionPedestrian prediction Compare / incorporate other recent work Compare / incorporate other recent work
[Ziebart IROS 2009][Ziebart IROS 2009]
ConclusionConclusion
We have presented a framework for We have presented a framework for learning learning to to imitate human behavior from example tracesimitate human behavior from example traces
We learn weights that produce paths matching We learn weights that produce paths matching observed behavior from whatever features are observed behavior from whatever features are made available made available
Our inverse reinforcement learning algorithm Our inverse reinforcement learning algorithm handles handles locally observable dynamic featureslocally observable dynamic features
Resulting paths are more similar to observed Resulting paths are more similar to observed human pathshuman paths