Upload
griffin-nickolas-weaver
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Stochastic Tracking of HumansStochastic Tracking of Humans
Michael J. Black
http://www.cs.brown.edu/~black
Department of Computer ScienceBrown University
CollaboratorsCollaboratorsHedvig Sidenbladh
Royal Institute of Technology (KTH), Sweden
Dirk Ormoneit and Trevor Hastie Dept. of Statistics, Stanford University
David Fleet Xerox PARC and Queen’s University
Allan JepsonUniversity of Toronto
Goal: 3D Human MotionGoal: 3D Human Motion
* 3D articulated model* Perspective projection* Monocular sequence* Unknown, cluttered, environment
* Infer 3D human motion from 2D image motion.
OverviewOverview* Why is 3D human motion important?* Why is recovering it hard?* A Bayesian approach
* generative model * robust likelihood function* temporal prior model (learning)* stochastic search (particle filtering)
* Where are we going?* Recent advances & state of the art.* What remains to be done?
Why is it Important?Why is it Important?ApplicationsApplications
• Human-Computer Interaction• Surveillance • Motion capture (games and animation)• Video search/annotation• Work practice analysis.
Social display of puzzlement
* detect moving regions* estimate motion* model articulated objects* model temporal patterns of activity * interpret the motion
Why is it Hard?Why is it Hard?
The appearance of peoplecan vary dramatically.
Bones and jointsare unobservable(muscle, skin, clothing hide the underlying structure).
(inference)
Why is it hard?Why is it hard?
People can appear in arbitrary poses.
They can deform in complexways.
Occlusion results inambiguities and multiple interpretations.
Why is it hard?Why is it hard?
Geometrically under-constrained.
Other ProblemsOther Problems
* non-linear dynamics of limbs* similarity of appearance of different limbs (matching ambiguities)* image noise* outliers
Our models are approximations.Image changes that are not modeled(e.g. clothing deformation) will be outliers.
Common AssumptionsCommon Assumptions
* Multiple Cameras(additional constraints, occlusion)
* Color Images(locate face and hands)
* Known Background(background subtraction to locate person)
* Batch process an entire sequence.* Known Initialization
(to be avoided)
RequirementsRequirements
1. Represent uncertainty and multiple hypotheses.
2. Model non-linear dynamics of the body.
3. Exploit image cues in a robust fashion.
4. Integrate information over time.
5. Combine multiple image cues.
Bayesian InferenceBayesian InferenceBuild models of human form and motion. Learnpriors over model parameters:
p(model)
Exploit cues in the images. Model robust likelihoods:p(image cue | model)
Represent the posterior distributionp(model | cue) p(cue | model) p(model)
ProblemsProblemsA simple articulated human model may have 30+parameters (e.g. joint angles. 60+ w/ velocities).
Models of human action are non-linear and likelihood models will be multi-modal.
Key challenges Key challenges (common to other domains)• representation,• learning, and• search
in high dimensional spaces.
Bayesian FormulationBayesian Formulation
Represent a distributionover 3D poses.
* define generative model of image appearance* multi-modal posterior over model parameters - sampled representation - particle filtering approach.* focus on image motion as a cue (adding edges,…)
Generative Model: ShapeGenerative Model: Shape
* 3D Articulated Body Model* pinhole camera* parameter vector =
Generative Model: MotionGenerative Model: Motion
t-1
t
),( 11t ttM IR
Projection ofimage textureonto the 3D model
),(1ttt M RI
Projection of modelappearance intoimage coordinates
Appearance ModelAppearance ModelCould be many things
* template (Cham & Rehg ‘99)* eigen-model (Sidenbladh, et al ‘00)* texture model* filter responses (edges, ridges, …)* learned over time
Simple probabilistic model:
)),((),|(),,|( 1111111 tttttttttt Mpp IRIRRIR
Markov assumption
Noise ModelNoise Model
),,()(I 1jttjt M xRx
Generative model:
cGp t )()1();(Mixture of Gaussian and uniform outlier distribution:
Function ofsurface orientation
256/110
Generative Model: TemporalGenerative Model: Temporal
),|(),|( 1111 tttttt pp VV
)|(),|( 111 ttttt pp VVVV
* general smooth motion or,* action-specific motion (walking)
First order Markov assumption on angles, , and angular velocity, V:
Explore two models of human motion
Bayesian FormulationBayesian Formulation
1111111
11111
))|,,(
),|()|(),|((),(
)|,,(
ttttttt
ttttttttttt
tttt
dddp
pppp
p
RVIRV
IRVVVRI
IRV
Posterior over shape, velocity, and appearance given an image sequence.
Likelihood ofobserving the imagegiven the shape andappearance parameters
Temporal model
Posterior fromprevious time instant
Robust LikelihoodRobust Likelihood
For n random pixels from limb j compute: ij ,x
n
i j
ijtijtimagej
jp
1)(22
2)),(),(ˆ(
)(21
256, )exp(
xIxI
);,(M)(I ,1
, ijttijt xRx where
imagejj
ttt pp ,),|( RI
Temporal Model: Smooth MotionTemporal Model: Smooth Motion
otherwise0
],[if)),((
),|(
max,min,,1,1,,
1,1,,
iitiitititi
tititi
G
p
V
V
),()|( V1,,1,, ititititi Gp VVVV
* individual angles and velocities assumed independent
What does the posterior look like?What does the posterior look like?
x y z
Shoulder: 3dofElbow: 1dof
Elbow bends
Particle FilteringParticle Filtering* large literature (Gordon et al ‘93, Isard & Blake ‘96,…)
* non-Gaussian posterior approximated by N discrete samples
* explicitly represent the ambiguities
* exploit stochastic sampling for tracking
],,[ )()()( nt
nt
ntt RVs )10( 4NNn ,...,1
Particle FilterParticle Filter
samplesample
samplesample
normalizenormalize
Posterior
)I|V,( 111 tttp
Temporal dynamics )V,|V,( 11 ttttp
Likelihood)|I( ttp )I|V,( tttp
Posterio
r
Arm Tracking: Smooth motion priorArm Tracking: Smooth motion prior
Particle filter * represents ambiguity * propagates information over time
x y z
Display: expectedvalue of joint angles.
Full-Body TrackingFull-Body Tracking * parameter space too large * constrain posterior to valid 3D human motions. * learn generative models automatically from training data.
time
joint angles3D motion-capturedata: * segment into “movemes” (Bregler) * train probabilistic model.
(from M. Gleicher)
Learning Temporal ModelsLearning Temporal Models
* Motion capture data is noisy, data is missing, activities are performed differently.
* For cyclic motion (important but special class):1. Detect cycles and segment2. Account for missing data3. Preserve continuity of cycles4. Statistical model of variation
* Approaches should generalize to non-cyclic motion.
(Dirk Ormoneit & Trevor Hastie)
Detecting CyclesDetecting Cycles
Automatically detect length of cycles,Automatically segment and align cycles.
Modeling Cyclic MotionModeling Cyclic Motion
Automaticallyalign 3D data with a reference curverepresented usingperiodicallyconstrainedregression splines.
Modeling Cyclic MotionModeling Cyclic Motion
* Iterative SVD method (from gene expression work)* computes SVD in Fourier domain* construct a rank-q approximation and take inverse Fourier transform* impute missing data from the approximation* repeat until convergence.
* Segment into cycles, compute mean curve and represent variation by performing PCA on data.
* SVD must enforce periodicity and cope with missing data.
Action-Specific ModelAction-Specific Model
q
ktkkttt vc
1, )()(~
The joint angles at time t are a linear combinationof the basis motions evaluated at phase
Mean curve Basis curves
Temporal Model: WalkingTemporal Model: Walking
],,,,[ gt
gttttt c
sParameters of the generative model are now
),()|(
),]100[]1,([),,|(
),()|(
)),(()|(
),,()|(
11
1111111
111
111
,1,,1,
gggg
Ttt
Tgtt
gtt
gt
tttt
ttttt
kcc
ktktktkt
ttttGp
Gp
Gp
Gp
ccGccp
TT
Probabilistic model for )|( 1ttp ss
Learned Walking ModelLearned Walking Model
* mean walker
Learned Walking ModelLearned Walking Model
* sample with small
Learned Walking ModelLearned Walking Model
* sample with moderate
Learned Walking ModelLearned Walking Model
* sample with very large
Stochastic 3D TrackingStochastic 3D Tracking
Stochastic 3D tracking (manual initialization)
Use motion informationto update and trackdistribution over time
Stochastic 3D TrackingStochastic 3D Tracking
* significant changes in view and depth.* template-based methods will fail.
No likelihoodNo likelihood
* how strong is the walking prior? (or is our likelihood doing anything?)
IssuesIssues* Large parameter space
* approx. 10000 samples * sparsely represented* not real time
* Flow-based models can drift
* Requires initialization
Lessons LearnedLessons Learned* Probabilistic (Bayesian) framework allows - integration of information over time
- modeling of priors- explicit generative image model
* Particle filtering allows- multi-modal distributions- tracking with ambiguities and non-linear models
* Weak image cues necessitate strong priors and many samples.
Work to be doneWork to be done
* better appearance model - other cues (Color, edges, appearance,…)
* automatic initialization using 2D models* learn more general models of motion* better occlusion model (new)* model of the background motion (new)
* better representations of the posterior (Fleet&Chou)* better sampling methods (Fleet&Ormoneit)* adapt shape of limbs
Very preliminary work…
The Statistics of People in The Statistics of People in Images and VideoImages and Video
How do people appearin natural scenes?
Want a general model.
EdgeFilters
RidgeFilters
Statistics of ImagesStatistics of Images
Ruderman. Lee, Mumford, Huang. Portilla and Simoncelli. Olshausen & Field. Xu, Wu, & Mumford. …
Learning Pon and Poff for edge detection and road following:
Geman and JednyakKonishi, Yuille, and Coughlan
Example Training ImagesExample Training Images
Distribution of Filter ResponsesDistribution of Filter Responses
Ratios for different limbsRatios for different limbs
Local Contrast NormalizationLocal Contrast Normalization
LikelihoodLikelihood
bf x
bb
x
ffbf xIpxIpIp )|)(()|)((),|(
f
f
x
bfx
ff
x
b
xIp
xIpxIp
)|)((
)|)(()|)((
f
f
x
bfx
ff
xIp
xIpC
)|)((
)|)((
Foreground pixels
Background pixels
BenefitsBenefits• Generic model of appearance.
• Principled way to chose filters.
• Model of foreground and background is incorporated into the tracking framework.
• exploits the ratio between foreground and background likelihoods.• improves tracking.
• Done the same for ridges and motion.
OutlookOutlook5 years:
- Relatively reliable people tracking in monocular video.- Path is pretty clear.
… solve the vision problem.
Next step: Beyond person-centric- people interacting with object/world
Beyond that: Recognizing action- goals, intentions, ...
… solve the AI problem.
Some Related WorkSome Related Work* Bregler & Malik: image motion, single hypothesis,
full-body required multiple cameras, scaled ortho.* Ju, Black, Yacoob: cardboard person model,
image motion, 2D* Deutscher et al: Condensation, edge cues,
background subtraction.* Cham& Rehg: known templates, 2D (SPM), particle
filter.* Wachter & Nagel: nicely combines motion and edges,
single hypothesis (Kalman filter).* Leventon & Freeman: assumes 2D tracking,
probabilistic formulation, learned temporal model
(full body, monocular, articulated)
ConclusionsConclusionsBayesian formulation for tracking 3D human figuresusing monocular image information.
* Generative model of image appearance.* Non-linear model represents ambiguities, singularities occlusion, etc - sampled representation of posterior.* Particle filtering for incremental estimation.* Automatic learning of cyclic motion prior.
Rich framework for modeling the complexity ofhuman motion.
Initialization Using 2D ModelInitialization Using 2D Model
* Full-body walking model.
* Constructed from 3D mocap data.
* 2D, view-based (every 30 degrees)
* 4 subjects, 14 cycles
2D, View-Based Walker2D, View-Based Walker* Construct linear optical flow basis
* Use similar Bayesian framework for tracking (Black CVPR’99)
* Coarse estimate of 3D parameters
* Automatic initialization
Example Bases:
...
...
0 degrees
90 degrees
Recent ResultsRecent Results
* Box indicates mean position and scale.* Recovers distribution over phase and 3D scale.
Contrast NormalizationContrast Normalization
Locally weight image derivatives by
contrast
OcontrastSw
*2
)*tanh(1
Global contrast normalization (Lee, Mumford & Huang)
)ˆ
log(I
IInorm
Optimizing the FiltersOptimizing the Filters• Chose contrast normalization to maximize detection accuracy
• ROC curve
• Battacharyya
• Kullback-Leibler
dyypyppp offonoffonB )()(log),(
dyyp
ypyppp
off
ononoffonKL )(
)(log)(),(
Local Contrast NormalizationLocal Contrast Normalization
Representing the PosteriorRepresenting the Posterior
N
i
itt
ntt
p
pnt
1
)(
)(
)|(
)|()(
sI
sI
),( )()( nt
nt s
)|,,( ttttp IRV
represented by discrete set of N samples
Normalized likelihood:
CondensationCondensation1. Selection Sample from posterior at t-1
Most probable states selected most often.2. Prediction.3. Updating
1. Selection2. Prediction/Diffusion (sample from )
ie from the temporal prior:
1. Compute
2. Sample from
3. Sample from
3. Updating
CondensationCondensation
)|( 1ttp ss
),|()|(),|( 11111 tttttttt ppp IRVVV
),|( 11 tttp IR
)|( 1ttp VV
),|( 11 tttp V
tR
states
p
ts
1ts
CondensationCondensation1. Selection2. Prediction/Diffusion (sample from )
Models the dynamics:
3. Updating
)|( 1ttp ss
CondensationCondensation1. Selection2. Prediction3. Updating (the distribution)
Evaluate new likelihood.
Repeat until N new samples have beengenerated.
Compute normalized probability distribution.
),|()|( ttttt pp RIsI
Visualizing ResultsVisualizing Results
)(
1
)( )(|)( nt
N
n
nttt ffE
sIs
Expected value of state parameter )( tf s
LikelihoodLikelihood
* To cope with occluded limbs or those viewed at narrow angles, we introduce a probability of occlusion.* likelihood of observing limb j is then
occludedimagej pqpqp )1(
* likelihood of the model is product of limb likelihoods
jj
ttt pp ),|( RI
Indexing/SearchIndexing/SearchThe crux of the problem.
• The parameter space is huge.• Brute force search is infeasible (ditto discretely sampling the space).• Need to index into correct part of the space.• Use a hierarchy of models of increasing complexity
Images Generic Models(expansion, rotation,…)
Coarse ObjectModels
(EigenPeople)
Detailed Models(shape & activity)
Compute likelihood
Index Indexw/ Jepson & Fleet.
InitializationInitialization
* new spatially constrained mixture model * find appropriate mid-level representations* initialize high level models using mid-level cues
Digital Video AnalysisDigital Video Analysis
Social display of puzzlement
To automatically analyze such a sequence we must * detect moving regions * estimate and interpret the motion * model complex articulated objects such as humans * model temporal patterns of activity
Tracking Moving StructureTracking Moving Structure
Next steps * split/merge/kill/initialize/grow/shrink operations * probabilistic search for best interpretation of the scene * detect more complex structures (articulation)
Generative ModelGenerative Model
Mouth Training DataMouth Training Data
* 3000 image training sequence* motion estimated between pairs of frames* utterances: “center”, “print”, “track”, “release”
Learned Spatial ModelLearned Spatial Model
* 3 basis flow fields account for 85% of variance.* fewer needed for recognition than for accurate estimation.
Mouth Temporal ModelsMouth Temporal Models
Mouth ResultsMouth Results
Results
Mouth ResultsMouth Results
Let be the image measurements at time t.
Let be a sequence of measurements from 0 to t .
Bayesian FormulationBayesian FormulationLet ),,,,,( ps t be a state.
)|()()|( 1 tttttt pzkpp zsszs
We want
* not Gaussian.
tz
Measurement likelihood.Can’t represented in closedform.
Temporal prior.Can be sampled
tz
Generative ModelGenerative Model(Brightness Constancy)(Brightness Constancy)
),0()1),;((I )),((I tt ttt sxu)x(ssx
ppxsx )()( t
Optical flow
))(()();( ,1
psxbsxu
ti
n
iit a
Representing the PosteriorRepresenting the Posterior
S
i
itt
ntt
zp
zpnt
1
)(
)(
)|(
)|()(
s
s
),( )()( nt
nt s
)|( ttp zs represented by discrete set of S samplesSnn
tn
t ,...,1),,( )()( s
Stochastic SearchStochastic Search
* Particle filtering (Condensation):1. Sample from posterior at time t-1.2. Predict using temporal prior.3. Evaluate likelihood.
* Predict non-Gaussian distribution over time.* Update posterior with new measurements.* Allocate computational resources to effectively explore the space.
Generative Model: MotionGenerative Model: Motion
t-1
t
),P( 11-t tyx
),P(t tyx
),P(),P( 1t tt yyu
Learned Walking ModelLearned Walking Model
* sample with large