Auditory and Visual Spatial Sensing
Stan BirchfieldDepartment of Electrical and
Computer EngineeringClemson University
Human Spatial Sensing
The five senses:
Hearing
Taste
Touch
Smell
Seeing
f(t)f(x,y,,t)
Visual and Auditory Pathways
Two Problems inSpatial Sensing
Stereo Vision Acoustic Localization
Clemson Vision Laboratory
head tracking
root detection reconstruction
highway monitoring
motion segmentation
Clemson Vision Lab (cont.)
microphone position calibration
speakerlocalization
Stereo Vision
INPUT
OUTPUT
Left Right
Disparity map Depth discontinuities
epipolarconstraint
Epipolar Constraint
Left camera Right camera
world point
center ofprojection
epipolarplane
epipolarline
Energy Minimization
Left
Right
inte
nsi
ty occluded pixels
E E d(x ,x - ) u(l )data smoothness L Lx iiL
minimize:
dissimilarity discontinuitypenalty
(underconstrained)constraint
History of Stereo Correspondence
Birchfield & Tomasi 1998
Geiger et al. 1995
Intille &Bobick 1994
Belhumeur & Mumford 1992
Ohta & Kanade 1985
Baker & Binford 1981
MULTIWAY-CUT(2D)
DYNAMICPROGRAMMING
(1D)
Kolmogorov & Zabih 2001, 2002
Lin & Tomasi 2002
Birchfield & Tomasi 1999
Boykov, Veksler, and Zabih 1998
Roy & Cox 1998
Dynamic Programming: 1D Search
Dis
par
ity
map
occlusion
depthdiscontinuity
RIGHTL
EF
T
c a r t
ca
t 3 2 1 1 12 1 0 1 21 0 1 2 30 1 2 3 4
string editing:
stereo matching:
penalties: mismatch = 1 insertion = 1 deletion = 1
c a t
c a r t
Multiway-Cut:2D Search
pixels
labels
pixels
labels
[Boykov, Veksler, Zabih 1998]
Multiway-Cut Algorithm
),( x'x ))(, x(x fg
minimum cut
),(
)]()()[,())(,x'xx
x'xx'xx(x fffg Minimizes
source label
sink label
pixels
(cost of label discontinuity)
(cost of assigninglabel to pixel)
pixels
labels
Sampling-InsensitivePixel Dissimilarity
d(xL,xR)
xL xR
d(xL,xR) = min{d(xL,xR) ,d(xR,xL)}Our dissimilarity measure:
[Birchfield & Tomasi 1998]
IL IR
Given: An interval A such that [xL – ½ , xL + ½] _ A, and
[xR – ½ , xR + ½] _ A
Dissimilarity Measure Theorems
If | xL – xR | ≤ ½, then d(xL,xR) = 0
| xL – xR | ≤ ½ iff d(xL,xR) = 0
∩∩
Theorem 1:
Theorem 2:
(when A is convex or concave)
(when A is linear)
Correspondence as Segmentation
• Problem: disparities (fronto-parallel) O()surfaces (slanted) O( 2 n)=> computationally intractable!
• Solution: iteratively determine which labels to use
labelpixels
find affineparametersof regions
multiway-cut(Expectation)
Newton-Raphson(Maximization)
Stereo Results (Dynamic Programming)
Stereo Results (Multiway-Cut)
Stereo Results on Middlebury Database
imag
eB
irch
fiel
dT
om
asi 1
999
Ho
ng
-C
hen
200
4
Multiway-Cut Challenges
Multiway-cutDynamic programming
Acoustic Localization
Problem: Use microphone signals to determine sound source location
Traditional solutions:1. Delay-and-sum beamforming !2. Time-delay estimation (TDE) !
compact
distributed
Recent solutions:3. Hemisphere sampling !!4. Accumulated correlation !!5. Bayesian !6. Zero-energy !
! efficient ! accurate
Localization Geometry
t2
t1
t -2 t = 1
(one-half hyperboloid)
microphones
sound source
time
Principle of Least Commitment
“Delay decisions as long as possible”
Example:
[Marr 1982 Russell & Norvig 1995]
Localization by Beamforming
mic 1 signaldelay
mic 2 signal
prefilter
prefilter
mic 3 signal
find peak
mic 4 signal
prefilter
prefilter
sum
delay
delay
delay
[Silverman &Kirtman 1992; Duraiswami et al. 2001; Ward & Williamson, 2002]
energy
! accurate NOT efficient
makes decision late in pipeline(“principle of least commitment”)
delays (shifts) each signalfor each candidate location
Localization by Time-Delay Estimation (TDE)
mic 1 signal
correlatefind peakmic 2 signal
prefilter
prefilter
mic 3 signal
correlatefind peakmic 4 signal
prefilter
prefilter
intersect
(may be no intersection)
[Brandstein et al. 1995;
Brandstein & Silverman 1997;
Wang & Chu 1997]
! efficient NOT accurate
decision is made early
cross-correlation computed once for each microphone pair
Localization by Hemisphere Sampling
mic 1 signalcorrelate
map to common
coordinate system
sampled locus
sum
temporalsmoothing
mic 2 signal
prefilter
prefilter
mic 3 signalcorrelate
map to common
coordinate system
mic 4 signal
prefilter
prefilter
finalsampled
locus
correlate
correlate
correlate
correlate
… find peak
[Birchfield & Gillmor 2001]! efficient! accurate
(but restricted to compact arrays)
Localization by Accumulated Correlation
mic 1 signalcorrelate
map to common
coordinate system
sampled locus
sum
temporalsmoothing
mic 2 signal
prefilter
prefilter
mic 3 signalcorrelate
map to common
coordinate system
mic 4 signal
prefilter
prefilter
finalsampled
locus
correlate
correlate
correlate
correlate
… find peak
[Birchfield & Gillmor 2002]! efficient! accurate
Accumulated Correlation Algorithm
microphone
candidatelocation
= likelihood
+
...
pair 1:
pair 2:
+
Comparison
Bayesian:
Zero energy:
Acc corr:
Hem samp:
TDE:
similarity energy
efficient
accurate
Beamforming:
Unifying framework
efficient
accurate
Integration limits
BeamformingBayesianZero energy
Accumulated correlationHemisphere samplingTime-delay estimation
Compact Microphone Array
microphone
d=15cm
sampled hemisphere
Results on compact array
pan
tilt
without PHAT prefilter with PHAT prefilter
More Comparison
Hemisphere Sampling[Birchfield & Gillmor 2001]
BeamformingAccumulatedCorrelation
[Birchfield & Gillmor 2002]
Results on distributed array
Computational efficiency
0
1000
2000
3000
4000
5000
6000
7000
8000
Compact Distributed
Beamforming
Accumulatedcorrelation
Co
mp
uti
ng
tim
e p
er w
ind
ow
(m
s)
(600x faster) (50x faster)
Simultaneous Speakers
+ =
Detecting Noise Sourcesbackground noise source
Connection with Stereo
[Okutomi & Kanade 1993]
“Multi-baseline stereo”
Conclusion
• Spatial sensing achieved by arrays of visual and auditory sensors
• Stereo vision– match visual signals from multiple cameras– recent breakthrough: multiway-cut– limitations of multiway-cut
• Acoustic localization– match acoustic signals from multiple microphones– recent breakthrough: accumulated correlation– connection with multi-baseline stereo