1
A Dynamic Conditional Random Field Model for Joint Labeling of Object and Scene Classes Christian Wojek, Bernt Schiele Computer Science Department, TU Darmstadt, Germany {wojek, schiele}@cs.tu-darmstadt.de Objective Pixel-wise labeling of object and scene classes in a Dynamic Conditional Random Field framework[1] • Exploit powerful object detector in CRF framework to improve pixel-wise labeling of object classes • Leverage temporal information • Joint inference for objects and scene • New Dataset with pixel-wise labels for highly dynamic scenes Plain CRF formulation log( P pCRF (y t |x t , N 1 , Θ)) = X i Φ( y t i , x t ; Θ Φ )+ X (i , j )N 1 Ψ( y t i , y t j , x t ; Θ Ψ ) - log( Z t ) • Seven class labels: Sky, Road, Lane marking, Trees & bushes, Grass, Building, Void • Joint boosting[2] to obtain unary potentials Softmax transform to obtain pseudo-probability: Φ( y t i = k, x t ; Θ Φ )= log exp H ( k, f( x t i ); Θ Φ ) c exp H ( c , f( x t i ); Θ Φ ) • Pairwise potentials with logistic classifiers (learnt with gradient descent) [3] Ψ( y t i , y t j , x t ; Θ Ψ )= X (k, l )C w T 1 d t ij δ( y t i = k)δ( y t j = l ) • Piecewise training of unary and pairwise potentials • Distinguish east-west and north-south pairwise relations • No dynamic information encoded • Object classes suffer from too short range interactions Object CRF formulation log( P oCRF (y t , o t |x t , Θ)) = log( P pCRF (y t |x t , N 2 , Θ)) + X n Ω( o t n , x t ; Θ Ω )+ X (i , j , n)N 3 Λ( y t i , y t j , o t n , x t ; Θ Ψ ) • Enrich plain CRF model with additional longer range dependency information for object classes • Additional nodes for object hypotheses instantiated by object detector Underlying pairwise cliques are extended by object node to form cliques of three Object layout is learnt in discretized scale space Λ( y t i , y t j , o t n , x t ; Θ Λ )= X (k, l )C ; mO u T 1 d t ij δ( y t i = k)δ( y t j = l )δ( o t n = m) • Platt’s method to obtain pseudo-probability for unary object potential: Ω( o t n , x t ; Θ Ω )= log 1 1 + exp(s 1 · (v T · g({x t } o t n )+ b)+ s 2 ) Dynamic CRF formulation Λ Φ Λ Φ κ t t o t o t+1 y t y t+1 x t x t+1 Kalman Filter Homography • Independently model scene and object motion • Extended Kalman filter in 3D coordinate system for object classes log( P tCRF (y t , o t |x t , Θ)) = log( P pCRF (y t |x t , N 2 , Θ)) + X n κ t ( o t n , o t -1 , x t ; Θ κ )+ X (i , j , n)N 3 Λ( y t i , y t j , o t n , x t ; Θ Λ ) • For scene classes propagate CRF posterior as prior to next time step Δ t ( y t i , y t -1 ; Θ Δ t )= log( P tCRF ( y t -1 Q -1 (i ) |Θ)) log( P dCRF (y t , o t , x t |y t -1 , o t -1 , Θ)) = log( P tCRF (y t , o t |x t , Θ)) + X i Δ t ( y t i , y t -1 ; Θ Δ t ) Experiments on TUD Dynamic Scenes Dataset • New dataset containing dynamic scenes 176 sequences of 11 successive frames (88 sequences for training and 88 for testing) Last frame of each sequence with pixel-wise labels, bounding box labels for object class car • Publicly available from http://www.mis.informatik.tu-darmstadt.de • Unary classification performance Normalization on off multi-scale single-scale multi-scale single-scale Location on 82.2% 81.1% 79.7% 79.7% off 69.1% 64.1% 62.3% 62.3% Features • For unary and interaction potentials: Gray world normalization of input images Mean and Variance of 16 first Walsh-Hadamard transform coefficients from CIE L, a and b channel, extracted at multiple scales (8, 16 and 32 pixel windows) Node coordinates in regular 2D lattice • HOG features [4] for object node unary potentials Experiments on Sowerby Dataset • Evaluation of plain CRF (only static images) Pixel-wise accuracy Unary classification plain CRF model He et al. [5] 82.4% 89.5% Kumar&Hebert [3] 85.4% 89.3% Shotton et al. [6] 85.6% 88.6% This paper 84.5% 91.1% Input image Unary classification plain CRF result Input image Unary classification plain CRF result Sky Street object Road surface Building Vegetation Car Road marking • Sample segmentations and detections Input image Ground truth Unary classification plain CRF Object CRF Dynamic CRF (a) (b) (c) (d) (e) (f) Void Sky Road Lane marking Trees & bushes Grass Building Car (a) (b) (c) (d) (e) (f) • Pixel-wise evaluation of object class car No objects With object layer Including object dynamics Recall Precision Acc. Recall Precision Acc. Recall Precision Acc. CRF 50.1% 57.7% 88.3% 62.9% 52.3% 88.6% 70.4% 57.8% 88.7% dyn. CRF 25.5% 44.8% 86.5% 75.7% 50.8% 87.1% 78.0% 51.0% 88.1% • Confusion matrix for all classes True class Fraction Inferred Sky Road Lane marking Trees & bushes Grass Building Void Car Sky 10.4% 91.0 0.0 0.0 7.7 0.5 0.4 0.3 0.1 Road 42.1% 0.0 95.7 1.0 0.3 1.1 0.1 0.5 1.3 Lane marking 1.9% 0.0 36.3 56.4 0.8 2.9 0.2 1.8 1.6 Trees & bushes 29.2% 1.5 0.2 0.0 91.5 5.0 0.2 1.1 0.4 Grass 12.1% 0.4 5.7 0.5 13.4 75.3 0.3 3.5 0.9 Building 0.3% 1.6 0.2 0.1 37.8 4.4 48.4 6.3 1.2 Void 2.7% 6.4 15.9 4.1 27.7 29.1 1.4 10.6 4.8 Car 1.3% 0.3 3.9 0.2 8.2 4.9 2.1 2.4 78.0 References [1] Andrew McCallum, Khashayar Rohanimanesh, and Charles Sutton. Dynamic conditional random fields for jointly labeling multiple sequences. In NIPS* Workshop on Syntax, Semantics, 2003. [2] Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Sharing features: Efficient boosting procedures for multiclass object detection. In CVPR, 2004. [3] Sanjiv Kumar and Martial Hebert. A hierarchical field framework for unified context-based classification. In ICCV, 2005. [4] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. [5] Xuming He, Richard S. Zemel, and Miguel Á. Carreira-Perpiñán. Multiscale conditional random fields for image labeling. In CVPR, 2004. [6] Jamie Shotton, John Winn, Carsten Rother, , and Antonio Criminisi. Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In ECCV, 2006. Acknowledgements: This work has been funded, in part, by Continental Teves AG. Further we thank Joris Mooij for publicly releasing libDAI and BAE Systems for the Sowerby Dataset.

A Dynamic Conditional Random Field Model for Joint ... · [4]Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. [5]Xuming He, Richard

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Dynamic Conditional Random Field Model for Joint ... · [4]Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. [5]Xuming He, Richard

A Dynamic Conditional Random Field Model for Joint Labeling of Object and Scene ClassesChristian Wojek, Bernt Schiele

Computer Science Department, TU Darmstadt, Germanywojek, [email protected]

Objective

Pixel-wise labeling of object and scene classes in aDynamic Conditional Random Field framework[1]

• Exploit powerful object detector in CRF framework toimprove pixel-wise labeling of object classes

• Leverage temporal information

• Joint inference for objects and scene

• New Dataset with pixel-wise labels for highly dynamicscenes

Plain CRF formulation

log(PpCRF(yt|xt, N1,Θ)) =∑

i

Φ(y ti ,xt;ΘΦ) +∑

(i, j)∈N1

Ψ(y ti , y t

j ,xt;ΘΨ)− log(Z t)

• Seven class labels:Sky, Road, Lane marking, Trees & bushes, Grass,Building, Void

• Joint boosting[2] to obtain unary potentialsSoftmax transform to obtain pseudo-probability:

Φ(y ti = k,xt;ΘΦ) = log

exp H(k, f(x ti );ΘΦ)

c exp H(c, f(x ti );ΘΦ)

• Pairwise potentials with logistic classifiers (learnt withgradient descent) [3]

Ψ(y ti , y t

j ,xt;ΘΨ) =∑

(k,l)∈C

wT

1dt

i j

δ(y ti = k)δ(y t

j = l)

• Piecewise training of unary and pairwise potentials

• Distinguish east-west and north-south pairwiserelations

• No dynamic information encoded

• Object classes suffer from too short range interactions

Object CRF formulation

log(PoCRF(yt,ot|xt,Θ)) = log(PpCRF(y

t|xt, N2,Θ))+∑

n

Ω(otn,xt;ΘΩ) +∑

(i, j,n)∈N3

Λ(y ti , y t

j , otn,xt;ΘΨ)

• Enrich plain CRF model with additional longer rangedependency information for object classes

• Additional nodes for object hypotheses instantiated byobject detector

– Underlying pairwise cliques are extended by objectnode to form cliques of three

– Object layout is learnt in discretized scale space

Λ(y ti , y t

j , otn,xt;ΘΛ) =∑

(k,l)∈C;m∈O

uT

1dt

i j

δ(y ti = k)δ(y t

j = l)δ(otn = m)

• Platt’s method to obtain pseudo-probability for unaryobject potential:

Ω(otn,xt;ΘΩ) = log

1

1+ exp(s1 · (vT · g(xtotn) + b) + s2)

Dynamic CRF formulation

Λ

Ω

Φ

Λ

Ω

Φ

κt

∆t

ot ot+1

yt yt+1

xt xt+1

Kalman Filter

Homography

• Independently model scene and object motion• Extended Kalman filter in 3D coordinate system for

object classes

log(PtCRF(yt,ot|xt,Θ)) = log(PpCRF(y

t|xt, N2,Θ))+∑

n

κt(otn,ot−1,xt;Θκ) +

(i, j,n)∈N3

Λ(y ti , y t

j , otn,xt;ΘΛ)

• For scene classes propagate CRF posterior as prior tonext time step

∆t(y ti ,yt−1;Θ∆t) = log(PtCRF(y

t−1Q−1(i)

|Θ))

log(PdCRF(yt,ot,xt|yt−1,ot−1,Θ)) = log(PtCRF(y

t,ot|xt,Θ))+∑

i

∆t(y ti ,yt−1;Θ∆t)

Experiments on TUD Dynamic Scenes Dataset

• New dataset containing dynamic scenes

– 176 sequences of 11 successive frames (88 sequencesfor training and 88 for testing)

– Last frame of each sequence with pixel-wise labels,bounding box labels for object class car

• Publicly available fromhttp://www.mis.informatik.tu-darmstadt.de

• Unary classification performanceNormalization

on offmulti-scale single-scale multi-scale single-scale

Locationon 82.2% 81.1% 79.7% 79.7%off 69.1% 64.1% 62.3% 62.3%

Features

• For unary and interaction potentials:

– Gray world normalization of input images– Mean and Variance of 16 first Walsh-Hadamard

transform coefficients from CIE L, a and b channel,extracted at multiple scales (8, 16 and 32 pixelwindows)

– Node coordinates in regular 2D lattice

• HOG features [4] for object node unary potentials

Experiments on Sowerby Dataset

• Evaluation of plain CRF (only static images)Pixel-wise accuracy

Unary classification plain CRF modelHe et al. [5] 82.4% 89.5%Kumar&Hebert [3] 85.4% 89.3%Shotton et al. [6] 85.6% 88.6%This paper 84.5% 91.1%

Input image Unaryclassification

plain CRFresult

Input image Unaryclassification

plain CRFresult

Sky Street object Road surface BuildingVegetation Car Road marking

• Sample segmentations and detectionsInput image Ground truth

Unaryclassification

plain CRF Object CRF Dynamic CRF(a)

(b)

(c)

(d)

(e)

(f)

Void Sky Road Lane markingTrees & bushes Grass Building Car

(a) (b) (c)

(d) (e) (f)

• Pixel-wise evaluation of object class car

No objects With object layerIncluding object

dynamicsRecall Precision Acc. Recall Precision Acc. Recall Precision Acc.

CRF 50.1% 57.7% 88.3% 62.9% 52.3% 88.6% 70.4% 57.8% 88.7%dyn. CRF 25.5% 44.8% 86.5% 75.7% 50.8% 87.1% 78.0% 51.0% 88.1%

• Confusion matrix for all classes

True class Fraction Infe

rred

Sky

Roa

d

Lane

mar

king

Tree

s&

bush

es

Gra

ss

Bui

ldin

g

Void

Car

Sky 10.4% 91.0 0.0 0.0 7.7 0.5 0.4 0.3 0.1Road 42.1% 0.0 95.7 1.0 0.3 1.1 0.1 0.5 1.3Lane marking 1.9% 0.0 36.3 56.4 0.8 2.9 0.2 1.8 1.6Trees & bushes 29.2% 1.5 0.2 0.0 91.5 5.0 0.2 1.1 0.4Grass 12.1% 0.4 5.7 0.5 13.4 75.3 0.3 3.5 0.9Building 0.3% 1.6 0.2 0.1 37.8 4.4 48.4 6.3 1.2Void 2.7% 6.4 15.9 4.1 27.7 29.1 1.4 10.6 4.8Car 1.3% 0.3 3.9 0.2 8.2 4.9 2.1 2.4 78.0

References[1] Andrew McCallum, Khashayar Rohanimanesh, and Charles Sutton. Dynamic conditional

random fields for jointly labeling multiple sequences. In NIPS* Workshop on Syntax, Semantics,2003.

[2] Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Sharing features: Efficient boostingprocedures for multiclass object detection. In CVPR, 2004.

[3] Sanjiv Kumar and Martial Hebert. A hierarchical field framework for unified context-basedclassification. In ICCV, 2005.

[4] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR,2005.

[5] Xuming He, Richard S. Zemel, and Miguel Á. Carreira-Perpiñán. Multiscale conditional randomfields for image labeling. In CVPR, 2004.

[6] Jamie Shotton, John Winn, Carsten Rother, , and Antonio Criminisi. Textonboost: Jointappearance, shape and context modeling for multi-class object recognition and segmentation.In ECCV, 2006.

Acknowledgements: This work has been funded, in part, by Continental Teves AG. Further we thank Joris Mooij for publicly releasing libDAI and BAE Systems for the Sowerby Dataset.