Camera Network Chapter

8/3/2019 Camera Network Chapter

http://slidepdf.com/reader/full/camera-network-chapter 1/41

Optimal Visual Sensor Network

Configuration

Jian Zhao a Sen-ching S. Cheung a Thinh Nguyen b

aCenter for Visualization and Virtual Environments, University of Kentucky

bSchool of Electrical Engineering and Computer Science, Oregon State University

Abstract

Wide-area visual sensor networks are becoming more and more common. They

have wide-range of commercial and military applications from video surveillance

to smart home and from traffic monitoring to anti-terrorism. The design of such a

visual sensor network is a challenging problem due to the complexity of the environ-

ment, self and mutual occlusion of moving objects, diverse sensor properties and a

myriad of performance metrics for different applications. As such, there is a need to

develop a flexible sensor-planning framework that can incorporate all the aforemen-

tioned modeling details, and derive the sensor configuration that simultaneously

optimizes the target performance and minimizes the cost. In this chapter, we tackle

this optimal sensor problem by developing a general visibility model for visual sen-

sor networks and solving the optimization problem via Binary Integer Programming

(BIP). Our proposed visibility model supports arbitrary-shaped 3D environments

and incorporates realistic camera models, occupant traffic models, self occlusion and

mutual occlusion. Using this visibility model, two novel BIP algorithms are proposed

to find the optimal camera placement for tracking visual tags in multiple cameras.

Furthermore, a greedy implementation is proposed to cope with the complexity of

Preprint submitted to Elsevier 2 November 2008



BIP. Extensive performance analysis is performed using Monte-Carlo simulations,

virtual environment simulations and real-world experiments.

Key words: sensor placement, smart camera network, visual tags, binary integer

programming

1 Introduction

In recent years we have seen widespread deployment of smart camera net-

works for a variety of applications. Proper placement of cameras in such a

distributed environment is an important design problem. Not only does it de-

termine the coverage of the surveillance, it also has a direct impact on the

appearance of objects in the cameras which dictates the performance of all

subsequent computer vision tasks. For instance, one of the most important

tasks in distributed camera network is to identify and track common objects

across disparate camera views. It is a difficult problem because image features

like corners, scale-invariant feature transform (SIFT) contours, or color his-

tograms may vary significantly between different camera views due to dispar-

ity, occlusions and variation in illumination. One possible solution is to utilize

semantically rich visual features based either on intrinsic characteristics such

as faces or gaits, or artificial marks like jersey numbers or special-colored tags.

We call the problem of identifying distinctive visual features on an object the

“Visual Tagging” problem.

An early version of this work has appeared in IEEE Journal of Selected topics

in Signal Processing, Special Issue on Distributed Processing in Vision networks,

Volume 2 and Number 4 under the title “Optimal Camera Network Configurations

for Visual Tagging” by the same authors.

2



To properly design a camera network that can accurately identify and un-

derstand visual tags, one needs a visual sensor planning tool – a tool that

analyzes the physical environment and determines the optimal configuration

for the visual sensors so as to achieve specific objectives under a given set

of resource constraints. Determining the optimal sensor configuration for a

large-scale visual sensor networks is technically a very challenging problem.

First, visual line-of-sight sensors are amenable to occlusion by both static and

dynamic objects. This is particularly problematic as these networks are typi-

cally deployed in urban or indoor environments characterized by complicated

topologies, stringent placement constraints and constant flux of occupant orvehicular traffic. Second, from infra-red to range sensing, from static to pan-

tilt-zoom or even robotic cameras, there are a myriad of visual sensors and

many of them have overlapping capabilities. Given a fixed budget with limited

power and network connectivity, the choice and placement of sensors become

critical to the continuous operation of the visual sensor network. Third, the

performance of the network depends highly on the nature of the specific tasks

in the application. For example, biometric and object recognition require the

objects to be captured at a specific pose; triangulation requires visibility of

the same object from multiple sensors; object tracking can tolerate certain

degree of occlusion using a probabilistic tracker.

As such, there is a need to develop a flexible sensor-planning framework that

can incorporate all the aforementioned modeling details, and derive the sensor

configuration that simultaneously optimizes the target performance and mini-

mizes the cost. Such a tool can allow us to scientifically determine the number

of sensors, their positions, their orientations, and the expected outcome before

embarking on the actual construction of a costly visual sensor network project.

3



In this chapter, we propose a novel integer-programming based framework for

determining the optimal visual sensor configuration for 3D environments. Our

primary focus will be on optimizing the performance of the network in visual

tagging. To allow maximum flexibility, we do not impose a particular method

for tag detection and simply model it as a generic visual detector. Furthermore,

our framework allows users the flexibility to determine the number of views

in which the tag needed to be observed so that a wide variety of applications

can be simulated.

This chapter is organized as follows. After reviewing the state-of-the-art vi-

sual sensor placement techniques in Section 2, we discuss in Section 3 how the

performance of a sensor configuration can be measured using a general visi-

bility model. In Section 4, we adapt the general model to the “visual tagging”

problem using the probability of observing a tag from multiple visual sensors.

Using this refined model, we formulate in Section 5 the search of the optimal

sensor placements as two Binary Integer Programming (BIP) problems – the

first formulation, MIN CAM, focuses on minimizing the number of sensors

for a target performance level and the second one, FIX CAM, maximizes the

performance for a fixed number of sensors. Due to the computational com-

plexity of BIP, we will also present a greedy approximation algorithm called

GREEDY. Experimental results on these algorithms using both simulations

and camera network experiments are presented in Section 6. We conclude the

paper by discussing future work in Section 7.

4



2 Related work

The problem of finding the optimal camera placement has been studied for

a long time. The earliest investigation can be traced back to the “art gallery

problem” in computational geometry. This problem is the theoretical study on

how to place cameras in an arbitrary-shaped polygon so as to cover the entire

area [1–3]. Although Chvatal has shown in [4] that the upper bound of the

number of cameras is n/3, determining the minimum number of cameras

turns out to be a NP-complete problem [5]. While the theoretical difficulties

of the camera placement problem are well understood and many approximate

solutions have been proposed, few of them can be directly applied to realistic

computer vision problems. Camera placement has also been studied in the field

of photogrammetry for building accurate 3D models. Various metrics such as

visual hull [6] and viewpoint entropy [7] have been developed and optimization

are realized by various types of ad-hoc searching and heuristics [8]. These

techniques assume very dense placement of cameras and are not applicable towide-area wide-baseline camera networks.

Recently, Ramakrishnan et al. propose a framework to study the performance

of sensor coverage in wide-area sensor networks [9]. Unlike previous techniques,

their approach takes into account the orientation of the object. They develop

a metric to compute the probability of observing an object of random orienta-

tion from one sensor, and use that to recursively compute the performance for

multiple sensors. While their approach can be used to study the performance

of a fixed number of cameras, it is not obvious on how to extend their scheme

to find the optimal number of cameras as well as how to incorporate other con-

straints such as the visibility from more than one camera. More sophisticated

5



modeling pertinent to visual sensor networks are recently proposed in [10–12].

The sophistication in their visibility models comes at a high computational

cost for the optimization. For example, the simulated annealing scheme used

in [11] takes several hours to find the optimal placements of four cameras in a

room. Other optimization schemes such as hill-climbing[10], semi-definite pro-

gramming[12] and evolutionary approach[13] all prove to be computationally

intensive and prone to local minima.

Alternatively, the optimization can be tackled in the discrete domain. Horster

and Lienhart develop a flexible camera placement model by discretizing thespace into grid and denoting the possible placement of camera as a binary vari-

able over each grid point [14]. The optimal camera configuration is formulated

as an integer linear programming problem which can incorporate different

constraints and cost functions pertinent to a particular application. Similar

ideas were also proposed in [15–17]. While our approach follows a similar op-

timization strategy, we develop a more realistic visibility model to capture the

uncertainty of object orientation and mutual occlusion in 3D environments.

Unlike [14] in which the field of view of a camera is modeled as a 2-D fixed-size

triangle, ours is based on measuring the image size of the object as observed

by a pinhole camera with arbitrary 3-D location and pose. Our motivation is

based on the fact that the image size of the object is the key to the success

of any appearance-based object identification scheme. While the optimization

scheme described in [14] can theoretically be used for triangulating objects,

their results as well as others are limited to maximizing sensor coverage. Our

result, on the other hand, directly tackles the problem of visual tagging in

which each object needs to be visible by two or more cameras. Furthermore,

while the BIP formulation can avoid the local minima problem, its complexity

6



remains NP-complete [18, ch. 8]. As a consequence, these schemes again have

difficulties in scaling up to large sensor networks.

3 General Visibility Model

Consider the 3-D environment as depicted in Figure 1. Our goal in this section

is to develop a general model to compute the visibility of a single tag centered

at P in such an environment. We assume that the 3-D environment has vertical

walls with piecewise linear contours. Obstacles are modeled as columns of finite

height and polyhedral cross sections. Whether the actual tag is the face of a

subject or an artificial object, it is reasonable to model each tag as a small flat

surface perpendicular to the ground plane. We further assume that all the tags

are of the same square shape with known edge length 2w. Without any specific

knowledge of the height of individuals, we assume that the centers of all the

tags lie on the same plane Γ parallel to the ground plane. This assumption

does not hold in real world as individuals are of different height. Nevertheless,

as we will demonstrate in Section 6.1, such height variation does not much

affect the overall visibility measurements but reduces the complexity of our

model. While our model restricts the tags to be on the same plane, we place

no restriction on the 3-D positions, yaw and pitch angles of the cameras in

the visual sensor network.

Given the number of cameras and their placement in the environment, we

define the visibility V of a tag using an aggregate measure of the projected size

of a tag on the image planes of different cameras. The projected size of the

tag is very important as the image of the tag has to be large enough to be

automatically identified at each camera view. Due to the camera projection

7





the center of the tag on the plane Γ. vP is the pose vector of the tag. As we

assume the tag is perpendicular to the ground plane, the pose vector vP lies

on the plane Γ and has a single degree of freedom – the orientation angle θ

with respect to a reference direction. Note the dependency of V on vP allows

us to model self-occlusion – the tag is being occluded by the person who is

wearing it. The tag will not be visible to a camera if the pose vector is pointing

away from the camera.

While self occlusion can be succinctly captured by a single pose vector, the

precise modeling of mutual occlusion can be very complicated – the modeling

of mutual occlusion involves the number of neighboring objects, their distance

to the tag, the positions and orientations of the cameras. In our model, we

choose the worst-case approach by considering a fixed-size occlusion angle

β at random position measured from the center of the tag on the Γ plane.

Mutual occlusion is said to occur if the projection of the line of sight on

the Γ plane falls within the range of the occlusion angle. In other words, we

model the occlusion as a cylindrical wall of infinite height around the tag

partially blocking a fixed visibility angle of β at random starting position

β s. w is half of the edge length of the tag which is a known parameter. The

shape of the environment is encapsulated in the fixed parameter set K which

contains a list of oriented vertical planes that describe the boundary wall and

obstacles of finite height. It is straightforward to use K to compute whether

there is a direct line of sight between an arbitrary point in the environmentand a camera. The specific visibility function suitable for visual tagging will

be described in Section 4.

To correctly identify and track any visual tag, a typical classification algorithm

would require the tag size on the image to be larger than a certain minimum

9



size, though a larger projected size usually does not make much difference.

For example, a color tag detector needs a threshold to differentiate the tag

from noises, and a face detector needs a face image large enough to observe

the facial features. On the other hand, the information gain does not increase

as the projected object size increases beyond a certain value. Therefore, the

threshold version can represent our problem much better than the absolute

image size. Assuming that this minimum threshold on image size is T pixels,

this requirement can be modeled by binarizing the visibility function as follows:

V b(P, vP, β s|w,K,T ) =

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

1 if V (P, vP, β s|w, K ) > T

0 otherwise.

(1)

Finally, we define η, the mean visibility , to be the metric for measuring the

average visibility of P over the entire parameter space:

η =

V b(P, vP, β s|w,K,T ) · f (P, vP, β s) dP dvP dβ s (2)

where f (P, vP, β s) is the prior distribution that can incorporate prior knowl-

edge about the environment – for example, if an application is interested in

locating faces, the likelihood of the head positions and poses are affected by

furnishings and attractions such as television sets and paintings. Except for the

most straightforward environment such as a single camera in a convex environ-

ment discussed in [19], Equation (2) does not admit a closed-form solution.

Nevertheless, it can be estimated by using standard Monte-Carlo sampling

and its many variants.

10



4 Visibility model for visual tagging

In this section, we present a visibility model for the visual tagging problem.

This model is a specialization of the general model in Section 3. The goal is

to design a visibility function V (P, vP, β s|w, K ) that can measure the perfor-

mance of a camera network in capturing a tag in multiple camera views. We

will first present the geometry for the visibility from one camera and then

show a simple extension to create V (P, vP, β s|w, K ) for arbitrary number of

cameras.

Given a single camera with the camera center at C , it is straightforward to

see that a tag at P is visible at C if and only if the following four conditions

hold:

(1) The tag is not occluded by any obstacle or wall. (Environmental Occlu-

sion)(2) The tag is within the camera’s field of view. (Field Of View)

(3) The tag is not occluded by the person wearing it. (Self-Occlusion)

(4) The tag is not occluded by other moving objects. (Mutual Occlusion)

Thus, we define the visibility function for one camera to be the projected

length ||l|| on the image plane of the line segment l across the tag if the above

conditions are satisfied, and zero otherwise.

Figure 2 shows the projection of l, delimited by P l1 and P l2, onto the image

plane Π. Based on the assumptions that all the tag centers has the same

elevation and all tag planes are vertical, we can analytically derive the formulae

11



for P l1, P l2 as

P li = C −vC, O − C

vC, P li − C (P li − C ) (3)

where ·, · indicates inner product. The projected length ||l

|| is simply ||P

l1−

P l2||.

Pl1

Pl2

sP

TagOrientationvP

Occlusion

Environment,K

Pl1’Pl2’

vCP’

OImage

Plane

Fig. 2. Projection of a single tag onto a camera.

After computing the projected length of the tag, we proceed to check the four

visibility conditions as follows:

(1) Environmental Occlusion: We assume that environmental occlusion

occurs if the line segment connecting camera center C with the tag center

P intersect with some obstacle. While such an assumption does not take

into account of partial occlusion, it is adequate for most visual tagging

applications where the tag is much smaller than its distance from the

12



camera. We represent this requirement as the following binary function:

chkObstacle(P,C,K ) =

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

1 No obstacles intersect with the line segment PC

0 otherwise

(4)

Specifically, the obstacles are recorded in K as a set of oriented verti-

cal planes that describe the boundary wall and obstacles of finite height.

Intersection between the line of sight P C and each element in K is com-

puted. If there is no intersection within the confined environment or the

points of intersection are higher than the height of the camera, no occlu-sion occurs due to the environment.

(2) Field of View: Similar to determining environmental occlusion, we de-

clare the tag to be in the field of view if the image P of the tag center is

within the finite image plane Π. Using a similar derivation as in (3), the

image P is computed as follows:

P

= C −vC, O − C

vC, P − C (P − C ) (5)

We then convert P to local image coordinates to determine if P is in-

deed within Π. We encapsulate this condition using the binary function

chkFOV(P,C, vC, Π, O) takes camera intrinsic parameters, tag location,

pose vector as input, and returns a binary value indicating whether the

center of the tag is within the camera’s field of view.

(3) Self Occlusion: As illustrated in Figure 2, the tag is self occluded if

the angle α between the light of sight to the camera C − P and the tag

pose vP exceeds π2

. We can represent this condition as a step function

U (π2

− |α|).

(4) Mutual Occlusion: In Section 3, we model the worst-case occlusion

13



using an angle β . As illustrated in Figure 2, mutual occlusion occurs

when the tag center or half the line segment l is occluded. The angle β is

suspended at P on the Γ plane. Thus, occlusion occurs if the projection

of the light of the sight C − P on the Γ plane at P falls within the range

of [β s, β s + β ). We represent this condition using the binary function

chkOcclusion(P,C, vP, β s) which returns one for no occlusion and zero

otherwise.

Combining both ||l|| and the four visibility conditions, we define the projected

length of an oriented tag with respect to camera Υ as I (P, vP, β s|K, Υ) follows:

I (P, vP, β s|w,K, Υ) = ||l|| · chkObstacle(P,C,K )·

chkFOV(P,C, vC, Π, O) · U

π

2− |α|

· chkOcclusion(P,C, vP, β s) (6)

where Υ includes all camera parameters including Π, O and C . As stated in

Section 3, a threshold version is usually more convenient:

I b(P, vP, β s|w,K, Υ, T ) =

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

1 if I (P, vP, β s|w,K, Υ) > T

0 otherwise

(7)

To extend the single-camera case to multiple cameras, we note that the visibil-

ity of the tag from one camera does not affect the other and thus, each camera

can be treated independently. Assume that the specific application requires a

tag to be visible by H or more camera. The tag at a particular location and

orientation is visible if the sum of the I b() values from all the cameras exceed

H at that location. In other words, given N cameras Υ1, Υ2, . . . , ΥN , we define

14



the threshold visibility function V b(P, vP, β s|w,K,T ) as

V b(P, vP, β s|w,K,T ) =

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

1 if N

i=1 I b(P, vP, β s|w,K, Υi) ≥ H

0 otherwise.

(8)

Using this definition of visibility function, we can then compute the mean

visibility η as defined in (2) as a measure of the average likelihood of a random

tag being observed by H or more cameras. While the specific value of H

depends on the application, we will use H = 2 without loss of generality in

the sequel for concreteness.

5 Optimal Camera Placement

The goal of an optimal camera placement is to identify, among all possible

camera network configurations, the one that maximizes the visibility function

given by Equation (8). As Equation (8) does not possess an analytic form, it

is very difficult to apply conventional continuous optimization strategies such

as variational techniques or convex programming. As such, we follow a similar

approach as in [14] by finding an approximate solution over a discretization

of two spaces – the space of possible camera configurations and the space

of tag location and orientation. Section 5.1 describes the discretization of our

parameter spaces. In Sections 5.2 and 5.3, we introduce two BIP formulations,

targeting at different cost functions, on computing optimal configurations over

the discrete environment. A computationally efficient algorithm for solving

BIP based on greedy approach is presented in Section 5.4.

15



5.1 Discretization of Camera and Tag Spaces

The design parameters for a camera network include the number of cameras,

their 3-D locations, as well as their yaw and pitch angles. The number of

cameras is either an output discrete variable or a constraint in our formulation.

The elevation of the cameras is usually constrained by the environment. As

such, our optimization does not search for the optimal elevation but rather

have the user input it as a fixed value. For simplicity, we assume that all

cameras have the same elevation but it is a simple change in our code to allow

different elevation constraints to be used in different parts of the environment.

The remaining 4-D camera space: the 2-D location, yaw and pitch angles are

discretized into a uniform lattice gridC of N c camera grid points, denoted as

{Υi : i = 1, 2, . . . , N c}.

The unknown parameters about the tag in computing the visibility function

(8) include the location of the tag center P , the pose of the tag vP and the

starting position of the worse-case occlusion angle β s. Our assumptions stated

in Section 3 have the tag center lied on a 2-D plane and the pose restricted to

a 1-D angle with respect to a reference direction. As for occlusion, our goal is

to perform the worst-case analysis so that as long as the occlusion angle is less

than a given β as defined in Section 3, our solution is guaranteed to work no

matter where the occlusion is. As such, a straightforward quantization of the

starting position β s of the occlusion angle will not work – an occlusion angle

of β starting anywhere between grid points will occlude additional view. To

simultaneously discretize the space and maintain the guarantee, we select a

larger occlusion angle β m > β and quantize the starting position of the occlu-

sion angle using a step-size of β Δ = β m − β . The occlusion angles considered

16



under this discretization will then be {[iβ Δ, iβ Δ + β m) : i = 0, . . . , N β − 1}

where N β = (π−β m)/β Δ. This guarantees that any occlusion angle less than

or equal to β will be covered by one of the occlusion angles. Figure 3 show an

example of β = β Δ = π/4 and β m = π/2. Combining these three quantities

Fig. 3. Discretization to guarantee occlusion less than β = π/4 at any position will

be covered in one of the above three cases: [0, π2

), [π4, 3π2

) and [π2, π).

together, we discretize the 4-D tag space into a uniform lattice gridP with N p

tag grid points {Λi : i = 1, 2, . . . , N p}.

Given a camera grid point Υi and a tag grid point Λ j, we can explicitly eval-

uate the threshold single-camera visibility function (7) which we now rename

as I (Λ j|w,T,K, Υi) with Λ j representing the grid point for the space of P ,

vP and β s; w the size of the tag; T is the visibility threshold; K is the envi-

ronmental parameter and Υi is the camera grid point. The numerical values

of I (Λ j|w,T,K, Υi) will then be used in formulating cost constraints in our

optimal camera placement algorithms.

5.2 MIN CAM: Minimizing the number of cameras for a target visibility

MIN CAM estimates the minimum number of cameras which can provide a

mean visibility η equal to or higher than a given threshold ηt. There are two

17





This constraint represents the requirement of visual tagging that all tags must

be visible at two or more cameras. As defined in Equation (7), I b(Λ j|w,T,K, Υi)

measures the visibility of tag Λ j with respect to camera at Υi. Second, for each

camera location (x, y), we have

all Υi at (x, y)

bi ≤ 1 (12)

These are a set of inequalities guaranteeing that only one camera is placed at

any spatial location. The optimization problem in (10) with constraints (11)

and (12) forms a standard BIP problem.

The solution to the above BIP problem obviously depends on the selection of

grid points in gridP and gridC . While gridC is usually predefined according

to the constraint of the environment, there is no guarantee that a tag at a

random location can be visible by two cameras even if there is a camera at

every camera grid point. Thus, tag grid points must be placed intelligently –

tag grid points away from obstacles and walls are usually easier to observe. On

the other hand, focusing only on areas away from the obstacles may produce

a subpar result when measured over the entire environment. To balance the

two considerations, we solve the BIP repeatedly over a progressively refined

gridP over the spatial dimensions until the target ηt, measured over the entire

continuous environment, is satisfied. One possible refinement strategy is to

have gridP started from a single grid point at the middle of the environment,

and grew uniformly in density within the interior of the environment but

remains at least one interval away from the boundary. If the BIP fails to

return a solution, the algorithm will randomly remove half of the newly added

tag grid points. The iteration terminates when the target ηt is achieved or all

the newly added grid points are removed. The above process is summarized

19



in Algorithm 1.

Input: initial grid points for cameras gridC and tag gridP , ηt, maximum

grid density maxDensity

Output: Camera placement camPlace

Set η = 0, newP = ∅;

while η ≤ ηt do

foreach Υi in gridC do

foreach Λ j in gridP ∪ newP do

Calculate I b(Λ j|w,T,K, Υi);end

end

Solve newCamPlace = BIP solver(gridC,gridP,I b);

if newCamPlace == ∅ then

if |newP | == 1 then

break, return failure ;

Randomly remove half of the elements from newP ;

else

camPlace = newCamPlace;

gridP = gridP ∪ newP ;

newP = new grid points created by halving the spatial separation;

newP = newP \ gridP ;

Calculate η for camPlace by Monte Carlo Sampling;

end

end

Algorithm 1: MIN CAM Algorithm

20



5.3 FIX CAM: Maximizing the visibility for a given number of cameras

A drawback of MIN CAM is that it may need a large number of cameras in

order to satisfy the visibility of all tag grid points. If the goal is to maximize

the average visibility, a sensible way to reduce the number of cameras is to

allow a small portion of the tag grid points not being observed by two or more

cameras. The selection of these tag grid points should be dictated based on the

distribution of the occupant traffic f (P, vP, β s) used in computing the average

visibility as described in Equation (2). FIX CAM is the algorithm that does

precisely that.

We first define a set of binary variables on the tag grid {x j : j = 1, . . . , N p}

indicating whether a tag on the jth tag point in gridP is visible at two or more

cameras. We also assume a prior distribution {ρ j : j = 1, . . . , N p,

j ρ j = 1}

that describes the probability of having a person at that tag grid point. The

cost function defined to be the average visibility over the discrete space is

given as follows:

maxbi

N p j=1

ρ jx j (13)

The relationship between the camera placement variables bi’s as defined in

(9) and visibility performance variables x j ’s can be described by the following

constraints. For each tag grid point Λ j, we have

N ci=1

biI b(Λ j|w,T,K, Υi) − (N c + 1)x j < 1 (14)

N ci=1

biI b(Λ j|w,T,K, Υi) − 2x j ≥ 0 (15)

These two constraints effectively define the binary variable x j : if x j = 1,

21



Inequality (15) becomes

N ci=1

biI b(Λ j|w,T,K, Υi) ≥ 2

which means that a feasible solution of bi’s must have the tag visible at two

or more cameras. Inequality (14) becomes

N ci=1

biI b(Λ j |w,T,K, Υi) < N c + 2

which is always satisfied – the largest possible value from the left-hand size is

N c corresponding to the case when there is a camera at every grid point and

every tag point is observable by two or more cameras. If x j = 0, Inequality

(14) becomes

N ci=1

biI b(Λ j|w,T,K, Υi) < 1

which implies that the tag is not visible by two or more cameras. Inequality

(15) is always satisfied as it becomes

N ci=1

biI b(Λ j|w,T,K, Υi) ≥ 0

Two additional constraints are needed to complete the formulation: as the

cost function focuses only on visibility, we need to constrain the number of

cameras to be less than a maximum number of cameras as follows:

N c j=1

b j ≤ m (16)

We also keep the constraint in (12) to ensure only one camera is used at each

spatial location.

22



5.4 GREEDY: Greedy Algorithm to speed up BIP

BIP is a well-studied NP-hard combinatorial problem with plenty of heuristicschemes such as branch-and-bound already implemented in software libraries

such as lp solve [20]. However, even these algorithms can be quite intensive

if the search space is large. In this section, we introduce a simple greedy

algorithm GREEDY that can be used for both MIN CAM and FIX CAM.

Besides experimentally showing the effectiveness of GREEDY, we believe that

the greedy approach is an appropriate approximation strategy due to the

similarity of our problem to the set cover problem.

In the set cover problem, items can belong to multiple sets and the optimiza-

tion goal is to minimize the number of sets to cover all the items. While finding

the optimal solution to set covering is a NP-hard problem [21], it has been

shown that the greedy approach is essentially the best one can do to obtain

an approximate solution [22]. We can draw the parallel between our problem

with the set cover problem by considering each of the tag grid point as an item

“belonging” to a camera grid point if the tag is visible at that camera. The set

cover problem then minimizes the number of cameras needed, which is almost

identical to MIN CAM except for the fact that visual tagging requires each tag

to be visible by two or more cameras. The FIX CAM algorithm further allows

some of the tag points not to be covered at all. It is still an open problem

on whether these properties can be incorporated into the framework of set

covering, our experimental results demonstrate that the greedy approach is a

reasonable solution to our problem. The GREEDY algorithm is described in

Algorithm 2.

23



Input: initial grid points for cameras gridC and tags gridP , target mean

visibility ηt and the maximum number of cameras m

Output: Camera placement camPlace

Set U = gridC , V = ∅, W = gridP , camPlace = ∅;

while |V | < ηt · |gridP | or |camPlace| ≤ m do

c = the camera grid point in U that maximizes the number of visible tag

grid points in W ;

camPlace = camPlace ∪ {c};

S = the subset of grid that are visible by two or more cameras in

camPlace;

V = V ∪ S ;

W = W \ S ;

Remove c and all camera grid points in U that share the same spatial

location as c;

if U == ∅ then

camPlace = ∅;

return;

end

Output camPlaceAlgorithm 2: GREEDY: greedy search camera placement algorithm

In each round of the GREEDY algorithm, the camera grid point that can

see the most number of tag grid points is selected and all the tag grid points

visible at two or more cameras are removed. When using GREEDY to ap-

proximate MIN CAM, we no longer need to refine the tag grids to reduce the

computational efficiency. We can start with a fairly dense tag grid and set

the camera bound m to infinity. The GREEDY algorithm will terminate if

the estimated mean visibility reaches the target ηt. When using GREEDY to

24



approximate FIX CAM, ηt will be set to one and the GREEDY algorithm

will terminate when the number of cameras reaches the upper bound m as

required by FIX CAM.

6 Experimental Results

In this section, we present both simulation and realistic camera network re-

sults to demonstrate the proposed algorithms. In Section 6.1, we show various

properties of MIN CAM, FIX CAM and GREEDY by varying different model

parameters. In Section 6.2, we compare the optimal camera configurations

computed by our techniques with other camera configurations.

6.1 Optimal camera placement simulation experiments

All the simulations in this section assume a room of dimension 10m × 10mwith a single obstacle and a square tag with edge length w = 20 cm long. For

the camera and lens models, we assume a pixel width of 5.6 μm, focal length

of 8 cm and the field of view of 60 degrees. These parameters closely resembles

the real cameras that we use in the real-life experiments. The threshold T for

visibility is set to five pixels which we find to be an adequate threshold for our

color-tag detector.

6.1.1 Performance of MIN CAM

We first study how MIN CAM estimates the minimum number of cameras

for a target mean visibility ηt through tag grid refinement. For simplicity, we

25





(a) Iteration 1 (b) Iteration 2 (c) Iteration 3 (d) Iteration 4

(e) η = 0.4743 (f) η = 0.7776 (g) η = 0 (h) η = 0.9107

Fig. 4. Four iterations of MIN CAM

6.1.2 FIX CAM versus MIN CAM

In the second experiment, we demonstrate the difference between FIX CAM

and MIN CAM. Using the same environment as in Figure 4(c), we run FIX CAM

to maximize the performance with eleven cameras. The traffic model ρ j is set

to be uniform. MIN CAM fails to return a solution under this dense grid and

after randomly discarding some of the tag grid points, outputs η = 0.9107

using eleven cameras. On the other hand, without any random tuning of the

tag grid, FIX CAM returns a solution of η = 0.9205 and the results are shown

in Figures 5(a) and 5(b). When we reduce the number of cameras to ten and

rerun FIX CAM, we manage to produce η = 0.9170 which still exceeds the

results from MIN CAM. This demonstrates that we can use FIX CAM to

fine-tune the approximate result obtained by MIN CAM. The camera con-

figuration and the visibility distribution of using ten cameras are shown in

Figure 5(c) and 5(d), respectively.

27



(a) FC: 11 cam (b) FC: η = 0.9205 (c) FC: 10 cam (d) FC: η = 0.9170

(e) G: 11 cam (f ) G: η = 0.9245 (g) G: 10 cam (h) G: η = 0.9199

Fig. 5. Figures 5(a) to 5(d) show the results of using FIX CAM (F). Figures 5(e)

to 5(h) show the same set of experiments using GREEDY (G)as an approximation

to FIX CAM.

6.1.3 GREEDY Implementation of FIX CAM

Using the same setup, we repeat our FIX CAM experiments using the GREEDY

implementation. Our algorithm is implemented using MATLAB version 7.0 on

a Xeon 2.1Ghz machine with 4 Gigabyte of memory. The BIP solver inside the

FIX CAM algorithm is based on lp solve [20]. We have tested both algorithms

using eleven, ten, nine and eight maximum number of cameras. While chang-

ing the number of cameras does not change the number of constraints, the

search space becomes more restrictive as we reduce the number of cameras.

As such, it is progressively more difficult to prune the search space, making

the solver resemble that of an exhaustive search. The results are summarized

in Table 1. For each run, three numerical values are reported: the fraction

of tag points visible to two or more cameras which is the actual minimized

cost function, the running time and the mean visibility estimated by Monte

28



Carlo simulations. At eight cameras, GREEDY is 30,000 times faster than

lp solve but only 3% fewer visible tag points than the exact answer. It is also

worthwhile to point out that the lp solve fails to terminate when we refine

the tag grid by halving the step-size at each dimension, while GREEDY uses

essentially the same amount of time. The placement and visibility maps of

the GREEDY algorithm that mirror those from FIX CAM are shown in the

second row of Figure 5.

Table 1

Comparison between Lp solve and greedy

No. cameras Lp solve Greedy

Visible Tags Time(s) η Visible Tags Time(s) η

Eleven 0.99 1.20 0.9205 0.98 0.01 0.9245

Ten 0.98 46.36 0.9170 0.98 0.01 0.9199

Nine 0.97 113.01 0.9029 0.97 0.01 0.8956

Eight 0.96 382.72 0.8981 0.94 0.01 0.8761

6.1.4 Elevation of tags and cameras

Armed with an efficient greedy algorithm, we can explore various modeling

parameters in our framework. An assumption we made in the visibility model

is that all the tag centers are in the same horizontal plane. This does not re-

flect the real world due to the different height of individuals. In the following

experiment, we examine the impact of the variation in height on the perfor-

mance of a camera placement. Using the camera placement in Figure 5(g),

29



we simulate five different scenarios: the height of each person is 10 cm or 20

cm taller/shorter than the assumed height, as well as heights randomly drawn

from a bi-normal distribution based on U.S. census data [23]. The changes in

the average visibility are shown in Table 2. They range from -3.8% to -1.3%

which indicate that our assumption does not has a significant impact on the

measured visibiliy.

Table 2

Effect of height variation on η

height model +20 -20 +10 -10 Random

Change in η −3.8% −3.3% −1.2% −1.5% −1.3%

Next, we consider the elevation of the cameras. In typical camera networks,

cameras are usually installed at elevated positions to mitigate occlusion. The

drawback of the elevation is that it has a smaller field of view when compared

with the case when the camera is at the same elevation as the tags. By adjust-

ing the pitch angle of an elevated camera, we can selectively move the field of

view to various part of the environment. As we now add one more additional

dimension of pitch angle, the optimization becomes significantly more difficult

and GREEDY algorithm must be used. Figure (6) shows the result for m = 10

cameras with three different elevations above the Γ plane on which the centers

of all the tags are located. As expected, the mean visibility reduces as we raise

the cameras. The visibility maps in Figures 6(d), 6(e) and 6(f) show that as

the cameras are elevated, the coverage near the boundary drops but the center

remains well-covered as the algorithm adjusts the pitch angles of the cameras.

30



(a) 0.4m (b) 0.8m (c) 1.2m

(d) η = 0.9019 (e) η = 0.8714 (f) η = 0.8427

Fig. 6. Camera planning and Monte-Carlo simulation results when the cameras are

elevated to be 0.4, 0.8 and 1.2m above the tags.

6.1.5 Mutual Occlusion

We present simulation results to show how our framework deals with mutual

occlusion. Recall that we model occlusion as an occlusion angle of β at the

tag. Similar to the experiments on camera elevation, our occlusion model adds

an additional dimension to the tag grid and thus we have to resort to the

GREEDY algorithm. We would like to investigate how occlusion affects the

number of cameras and the camera positions of the output configuration. As

such, we use GREEDY to approximate MIN CAM by identifying the minimum

number of cameras to achieve a target level of visibility. We use a denser tag

grid than before to minimize the difference between the actual mean visibility

and that estimated by GREEDY over the discrete tag grid. The tag grid we

use is 16 × 16 spatially with 16 different orientations. We set the target to be

ηt = 0.8 and test different occlusion angle β at 0◦, 22.5◦ and 45◦. As explained

31



earlier in Section 5.1, our discretization uses a slightly larger occlusion angle

to guarantee worst-case analysis – we uses β m = 32.5◦ for β = 22.5◦ and

β m = 65◦ for β = 45◦. In the Monte Carlo simulation, we put the occlusion

angle at random position of each sample point. The results are shown in Figure

7. We can see that even with increasing number of cameras from six to eight to

twelve, the resulting mean visibility suffers slightly when the occlusion angle

increases. Another interesting observation from the visibility maps in Figures

7(d), 7(e) and 7(f) is that the region with perfect visibility, indicated by the

white pixels, dwindles as occlusion increases. This is reasonable because it is

difficult for a tag to be visible at all orientation in the presence of occlusion.

(a) 0◦

; 6 cam. (b) 22.5◦

; 8 cam (c) 45◦

; 12 cam

(d) η = 0.8006 (e) η = 0.7877 (f) η = 0.7526

Fig. 7. As the occlusion angle increases from 0◦ in Figure 7(a) to 22.5◦ in Figure

7(b) and 45◦ in Figure 7(c), the required number of cameras increases from 6 to 8

and 12 when using GREEDY to achieve a target performance of ηt = 0.8. Figure

7(d) to Figure 7(f) are the correspondent visibility maps.

32



6.1.6 Realistic Occupant Traffic Distribution

In this last experiment, we show how one can incorporate realistic occupant

traffic patterns into the FIX CAM algorithm. All experiments thus far assume

an uniform traffic distribution over the entire tag space – it is equally likely to

find a person at each spatial location and at each orientation. This model does

not reflect many real-life scenarios. For example, consider a hallway inside a

shopping mall: while there are people browsing at the window display, most of

the traffic flows from one end of the hallway to the other end. By incorporating

an appropriate traffic model, the performance should be improved under the

same resource constraint. In the FIX CAM framework, a traffic model can

be incorporated into the optimization by using non-uniform weights ρ j in the

cost function (13).

In order to use a reasonable traffic distribution, we employ a simple random

walk model to simulate a hallway environment. We imagine that there are

openings on the either sides of the top portion of the environment. At each of

the tag grid point, which is characterized by both the orientation and the po-

sition of a walker, we impose the following transitional probabilities: a walker

has a 50% chance of moving to the next spatial grid point following the cur-

rent orientation unless it is obstructed by an obstacle, and has a 50% chance

of changing orientation. In the case of changing orientation, there is a 99%

chance of choosing the orientation to face the tag grid point closest to the

nearest opening while the rest of the orientations share the remaining 1%. At

those tag grid points closest to the openings, we create a virtual grid point

to represent the event of a walker exiting the environment. The transitional

probabilities from the virtual grid point back to the real tag points near the

openings are all equal. The stationary distribution ρ j is then computed by

33



finding the eigenvector of the transitional probability matrix of the entire en-

vironment with eigenvalue equal to one [24][ch.11.3].

Figure 8(a) shows this hallway environment. The four hollow circles indicate

the tag grid points closest to the openings. The result of the optimization

under the constraint of using four cameras is shown in Figure 8(b). Clearly

the optimal configuration favors the heavy traffic hallway area. If the uniform

distribution is used instead, we obtain the configuration in Figure 8(c) and the

visual map in Figure 8(d). The average visibility drops from 0.8395 to 0.7538

as there is a mismatch of the traffic pattern.

(a) Random Walk (b) η = 0.8395 (c) Uniform (d) η = 0.7538

Fig. 8. Figures 8(a) and 8(b) use the specific traffic distribution for optimization

and obtain a higher η as compared to using an uniform distribution in figures 8(c)

and 8(d).

6.2 Comparison with other camera placement strategies

In this section, we compare our optimal camera placements with two differ-

ent placement strategies. The first one is uniform placement – assuming that

the cameras are restricted along the boundary of the environment, the most

intuitive scheme is to place them at regular intervals on the boundary, each

pointing towards the center of the room. The second one is based on the

34



optimal strategy proposed in [14].

To test the differences in visibility models, it is unfair to use Monte-Carlo

simulations which use the same model as the optimization. As a result, weresort to virtual environment simulations by creating a virtual 3-D environ-

ment that mimics the actual 10m×10m room used in Section 6.1. We then

insert a random-walking humanoid wearing a red tag. The results are based

on the visibility of the tag in two or more cameras. The cameras are set at

the same height as the tag and no mutual occlusion modeling is used. The

optimization is performed with respect to a fixed number of cameras. To be

fair to the scheme in [14], we run their optimization formulation to maximize

the visibility from two cameras. The measurements of η for the three schemes

with the number of cameras varied from five to eight are shown in Table 3.

Our proposed FIX CAM performs the best followed by the uniform placement.

The scheme in [14] does not perform well as it does not take into account the

orientation of the tag. As such the cameras do not compensate each other

when the tag is in different orientations.

Table 3

η measurements among the three schemes using virtual simulations

Number of cameras FIX CAM [14] Uniform Placement

5 0.614 ± 0.011 0.352 ± 0.010 0.522 ± 0.011

6 0.720 ± 0.009 0.356 ± 0.010 0.612 ± 0.011

7 0.726 ± 0.009 0.500 ± 0.011 0.656 ± 0.010

8 0.766 ± 0.008 0.508 ± 0.011 0.700 ± 0.009

35



We are, however, surprised by how close uniform placement is to our optimal

scheme. Thus, we further test the difference between the two with a real-

life experiment that incorporates mutual occlusion. We conduct our real-life

experiments indoor in a room of 7.6 meters long, 3.7 meters wide, and 2.5

meters high. There are two desks and a shelf along three of the four walls. Seven

Unibrain Fire-i400 cameras at elevation of 1.5 meters with Tokina Varifocol

TVR0614 lens are used. Since they are variable focal-length lens, we have

set them at a focal length of 8mm with a vertical field of view of 45◦ and

horizontal field of view of 60◦. As the elevation of the cameras is roughly level

with the position of the tags, we have chosen a fairly large occlusion angle of β m = 65◦ in deriving our optimal placement. Monte-Carlo results between the

uniform placement and the optimal placement are shown in Figure 9. For the

virtual environment simulation, we insert three randomly walking humanoids

and capture 250 frames for measurement. For the real-life experiments, we

capture about two minutes of video from the seven cameras, again with three

persons walking in the environment. Figures 10 and 11 show the seven real-

life and virtual camera views from both the uniform placement and optimal

placement respectively. As shown in Table 4, the optimal camera placement is

better than the uniform camera placement in all three evaluation approaches.

The three measured η’s for the optimal placement are consistent. The results

of the uniform placement have higher variation most likely due to the fact

that excessive amount of occlusion makes detection of color tags less reliable.

36



Table 4

η measurements between uniform and optimal camera placements

Method MC Simulations Virtual Simulation Real-life Experiments

Uniform 0.3801 0.4104 ± 0.0153 0.2335 ± 0.0112

Optimal 0.5325 0.5618 ± 0.0156 0.5617 ± 0.0121

(a) Uniform placement: η = 0.3801 (b) Optimal placement: η = 0.5325

(c) Uniform placement: η = 0.3801 (d) Optimal placement: η = 0.5325

Fig. 9. Camera placement in a real camera network

Fig. 10. Seven camera views from uniform camera placement

7 Conclusions and Future work

In this chapter, we have proposed a framework in modeling, measuring and

optimizing placement of multiple cameras. By using a camera placement met-

ric that captures both self and mutual occlusion in 3-D environments, we

37



Fig. 11. Seven camera views from optimal camera placement

have proposed two optimal camera placement strategies that complement each

other using grid based binary integer programming. To deal with the computa-

tional complexity of BIP, we have also developed a greedy strategy to approx-

imate both of our optimization algorithms. Experimental results have been

presented to verify our model and to show the effectiveness of our approaches.

There are many interesting issues in our proposed framework and visual tag-

ging in general that deserve further investigation. The incorporation of models

for different visual sensors such as omnidirectional and PTZ cameras or even

non-visual sensors and other output devices such as projectors is certainly a

very interesting topic. The optimality of our greedy approach can benefit from

a detailed theoretical studies. Last but not the least, the use of visual tagging

in other application domains such as immersive environments and surveillance

visualization should be further explored.

References

[1] J. O’Rourke, Art Gallery Theorems and Algorithms, Oxford University Press,

1987.

[2] J. Urrutia, Art Gallery And Illumination problems, Amsterdam: Elsevier

Science, 1997.

38



[3] T. Shermer, Recent results in art galleries, Proceedings of the IEEE 80 (9)

(1992) 1384–1399.

[4] V. Chvatal, A combinatorial theorem in plane geometry, Journal of

Combinatorial Theory Series B 18 (1975) 39–41.

[5] D. Lee, A. Lin, Computational complexity of art gallery problems, IEEE

Transactions on Information Theory 32 (1986) 276–282.

[6] D. Yang, J. Shin, A. Ercan, L. Guibas, Sensor tasking for occupancy reasoning in

a camera network, in: 1st Workshop on Broadband Advanced Sensor Networks

(BASENETS), IEEE/ICST, 2004.

[7] P.-P. Vazquez, M. Feixas, M. Sbert, W. Heidrich, Viewpoint selection using

viewpoint entropy, in: Proceedings of the Vision Modeling and Visualization

Conference (VMV01), IOS Press, Amsterdam, 2001, pp. 273–280.

[8] J. Williams, W.-S. Lee, Interactive virtual simulation for multiple camera

placement, in: International Workshop on Haptic Audio Visual Environments

and their Applications, IEEE, 2006, pp. 124–129.

[9] S. Ram, K. R. Ramakrishnan, P. K. Atrey, V. K. Singh, M. S. Kankanhalli,

A design methodology for selection and placement of sensors in multimedia

surveillance systems, in: VSSN ’06: Proceedings of the 4th ACM international

workshop on Video surveillance and sensor networks, ACM Press, New York,

NY, USA, 2006, pp. 121–130.

[10] T. Bodor, A. Dremer, P. Schrater, N. Papanikolopoulos, Optimal camera

placement for automated surveillance tasks, Journal of Intelligent and Robotic

Systems 50 (2007) 257–295.

[11] A. Mittal, L. S. Davis, A general method for sensor planning in multi-sensor

systems: Extension to random occlusion, International Journal of Computer

Vision 76 (1) (2008) 31–52.

39



[12] A. Ercan, D. Yang, A. E. Gamal, L. Guibas, Optimal placement and selection

of camera network nodes for target localization, in: International Conference on

Distributed Computing in Sensor System, Vol. 4026, IEEE, 2006, pp. 389–404.

[13] E. Dunn, G. Olaque, Pareto optimal camera placement for automated visual

inspection, in: International Conference on Intelligent Robots and Systems,

IEEE/RSJ, 2005, pp. 3821–3826.

[14] E. Horster, R. Lienhart, On the optimal placement of multiple visual sensors,

in: VSSN ’06: Proceedings of the 4th ACM international workshop on Video

surveillance and sensor networks, ACM Press, New York, NY, USA, 2006, pp.

111–120.

[15] M. A. Hasan, K. K. Ramachandran, J. E. Mitchell, Optimal placement of stereo

sensors, Optimization Letters 2 (2008) 99–111.

[16] U. M. Erdem, S. Sclaroff, Optimal placement of cameras in floorplans to satisfy

task-specific and floor plan-specific coverage requirements, Computer Vision

and Image Understanding 103 (3) (2006) 156–169.

[17] K. Chakrabarty, S. Iyengar, H. Qi, E. Cho, Grid coverage of surveillance and

target location in distributed sensor networks, IEEE Transaction on Computers

51(12) (2002) 1448–1453.

[18] G. Sierksma, Linear and Integer Programming: Theory and Practice, Marcel

Dekker Inc., 2002.

[19] J. Zhao, S.-C. Cheung, Multi-camera surveillance with visual tagging and

generic camera placement, in: International Conference on Distributed Smart

Cameras, ICDSC07, ACM/IEEE, 2007.

[20] Introduction to lp solve 5.5.0.10, http://lpsolve.sourceforge.net/5.5/.

[21] R. M. Karp, Reducibility among combinatorial problems, Complexity of

Computer Computations (1972) 85–103.

40



Documents

Camera Network Chapter