Download pdf - A Mixed-Reality System for Broadcasting Sports Video to Mobile Devices

A Mixed-RealitySystem forBroadcastingSports Video toMobile Devices

Jungong Han, Dirk Farin, and Peter H.N. de WithUniversity of Technology Eindhoven

Watching sports, such as ten-

nis and soccer, remains pop-

ular for a broad class of

consumers. However, audi-

ences might not be able to enjoy their favorite

games from a television or PC when they are

traveling, so mobile devices are increasingly

used for watching sports video. Unfortunately,

watching sports on a mobile device is not as

simple as watching sports on TV. Bandwidth

limitations on wireless networks prevent high

bit-rate video transmission. In addition, small

displays lose visual details of the sports event.

Bandwidth limitations occur primarily when

multiple users want to stream video on the

same wireless link. These bottlenecks are likely

to occur because of the popularity of an

event. See the ‘‘Other Approaches’’ sidebar

(on page 74) for examples of existing systems.

This article describes a camera modeling-

based, mixed-reality system concept. The idea

is to build a 3D model for sports video, where

all parameters of this model can be obtained

by analyzing the broadcasting video. Instead

of sending original images to the mobile de-

vice, the system only sends parameters of the

3D model and the information about players

and balls, which can significantly save trans-

mission bandwidth. Additionally, because we

have full-quality information about camera

modeling and also players and balls, the mobile

client is able to recover the important informa-

tion without loss of the visual details. In addi-

tion, we can generate virtual scenes for less

important areas, such as the playing field and

the commercial billboard, without changing

the major story of the sports game.

Moreover, by changing the parameters of

the original camera, a variety of mixed-reality

scenes can be synthesized to better visualize a

scene on the mobile device. For example, in a

tennis video captured by a long-shot camera,

the important objects (for example, players)

might not be clearly visible on the small LCD

panel. However, a mixed-reality presentation

of the zoomed version of the original scene

might provide a better visualization. The con-

cept presented here fully relies on an accurate

3D modeling of the scene. For this reason, we

also contribute techniques for precisely extract-

ing camera parameters. In our system, a proba-

bilistic method based on the Expectation

Maximization (EM) algorithm finds the opti-

mal feature points, thereby enabling the auto-

matic acquisition of the camera parameters

from the sports video (tennis, badminton, and

volleyball) with sufficiently high accuracy.

System architectureThe architecture of our proposed system is

composed of several interacting, but clearly

separated modules. Figure 1 depicts our system

architecture with its major functional units and

the data flow. The most important modules are

as follows:

� Camera calibration. To generate mixed-reality

sports scenes, the original camera parame-

ters have to be calculated using the broad-

casting sports video as input. The goal of

this module is to compute a camera projec-

tion matrix from the input video and further

decompose it into camera intrinsic and ex-

trinsic parameters.

� Player and ball information extraction. To pre-

serve the visual nature of the original

human motion, the information concerning

players and ball, such as position, shape, and

texture, must be extracted from real video

and texture-mapped onto the virtual video.

To this end, we use a player segmentation al-

gorithm, which is discussed in our previous

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 72

Feature Article

Transmitting camera

parameters and

additional

information enables

the generation

of a mixed-reality

presentation of

sports on mobile

devices.

1070-986X/11/$26.00 �c 2011 IEEE Published by the IEEE Computer Society72

work.1,2 Our approach is based on the back-

ground subtraction technique, incorporating

a shadow detector. Our player-segmentation

algorithm considers the relation between

player position and court-line location, so

it’s easy to reject other moving objects,

like the ball boy. Meanwhile, a ball-tracking

algorithm that employs graph theory is

implemented.3

� Information encoding and decoding. In our sys-

tem, we only need to transmit the camera

parameters and the information about the

players and the ball to the decoder. Here, a

simple lossless compression scheme is adopted

to encode all the required information.

� Virtual camera creation and intelligent display.

Once the original camera parameters at

each frame are available, it’s possible to con-

struct mixed-reality scenes of the sports

video on smart mobile device. Our idea is

to precisely recover the information, such

as court lines, players, and ball, but to gener-

ate virtual scenes for less important areas

(from the normal user’s point of view),

such as ground field and commercial bill-

boards. If the mobile device screen is too

small, we can magnify the ball and court-

net lines. Moreover, a virtual camera derived

from the original camera helps to synthesize

a variety of virtual scenes, such as the scene

from the player viewpoint.

Camera calibration

The task of the camera calibration provides a

geometric transformation that maps the points

in the real-world coordinates to the image do-

main. Because the court model is a 3D scene but

the displayed image is a 2D plane, this mapping

can be written as a 3 � 4 projection matrix M,

which transforms a point p ¼ (x, y, z, 1)T in real-

world coordinates to image coordinates p0 ¼(u, v, w)T by p0 ¼Mp, being equivalent to

uvw

0@

1A ¼ m11 m12 m13 m14

m21 m22 m23 m24

m31 m32 m33 m34

0@

1A

xyz1

0BB@

1CCA

Because M is scaling invariant, 11 free

parameters have to be determined. They can

be calculated from six points whose positions

are known in both the 3D coordinates and

the image. Matrix M can be further decom-

posed into camera intrinsic and extrinsic

parameters, described by

M � K½Rj �Rt� ð1Þ

where

K ¼fx s u0

0 fy v0

0 0 1

24

35; R ¼

iT

jT

kT

24

35 ; t ¼

tx

ty

tz

24

35 ð2Þ

Matrix M in Equation 1 actually represents

the perspective camera model, which is formed

by camera intrinsic parameters K and the

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 73

Courtdetection

Netdetection

EM-based featurepoint selection

Projection matrixdecomposition Encode

DecodeVirtualcameracreation

User intention

Display

Camera calibration

Camera parameters and information aboutplayers and ball

Playerand ballposition,

shape, andtexture

extraction

Figure 1. Architecture

of the complete system,

which is constructed

from camera

calibration, player,

and ball information

extraction; encode and

decode; mixed-reality

scene generation; and

intelligent display.

Ap

ril�Ju

ne

2011

73

camera extrinsic parameters ½Rj �Rt�. The in-

trinsic camera parameters describe properties

of the camera, such as its focal length and the

image geometry, while the extrinsic parameters

describe the camera placement and orientation

in the 3D world. The upper triangular calibra-

tion matrix K encodes the intrinsic parameters

of the camera. Parameters fx, fy represent the

focal length, (u0, v0) is the principal point,

and s is a skew parameter. Matrix R is the rota-

tion matrix with i, j, and k denoting the rota-

tion axes. Vector t is the translation vector.

The parameters R and t cover the camera ex-

trinsic parameters.

In this article, we adopt the basic concept of

Han, Farin, and de With4 that takes at least six

points arranged in two perpendicular planes to

compute M. The court and net lines character-

ize these two planes. Additionally, we explore

the EM approach to select feature points, in-

stead of using a random selection,4 thereby

improving the accuracy of projection matrix

M. Our aim is to make matrix M accurate

enough to enable a decomposition into camera

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 74

Other ApproachesMost existing systems assume that video transmission

and display are two different topics, so they are treated inde-

pendently. Chang, Zhong, and Kumar develop an adaptive,

streaming system for sports videos over resource-limited net-

works or devices.1 The rate adaptation is content-based and

dynamically varied according to the event structure and the

video content. For example, in wireless streaming of baseball

video, such a system can send full-quality frames showing

pitching and important follow-up activities, but change to

low-bandwidth mode, using only keyframes during unim-

portant video segments.

Knoche, Mccarthy, and Sasse aim to couple the applied

image resolution to the display size of mobile devices.2

This inevitably leads to a reduced resolution, which in turn

yields to a loss of visual details. This work concluded that

field sports, such as soccer and tennis, suffer more clearly

in the user experience due to the reduced resolution com-

pared to music, news, and animation. In Seo et al., an intel-

ligent display of the soccer game on a mobile device is

presented where the region-of-interest is extracted and

magnified for display.3 This gives viewers a more comfortable

experience in understanding what is happening in the scene.

The new video-compression standards, such as MPEG-4,

suggest considering the video content when encoding the

video sequence. More specifically, it proposes an object-

based presentation, which allocates more bits to encode

moving objects. In this concept, the moving object is

assumed to be more important in a scene. Given the limited

bandwidth, these standards might lead to a quality en-

hancement of the reconstructed video at the decoder, be-

cause moving objects (important parts in a scene) will be

clearly recovered with only visual losses of the background

parts.

Apparently, this idea is rather generic and can be used for

many different applications. However, it’s only a conceptual

solution, which can’t be easily realized due to the difficulty

of designing a generic object-extraction algorithm. Addi-

tionally, neither 3D camera modeling techniques nor

mixed-reality techniques are taken into account in this

framework. The combination of these two techniques can

provide 3D position of the moving object, and it also allows

users to create virtual views and backgrounds in terms of

their own interests.

The current status of mixed-reality techniques for sports

video can be divided into two categories. The research in

the first category focuses on generating virtual scenes by

means of multiple, synchronous video sequences of a

given sports game.4,5 However, it’s difficult and expensive

to apply such systems for TV broadcasting in the current

broadcasting framework, because only single-viewpoint

video is available to the viewers at any time. The second cat-

egory aims at synthesizing virtual sports scenes from a

single-view, TV-broadcast video, which is the focus of the

main article text.

In Matsui et al., the proposed system performs a camera

calibration algorithm to establish a mapping between the

soccer playing-field in the image and that of the virtual

scene.6 The player’s posture is selected from three basic

choices—stop, walk, or run—using the player’s motion di-

rection and speed. Finally, computer graphics techniques

(that is, OpenGL) are employed to generate an animated

scene from the viewpoint of any player.

The work reported in Liang et al.7 is an improved version

of Matsui et al.,6 where a more advanced tracking approach

for the players and ball is realized. Such systems still suffer

from two problems. First, the so-called camera calibration

only builds a 2D homography mapping without providing

the exact camera parameters. As a result, the virtual scene

generated by this technique might be quite different from

the original scene. Second and unfortunately, the graphics-

based animation is not very realistic, because the texture

and motion of the player are lost completely.

In our previous work,8 we have discussed the generation

of virtual scenes from broadcast sports video by using a real

camera-calibration technique, where all the camera parame-

ters can be achieved. The current article explores this

IEEE

Mu

ltiM

ed

ia

74

intrinsic and extrinsic parameters with the pur-

pose of deriving a virtual camera view.

For the selection of EM-based feature points,

we assume that enough court lines have been

detected in the image using our previous tech-

nique.4 The remaining task is to find six point

correspondences from these lines to compute M.

In our approach, we select four points from the

ground plane, and two points from the net

plane. In the ground plane, the intersections

of the court lines establish four point corre-

spondences (see Figure 2 on the next page for

an example). Note that in the sequel, we explain

the 3D modeling for a tennis court but this can

be equally applied to volleyball and badminton.

The way to extract feature points from the

net line is actually more complex. In our previ-

ous work,4 we assume that a change in object

height only affects the vertical coordinate of

the image. This implicitly assumes that we

can neglect the perspective foreshortening in

the z-direction, because the object heights are

relatively small compared to the whole field

of view. In other words, any vertical projection

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 75

previous work to solve the problems of bandwidth limitation

and small display size of sports video on mobile devices. As

far as we know, there is no system so far that uses a (camera)

modeling-based, mixed-reality concept to facilitate the mo-

bile applications of sports video.

Several publications have been devoted to the camera

calibration for sports video.4,9-11 Liu et al.9 propose a self-cal-

ibration method to extract 3D information in broadcast soc-

cer video. Their work is based on Zhang’s method,12 where

the camera is calibrated from two homography mappings

without the need of 3D geometry knowledge of the

scene. Zhang’s technique assumes that the camera’s intrinsic

parameters during the calibration process are fixed, which

cannot be applied in running broadcast videos where, for

example, the focal length changes frequently during video

capture (this explains the reported error in Liu et al.,9

which is around 25 percent).

In Yu et al., the authors present a novel method for cali-

brating tennis video using six point correspondences and

design different methods to refine the clip- and frame-vary-

ing camera parameters.10 We proposed a system for calibrat-

ing sports video on the basis of randomly selected points

from the net line to indicate the scene height.11 The differ-

ence with Yu et al.10 is that this method relies on the detec-

tion of the top points of two net posts, which is not robust,

as the net posts might not be visible in the image. Our

method proved to be more generally applicable (we only

need a part of the net) to badminton, tennis, and volleyball.

Subsequently, our method8 was successfully integrated into

a semantic-level sports-video analyzer,13 leading to a wide

range of analysis results at different levels.

References

1. S. Chang, D. Zhong, and R. Kumar, ‘‘Real-Time Content-

Based Adaptive Streaming of Sports Videos,’’ Proc. IEEE

Workshop Content-Based Access of Image and Video Libraries,

IEEE Press, 2001, pp. 139-143.

2. H. Knoche, J. Mccarthy, and M. Sasse, ‘‘Can Small Be

Beautiful? Assessing Image Resolution Requirements for

Mobile TV,’’ Proc. ACM Multimedia, ACM Press, 2005,

pp. 829-838.

3. K. Seo et al., ‘‘An Intelligent Display Scheme of Soccer

Video on Mobile Devices,’’ IEEE Trans. Circuits and Systems

for Video Technology, vol. 17, no. 10, 2007, pp. 1395-1401.

4. T. Bebie and H. Bieri, ‘‘A Video-Based 3D-Reconstruction

of Soccer Games,’’ Eurographics, vol. 19, no. 3, 2000,

pp. 391-400.

5. N. Inamoto and H. Saito, ‘‘Virtual Viewpoint Replay for

a Soccer Match by View Interpolation from Multiple

Cameras,’’ IEEE Trans. Multimedia, vol. 9, Oct. 2007,

pp. 1155-1166.

6. K. Matsui et al., ‘‘Soccer Image Sequence Computed by a

Virtual Camera,’’ Proc. Computer Vision and Pattern Recogni-

tion, IEEE Press, 1998, pp. 860-865.

7. D. Liang et al., ‘‘Video2Cartoon: A System for Converting

Broadcast Soccer Video into 3D Cartoon Animation,’’

IEEE Trans. Consumer Electronics, vol. 53, Aug. 2007,

pp. 1138-1146.

8. J. Han, D. Farin, P.H.N. de With, ‘‘A Real-Time Augmented

Reality System for Sports Broadcast Video Enhancement,’’

Proc. ACM Multimedia, ACM Press, 2007, pp. 337-340.

9. Y. Liu et al., ‘‘Extracting 3D Information from Broadcast

Soccer Video,’’ Image and Vision Computing, vol. 24, 2006,

pp. 1146-1162.

10. X. Yu et al., ‘‘Inserting 3D Projected Virtual Content into

Broadcast Tennis Video,’’ Proc. ACM Multimedia, ACM Press,

2006, pp. 619-622.

11. J. Han, D. Farin, and P.H.N. de With, ‘‘Generic 3-D Model-

ling for Content Analysis of Court-Net Sports Sequences,’’

Proc. Int. Conf. Multimedia Modeling, Springer, 2007,

pp. 279-288.

12. Z. Zhang, ‘‘A Flexible New Technique for Camera Calibra-

tion,’’ IEEE Trans. Pattern Analysis and Machine Intelligence,

vol. 22, no. 11, 2000, pp. 1330-1334.

13. J. Han, D. Farin, and P.H.N. de With, ‘‘Broadcast Court-Net

Sports Video Analysis Using Fast 3-D Camera Modeling,’’

IEEE Trans. Circuit Systems for Video Technology, vol. 18,

no. 11, 2008, pp. 1628-1638.

Ap

ril�Ju

ne

2011

75

line onto the ground plane in the 3D domain

remains vertical in the image domain. In this

way, two arbitrary points p05 and p06 on the net

line in the 3D model (see Figure 2) are corre-

sponding to T5 and T6 in the image. However,

because the broadcast video is normally cap-

tured from the top view, this assumption

doesn’t hold in many practical cases. It appears

that there is a slight slope difference between

the 3D projection line and the visible line in

the image, which corresponds to the viewing

angle of the camera onto the scene. This

angle increases with a higher camera position.

Although this phenomenon doesn’t change

the projection matrix significantly, it has a pro-

found influence on a possibly accurate decom-

position of the matrix. Depending on the

position of the camera and the distance be-

tween points in the image, the projection ma-

trix might not be Euclidian by nature, so that

the decomposition into camera parameters

gives large errors. Our strategy is to select the

feature points on the net line more carefully

so that they yield a better Euclidian setting of

the projection problem. Consequently, the de-

composition of the projection matrix can be-

come sufficiently accurate.

Using our previous method4 to extract two

initial net line points, we can find many candi-

date points around these two initial points (as

Figure 2 shows). Employing the EM-based

method, we classify these candidates into two

categories: acceptable points (AP) and rejected

points (RP). From the set of APs, we choose

the best point through maximum likelihood

inference.

Suppose that the computed M can be

decomposed into camera parameters described

by Equation 1. Ideally, the camera’s principal

point (u0, v0) should be at the image’s center.

Due to the presence of noise, the principal

point might not be at the exact image center,

but at least it should be close. In other words,

the distance between the computed principal

point and the image center can be used to eval-

uate the quality of matrix decomposition. On

the basis of this distance, called dk, a candidate

point is classified into an AP or an RP, using the

iterative EM procedure. At each point, indexed

by k and assuming N candidate points, we have

a two-class problem (AP ¼ w1, RP ¼ w2) based

on the mentioned distance dk. More specifically,

we need to estimate the posterior p(wi |dk) for

each point. Given by the Bayesian rule, this

posterior equals to

pðwi j dkÞ ¼pðdk jwi; �i; �iÞpðwiÞ

pðdkÞ

Here, pðdkÞ ¼P2i¼1

pðwiÞpðdk jwi; �i; �iÞ, which is

represented by the Gaussian mixture model.

In addition,

pðdk jwi; �i; �iÞ ¼1ffiffiffiffiffiffi2�p

�i

expð�ðdk � �iÞ2=2�2i Þ

Now, the problem reduces to estimating p(wi),

�i and �i, which can be iteratively estimated

using the EM update equations.

The EM process is initialized by choosing

class posterior labels on the basis of the

observed distance; the shorter the distance of

a point, the greater the initial posterior proba-

bility of being an AP, so that

pð0Þðw2 j dkÞ ¼minð1:0; dk=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðc2

x þ c2y Þ

qÞ;

pð0Þðw1 j dkÞ ¼ 1� pð0Þðw2 j dkÞ

Here, (cx, cy) denotes the image center.

Having acquired the APs, the next step is to

find the best point from the APs. The basic idea

is to measure the distance between a virtual

court configuration and the detected court

configuration in the picture. Assuming that

we have in total k APs from the previous

step, then we should obtain k camera matrixes

accordingly, where the ith camera matrix is

denoted as Mi. For each AP, a virtual court

configuration can be generated by projecting

the 3D real-world court-net configuration

(derived from the model) back to the image

on the basis of Mi. Among the k virtual court

configurations, the configuration minimizing

a matching error can be identified as the

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 76

Figure 2. (a) The lines

and points are selected

in the image, and

(b) the correspondences

in the standard model

are determined. Six

points are used for

calibration.

(a) (b)

T5P5

Pa Pb

P3 P4 P3’

P1’ P2’

P4’

Pb’Pa’

P5’

P6

T6

nr’nr

P6’

P1 P2

IEEE

Mu

ltiM

ed

ia

76

best, final solution. The matching error is for-

mulated by

Ei ¼Xmj¼1

Lij;LvjðMiÞ��

Line Lij is the jth detected line of the court

configuration formed by m lines in the picture;

Lvj(Mi) denotes the corresponding line in the

virtual configuration. The metric �; �k k denotes

the distance between two lines.12 The matrix

Mi giving the minimum E is selected as the

best one. The last step is to decompose matrix

Mi into camera intrinsic and extrinsic parame-

ters (see Equation 1).

Player and ball information extraction

As we mentioned previously, we hope to re-

cover the information of players and ball

including position, shape, and texture in the

mixed-reality scene, so this kind of information

needs to be extracted from the input video. In

this system, our previous technique1 is adopted

to segment moving players. Tennis ball track-

ing is a more challenging problem, as the ball

is a small object (with a diameter of approxi-

mately 6.5 centimeters) traveling at a speed of

up to 150 miles per hour. We base our ball-

tracking algorithm on graph theory, which

was initially used by Yu et al.5 and was further

improved by Yan, Christmas, and Kittler.3 To

reduce the computational load of the algorithm

used in these works, we assume that

� the initial ball position is known in the

image (manually labeled), and

� the ball only appears in the court with only a

small border extension (of about 2 meters)

around the court (90.2 percent of the cases

in our database cover this situation).

First, we implement ball-candidate detec-

tion. In our previous work,1 a binary map

indicating moving objects can be achieved,

employing a background subtraction technique.

Here, we explore a connected component-

labeling algorithm to isolate each object on

the binary map. We suppose that a tennis

ball in the image can’t be larger than a prede-

fined square box with a size of r � r samples.

Therefore, all the larger objects will be

removed.

Next, after candidate detection over Q con-

secutive frames, a ball trajectory graph is estab-

lished. Each node in this graph is assigned a

node weight, representing the resemblance

with a ball. Meanwhile, each edge is associated

with an edge weight, referring to the likelihood

that two nodes in the same graph correspond

to each other when considering the motion

and the direction of a ball. More specifically,

the node weight is actually the probability of

a candidate Oi to be a ball, given the color

and size information. Using the Bayes rule, we

can express this procedure in terms of the dis-

tribution of the ball color and size data:

Nti ¼ PðOt

i ! ball j color; sizeÞ / Pðcolor; size jOti

! ballÞPðOti ! ballÞ

where superscript t denotes the index of the

current frame, and subscript i defines the ith

ball candidate in frame t. We consider the

color and size information of the ball inde-

pendently, so that Pðcolor; size jOti ! ballÞ ¼

Pðcolor jOti ! ballÞPðsize jOt

i ! ballÞ. We as-

sume that the color of the ball has a Gaussian

distribution, and its � and � can be computed

on the basis of several tennis ball samples,

taking varying outdoor lighting conditions

into account. Using the probability

Pðsize jOti ! ballÞ ¼ w � h=ðr � rÞ, where w and h

represent the width and height of the candi-

date, if the size of a candidate is more similar

to that of the predefined ball, then that candi-

date has a larger probability of being a ball. An

edge Ei,j of the graph connects nodes in frame t

with nodes in frame t þ 1 or t þ 2. Such a con-

nection is based on the phenomenon that usu-

ally the ball positions in any two adjacent

frames are quite close. As a consequence, the

edge weight of the graph can be formulated as

Ei;j ¼ Kxtþn

j � ðxti þ n� dsÞkw

!ð3Þ

Here, K(:) is the Epanechnikov kernel function,

which is specified by

KðyÞ ¼1� yk k2 for yk k2� 1

0 otherwise

(

In Equation 3, parameters xtþnj and xt

i repre-

sent the position of jth node in the (t þ n)th

frame (with n ¼ 1 or 2) and the position of

the ith node in frame t, respectively. Parameter

ds refers to a standard displacement, such as

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 77

Ap

ril�Ju

ne

2011

77

the average speed, of a moving ball over two

consecutive frames. The width of the kernel

function is specified by kw. The last two param-

eters can be set manually, depending on image

resolution.

Finding the optimal path of a graph is a typ-

ical dynamic programming problem. Because

there are a number of possible paths from the

beginning to the last frame, for example, from

frame t to t þ 2 in Figure 3, we need to identify

the path having maximum weight, which indi-

cates the most likely ball trajectory.

Parameter data compression

In our system, we need to transmit three dif-

ferent types of information to mobile devices:

camera parameters, multiple player informa-

tion, and ball information. Here, the usage of

a lossless, fixed-length compression scheme

helps encode all three information types:

� Camera-parameter compression. Eleven cam-

era parameters (see Equation 2) require com-

pression. We first clip the value of each

parameter into six-digit number, having

two decimals behind the comma. And

then, four bits are assigned to encode each

digit and three bits are used to indicate the

position of the decimal point. Therefore,

we need, in total, 297 bits to encode all cam-

era parameters.

� Player-information compression. Instead of

encoding the absolute position of each

pixel belonging to a player, we encode the

bounding box’s location and a moving play-

er’s binary map, where bounding-box loca-

tion is described by the coordinate of its

top-left point and its width and height.

The binary map can be directly transmitted

without compression, in which 0 represents

a background pixel, and 1 refers to a

foreground pixel. The combination of the

bounding-box location and the binary map

allows the reconstruction of the absolute po-

sition of each pixel of a moving player at the

decoder. Additionally, we also need to en-

code the RGB value of each foreground

pixel. As we do with camera-parameter com-

pression, we allocate fixed-length bits for

each pixel.

� Ball-information compression. We only encode

the position and the radius of the detected

ball. Ball color is predetermined by the user

or set as default to red at the decoder to

make it easily noticeable on the small

screen.

Mixed-reality scene generation

and intelligent display

Once we have obtained the required infor-

mation at the decoder, the next step is to syn-

thesize the mixed-reality scene. Normally, our

mixed-reality scene is formed by court lines,

ground field, commercial billboards and logos,

and players and ball.

� Court-lines generation. Because we have

acquired camera parameters and we also

know the physical position of each court

line in the real-world domain, it’s easy to re-

construct lines in the mixed-reality scene.

Our basic idea is to transform the intersec-

tion points of lines in the 3D domain to

the image domain using Equation 2. Con-

necting the corresponding points in terms

of the layout of a standard court model can

recover the court lines in the mixed-reality

scene. If the mobile screen is too small, we

can magnify those lines.

� Ground-field generation. Compared with the

court lines, the ground field is less impor-

tant, because viewers usually don’t care

about the type of the ground field. In our

system, we have several court fields with dif-

ferent colors, which can be chosen ran-

domly or by the viewers themselves.

� Commercial billboards and logos generation.

The insertion of virtual advertising to a

sports video is an interesting topic, because

it enhances the commercial opportunities

to advertisers, content providers, and broad-

casters. Unlike conventional methods that

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 78

0.9

0.4

0.7 0.8

0.4

0.40.3

0.50.6 0.8

0.3

0.2

t + 1t t + 2

0.5

0.7

0.1

0.1

0.2

0.5

Figure 3. Illustration

of the ball-path

extraction based on a

weighted graph, where

the optimal path is

marked with a red line.

IEEE

Mu

ltiM

ed

ia

78

insert virtual 2D advertisements into the

image,6 our algorithm is capable of inserting

3D scenes into the image by means of 3D

camera modeling. Therefore, both the adver-

tisements that should be projected onto

the ground plane and those that are perpen-

dicular to the ground plane can be success-

fully inserted into the image. Figure 4a

shows an example, where an institute logo

is nicely projected two times in the correct

perspective.

� Players and ball generation. Recovering players

and ball is relatively easy, because they are

all available at the decoder. If users feel

that the screen is too small, they may in-

crease the ball size.

As mentioned, our system facilitates an in-

telligent visualization, where viewers are able

to control a virtual camera derived from the

original camera. We have realized three differ-

ent ways of generating the view from a virtual

camera. First, users can obtain their preferred

viewing angle by changing the camera focal

length (zoom in/out) or the camera rotation

angle. Second, users can control the camera so

that it concentrates on one player specifically

and tracks that player in the middle of the

image. This feature is realized by minimizing

the following function:

D ¼ w

2� PxðK;R

_

; t;pÞ��

where PxðK;R_

; t;pÞ is the x-coordinate of the

projection point p of w in the image. p is the

real-world position of the target player, and

w is the width of the image. Figures 4b

and 4c demonstrate that we can virtually ro-

tate the camera and track a player in the mid-

dle of the image. Black lines indicate the

camera rotation angle. A third option for the

user is to locate the virtual camera on top of

a player’s head when the player is close to a

real camera position, enabling the viewer to

watch the match as a highly realistic experience.

Experimental resultsTo evaluate the operation and efficiency of

the system, we tested our system on more

than 1 hour of video sequences, which were

recorded from regular TV broadcasts containing

tennis, badminton, and volleyball. In our test

set, the video sequences have two resolutions:

720 � 576 and 320 � 240. In addition, our

system is a Pentium 4, 3-GHz computer with

512 Mbytes of RAM programmed with C under

Linux, because the combination of C and

Linux is a popular framework to develop appli-

cations for smartphones.

System performance

In an objective evaluation, we tested our

camera-calibration algorithm on six video

clips. Three of them were tennis games on differ-

ent court classes, two were badminton games,

and one was a volleyball game. Figure 5 (next

page) shows sample pictures for the court detec-

tion, where three difficult scenes are selected.

Evidently, the method of Yu et al.7 fails here,

as one net post is not visible. Table 1 shows

the evaluation results of our algorithm, which

indicates that the detection of the court lines

is correct for more than 90 percent of the

sequences on average.

Moreover, we compared our algorithm with

earlier work,4 and related those operations to

the ground truth on the basis of a manual selec-

tion of a few intersections in the 3D domain.

We have transformed those 3D reference points

to the image domain using the transformation

matrix (see Equation 2), and we have measured

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 79

Figure 4. Examples for

mixed-reality scene

generation: (a) logo

insertion, where we can

project the logo onto

the ground and also

make it perpendicular

to the ground; (b) and

(c) camera rotation,

where (b) shows the

original image and

(c) shows that we

rotate the camera and

make one player appear

in the middle of the

image. Black lines

indicate the camera

rotation direction.

(a) (b) (c)

Ap

ril�Ju

ne

2011

79

the distance between the projected 3D refer-

ence intersections and the ground truth in

the image domain. Table 1 shows that for a

badminton clip, the average distance obtained

by the approach of Han, Farin, and de With4

is 1.6 pixels per point, and our new algorithm

has only an error of 1.1 pixels per point. We

give the coordinates of the camera principal

points computed by both methods.

Again, our principal point is much closer to

the image center, which equals to (360, 288). It

should be noted that it’s difficult to find an ob-

jective way to define how much accuracy of the

camera parameter we need for this application.

Instead, we turn our interests to subjective inves-

tigations, which reveal that the user will obvi-

ously recognize projection errors if the distance

between the projected 3D intersections and the

ground truth is more than 15 pixels per point.

We implemented our proposed simple

information-compression scheme and com-

pared it with another low-complexity scheme

that uses JPEG for encoding each original

video frame, where we assumed that the trans-

mission bandwidth is fixed. To fairly compare

these two schemes, both of them didn’t consider

the motion estimation over video frames. We

found that our proposal needs 6.35 Kbytes to en-

code one frame on the average. We have set the

JPEG encoding by changing the quantization

factor so that it used an equivalent amount of

bits for compressing video frames. This gives a

visual quality comparison of the reconstructed

image at more or less the same bit rate.

Figure 6a shows some examples, where we

can see that the JPEG image has visible block-

ing artifacts at this low bit rate, thereby result-

ing in the blur of essential visual information,

such as players, court lines, and ball. However,

this phenomenon doesn’t happen to the recon-

structed image by using our compression

scheme (see Figure 6b), in which the ball and

the court lines are even more recognizable on

the small screen. Meanwhile, we have noticed

that the wireless bandwidth is always dynamic

in the real application. Therefore, we also

encoded the video using different bit rates, and

compared the quality of reconstructed images.

The conclusion is that reconstructed images,

by using our scheme, are visually similar at dif-

ferent bit rates with the condition that the

channel bandwidth should be sufficient to

transmit the camera parameter, player, and

the ball information. If this condition doesn’t

hold, the important parts, such as player or

ball, might not be completely reconstructed at

the decoder. Similarly, we also compared the

image quality difference of the JPEG scheme if

we increase the channel bandwidth from

6.4 Kbytes to 8.5 Kbytes (see Figure 6c). Here,

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 80

Figure 5. Detection of the court lines and net line for different court types and viewing angles.

Table 1. Court and net detection and camera calibration.

Type Court (%) Net (%) Camera center (30 frames) Principal points

Badminton 98.7 96.2 Other method4 1.6 (388, �121)

Our method 1.1 (372, 256)

Tennis 96.1 95.8 Other method4 2.5 (399, �232)

Our method 1.3 (395, 249)

Volleyball 95.4 91.8 Other method4 3.2 (320, 98)

Our method 2.1 (340, 220)

IEEE

Mu

ltiM

ed

ia

80

the players, court-net lines, and ball are still not

clearly visible in the reconstructed image even

with the bandwidth slightly increased.

Unfortunately, an objective quality metric

such as peak signal-to-noise ratio can’t be

given, because data is partly virtually generated

in our scheme. The quality difference between

JPEG and our mixed-reality system is explained

by the fact that more bits are used to encode the

important objects, like players and ball, instead

of compressing every one through averaging.

Let us now look at the results and possibil-

ities when virtual reality is mixed with the

actual sports scene. Figure 7 (next page) dem-

onstrates some virtual scenes generated by our

system, where two badminton games are

shown (more demo movies can be found at

http://vca.ele.tue.nl/demos/tennisanalysis/index.

html). We have generated two different virtual

scenes for each game. For a single match of

badminton, we have translated the camera so

that one player is always located at the midline

of the captured image, and the viewer watches

the game from an arbitrary viewing angle.

In the double badminton match, we demon-

strate the cases with increased height of the

camera and modified focal length of the cam-

era. Our virtual scene proves to be realistic

because we preserve the shape and motion

nature of the player, which is better than the

animation-based systems.8,9

To further investigate the performance of

our mixed-reality system, we conducted a sub-

jective evaluation, where we used four demos

together with original videos (two tennis videos

and two badminton videos). We invited six

subjects to watch the demos and original videos

on a laptop. The participants were all young

researchers ranging in age from 23 to 34.

They were all sports fans and usually watch ten-

nis and badminton games. We asked them

to compare the mixed-reality videos with the

original videos, and also answer the following

questions:

� Q1: Are court lines and ball clearer in the

mixed-reality video than in the original

video?

� Q2: Does the inserted virtual scene move

with camera?

� Q3: Does the synthesized scene visualize the

sports game more effectively?

� Q4: Is the virtual scene realistic?

� Q5: Is the whole mixed-reality scene realistic?

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 81

Figure 6. Visual results obtained after the compression: the (a) column shows JPEG coding results at 6.4 Kbytes, the (b) column shows

our fixed-length coding scheme results at 6.35 Kbytes, and the (c) column shows JPEG coding results at 8.5 Kbytes.

(c)(a) (b)

Ap

ril�Ju

ne

2011

81

Instead of only answering yes or no, subjects

chose a score between 1 to 5, in which 5 corre-

sponds to strongly accept, 4 is accept, 3 indi-

cates marginally accept, 2 means reject, and

1 represents strongly reject. The average scores

from all subjects are given in Figure 8. While

users gave positive answers to the questions

about the system, we paid special attention to

Q5 because it resulted in the lowest average

score. We found that the major reason for a

lower score is that users weren’t satisfied with

the result of player segmentation. In some

video frames, the contour of the player, in par-

ticular the player farthest from the camera, was

not completely extracted, because the player

suit color was similar to the background color.

System efficiency

In addition to showing system performance,

we also want to demonstrate its efficiency.

In principle, the efficiency of our camera-

calibration technique mainly depends on

image resolution, but is slightly influenced by

the content complexity of the picture. To

prove this, we calculated the time for each

frame when performing our camera-calibration

technique on a tennis video clip (320 � 240)

and a badminton video clip (720 � 576). The

results are given in Figure 9a. For the tennis

video, the execution time for the initialization

step (first frame) was 93 milliseconds (ms),

and the execution times for other frames were

between 27 and 32 ms, depending on the

frame’s complexity.

For the badminton videos, the initial step

required 457 ms. The average execution time

per frame for badminton was 131 ms. It’s clear

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 82

Figure 7. Illustration of virtual scene generation by increasing camera height and focal length of the original camera setting. The

(a) row shows original images, the (b) row shows camera rotation and focal length modification, and the (c) row shows change view

angle and zooming.

(a)

(c)

(b)

Figure 8. Subjective

evaluation of our

mixed-reality system,

where the height of the

bar indicates the score

of the corresponding

question.

5

4.5

4

3.5

3

2.5

2

1.5

1

0

0.5

Q1

Scor

e

Q2 Q3 Q4 Q5Question

82

that the execution time varies from lower-

resolution video to higher-resolution video be-

cause the size (width and length) of a court

line changes under different resolutions. It

can be concluded that our proposal, when exe-

cuted on a PC, is a near real-time 3D camera-

calibration technique and is able to support

real-time content-analysis applications.

We applied the complete system to a single

tennis video (720 � 576), doing camera calibra-

tion, player and ball extraction, and data com-

pression. The result is shown in Figure 9b. We

found that average execution time per frame

was around 503.4 ms and that the player and

ball extraction component used the most

computations.

It’s important to test the execution time and

also the memory usage of our virtual-scene-

generation algorithm, because it will be the

heaviest part running on the smartphone.

We evaluated this algorithm in our PC-based

environment using a single badminton video

(720 � 576), and found that the execution time

for each individual frame is around 45.7 ms

and that the memory usage is about 22.6 Mbytes.

This evaluation result shows that it’s poten-

tially possible to run our algorithm on a smart-

phone in real time.

Conclusions and discussionsBecause of the accuracy of the 3D camera

modeling, our system might be suited for

other professional applications besides the

enhanced viewing experience of sports video

on mobile devices. For example, our modeling

technique could be an efficient replacement

for manually controlling the video camera to

concentrate on and track an object of interest

in the middle of an image. Moreover, our sys-

tem could be a camera simulator, providing vir-

tual videos with different camera settings and

finding the perfect camera position and ideal

camera setting through simulation.

Our entire system has not yet been commer-

cialized. However, our system is conceptually

close to an object-based video-compression

scheme, such as MPEG-4. The recent achieve-

ments of commercial products based on

MPEG-4 clearly show the possibility of com-

mercializing our system in the near future. For

example, existing commercial MPEG-4 prod-

ucts offer a broader and more sophisticated

class of fully reactive and interactive content,

including video object segmenting of arbitrary

shapes and user-interaction features integrated

within content. Some products, such as Thom-

sons’ set-top box, even enable users to create

and render 2D and 3D graphics. Solutions

such as our proposed algorithm for object ren-

dering provide good references for further com-

mercialization. In fact, an improved version of

our object-segmentation technique, the key

component of our system, has been transferred

to the industry (see http://www.vinotion.nl),

resulting in commercial software.

The current system can still be improved in

some aspects. For example, the entire system

can’t yet achieve real-time operation when exe-

cuted on a single CPU platform. Fortunately,

the recent modern smartphones already in-

creasingly deploy more than one CPU core.

Our algorithm can be nicely split into two

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 83

Figure 9. System

efficiency. (a) The

execution time of the

camera calibration on

a 3-GHz PC. (b) The

execution time of

each processing

component in the entire

compression system on

a single tennis match

(average execution time

per frame is 503.4 ms).

500

450

400

350

Tim

e (m

illis

econ

ds)

300

250

200

150

100

01 4 7 10 13 16 19 22 25 28 31

Frame

34 37 40 43 46 49

50

Cameracalibration

(31%)

Datacompression

(3%)

Player andball

extraction (66%)(a) (b)

Tennis (320*240)

Badminton (720*576)

83

Ap

ril�Ju

ne

2011

parts after the decoder obtains the camera

parameters, where the first part performs loss-

less decoding of the information of players

and ball, and the second part generates the vir-

tual scene according to the user preference. If

these parts can be executed in parallel on two

CPUs, we estimate that the entire system can

perform real-time execution. MM

References

1. J. Han, D. Farin, and P.H.N. de With, ‘‘Broadcast

Court-Net Sports Video Analysis Using Fast 3-D

Camera Modeling,’’ IEEE Trans. Circuit Systems

for Video Technology, vol. 18, no. 11, 2008,

pp. 1628-1638.

2. J. Han and P.H.N. de With, ‘‘Real-Time Multiple

People Tracking for Automatic Group-Behavior

Evaluation in Delivery Simulation Training,’’ Multi-

media Tools and Applications, vol. 51, no. 3, 2011,

pp. 913-933.

3. F. Yan, W. Christmas, and J. Kittler, ‘‘Layered Data

Association Using Graph-Theoretic Formulation

with Applications to Tennis Ball Tracking in

Monocular Sequences,’’ IEEE Trans. Pattern Analy-

sis and Machine Intelligence, vol. 30, no. 10,

2008, pp. 1814-1830.

4. J. Han, D. Farin, and P.H.N. de With, ‘‘Generic

3-D Modelling for Content Analysis of Court-Net

Sports Sequences,’’ Proc. Int. Conf. Multimedia

Modeling, Springer, 2007, pp. 279-288.

5. X. Yu et al., ‘‘A Trajectory-Based Ball Detection

and Tracking Algorithm in Broadcast Tennis

Video,’’ Proc. IEEE Int’l Conf. Image Processing,

IEEE Press, 2004, pp. 1049-1052.

6. K. Wan and X. Yan, ‘‘Advertising Insertion in

Sports Webcasts,’’ IEEE MultiMedia, vol. 14, no. 2,

2007, pp. 78-82.

7. X. Yu et al., ‘‘Inserting 3D Projected Virtual Con-

tent into Broadcast Tennis Video,’’ Proc. ACM

Multimedia, ACM Press, 2006, pp. 619-622.

8. K. Matsui et al., ‘‘Soccer Image Sequence Com-

puted by a Virtual Camera,’’ Proc. Computer

Vision and Pattern Recognition, IEEE Press, 1998,

pp. 860-865.

9. D. Liang et al., ‘‘Video2Cartoon: A System for

Converting Broadcast Soccer Video into 3D Car-

toon Animation,’’ IEEE Trans. Consumer Electronics,

vol. 53, Aug. 2007, pp. 1138-1146.

Jungong Han is a researcher at the Centre for Math-

ematics and Computer Science (CWI) in Amsterdam,

and was a researcher on video content analysis at the

department of signal processing systems at the Uni-

versity of Technology Eindhoven, Netherlands. His re-

search interests include content-based video analysis

and video compression. Han has a PhD in communi-

cation and information engineering from Xidian Uni-

versity. Contact him at [email protected].

Dirk Farin is a senior researcher at Robert Bosch, Cor-

porate Research in Hildesheim, Germany. His research

interests include object classification, 3D reconstruc-

tion, and video compression. Farin has a PhD in elec-

trical engineering from the University of Technology

Eindhoven, Netherlands. In 2008, he received three

best student paper awards for his work on video cod-

ing. Contact him at [email protected].

Peter H.N. de With is a professor at the University of

Technology Eindhoven, Netherlands, and the leading

a chair on video coding and architectures. His re-

search interests include video coding, architectures,

and their realization. De With has a PhD in electrical

engineering from the University of Technology Delft,

Netherlands. He is an IEEE Fellow. Contact him at

[email protected].

[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 84

84

IEEE

Mu

ltiM

ed

ia