A Mixed-RealitySystem forBroadcastingSports Video toMobile Devices
Jungong Han, Dirk Farin, and Peter H.N. de WithUniversity of Technology Eindhoven
Watching sports, such as ten-
nis and soccer, remains pop-
ular for a broad class of
consumers. However, audi-
ences might not be able to enjoy their favorite
games from a television or PC when they are
traveling, so mobile devices are increasingly
used for watching sports video. Unfortunately,
watching sports on a mobile device is not as
simple as watching sports on TV. Bandwidth
limitations on wireless networks prevent high
bit-rate video transmission. In addition, small
displays lose visual details of the sports event.
Bandwidth limitations occur primarily when
multiple users want to stream video on the
same wireless link. These bottlenecks are likely
to occur because of the popularity of an
event. See the ‘‘Other Approaches’’ sidebar
(on page 74) for examples of existing systems.
This article describes a camera modeling-
based, mixed-reality system concept. The idea
is to build a 3D model for sports video, where
all parameters of this model can be obtained
by analyzing the broadcasting video. Instead
of sending original images to the mobile de-
vice, the system only sends parameters of the
3D model and the information about players
and balls, which can significantly save trans-
mission bandwidth. Additionally, because we
have full-quality information about camera
modeling and also players and balls, the mobile
client is able to recover the important informa-
tion without loss of the visual details. In addi-
tion, we can generate virtual scenes for less
important areas, such as the playing field and
the commercial billboard, without changing
the major story of the sports game.
Moreover, by changing the parameters of
the original camera, a variety of mixed-reality
scenes can be synthesized to better visualize a
scene on the mobile device. For example, in a
tennis video captured by a long-shot camera,
the important objects (for example, players)
might not be clearly visible on the small LCD
panel. However, a mixed-reality presentation
of the zoomed version of the original scene
might provide a better visualization. The con-
cept presented here fully relies on an accurate
3D modeling of the scene. For this reason, we
also contribute techniques for precisely extract-
ing camera parameters. In our system, a proba-
bilistic method based on the Expectation
Maximization (EM) algorithm finds the opti-
mal feature points, thereby enabling the auto-
matic acquisition of the camera parameters
from the sports video (tennis, badminton, and
volleyball) with sufficiently high accuracy.
System architectureThe architecture of our proposed system is
composed of several interacting, but clearly
separated modules. Figure 1 depicts our system
architecture with its major functional units and
the data flow. The most important modules are
as follows:
� Camera calibration. To generate mixed-reality
sports scenes, the original camera parame-
ters have to be calculated using the broad-
casting sports video as input. The goal of
this module is to compute a camera projec-
tion matrix from the input video and further
decompose it into camera intrinsic and ex-
trinsic parameters.
� Player and ball information extraction. To pre-
serve the visual nature of the original
human motion, the information concerning
players and ball, such as position, shape, and
texture, must be extracted from real video
and texture-mapped onto the virtual video.
To this end, we use a player segmentation al-
gorithm, which is discussed in our previous
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 72
Feature Article
Transmitting camera
parameters and
additional
information enables
the generation
of a mixed-reality
presentation of
sports on mobile
devices.
1070-986X/11/$26.00 �c 2011 IEEE Published by the IEEE Computer Society72
work.1,2 Our approach is based on the back-
ground subtraction technique, incorporating
a shadow detector. Our player-segmentation
algorithm considers the relation between
player position and court-line location, so
it’s easy to reject other moving objects,
like the ball boy. Meanwhile, a ball-tracking
algorithm that employs graph theory is
implemented.3
� Information encoding and decoding. In our sys-
tem, we only need to transmit the camera
parameters and the information about the
players and the ball to the decoder. Here, a
simple lossless compression scheme is adopted
to encode all the required information.
� Virtual camera creation and intelligent display.
Once the original camera parameters at
each frame are available, it’s possible to con-
struct mixed-reality scenes of the sports
video on smart mobile device. Our idea is
to precisely recover the information, such
as court lines, players, and ball, but to gener-
ate virtual scenes for less important areas
(from the normal user’s point of view),
such as ground field and commercial bill-
boards. If the mobile device screen is too
small, we can magnify the ball and court-
net lines. Moreover, a virtual camera derived
from the original camera helps to synthesize
a variety of virtual scenes, such as the scene
from the player viewpoint.
Camera calibration
The task of the camera calibration provides a
geometric transformation that maps the points
in the real-world coordinates to the image do-
main. Because the court model is a 3D scene but
the displayed image is a 2D plane, this mapping
can be written as a 3 � 4 projection matrix M,
which transforms a point p ¼ (x, y, z, 1)T in real-
world coordinates to image coordinates p0 ¼(u, v, w)T by p0 ¼Mp, being equivalent to
uvw
0@
1A ¼ m11 m12 m13 m14
m21 m22 m23 m24
m31 m32 m33 m34
0@
1A
xyz1
0BB@
1CCA
Because M is scaling invariant, 11 free
parameters have to be determined. They can
be calculated from six points whose positions
are known in both the 3D coordinates and
the image. Matrix M can be further decom-
posed into camera intrinsic and extrinsic
parameters, described by
M � K½Rj �Rt� ð1Þ
where
K ¼fx s u0
0 fy v0
0 0 1
24
35; R ¼
iT
jT
kT
24
35 ; t ¼
tx
ty
tz
24
35 ð2Þ
Matrix M in Equation 1 actually represents
the perspective camera model, which is formed
by camera intrinsic parameters K and the
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 73
Courtdetection
Netdetection
EM-based featurepoint selection
Projection matrixdecomposition Encode
DecodeVirtualcameracreation
User intention
Display
Camera calibration
Camera parameters and information aboutplayers and ball
Playerand ballposition,
shape, andtexture
extraction
Figure 1. Architecture
of the complete system,
which is constructed
from camera
calibration, player,
and ball information
extraction; encode and
decode; mixed-reality
scene generation; and
intelligent display.
Ap
ril�Ju
ne
2011
73
camera extrinsic parameters ½Rj �Rt�. The in-
trinsic camera parameters describe properties
of the camera, such as its focal length and the
image geometry, while the extrinsic parameters
describe the camera placement and orientation
in the 3D world. The upper triangular calibra-
tion matrix K encodes the intrinsic parameters
of the camera. Parameters fx, fy represent the
focal length, (u0, v0) is the principal point,
and s is a skew parameter. Matrix R is the rota-
tion matrix with i, j, and k denoting the rota-
tion axes. Vector t is the translation vector.
The parameters R and t cover the camera ex-
trinsic parameters.
In this article, we adopt the basic concept of
Han, Farin, and de With4 that takes at least six
points arranged in two perpendicular planes to
compute M. The court and net lines character-
ize these two planes. Additionally, we explore
the EM approach to select feature points, in-
stead of using a random selection,4 thereby
improving the accuracy of projection matrix
M. Our aim is to make matrix M accurate
enough to enable a decomposition into camera
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 74
Other ApproachesMost existing systems assume that video transmission
and display are two different topics, so they are treated inde-
pendently. Chang, Zhong, and Kumar develop an adaptive,
streaming system for sports videos over resource-limited net-
works or devices.1 The rate adaptation is content-based and
dynamically varied according to the event structure and the
video content. For example, in wireless streaming of baseball
video, such a system can send full-quality frames showing
pitching and important follow-up activities, but change to
low-bandwidth mode, using only keyframes during unim-
portant video segments.
Knoche, Mccarthy, and Sasse aim to couple the applied
image resolution to the display size of mobile devices.2
This inevitably leads to a reduced resolution, which in turn
yields to a loss of visual details. This work concluded that
field sports, such as soccer and tennis, suffer more clearly
in the user experience due to the reduced resolution com-
pared to music, news, and animation. In Seo et al., an intel-
ligent display of the soccer game on a mobile device is
presented where the region-of-interest is extracted and
magnified for display.3 This gives viewers a more comfortable
experience in understanding what is happening in the scene.
The new video-compression standards, such as MPEG-4,
suggest considering the video content when encoding the
video sequence. More specifically, it proposes an object-
based presentation, which allocates more bits to encode
moving objects. In this concept, the moving object is
assumed to be more important in a scene. Given the limited
bandwidth, these standards might lead to a quality en-
hancement of the reconstructed video at the decoder, be-
cause moving objects (important parts in a scene) will be
clearly recovered with only visual losses of the background
parts.
Apparently, this idea is rather generic and can be used for
many different applications. However, it’s only a conceptual
solution, which can’t be easily realized due to the difficulty
of designing a generic object-extraction algorithm. Addi-
tionally, neither 3D camera modeling techniques nor
mixed-reality techniques are taken into account in this
framework. The combination of these two techniques can
provide 3D position of the moving object, and it also allows
users to create virtual views and backgrounds in terms of
their own interests.
The current status of mixed-reality techniques for sports
video can be divided into two categories. The research in
the first category focuses on generating virtual scenes by
means of multiple, synchronous video sequences of a
given sports game.4,5 However, it’s difficult and expensive
to apply such systems for TV broadcasting in the current
broadcasting framework, because only single-viewpoint
video is available to the viewers at any time. The second cat-
egory aims at synthesizing virtual sports scenes from a
single-view, TV-broadcast video, which is the focus of the
main article text.
In Matsui et al., the proposed system performs a camera
calibration algorithm to establish a mapping between the
soccer playing-field in the image and that of the virtual
scene.6 The player’s posture is selected from three basic
choices—stop, walk, or run—using the player’s motion di-
rection and speed. Finally, computer graphics techniques
(that is, OpenGL) are employed to generate an animated
scene from the viewpoint of any player.
The work reported in Liang et al.7 is an improved version
of Matsui et al.,6 where a more advanced tracking approach
for the players and ball is realized. Such systems still suffer
from two problems. First, the so-called camera calibration
only builds a 2D homography mapping without providing
the exact camera parameters. As a result, the virtual scene
generated by this technique might be quite different from
the original scene. Second and unfortunately, the graphics-
based animation is not very realistic, because the texture
and motion of the player are lost completely.
In our previous work,8 we have discussed the generation
of virtual scenes from broadcast sports video by using a real
camera-calibration technique, where all the camera parame-
ters can be achieved. The current article explores this
IEEE
Mu
ltiM
ed
ia
74
intrinsic and extrinsic parameters with the pur-
pose of deriving a virtual camera view.
For the selection of EM-based feature points,
we assume that enough court lines have been
detected in the image using our previous tech-
nique.4 The remaining task is to find six point
correspondences from these lines to compute M.
In our approach, we select four points from the
ground plane, and two points from the net
plane. In the ground plane, the intersections
of the court lines establish four point corre-
spondences (see Figure 2 on the next page for
an example). Note that in the sequel, we explain
the 3D modeling for a tennis court but this can
be equally applied to volleyball and badminton.
The way to extract feature points from the
net line is actually more complex. In our previ-
ous work,4 we assume that a change in object
height only affects the vertical coordinate of
the image. This implicitly assumes that we
can neglect the perspective foreshortening in
the z-direction, because the object heights are
relatively small compared to the whole field
of view. In other words, any vertical projection
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 75
previous work to solve the problems of bandwidth limitation
and small display size of sports video on mobile devices. As
far as we know, there is no system so far that uses a (camera)
modeling-based, mixed-reality concept to facilitate the mo-
bile applications of sports video.
Several publications have been devoted to the camera
calibration for sports video.4,9-11 Liu et al.9 propose a self-cal-
ibration method to extract 3D information in broadcast soc-
cer video. Their work is based on Zhang’s method,12 where
the camera is calibrated from two homography mappings
without the need of 3D geometry knowledge of the
scene. Zhang’s technique assumes that the camera’s intrinsic
parameters during the calibration process are fixed, which
cannot be applied in running broadcast videos where, for
example, the focal length changes frequently during video
capture (this explains the reported error in Liu et al.,9
which is around 25 percent).
In Yu et al., the authors present a novel method for cali-
brating tennis video using six point correspondences and
design different methods to refine the clip- and frame-vary-
ing camera parameters.10 We proposed a system for calibrat-
ing sports video on the basis of randomly selected points
from the net line to indicate the scene height.11 The differ-
ence with Yu et al.10 is that this method relies on the detec-
tion of the top points of two net posts, which is not robust,
as the net posts might not be visible in the image. Our
method proved to be more generally applicable (we only
need a part of the net) to badminton, tennis, and volleyball.
Subsequently, our method8 was successfully integrated into
a semantic-level sports-video analyzer,13 leading to a wide
range of analysis results at different levels.
References
1. S. Chang, D. Zhong, and R. Kumar, ‘‘Real-Time Content-
Based Adaptive Streaming of Sports Videos,’’ Proc. IEEE
Workshop Content-Based Access of Image and Video Libraries,
IEEE Press, 2001, pp. 139-143.
2. H. Knoche, J. Mccarthy, and M. Sasse, ‘‘Can Small Be
Beautiful? Assessing Image Resolution Requirements for
Mobile TV,’’ Proc. ACM Multimedia, ACM Press, 2005,
pp. 829-838.
3. K. Seo et al., ‘‘An Intelligent Display Scheme of Soccer
Video on Mobile Devices,’’ IEEE Trans. Circuits and Systems
for Video Technology, vol. 17, no. 10, 2007, pp. 1395-1401.
4. T. Bebie and H. Bieri, ‘‘A Video-Based 3D-Reconstruction
of Soccer Games,’’ Eurographics, vol. 19, no. 3, 2000,
pp. 391-400.
5. N. Inamoto and H. Saito, ‘‘Virtual Viewpoint Replay for
a Soccer Match by View Interpolation from Multiple
Cameras,’’ IEEE Trans. Multimedia, vol. 9, Oct. 2007,
pp. 1155-1166.
6. K. Matsui et al., ‘‘Soccer Image Sequence Computed by a
Virtual Camera,’’ Proc. Computer Vision and Pattern Recogni-
tion, IEEE Press, 1998, pp. 860-865.
7. D. Liang et al., ‘‘Video2Cartoon: A System for Converting
Broadcast Soccer Video into 3D Cartoon Animation,’’
IEEE Trans. Consumer Electronics, vol. 53, Aug. 2007,
pp. 1138-1146.
8. J. Han, D. Farin, P.H.N. de With, ‘‘A Real-Time Augmented
Reality System for Sports Broadcast Video Enhancement,’’
Proc. ACM Multimedia, ACM Press, 2007, pp. 337-340.
9. Y. Liu et al., ‘‘Extracting 3D Information from Broadcast
Soccer Video,’’ Image and Vision Computing, vol. 24, 2006,
pp. 1146-1162.
10. X. Yu et al., ‘‘Inserting 3D Projected Virtual Content into
Broadcast Tennis Video,’’ Proc. ACM Multimedia, ACM Press,
2006, pp. 619-622.
11. J. Han, D. Farin, and P.H.N. de With, ‘‘Generic 3-D Model-
ling for Content Analysis of Court-Net Sports Sequences,’’
Proc. Int. Conf. Multimedia Modeling, Springer, 2007,
pp. 279-288.
12. Z. Zhang, ‘‘A Flexible New Technique for Camera Calibra-
tion,’’ IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 22, no. 11, 2000, pp. 1330-1334.
13. J. Han, D. Farin, and P.H.N. de With, ‘‘Broadcast Court-Net
Sports Video Analysis Using Fast 3-D Camera Modeling,’’
IEEE Trans. Circuit Systems for Video Technology, vol. 18,
no. 11, 2008, pp. 1628-1638.
Ap
ril�Ju
ne
2011
75
line onto the ground plane in the 3D domain
remains vertical in the image domain. In this
way, two arbitrary points p05 and p06 on the net
line in the 3D model (see Figure 2) are corre-
sponding to T5 and T6 in the image. However,
because the broadcast video is normally cap-
tured from the top view, this assumption
doesn’t hold in many practical cases. It appears
that there is a slight slope difference between
the 3D projection line and the visible line in
the image, which corresponds to the viewing
angle of the camera onto the scene. This
angle increases with a higher camera position.
Although this phenomenon doesn’t change
the projection matrix significantly, it has a pro-
found influence on a possibly accurate decom-
position of the matrix. Depending on the
position of the camera and the distance be-
tween points in the image, the projection ma-
trix might not be Euclidian by nature, so that
the decomposition into camera parameters
gives large errors. Our strategy is to select the
feature points on the net line more carefully
so that they yield a better Euclidian setting of
the projection problem. Consequently, the de-
composition of the projection matrix can be-
come sufficiently accurate.
Using our previous method4 to extract two
initial net line points, we can find many candi-
date points around these two initial points (as
Figure 2 shows). Employing the EM-based
method, we classify these candidates into two
categories: acceptable points (AP) and rejected
points (RP). From the set of APs, we choose
the best point through maximum likelihood
inference.
Suppose that the computed M can be
decomposed into camera parameters described
by Equation 1. Ideally, the camera’s principal
point (u0, v0) should be at the image’s center.
Due to the presence of noise, the principal
point might not be at the exact image center,
but at least it should be close. In other words,
the distance between the computed principal
point and the image center can be used to eval-
uate the quality of matrix decomposition. On
the basis of this distance, called dk, a candidate
point is classified into an AP or an RP, using the
iterative EM procedure. At each point, indexed
by k and assuming N candidate points, we have
a two-class problem (AP ¼ w1, RP ¼ w2) based
on the mentioned distance dk. More specifically,
we need to estimate the posterior p(wi |dk) for
each point. Given by the Bayesian rule, this
posterior equals to
pðwi j dkÞ ¼pðdk jwi; �i; �iÞpðwiÞ
pðdkÞ
Here, pðdkÞ ¼P2i¼1
pðwiÞpðdk jwi; �i; �iÞ, which is
represented by the Gaussian mixture model.
In addition,
pðdk jwi; �i; �iÞ ¼1ffiffiffiffiffiffi2�p
�i
expð�ðdk � �iÞ2=2�2i Þ
Now, the problem reduces to estimating p(wi),
�i and �i, which can be iteratively estimated
using the EM update equations.
The EM process is initialized by choosing
class posterior labels on the basis of the
observed distance; the shorter the distance of
a point, the greater the initial posterior proba-
bility of being an AP, so that
pð0Þðw2 j dkÞ ¼minð1:0; dk=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðc2
x þ c2y Þ
qÞ;
pð0Þðw1 j dkÞ ¼ 1� pð0Þðw2 j dkÞ
Here, (cx, cy) denotes the image center.
Having acquired the APs, the next step is to
find the best point from the APs. The basic idea
is to measure the distance between a virtual
court configuration and the detected court
configuration in the picture. Assuming that
we have in total k APs from the previous
step, then we should obtain k camera matrixes
accordingly, where the ith camera matrix is
denoted as Mi. For each AP, a virtual court
configuration can be generated by projecting
the 3D real-world court-net configuration
(derived from the model) back to the image
on the basis of Mi. Among the k virtual court
configurations, the configuration minimizing
a matching error can be identified as the
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 76
Figure 2. (a) The lines
and points are selected
in the image, and
(b) the correspondences
in the standard model
are determined. Six
points are used for
calibration.
(a) (b)
T5P5
Pa Pb
P3 P4 P3’
P1’ P2’
P4’
Pb’Pa’
P5’
P6
T6
nr’nr
P6’
P1 P2
IEEE
Mu
ltiM
ed
ia
76
best, final solution. The matching error is for-
mulated by
Ei ¼Xmj¼1
Lij;LvjðMiÞ�� ��
Line Lij is the jth detected line of the court
configuration formed by m lines in the picture;
Lvj(Mi) denotes the corresponding line in the
virtual configuration. The metric �; �k k denotes
the distance between two lines.12 The matrix
Mi giving the minimum E is selected as the
best one. The last step is to decompose matrix
Mi into camera intrinsic and extrinsic parame-
ters (see Equation 1).
Player and ball information extraction
As we mentioned previously, we hope to re-
cover the information of players and ball
including position, shape, and texture in the
mixed-reality scene, so this kind of information
needs to be extracted from the input video. In
this system, our previous technique1 is adopted
to segment moving players. Tennis ball track-
ing is a more challenging problem, as the ball
is a small object (with a diameter of approxi-
mately 6.5 centimeters) traveling at a speed of
up to 150 miles per hour. We base our ball-
tracking algorithm on graph theory, which
was initially used by Yu et al.5 and was further
improved by Yan, Christmas, and Kittler.3 To
reduce the computational load of the algorithm
used in these works, we assume that
� the initial ball position is known in the
image (manually labeled), and
� the ball only appears in the court with only a
small border extension (of about 2 meters)
around the court (90.2 percent of the cases
in our database cover this situation).
First, we implement ball-candidate detec-
tion. In our previous work,1 a binary map
indicating moving objects can be achieved,
employing a background subtraction technique.
Here, we explore a connected component-
labeling algorithm to isolate each object on
the binary map. We suppose that a tennis
ball in the image can’t be larger than a prede-
fined square box with a size of r � r samples.
Therefore, all the larger objects will be
removed.
Next, after candidate detection over Q con-
secutive frames, a ball trajectory graph is estab-
lished. Each node in this graph is assigned a
node weight, representing the resemblance
with a ball. Meanwhile, each edge is associated
with an edge weight, referring to the likelihood
that two nodes in the same graph correspond
to each other when considering the motion
and the direction of a ball. More specifically,
the node weight is actually the probability of
a candidate Oi to be a ball, given the color
and size information. Using the Bayes rule, we
can express this procedure in terms of the dis-
tribution of the ball color and size data:
Nti ¼ PðOt
i ! ball j color; sizeÞ / Pðcolor; size jOti
! ballÞPðOti ! ballÞ
where superscript t denotes the index of the
current frame, and subscript i defines the ith
ball candidate in frame t. We consider the
color and size information of the ball inde-
pendently, so that Pðcolor; size jOti ! ballÞ ¼
Pðcolor jOti ! ballÞPðsize jOt
i ! ballÞ. We as-
sume that the color of the ball has a Gaussian
distribution, and its � and � can be computed
on the basis of several tennis ball samples,
taking varying outdoor lighting conditions
into account. Using the probability
Pðsize jOti ! ballÞ ¼ w � h=ðr � rÞ, where w and h
represent the width and height of the candi-
date, if the size of a candidate is more similar
to that of the predefined ball, then that candi-
date has a larger probability of being a ball. An
edge Ei,j of the graph connects nodes in frame t
with nodes in frame t þ 1 or t þ 2. Such a con-
nection is based on the phenomenon that usu-
ally the ball positions in any two adjacent
frames are quite close. As a consequence, the
edge weight of the graph can be formulated as
Ei;j ¼ Kxtþn
j � ðxti þ n� dsÞkw
!ð3Þ
Here, K(:) is the Epanechnikov kernel function,
which is specified by
KðyÞ ¼1� yk k2 for yk k2� 1
0 otherwise
(
In Equation 3, parameters xtþnj and xt
i repre-
sent the position of jth node in the (t þ n)th
frame (with n ¼ 1 or 2) and the position of
the ith node in frame t, respectively. Parameter
ds refers to a standard displacement, such as
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 77
Ap
ril�Ju
ne
2011
77
the average speed, of a moving ball over two
consecutive frames. The width of the kernel
function is specified by kw. The last two param-
eters can be set manually, depending on image
resolution.
Finding the optimal path of a graph is a typ-
ical dynamic programming problem. Because
there are a number of possible paths from the
beginning to the last frame, for example, from
frame t to t þ 2 in Figure 3, we need to identify
the path having maximum weight, which indi-
cates the most likely ball trajectory.
Parameter data compression
In our system, we need to transmit three dif-
ferent types of information to mobile devices:
camera parameters, multiple player informa-
tion, and ball information. Here, the usage of
a lossless, fixed-length compression scheme
helps encode all three information types:
� Camera-parameter compression. Eleven cam-
era parameters (see Equation 2) require com-
pression. We first clip the value of each
parameter into six-digit number, having
two decimals behind the comma. And
then, four bits are assigned to encode each
digit and three bits are used to indicate the
position of the decimal point. Therefore,
we need, in total, 297 bits to encode all cam-
era parameters.
� Player-information compression. Instead of
encoding the absolute position of each
pixel belonging to a player, we encode the
bounding box’s location and a moving play-
er’s binary map, where bounding-box loca-
tion is described by the coordinate of its
top-left point and its width and height.
The binary map can be directly transmitted
without compression, in which 0 represents
a background pixel, and 1 refers to a
foreground pixel. The combination of the
bounding-box location and the binary map
allows the reconstruction of the absolute po-
sition of each pixel of a moving player at the
decoder. Additionally, we also need to en-
code the RGB value of each foreground
pixel. As we do with camera-parameter com-
pression, we allocate fixed-length bits for
each pixel.
� Ball-information compression. We only encode
the position and the radius of the detected
ball. Ball color is predetermined by the user
or set as default to red at the decoder to
make it easily noticeable on the small
screen.
Mixed-reality scene generation
and intelligent display
Once we have obtained the required infor-
mation at the decoder, the next step is to syn-
thesize the mixed-reality scene. Normally, our
mixed-reality scene is formed by court lines,
ground field, commercial billboards and logos,
and players and ball.
� Court-lines generation. Because we have
acquired camera parameters and we also
know the physical position of each court
line in the real-world domain, it’s easy to re-
construct lines in the mixed-reality scene.
Our basic idea is to transform the intersec-
tion points of lines in the 3D domain to
the image domain using Equation 2. Con-
necting the corresponding points in terms
of the layout of a standard court model can
recover the court lines in the mixed-reality
scene. If the mobile screen is too small, we
can magnify those lines.
� Ground-field generation. Compared with the
court lines, the ground field is less impor-
tant, because viewers usually don’t care
about the type of the ground field. In our
system, we have several court fields with dif-
ferent colors, which can be chosen ran-
domly or by the viewers themselves.
� Commercial billboards and logos generation.
The insertion of virtual advertising to a
sports video is an interesting topic, because
it enhances the commercial opportunities
to advertisers, content providers, and broad-
casters. Unlike conventional methods that
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 78
0.9
0.4
0.7 0.8
0.4
0.40.3
0.50.6 0.8
0.3
0.2
t + 1t t + 2
0.5
0.7
0.1
0.1
0.2
0.5
Figure 3. Illustration
of the ball-path
extraction based on a
weighted graph, where
the optimal path is
marked with a red line.
IEEE
Mu
ltiM
ed
ia
78
insert virtual 2D advertisements into the
image,6 our algorithm is capable of inserting
3D scenes into the image by means of 3D
camera modeling. Therefore, both the adver-
tisements that should be projected onto
the ground plane and those that are perpen-
dicular to the ground plane can be success-
fully inserted into the image. Figure 4a
shows an example, where an institute logo
is nicely projected two times in the correct
perspective.
� Players and ball generation. Recovering players
and ball is relatively easy, because they are
all available at the decoder. If users feel
that the screen is too small, they may in-
crease the ball size.
As mentioned, our system facilitates an in-
telligent visualization, where viewers are able
to control a virtual camera derived from the
original camera. We have realized three differ-
ent ways of generating the view from a virtual
camera. First, users can obtain their preferred
viewing angle by changing the camera focal
length (zoom in/out) or the camera rotation
angle. Second, users can control the camera so
that it concentrates on one player specifically
and tracks that player in the middle of the
image. This feature is realized by minimizing
the following function:
D ¼ w
2� PxðK;R
_
; t;p��� ���
where PxðK;R_
; t;pÞ is the x-coordinate of the
projection point p of w in the image. p is the
real-world position of the target player, and
w is the width of the image. Figures 4b
and 4c demonstrate that we can virtually ro-
tate the camera and track a player in the mid-
dle of the image. Black lines indicate the
camera rotation angle. A third option for the
user is to locate the virtual camera on top of
a player’s head when the player is close to a
real camera position, enabling the viewer to
watch the match as a highly realistic experience.
Experimental resultsTo evaluate the operation and efficiency of
the system, we tested our system on more
than 1 hour of video sequences, which were
recorded from regular TV broadcasts containing
tennis, badminton, and volleyball. In our test
set, the video sequences have two resolutions:
720 � 576 and 320 � 240. In addition, our
system is a Pentium 4, 3-GHz computer with
512 Mbytes of RAM programmed with C under
Linux, because the combination of C and
Linux is a popular framework to develop appli-
cations for smartphones.
System performance
In an objective evaluation, we tested our
camera-calibration algorithm on six video
clips. Three of them were tennis games on differ-
ent court classes, two were badminton games,
and one was a volleyball game. Figure 5 (next
page) shows sample pictures for the court detec-
tion, where three difficult scenes are selected.
Evidently, the method of Yu et al.7 fails here,
as one net post is not visible. Table 1 shows
the evaluation results of our algorithm, which
indicates that the detection of the court lines
is correct for more than 90 percent of the
sequences on average.
Moreover, we compared our algorithm with
earlier work,4 and related those operations to
the ground truth on the basis of a manual selec-
tion of a few intersections in the 3D domain.
We have transformed those 3D reference points
to the image domain using the transformation
matrix (see Equation 2), and we have measured
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 79
Figure 4. Examples for
mixed-reality scene
generation: (a) logo
insertion, where we can
project the logo onto
the ground and also
make it perpendicular
to the ground; (b) and
(c) camera rotation,
where (b) shows the
original image and
(c) shows that we
rotate the camera and
make one player appear
in the middle of the
image. Black lines
indicate the camera
rotation direction.
(a) (b) (c)
Ap
ril�Ju
ne
2011
79
the distance between the projected 3D refer-
ence intersections and the ground truth in
the image domain. Table 1 shows that for a
badminton clip, the average distance obtained
by the approach of Han, Farin, and de With4
is 1.6 pixels per point, and our new algorithm
has only an error of 1.1 pixels per point. We
give the coordinates of the camera principal
points computed by both methods.
Again, our principal point is much closer to
the image center, which equals to (360, 288). It
should be noted that it’s difficult to find an ob-
jective way to define how much accuracy of the
camera parameter we need for this application.
Instead, we turn our interests to subjective inves-
tigations, which reveal that the user will obvi-
ously recognize projection errors if the distance
between the projected 3D intersections and the
ground truth is more than 15 pixels per point.
We implemented our proposed simple
information-compression scheme and com-
pared it with another low-complexity scheme
that uses JPEG for encoding each original
video frame, where we assumed that the trans-
mission bandwidth is fixed. To fairly compare
these two schemes, both of them didn’t consider
the motion estimation over video frames. We
found that our proposal needs 6.35 Kbytes to en-
code one frame on the average. We have set the
JPEG encoding by changing the quantization
factor so that it used an equivalent amount of
bits for compressing video frames. This gives a
visual quality comparison of the reconstructed
image at more or less the same bit rate.
Figure 6a shows some examples, where we
can see that the JPEG image has visible block-
ing artifacts at this low bit rate, thereby result-
ing in the blur of essential visual information,
such as players, court lines, and ball. However,
this phenomenon doesn’t happen to the recon-
structed image by using our compression
scheme (see Figure 6b), in which the ball and
the court lines are even more recognizable on
the small screen. Meanwhile, we have noticed
that the wireless bandwidth is always dynamic
in the real application. Therefore, we also
encoded the video using different bit rates, and
compared the quality of reconstructed images.
The conclusion is that reconstructed images,
by using our scheme, are visually similar at dif-
ferent bit rates with the condition that the
channel bandwidth should be sufficient to
transmit the camera parameter, player, and
the ball information. If this condition doesn’t
hold, the important parts, such as player or
ball, might not be completely reconstructed at
the decoder. Similarly, we also compared the
image quality difference of the JPEG scheme if
we increase the channel bandwidth from
6.4 Kbytes to 8.5 Kbytes (see Figure 6c). Here,
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 80
Figure 5. Detection of the court lines and net line for different court types and viewing angles.
Table 1. Court and net detection and camera calibration.
Type Court (%) Net (%) Camera center (30 frames) Principal points
Badminton 98.7 96.2 Other method4 1.6 (388, �121)
Our method 1.1 (372, 256)
Tennis 96.1 95.8 Other method4 2.5 (399, �232)
Our method 1.3 (395, 249)
Volleyball 95.4 91.8 Other method4 3.2 (320, 98)
Our method 2.1 (340, 220)
IEEE
Mu
ltiM
ed
ia
80
the players, court-net lines, and ball are still not
clearly visible in the reconstructed image even
with the bandwidth slightly increased.
Unfortunately, an objective quality metric
such as peak signal-to-noise ratio can’t be
given, because data is partly virtually generated
in our scheme. The quality difference between
JPEG and our mixed-reality system is explained
by the fact that more bits are used to encode the
important objects, like players and ball, instead
of compressing every one through averaging.
Let us now look at the results and possibil-
ities when virtual reality is mixed with the
actual sports scene. Figure 7 (next page) dem-
onstrates some virtual scenes generated by our
system, where two badminton games are
shown (more demo movies can be found at
http://vca.ele.tue.nl/demos/tennisanalysis/index.
html). We have generated two different virtual
scenes for each game. For a single match of
badminton, we have translated the camera so
that one player is always located at the midline
of the captured image, and the viewer watches
the game from an arbitrary viewing angle.
In the double badminton match, we demon-
strate the cases with increased height of the
camera and modified focal length of the cam-
era. Our virtual scene proves to be realistic
because we preserve the shape and motion
nature of the player, which is better than the
animation-based systems.8,9
To further investigate the performance of
our mixed-reality system, we conducted a sub-
jective evaluation, where we used four demos
together with original videos (two tennis videos
and two badminton videos). We invited six
subjects to watch the demos and original videos
on a laptop. The participants were all young
researchers ranging in age from 23 to 34.
They were all sports fans and usually watch ten-
nis and badminton games. We asked them
to compare the mixed-reality videos with the
original videos, and also answer the following
questions:
� Q1: Are court lines and ball clearer in the
mixed-reality video than in the original
video?
� Q2: Does the inserted virtual scene move
with camera?
� Q3: Does the synthesized scene visualize the
sports game more effectively?
� Q4: Is the virtual scene realistic?
� Q5: Is the whole mixed-reality scene realistic?
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 81
Figure 6. Visual results obtained after the compression: the (a) column shows JPEG coding results at 6.4 Kbytes, the (b) column shows
our fixed-length coding scheme results at 6.35 Kbytes, and the (c) column shows JPEG coding results at 8.5 Kbytes.
(c)(a) (b)
Ap
ril�Ju
ne
2011
81
Instead of only answering yes or no, subjects
chose a score between 1 to 5, in which 5 corre-
sponds to strongly accept, 4 is accept, 3 indi-
cates marginally accept, 2 means reject, and
1 represents strongly reject. The average scores
from all subjects are given in Figure 8. While
users gave positive answers to the questions
about the system, we paid special attention to
Q5 because it resulted in the lowest average
score. We found that the major reason for a
lower score is that users weren’t satisfied with
the result of player segmentation. In some
video frames, the contour of the player, in par-
ticular the player farthest from the camera, was
not completely extracted, because the player
suit color was similar to the background color.
System efficiency
In addition to showing system performance,
we also want to demonstrate its efficiency.
In principle, the efficiency of our camera-
calibration technique mainly depends on
image resolution, but is slightly influenced by
the content complexity of the picture. To
prove this, we calculated the time for each
frame when performing our camera-calibration
technique on a tennis video clip (320 � 240)
and a badminton video clip (720 � 576). The
results are given in Figure 9a. For the tennis
video, the execution time for the initialization
step (first frame) was 93 milliseconds (ms),
and the execution times for other frames were
between 27 and 32 ms, depending on the
frame’s complexity.
For the badminton videos, the initial step
required 457 ms. The average execution time
per frame for badminton was 131 ms. It’s clear
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 82
Figure 7. Illustration of virtual scene generation by increasing camera height and focal length of the original camera setting. The
(a) row shows original images, the (b) row shows camera rotation and focal length modification, and the (c) row shows change view
angle and zooming.
(a)
(c)
(b)
Figure 8. Subjective
evaluation of our
mixed-reality system,
where the height of the
bar indicates the score
of the corresponding
question.
5
4.5
4
3.5
3
2.5
2
1.5
1
0
0.5
Q1
Scor
e
Q2 Q3 Q4 Q5Question
82
that the execution time varies from lower-
resolution video to higher-resolution video be-
cause the size (width and length) of a court
line changes under different resolutions. It
can be concluded that our proposal, when exe-
cuted on a PC, is a near real-time 3D camera-
calibration technique and is able to support
real-time content-analysis applications.
We applied the complete system to a single
tennis video (720 � 576), doing camera calibra-
tion, player and ball extraction, and data com-
pression. The result is shown in Figure 9b. We
found that average execution time per frame
was around 503.4 ms and that the player and
ball extraction component used the most
computations.
It’s important to test the execution time and
also the memory usage of our virtual-scene-
generation algorithm, because it will be the
heaviest part running on the smartphone.
We evaluated this algorithm in our PC-based
environment using a single badminton video
(720 � 576), and found that the execution time
for each individual frame is around 45.7 ms
and that the memory usage is about 22.6 Mbytes.
This evaluation result shows that it’s poten-
tially possible to run our algorithm on a smart-
phone in real time.
Conclusions and discussionsBecause of the accuracy of the 3D camera
modeling, our system might be suited for
other professional applications besides the
enhanced viewing experience of sports video
on mobile devices. For example, our modeling
technique could be an efficient replacement
for manually controlling the video camera to
concentrate on and track an object of interest
in the middle of an image. Moreover, our sys-
tem could be a camera simulator, providing vir-
tual videos with different camera settings and
finding the perfect camera position and ideal
camera setting through simulation.
Our entire system has not yet been commer-
cialized. However, our system is conceptually
close to an object-based video-compression
scheme, such as MPEG-4. The recent achieve-
ments of commercial products based on
MPEG-4 clearly show the possibility of com-
mercializing our system in the near future. For
example, existing commercial MPEG-4 prod-
ucts offer a broader and more sophisticated
class of fully reactive and interactive content,
including video object segmenting of arbitrary
shapes and user-interaction features integrated
within content. Some products, such as Thom-
sons’ set-top box, even enable users to create
and render 2D and 3D graphics. Solutions
such as our proposed algorithm for object ren-
dering provide good references for further com-
mercialization. In fact, an improved version of
our object-segmentation technique, the key
component of our system, has been transferred
to the industry (see http://www.vinotion.nl),
resulting in commercial software.
The current system can still be improved in
some aspects. For example, the entire system
can’t yet achieve real-time operation when exe-
cuted on a single CPU platform. Fortunately,
the recent modern smartphones already in-
creasingly deploy more than one CPU core.
Our algorithm can be nicely split into two
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 83
Figure 9. System
efficiency. (a) The
execution time of the
camera calibration on
a 3-GHz PC. (b) The
execution time of
each processing
component in the entire
compression system on
a single tennis match
(average execution time
per frame is 503.4 ms).
500
450
400
350
Tim
e (m
illis
econ
ds)
300
250
200
150
100
01 4 7 10 13 16 19 22 25 28 31
Frame
34 37 40 43 46 49
50
Cameracalibration
(31%)
Datacompression
(3%)
Player andball
extraction (66%)(a) (b)
Tennis (320*240)
Badminton (720*576)
83
Ap
ril�Ju
ne
2011
parts after the decoder obtains the camera
parameters, where the first part performs loss-
less decoding of the information of players
and ball, and the second part generates the vir-
tual scene according to the user preference. If
these parts can be executed in parallel on two
CPUs, we estimate that the entire system can
perform real-time execution. MM
References
1. J. Han, D. Farin, and P.H.N. de With, ‘‘Broadcast
Court-Net Sports Video Analysis Using Fast 3-D
Camera Modeling,’’ IEEE Trans. Circuit Systems
for Video Technology, vol. 18, no. 11, 2008,
pp. 1628-1638.
2. J. Han and P.H.N. de With, ‘‘Real-Time Multiple
People Tracking for Automatic Group-Behavior
Evaluation in Delivery Simulation Training,’’ Multi-
media Tools and Applications, vol. 51, no. 3, 2011,
pp. 913-933.
3. F. Yan, W. Christmas, and J. Kittler, ‘‘Layered Data
Association Using Graph-Theoretic Formulation
with Applications to Tennis Ball Tracking in
Monocular Sequences,’’ IEEE Trans. Pattern Analy-
sis and Machine Intelligence, vol. 30, no. 10,
2008, pp. 1814-1830.
4. J. Han, D. Farin, and P.H.N. de With, ‘‘Generic
3-D Modelling for Content Analysis of Court-Net
Sports Sequences,’’ Proc. Int. Conf. Multimedia
Modeling, Springer, 2007, pp. 279-288.
5. X. Yu et al., ‘‘A Trajectory-Based Ball Detection
and Tracking Algorithm in Broadcast Tennis
Video,’’ Proc. IEEE Int’l Conf. Image Processing,
IEEE Press, 2004, pp. 1049-1052.
6. K. Wan and X. Yan, ‘‘Advertising Insertion in
Sports Webcasts,’’ IEEE MultiMedia, vol. 14, no. 2,
2007, pp. 78-82.
7. X. Yu et al., ‘‘Inserting 3D Projected Virtual Con-
tent into Broadcast Tennis Video,’’ Proc. ACM
Multimedia, ACM Press, 2006, pp. 619-622.
8. K. Matsui et al., ‘‘Soccer Image Sequence Com-
puted by a Virtual Camera,’’ Proc. Computer
Vision and Pattern Recognition, IEEE Press, 1998,
pp. 860-865.
9. D. Liang et al., ‘‘Video2Cartoon: A System for
Converting Broadcast Soccer Video into 3D Car-
toon Animation,’’ IEEE Trans. Consumer Electronics,
vol. 53, Aug. 2007, pp. 1138-1146.
Jungong Han is a researcher at the Centre for Math-
ematics and Computer Science (CWI) in Amsterdam,
and was a researcher on video content analysis at the
department of signal processing systems at the Uni-
versity of Technology Eindhoven, Netherlands. His re-
search interests include content-based video analysis
and video compression. Han has a PhD in communi-
cation and information engineering from Xidian Uni-
versity. Contact him at [email protected].
Dirk Farin is a senior researcher at Robert Bosch, Cor-
porate Research in Hildesheim, Germany. His research
interests include object classification, 3D reconstruc-
tion, and video compression. Farin has a PhD in elec-
trical engineering from the University of Technology
Eindhoven, Netherlands. In 2008, he received three
best student paper awards for his work on video cod-
ing. Contact him at [email protected].
Peter H.N. de With is a professor at the University of
Technology Eindhoven, Netherlands, and the leading
a chair on video coding and architectures. His re-
search interests include video coding, architectures,
and their realization. De With has a PhD in electrical
engineering from the University of Technology Delft,
Netherlands. He is an IEEE Fellow. Contact him at
[3B2-11] mmu2011020072.3d 13/4/011 11:4 Page 84
84
IEEE
Mu
ltiM
ed
ia