Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Ph.D. Thesis
VoIP Streaming Over Packet-Based Networks
Mirko Luca Lobina Advisor: Prof. Luigi Atzori
University of Electrical and Electronic Engineering Cagliari
1
2
Contents
My Thanks, Introduction and Sketch of the Thesis
I. The Playout Buffering
II. Previous Works
III. The ITU-T E-Model
IV. The eEM Playout Strategy
i. Analysis of the Packet Loss Burstiness
j. Extension of the E-Model
k. Playout Buffering by Quality Maximization (eEM)
l. Analysis of the Computational Complexity of the eEM Strategy
V. Experiments
VI. Conclusions
Appendix I: The ITU-T H.323
Appendix II: Intrusive and Non-Intrusive Evaluation of Speech Quality
Appendix III: The ITU-T G.729A-VAD, a low bit-rate speech codec
Appendix IV: Other Research Fields and Articles during the Ph.D. period
i) The EM Playout Strategy
ii) Audio Watermarking using Psychoacoustic Model II
3
Index of Figures and Tables
Figure 1, pp.5: Playout buffering to compensate the transmission jitter.
Figure 2, pp19: Sketch of the proposed playout buffering strategy.
Figure 3, pp.23 4-State Markov model for the packet loss process.
Figure 4, pp.27 Effects of the playout buffer setting on the packet loss burstiness.
Figure 5, pp.28 Predicted and measured functions of g, b, ge , be for talkspurt 25 in Trace 4: the continuous and dashed
lines have been drawn from measured and predicted values, respectively.
Figure 6, pp.30 Average R Factor versus N for the 8 test traces.
Figure 7, pp.33 Comparison of the proposed algorithm with the competing ones for the first talkspurts of Trace 8.
Figure 8, pp.34 Overall R Factor for Trace 7 and 8.
Figure 9, pp.40 H.323 Terminals on a Packet Network
Figure 10, pp.43 An H.323 Zone
Figure 11, pp.51 H.323 Call Establishment
Figure 12, pp.52 IP Telephony: H.323 Interworking with SCN
Figure 13, pp.52 The PESQ Strategy
Figure 14, pp.57 The ITU-T R-Factor
Figure 15, pp.58 A non-intrusive strategy for measuring objective voice quality (Psytechnics)
Figure 16, pp.73 Comparison of the proposed EMv1 algorithm with the Linear Filter and Concord algorithms for the third
experiment.
Figure 17, pp.75 Distribution of the mean difference of the samples in Un-Watermarked and Watermarked signals.
Figure 18, pp.77 Steps (1-4) of the Patchwork shaping algorithm.
Figure 19, pp.78 Probability density function of detection for the random variable z, varying the dimension of the patch with
SNR = 26.
Table I, pp.21: Voice traces used during experiments.
Table II, pp.22: Burstiness analysis results.
Table III, pp.30 Settings for the eEM algorithm parameters used in the experiments.
Table IV, pp.31 Comparison of eEM with other strategies for Traces 1-4. In the last two columns
Table V, pp.32 Comparison of eEM with other strategies for Traces 5-8.
Table VI, pp.32 Results of the eEM algorithm for Traces 1-4 when assuming loss randomness: the equipment impairment
factor has been evaluated by means of (5).
Table VII, pp.73 2χP , uP , and cP for four of the performed experiments after applying the EMv1
4
Table VIII, pp.73 Results in terms of the de2e and the R Factor for the three proposed algorithms (EMv1, EMv2, and EMv3)
and the two comparing approaches (Linear Filter and Concord). The potential maximum R Factor is also presented.
Table IX, pp.80 Error Probabilities for lossy compression at different rates.
5
My Thanks: this Thesis is the result of the professional collaboration with Dr. Luigi Atzori. My Thanks
goes to Dr. Atzori mainly for his skilled guidelines and steady support. The algorithms and reflections
proposed in this work cover the three years spent as a Ph.D. Student at MCLab (DIEE), CNIT
Multimedia Communication Lab at University of Cagliari.
6
Introduction to the Thesis: the core of this work is the study of real-time transmission and fruition
mechanisms of speech contents over an Internet protocol (IP) network (i.e., Internet telephony).
Internet telephony refers to communications services-voice, facsimile, and/or voice-messaging
applications, that are transported via the Internet, rather than the public switched telephone network
(PSTN). The basic steps involved in originating an Internet telephone call are conversion of the
analog voice signal to digital format and compression/translation of the signal into IP packets for
transmission over the Internet; the process is reversed at the receiving end. The possibility of voice
communications traveling over the Internet, rather than the PSTN, first became a reality in February
1995 when Vocaltec, Inc. introduced its Internet Phone software. Designed to run on a 486/33-MHz
(or higher) personal computer (PC) equipped with a sound card, speakers, microphone, and modem,
the software compresses the voice signal and translates it into IP packets for transmission over the
Internet. This PC-to-PC Internet telephony works, however, only if both parties are using Internet
Phone software. In the relatively short period of time since then, Internet telephony has advanced
rapidly. Many software developers now offer PC telephony software but, more importantly, gateway
servers are emerging to act as an interface between the Internet and the PSTN. Equipped with voice-
processing cards, these gateway servers enable users to communicate via standard telephones. A call
goes over the local PSTN network to the nearest gateway server, which digitizes the analog voice
signal, compresses it into IP packets, and moves it onto the Internet for transport to a gateway at the
receiving end. With its support for computer-to-telephone calls, telephone-to-computer calls and
telephone-to-telephone calls, Internet telephony represents a significant step toward the integration of
voice and data networks. A complete description of the main scenarios and standards for Internet
telephony (e.g., ITU-T H.323) has been provided in Appendix I.
Originally regarded as a novelty, Internet telephony is attracting more and more users because it
offers tremendous cost savings relative to the PSTN. Users can bypass long-distance carriers and
their per-minute usage rates and run their voice traffic over the Internet for a flat monthly Internet-
access fee.
Although progressing rapidly, Internet telephony still has some problems with reliability and quality.
Reliability means that You should ensure that: the voice quality of Your Internet telephony solution
7
meets Your requirements; ensure Your unified network can prioritize voice traffic and can deal with
high traffic conditions. Quality means that: poor speech quality will make Your Internet telephony
solutions unpopular to use. Reliability and Quality strongly depend on several factors both at a
network and applicative level, such as bandwidth limitations, coding strategies, channel
characteristics and application design. In this optic, this Thesis primarily focuses on two aspects:
proposals to improve the end-user speech quality using ad hoc playout strategies (i.e., the EM and
eEM strategies), considering the coding strategies, bandwidth limitations and backbone
characteristics as fixed points; measuring the speech quality using specific metrics (i.e., See Section
III and Appendix II) able to traduce the end-to-end Internet telephony impairments into a friendly
score (i.e., Mean Opinion Score (MOS)).
Obviously, the two aspects described above are not independent, but few studies proposed in the past
focused directly on the maximization of the perceived quality. The EM and eEM algorithms fill this
lack. To this aim, it has been required a deep study of: mechanisms of coding and transmission of
speech contents; channel characteristics and their statistical characterization; study of the state of art
playout strategies.
As Appendixes to the Thesis two other research fields are also presented: an audio watermarking
strategy based on Psychoacoustic Model 2 (e.g., the same applied in MP3 compression strategy) and
an Error Concealment technique to be applied in ITU-T G.723.1 low bit rate codec.
8
Sketch of the Thesis: This Thesis is composed of seven sections and four appendixes. The Sections
mainly treats the playout control problem in real time streaming applications (e.g., Internet
Telephony). Two new strategies are presented: a new playout method based on maximization of the R
Factor and its upgrade, named eEM strategy. Several points related to these methods are discussed
and commented in the first three appendixes of the Thesis. The last appendix argues on two other
research fields always related to audio processing: an audio watermarking technique based on
psychoacoustic model 2 and a new error concealment approach for the ITU-T G.723.1 speech codec.
9
I. Playout buffering
IP Telephony applications have been developed over a set of protocols (RTP, UDP, and IP) that are
not able to natively guarantee the application required quality of service. In fact, different factors
deeply affect the end-user perceived quality. One of the most impairing factors is the variation of the
packet transmission delay during the streaming, named jitter, which is caused by the temporal
variability of the network conditions.
In real-time applications such as IP Telephony, every transmitted packet has an associated playout
time. If the packet arrives later than this time, it is discarded by the decoder, being useless. Otherwise
it is buffered at the de-jitter buffer until its playout time, so as to compensate the transmission jitter.
Fig. 1 illustrates this operation.
0,Dt
0,dejd T
Sender
Network
Receiver
Buffer
Playout
1,Dt 2,Dt 3,Dt 4,Dt 5,Dt
0,At1,At 2,At 4,At
5,At
0,Pt 1,Pt 2,Pt 5,Pt4,Pt3,Pt
ms
ms
ms
0,netd
Figure 1. Playout buffering to compensate the transmission jitter.
In the axis at the top of the figure, the departure instants iDt , are drawn, for every packet i
( ,...2,1,0=i ), with 0 ,Dt set to zero. These instants are uniformly spaced by an interval T: Tit iD ⋅= , .
This interval depends on the number and size of the speech frames conveyed in every transport packet
(e.g., for the ITU-T G.729 speech codec, T = 20 ms, corresponding to two speech frames of 10 ms).
The axis in the middle of the figure is used to show the arrival instants iAt , of the packets at the
receiver. The arrivals could be not ordered, since the network delay inetd , for every packet is a random
10
variable. Additionally, some packets can be lost during transmission due to network problems, such as
nodes buffer congestion; this is the case of packet 3 in the figure. Note that with our notation
iDiAinet ttd ,,, −= and Tidt inetiA ⋅+= ,, . The delay variability is removed by the de-jitter buffer that
introduces an additional delay idejd , . The intent is to obtain a sequence of playout instants iPt ,
uniformly spaced, as illustrated in the axis at the bottom of the figure. Accordingly, the delay between
the departure and the playout of the packets eed 2 is equal for every packet so that
idejinetee ddd ,,2 += . A packet is discarded at the receiver if eeinet dd 2, > . The playout algorithm is
devoted to setting of the end-to-end delay, which can be changed occasionally during the streaming
session as described in the following.
The removal of the jitter is accomplished at the receive side by means of a playout buffer that masks
the jitter at the expense of an additional delay. Within this framework, an important task is the setting
of the total end-to-end delay, which should consider the network delay, the packet loss, and the
perceived subjective quality.
Originally, the setting of the playout buffer was purely based on the introduced additional delay and
loss performance. In the last years, a different approach has been proposed, which consists in taking
into account the effects of delay and losses on the subjective quality. Such an approach requires the
use of an appropriate tool to evaluate the combined effects of transmission impairments that affect the
conversational quality. On the basis of this tool, the playout buffering algorithm estimates the optimal
buffer configuration by weighting the contribution of delay and loss to the conversational quality. The
use of such a perceptually motivated optimality criterion allows the receiver to automatically balance
packet delay versus packet loss. Almost all the proposed works founded on this approach, which we
refer to with quality-based, make use of the ITU-T E-Model for quality evaluation [1], [2]. It is a
computational framework for the estimation of the conversational quality by means of a synthetic
index (the R Factor), which encloses the contributions of many features, presented as impairment
factors. However, an important problem limits the applicability of this model: it is valid only in case of
random packet losses, which are observed very rarely in IP Telephony. In fact, several studies have
shown the burstiness of packet losses in the Internet [3], [4]. Dealing with bursty losses as if these
11
were random would be a significant error. Indeed, at a given total loss ratio, the subjective impact of
isolated losses respect to grouped losses is quite different.
Based on these considerations, in this work we study the application of the quality-based approach to
more realistic models for the packet loss process. We then propose a new playout buffering algorithm
based on an extension of the ITU-T E-Model proposed in the ETSI Tiphon to incorporate the effects
of loss burstiness on the perceived quality [5]. The resulting algorithm works during the silence
periods. It estimates the parameters of a 4-state Markov model representing the loss behavior during
the subsequent talkspurt, evaluates the expected conversation quality varying the end-to-end delay
within a certain range, and finds the optimal setting of the playout buffer. To evaluate the expected
quality, the algorithm considers packet loss correlations and takes into account important effects
recently studied, such as the recency effect [6], [7], the smoothing of the user perception respect to
sudden variations of the packet loss [7], and the temporal position of the losses in the speech stream
[8], [9].
12
II. Previous Works
The problem of jitter compensation has been addressed in several ways in the past. The proposed
techniques can be classified into two groups: fixed and adaptive. According to the techniques
belonging to the first group, the end-to-end delay is kept constant for the entire session. Differently,
techniques in the second group work adapting this delay to the variable network conditions during the
streaming: this means avoiding lateness with potential high delays.
In adaptive playout buffering, the most important features are related to when the playout buffer is
adjusted and which criterion is adopted. Intra-talkspurt techniques modify the end-to-end delay during
the entire streaming independently from the silence periods, using some strategies of compression and
extension of the waveform. On the contrary, between-talkspurt methods act during the intervals of
silence. The last approach is more frequently used.
As to the used criterion, different approaches have been experimented. An autoregressive class of
between-talkspurt methods is described in [10]. These methods are mainly based on two steps:
estimation of the network delay conditions and tuning of the playout instants so as to encounter a
small fraction of late packets. Denoting with d and v the estimates of the mean and variation in the
network delay in the next talkspurt, the end-to-end delay is set as follows: vdd ee ˆˆ2 ⋅+= β , where β
is usually set to 4.0. The four algorithms presented in [10] differ in the way d is computed. The first
algorithm estimates the average delay by means of a linear recursive filter characterized by a
weighting factor α . The second algorithm presents a slight modification based on the use of two
values of α : one for the increasing trend and the other of the decreasing trend of the network delay. In
this way, it should be possible to follow short bursts of packet incurring long network delays. The
third algorithm sets d to the minimum the network delay experienced in the previous talkspurt. The
last one introduces the feature of detecting short-live bursts of delay variations (spikes) and working
differently whether a spike is detected or not.
A variant of the autoregressive approach in [10] is the α -Adaptive technique [11], which generalizes
the filtering method in [10] by defining different values of α . In fact, it was found that the choice of
α greatly affects the rapidity with which the mean network delay estimate may vary respect to sudden
13
variations in the actual delay. Thus, the α -Adaptive technique proposes an adaptive adjustment of the
value of α after few packet arrivals. In [12], a different algorithm is proposed based on a normalized
least mean square (NLMS) active predictor. The strategy estimates the network delay for each packet
from the previous N ones, using a NLMS predictor. The computation of the end-to-end delay variance
and the choice of eed 2 are performed as in the autoregressive approach.
The algorithms in [10]-[12] estimate an average network delay and use it to fix the set the eed 2 so that
the faction of late packets is kept very small. However, IP Telephony applications can tolerate or
conceal a small amount of late packets. Thus, strategies performing a controlled tradeoff between the
packet lateness and the delay may offer better results. The gap-based algorithm proposes a strategy
where the packet loss ratio is settable to reduce the playout buffer delay [13]. For all the packets in a
talkspurt, the algorithm computes the “gap”, defined as the difference between the playout and the
arrival instants. This is a measure of the performance for a certain playout buffer delay setting. The
optimal buffer setting corresponds to the minimum amount of delay to be added to each packet that
would have allowed for obtaining the packet loss ratio tolerated by the application. The operations
performed during the adjustment may vary depending on the working conditions. In fact, the gap-
based algorithm includes a spike detection strategy. The strategy works either in a “Impulsive Mode”
or “Normal Mode”, whether a spike is detected or not. The Concord algorithm in [14] fixes two
thresholds for the maximum late packet percentage and maximum acceptable delay. This strategy
performs the following operations: for each packet, the network delay is stored and used to build a
histogram; from this, an approximated and sampled version of the packet delay distribution (PDD) is
computed. The PDD is weighted by applying an aging function to the collected information. By using
such PDD, the algorithm set the eed 2 so that the two constraints on the maximum late packets
percentage and maximum acceptable delay are satisfied.
A step forward respect to the algorithms presented so far consists in setting the end-to-end delay so
that the perceived quality is maximized. Some works exploiting this principle have been proposed in
the last years. In [15], the E-MOS strategy makes use of a statistical distribution approach to better
estimate the network delay. To this purpose, a cumulative density function (CDF) is built for the tail of
the network packet delay distribution using the Pareto function; this approach relies on the delay
14
analysis in [16] and [17]. Then, a mathematical relationship between the MOS (Mean Opinion Score)
and the delay and loss is extracted from the results presented in [18]. This relationship is used to find
the end-to-end delay that maximizes the expected subjective quality. The advantage of this approach is
the direct use of the MOS index to evaluate the conversation quality. However, the application of this
index introduces some issues concerning the linear combination of different impairment factors in the
MOS scale. Indeed, the authors have added the impairments related to the delay and packet losses to
obtain the final quality index. This has been proved for the ITU-T G.711 [19] codec but has not been
investigated for low bit-rate codec. Then, the validity of this operation for other codecs needs
additional analysis.
In [20], the jointly design of FEC (Forward Error Correction) and a playout buffering for Internet
Telephony is presented. The authors show that a real benefit in using the joint control of both playout
and FEC is obtained only when delay is critical. The ITU-T E-Model is used to derive an expression
of the application quality as a function of the encoding rate, packet loss rate, and end-to-end delay.
The source-channel coding parameters and the end-to-end delay are selected by maximizing the
quality function. The authors focus on random losses, stating that these are equivalent to the case of
bursty losses if the loss percentage is lower than 5%. Indeed, this is true for the ITU-T G.711 codec
[19] as shown in [21], but it is unlikely to happen for low bit-rate codecs. In [22], a non-linear
regression model to predict voice quality, based on the ITU-T PESQ [23], PESQ-LQ [24] and ITU-T
E-Model, is presented. Several models for different speech codec are derived, which can be used for
general QoS (Quality of Service) control purposes, including voice quality monitoring and playout
buffering. Basically, this work is the natural progress of [25]. Also in this study, the packet losses are
assumed to be random. A similar approach is presented in the [26] for the IP Telephony applications
that make use of the ITU.T G.729.
Even if not directly related to the playout buffering problem, there are other works that are worth
mentioning at this point since they make use of the ITU-T E-Model. In [27], the perceived quality of
different codecs under the same bandwidth requirements is evaluated using the MOS. The important
results of this work are: FEC cannot reduce jitter unless out-of-order packets are common in the
Internet; a robust codec with a PLC (Packet Loss Concealment) strategy is more useful for battling
15
jitter. In [28], a joint source-channel coding adaptation algorithm for the AMR (i.e., Adaptive Multi
Rate) speech codec is described. The paper presents an analysis of the best tradeoff between source
and channel bit rates, under constraints on packet loss, end-to-end delay, and transmission rates. The
performance is evaluated making use of the ITU-T E-Model. It is recognized that while a FEC strategy
mitigates the effects of packet loss, it also increases end-to-end delay. These two features work as
opposite respect to the speech quality. Assuming the loss to be random, the proposed algorithm tries to
find the optimal compromise between packet loss recovery and end-to-end delay. Finally, [29]
presents a decision system to select the coding scheme and routing path to maximize the amount of
calls to be placed in a VoIP system, still guaranteeing a minimum level of speech quality.
16
III. The ITU-T E-Model
To evaluate the voice transmission quality, intrusive and non-intrusive methods have been proposed in
the past. The former are based on a comparison between the original and the distorted signals, while
the later compute a quality index from the analysis of the system configuration and the measurements
of transmission parameters, such as codec configuration, information loss, and transmission delay.
In this work, we focus on the ITU-T E-Model [1], which belongs to the category of non-intrusive
methods. The choice of this model arises from the need of evaluating the influence of the most
important system settings on the perceived quality. In fact, the E-Model is a tool that estimates the
voice quality when comparing different network equipments and designs. Its main feature is the
capacity of revealing the underlying causes of speech quality problems by means of an overall quality
index, the R Factor, which is the combination of a well-defined set of metrics linked to: low bit-rate
speech coding; delay and loss distribution; frame erasure distribution; loss concealment technique;
architectural choices such as de-jitter buffer, packet and codec frame size. The R Factor is defined as
follows:
AIIIR effeds +−−−= ,100 . (1)
The maximum value of this index is 100. The signal-to-noise impairment factor sI comprises the
distortions introduced by the circuit-switched part of the end-to-end communication network. A set of
default values for this parameter is provided in [1]. The term dI measures the impairments associated
with the mouth-to-ear delays encountered along the transmission path. effeI , represents the
impairments associated with the signal distortion, caused by low bit-rate codecs and packet losses. The
Expectation factor A increases the level of conversational quality when the end-user may accept some
decrease in quality for access advantage (e.g., mobility). A comprehensive description of the
Expectation factor is provided in [2], but no agreement has been reached for the value in case of IP
Telephony. For this reason, it is usually set to zero.
Despite its apparent simplicity, (1) represents a non operative form of the E-Model since the four
factors depend on several configuration parameters. A first simplification can be obtained when using
a set of default values and operative working conditions [1]:
17
effed IIR ,2.93 −−= . (2)
Introducing the assumptions in [30] (i.e., no circuit switched network internetworking for the access to
the IP Telephony service) and using the default values, dI becomes a function of the average mouth-
to-ear delay. Such a delay, represented with d, is defined as the sum of the end-to-end delay eed 2 and
the encoding/decoding delay, which comprises both the packetization and the algorithmic components
(usually neglected). Thus, eecodec ddd 2+= , where the packetization delay codecd is equal to 25 ms in
case of ITU-T G.729-A+VAD. Note that eed 2 , as defined in Section II.A, is the interval of time
between the departure of a packet from the transmitter and the time its content is played out at the
receiver. It is controlled by the playout buffering algorithm and affects the conversational quality. The
resulting dI for a certain range of d values has been experimentally obtained [31]. In [30], these
values have been interpolated to obtain an analytical expression:
)3.177()3.177(11.0024.0 −⋅−⋅+⋅= dHddId . (3)
Where )(xH is the step function: 0)( =xH if 0<x and 1)( =xH if 0>x .
effeI , in (2) is a function of the end-to-end packet loss ratio eee 2 and the used speech codec:
( )eeeeffe ecodecII 2, ,= . (4)
eee 2 comprises the packets that have been lost during transmission, after the application of the FEC
loss recovery if used [20], and those that arrive correctly at the receiver but are too late to be played
out. For a fixed eee 2 value, different impairment values are observed changing the number of frames
inserted in a transport packet, the distribution of the packet losses, the sensibility of the used codec to
data frame losses, and the used concealment algorithm. In [21], the effeI , for some configurations are
provided. For the G.729-A+VAD speech codec, which requires the transmission of two frames of 10
ms in each transport packet, the following expression has been obtained [30]:
)101log(4011 2, eeeffe eI ⋅+⋅+= . (5)
(5) is valid in case of random packet losses and when the standard concealment algorithm is used. The
current expression of effeI , does not allow for considering the effects of bursty packet losses. Several
18
changes in this direction have been introduced in the 2003 version of ITU-T E-Model, but a complete
integration of the burstiness model has not been included yet.
Under the most common conditions, the relationship between the R Factor and the MOS has been
verified for some quality levels: user very satisfied, user satisfied, some user dissatisfied, many user
dissatisfied, and nearly all users dissatisfied. For some other conditions, the E-Model is less accurate;
in particular, annex A of [1] provides the situations where the model validity has not been completely
verified, especially regarding the overall additive property of the model, which is applicable only to a
certain extent. An estimated MOS can be obtained from the R Factor using the following formulae:
��
��
�
>=<<⋅⋅+⋅+=
<=
100R if4.5 MOS100R0 if107R)-60)(100-R(RR0.0351MOS
0R if1MOS6- . (6)
IV. The eEM Playout Strategy
As discussed above, most of the existing techniques for playout buffering are based on two main
operations: prediction of delay statistics for future packets; setting of the playout buffering in
accordance with a constraint on either the maximum the end-to-end delay or the maximum fraction of
late packets. This methodology does not guarantee the maximization of the conversational quality.
Indeed, fixing a maximum value for the total end-to-end delay does not allow for controlling packet
losses and relevant effects on the end-user perceived quality; on the other hand, limiting the maximum
information loss does not enable to curb the end-to-end delay, which may heavily affect the
application interactivity. Differently from this approach, another class of playout buffering techniques
has appeared in the last years. These are based on controlling the playout buffer so as to maximize the
expected conversational quality. The aim is to jointly consider the expected delay and the information
loss making use of a perceptually motivated optimality criterion that allows the receiver to
automatically balance packet delay versus loss. The playout buffering technique presented in this
paper belongs to this category, making use of an extended version of the ITU-T E-Model proposed by
the ETSI Society [5].
19
Figure 2. Sketch of the proposed playout buffering strategy.
Fig. 2 shows the main blocks involved in the proposed approach: statistics relevant to loss and delay
are predicted by means of the previously sent packets; based on this information, the playout buffer
setting is accomplished so as to maximize the expected conversational quality during the future
conversional unit. Two main features distinguish this approach from the past ones: the prediction of
the correlation features that characterize the packet loss process; the use of a quality model that
evaluates the effects of the loss burstiness on the perceived quality. It is a matter of fact that the
prediction of packet loss and delay statistics is a topic frequently addressed by past works also in
similar issues. These can also be applied to the prediction of the loss burstiness statistics with only few
changes. Differently, the use of a quality model within this framework still presents some open issues.
When using the ITU-T E-Model, these are mainly concerned with the fact that:
� This model can be applied only during stationary periods;
� This model is valid only within certain conditions.
Based on the first point, we apply the quality model to conversation units during which the playout
buffer settings are left unchanged: in fact, the impact of end-to-end delay variation on speech
transmission quality has not been included in the E-Model algorithm yet. In particular, we propose to
modify the end-to-end delay during silence periods based on the estimated quality for the next
talkspurt during which the playout buffer is kept constant.
As to the second point, it is important to note that the effects of information losses on the R Factor
have not been defined in case of bursty packet losses for some common codecs (e.g., ITU-T G.729 and
G.723). This is a crucial limitation not only for this context but for most of the E-Model uses. In fact,
several studies [3], [4] and [32] have shown the burstiness of packet losses in the Internet and more in
general in packet based networks. Dealing with bursty losses as if these were random would be a
significant error since the impact on the human perception of isolated losses respect to grouped losses
is quite different. This phenomenon was demonstrated on a study presented in [33] for the ITU-T
G.729-A+VAD speech codec. It has been shown that the concealment works well with a single speech
Loss and delay information
of past packets Prediction of network behavior
� Packet loss process (4-state Markov model)
Computation of buffer dimension � Maximizing the quality
(use of the E-Model)
end-to-end delay
20
frame erasure, but not well enough with multiple frame erasures. This requires the introduction of
appropriate extensions on the E-Model to take into account the quality degradation caused by the
packet loss burstiness. Due to the importance of this point, in the following two sub-sections, we
present a packet loss analysis performed on real traces over the Internet and the extension of the E-
Model proposed by the ETSI Society in [5]. Then, based on the extended R Factor computation, in the
third sub-section we illustrate how the proposed playout algorithm works. In the fourth sub-section we
then discuss the computational complexity.
I.V.i. Analysis of the packet loss burstiness
We have analyzed the burstiness of packet losses in several traffic traces, relevant to voice connections
over the Italian GARR research network and the European Tiscali ISP network with dial-up and
ADSL access lines. Each trace lasted an average of 10 minutes and was registered at different times of
the day. Two H.323 hosts were used during the experiments employing the G.729-A+VAD codec (2
packets of 10 ms voice frames for every UDP packet and the native concealment strategy [34]). The
H.323 numbering mechanism was used to detect missing frames. The sender/receiver clocks were
synchronized, through the NTP (the Network Time Protocol) protocol, and all packets within the
conversation were captured at both sides.
The basic characteristics of the used traces are summarized in Table I. The first four traces were
obtained during conversations over LAN-to-ADSL connections. These traces are quite similar in terms
of network delay but not in terms of packet loss percentage. Trace 5 is the result of an intra-LAN
connection with high background traffic conditions: it is characterized by low average network delay
with high variability. Trace 6 is relevant to an ADSL-to-ADSL connection: it is characterized by a
high delay and null packet loss with low delay variability. Traces 7 and 8 were recorded during
conversations between hosts located in two different LANs and present a good behavior for what
concerns both loss and delay.
21
Table I. Voice traces used during experiments.
Trace # Length (sec)
Average network loss (%)
Average network Delay (ms)
Network delay std
Trace 1 761.73 1.59 85.1 23.4
Trace 2 656.50 2.57 82.2 24.3
Trace 3 568.34 3.58 79.7 22.2
Trace 4 673.80 4.22 84.5 22.4
Trace 5 527.10 4.08 77.5 26.4
Trace 6 576.82 0.00 288.6 10.6
Trace 7 625.94 1.39 43.4 19.6
Trace 8 631.24 0.76 39.6 19.2
The burstiness analysis had to be conducted after the playout buffering, which means that the packet
loss had to include late packets other than those lost in the network, and with a “neutral” strategy for
buffer setting. So we decided to analyze all the traces making use of a fixed buffering strategy with a
high end-to-end delay of 1 sec, almost equivalent to observe only the network losses, and with a delay
of 100 ms.
A first investigation has been performed using the statistical chi-squared test [35] by considering the
random variable K associated to the length of packet loss bursts (K-1 lost packets and 1 correctly
received packet). In case of random packet loss, the variable K should be distributed according to a
geometric PMF (probability mass function) and the chi-squared test has been used to verify this
hypothesis (hypothesis 0H ). For the proposed experiments, we computed the chi-squared value and
the associated probability 2χP of having a chi-squared value equal or greater than the obtained value
by chance only. We were then able to reject or not reject 0H depending on whether 2χP was smaller
or not than a given significance level that is usually selected equal to 1% or 5%.
Additionally, we applied the 2-state Gilbert model to our traces, which is frequently used to represent
the temporal correlation existing in bit-error and packet loss sequences [36]. This model is based on
the unconditional packet loss probability uP and the conditional packet loss probability cP . The
distance between uP and cP gives an indication of the deviation of the packet loss process from a
memoryless Bernoulli process, which is characterized by having these two probabilities equal. The
outcomes of these tests are reported in Table II.
22
Table II. Burstiness analysis results.
eed 2 of 1 sec eed 2 of 100 ms Trace #
2χP (%) uP (%) cP (%) 2χP (%) uP (%) cP (%)
Trace 1 > 1 1.59 1.8 > 1 1.81 2.01
Trace 2 << 0.1 2.57 11.1 << 0.1 2.91 21.3
Trace 3 << 0.1 3.58 16.2 << 0.1 3.82 24.2
Trace 4 << 0.1 4.22 19.6 << 0.1 4.43 28.1
Trace 5 << 0.1 4.08 21.9 << 0.1 4.36 35.7
Trace 6 << 0.1 0.00 77.1 << 0.1 0.27 86.9
Trace 7 << 0.1 1.39 25.3 << 0.1 1.62 37.1
Trace 8 << 0.1 0.76 17.2 << 0.1 1.01 37.2
By considering 0H as the hypothesis of having random losses, we found by means of the chi-squared
test that 0H has to be rejected for all the experiments, except for Trace 1. In fact, for almost all the
performed trials a chi-squared value lower than the quantile of order 0.99 with different degrees of
freedom has been obtained. Accordingly, the deviation of the observed variable K distribution from
the geometric PMF is significant, so we have to reject the hypothesis of random losses in these traces.
The chi-squared test results are in accordance with those based on the Gilbert model as shown in Table
II. Additionally, note that the loss pattern burstiness increases decreasing the end-to-end delay; it
means that the loss correlation is higher with an end-to-end delay of 100 ms respect to the case of 1
sec. This phenomenon is shown by the resulting distances between uP and cP that are significantly
high for all the experiments but higher after the playout buffering operation. This feature is a direct
consequence of the burstiness of the network delay. When the playout algorithm sets a high end-to-end
delay, the late packets are those located at the peaks of the high-delay bursts of packets. Decreasing
the end-to-end delay, more consecutive packets are discarded from each high-delay burst, increasing
the temporal correlation of losses.
I.V.j. Extension of the E-Model
At present, a standard integration of the loss burstiness effects in the E-Model is under study by ITU-T
Study Group 12. An alternative solution is proposed by the ETSI Society, through a step-by-step
extension of the equipment impairment factor effeI , [37], [5]. The first step is the modeling of the loss
23
burstiness, where a burst period is defined as an interval of time during which a high packet loss
percentage is observed. The bursts are separated by gaps, which are characterized by sporadic loss
events. Specifically, two lost packets identify a burst if between these less than gmin packets have been
correctly received. If gmin or more packets are correctly received, such sequence is regarded as being
part of a gap. According to [37], gmin is set to 16. The system is then modeled with a 4-state Markov
chain, which is drawn in Fig. 3.
4 14
2 3
P41
P14
P11
P22
P33
P23
P32
P13
P31
Figure 3. 4-State Markov model for the packet loss process.
The four states are determined by which period the system belongs to (burst or gap) and whether the
last transmitted packet has been lost or correctly received: gap-no loss (state 1), burst-no loss (state 2),
burst-loss (state 3), and gap-loss (state 4). [5] also suggests a simple procedure to estimate the
transition probabilities from an observed sample path of the packet loss process. During the streaming,
for every transmitted packet a group of estimation counters are updated depending on the current
system state and whether or not the packet has been correctly received in time to be played out. The
transition probabilities are then obtained from these counters at the end of the observation period.
From the estimated probabilities, a set of additional parameters characterizing the bursty loss process
are derived. These are the followings: g and b, that represent the durations in seconds of the gap and
burst periods, respectively; ge and be , that represent the average packet loss ratios during the gap and
burst periods, respectively; and y, that is the time interval in seconds since the last burst of packet loss.
Details of the estimation procedure can be found in [5].
Once, the entire sequence of packets is divided into gaps and bursts, the equipment impairment factor
is separately evaluated for the gap ( egI ) and the burst ( ebI ) periods as if the losses were random. For
this reason, egI and ebI are computed for the ITU-T G.729A-VAD codec, by means of (5):
24
( )geg eI ⋅+⋅+= 101log4011 , (7)
( )beb eI ⋅+⋅+= 101log4011 . (8)
These two contributions are integrated by considering the modeling of the smoothing effect: a sudden
variation in the packet loss ratio would not necessarily result in a sudden change in the perceived
quality. In fact, the user does not perceive instantaneously the start and the end of a burst, but he
notices the quality change by degrees. Several tests have shown such conduct [7]. In these tests, the
packet loss ratio of a VoIP connection was varied from 0% to 30%, for periods of time from 15 to 30 s
for a test call of 3 min. The listener furnished a feedback on the instantaneous quality during the call.
The results of the tests revealed that the temporal interval for the change gap-burst quality was of 4-5 s
(named t1), while for the change burst-gap was about 10-15 s (named t2). Let )(1 tI denote the
equipment impairment factor during a burst period, with 0=t corresponding to the instant at with loss
process changes from gap to a burst; and let )(2 tI represent the equipment impairment factor during a
gap period, with 0=t corresponding to the instant at with loss process changes from burst to a gap.
These two functions have an exponential behavior as follows:
( )121 exp)()( ttIIItI gebeb −⋅−−= , (9)
( )212 exp)()( ttIIItI egbeg −⋅−+= , (10)
where )(11 btII b == , and )(22 gtII g == . The overall equipment impairment factor is computed by
a temporal average:
( )( ) ( )( )gb
eIIteIItIgIbavI
tgegb
tbgebegeb
e +−−⋅+−−⋅−⋅+⋅
=−− 21 /
12/
21 11)(
. (11)
In the final expression of the impairment factor, the ETSI takes also into account the recency effect,
that is due to the influence of the position of a noisy/lossy burst within a call on the subjective
evaluation of the overall quality. Such effect was studied for the first time by AT&T [6]. In these
experiments, different types of impairments were introduced in different positions of the test traces.
The resulting MOS value decreased gradually from the case of impairment at the beginning of the
trace to the case of impairment at the end of the trace. For instance, in case of bursty noise impairment,
the two MOS values observed with these two configurations were of 3.82 and 3.18, respectively.
25
The recency effect can be modelled hypothesizing an exponential behavior of the perceived quality,
which starts from the quality level of the last significant loss burst and goes asymptotically to the
average quality value of the call. In terms of equipment impairment factor, such effect results as:
( ) ( )( )( ) ( )3*1, exp tyavIIkavII eeeffe −⋅−⋅+= , (12)
where *1I is the exit value of the equipment impairment factor from the last burst, k is a constant
value, equal to 0.7 as suggested in [5], and t3 is the exponential time constant, equal to 30 s, which
represents the user memory about the last occurred event.
(12) represents the final expression for effeI , to be used in (2). Considering a set of default operative
conditions and explicating dI in (2), we then obtain the following expression for the R Factor:
effeIdHddR ,)3.177()3.177(11.0024.02.93 −−⋅−⋅−⋅−= (13)
(13) allows for the evaluation of the overall quality in case of bursty losses for the ITU-T G.729A-
VAD.
I.V.k. Playout buffering by quality maximization (eEM)
(9)-(13) provide an analytical expression of the quality index in terms of the end-to-end delay and of
the five parameters that characterize the packet loss process. It is important to highlight that also these
parameters are functions of the end-to-end delay: at increasing values of eed 2 , the burst length and the
packet loss ratios are expected to increase, while the gap length is expected to decrease. Additionally,
it is reasonable to assume that these functions are time-varying due to the temporal variability of the
network delay process.
The overall expression of the R Factor in terms of the end-to-end delay is used to evaluate the speech
quality in the proposed playout buffering algorithm, which we call eEM (extended E-Model based
playout buffering) for presentation convenience. The eEM algorithm works in a between talkspurt
fashion, that is, the buffer is adjusted during the silence periods by maximizing the expected R Factor
for the next talkspurt. Let ( )eei dR 2 denote the E-Model quality function for talkspurt i. Our objective
consists in finding the optimal setting of the playout buffer *,2 ieed for talkspurt i, which is defined as:
( ) ( ) +∈∀≥ RxxRdRd iieeiiee*
,2*
,2 : . (14)
26
This value represents the end-to-end delay that will be used during talkspurt i.
To find *,2 ieed , we need to predict the relationships between the loss process parameters and the eed 2
for next talkspurt i. In particular, functions ( )eei dg 2 , ( )eei db 2 , ( )eeig de 2, , ( )eeib de 2, , and ( )eei dy 2
are required to obtain function ( )eei dR 2 . For presentation convenience, these functions are referred to
with the array ( )eei d 2I . To address this issue, we rely on a numeral approach that consists in
performing the estimation of the loss process parameters for a set of past talkspurts; the results of this
estimation are then used as prediction for talkspurt i. Let ( )eeNi d 2,I denote the result of the estimation
performed during the N talkspurts that precede the i-th one. Then, the prediction of ( )eei d 2I is:
( ) ( )eeNieei dd 2,2ˆ II = . (15)
Since it is not possible to obtain an analytical expression of ( )eeNi d 2,I , a discrete version is extracted
by selecting a search range ( +− ÷ eeee dd 22 ) and dividing it into search steps of size D.
( ) DddM eeee−+ −= 22 is the number of resulting levels seed ,2 ( Dsdd eesee ⋅+= −
2,2 , with s=1,…,M).
During the streaming session, for every transmitted packet, the counters in [5] required for the
estimation are updated for every search level s, considering the packet lost if its network delay is
greater than seed ,2 . During the silence period, ( )seeNi d ,2,I are estimated for every s, which are then
used as prediction of ( )seei d ,2I and to find the optimal end-to-end delay value.
To show the behavior of functions ( )eei d 2I , in Fig. 4 we present how the sequence of bursts and gaps
change varying the playout buffer setting. In particular, Fig. 4.a depicts the network delays observed
for a sequence of consecutive packets drawn from one of the traces of Table I. In this figure, two lines
have been drawn representing two possible values for the end-to-end delay. Each line separates late
packets, which are considered lost, from in-time packets. A different loss sequence configuration
results from each eed 2 setting as shown in Fig. 4.b (to simplify the presentation, we used gmin equal to
8 in this example). This figure illustrates how the durations of the gap and burst periods, and relevant
loss percentages, vary in a trace varying the buffer configuration. Note that a difference of 30 ms in
the end-to-end delay in this trace made the average length of the burst vary from 4,5 to 41 packets.
27
0
20
40
60
80
100
120
140
0 10 20 30 40 50Packet Index
d net (
ms)
d e 2e = 110 ms
d e 2e = 80 ms
(a)
Received packet sequence
Packet sequence when de2e=110 ms
Packet sequence when de2e=80 ms
(b) Figure 4. Effects of the playout buffer setting on the packet loss burstiness. (a) packet network
delay with two possible settings of the playout buffer (dashed lines); these lines separate in-time packets from late packets. (b) sequences of gaps and bursts when considering only network losses and when considering also late packets with the two buffer configurations: ( valid packet, lost packet, discarded packet, gap, burst).
The graphs in Fig. 5 in the next page show the typical behavior that is observed for the loss process
parameters varying the end-to-end delay. As expected, the trend of these curves is descending at
increasing eed 2 , except for the gap duration, which has an opposite behavior. In these graphs, the
continuous lines represent the effective behaviors, ( )eei d 2I , which have been estimated at the end of
talkspurt; instead, the dashed lines represent the predicted behavior, ( )eei d 2I , on the basis of the
previous N talkspurts (N has been set to 17). These figures show that the predicted and actual curves
have a similar behavior. Note that the curves relevant to the predicted data are more continuous than
28
those of the measured data. This is due to the fact that the first are obtained from N talkspurts, while
the second from only one talkspurt.
There are other methods that may introduce some improvements in the prediction procedure. A system
based on the fuzzy logic based could be used to predict ( )seei d ,2I , with s=1,…,M, on the basis of
( )seeji d ,2−I , with j=1,…,N. In the same way, an approach using the neural networks can be adopted.
However, this would increase the computational complexity of the playout algorithm, which should be
simple to be implemented in low cost terminals.
Figure 5. Predicted and measured functions of g, b, ge , be for talkspurt 25 in Trace 4: the continuous
and dashed lines have been drawn from measured and predicted values, respectively.
I.V.l. Analysis of the computational complexity of the eEM strategy
The computational complexity of the algorithm has to be examined separately for the operations
executed within and between talkspurts:
a) Update of the estimation counters: According to [5], the computational complexity is O(M). To
be more precise, 9M counters track the system state, and at most 10M operations are executed for
every packet. This procedure is performed during the streaming so that 10M operations are
29
executed during a packet time, which is of 20 ms on average for the ITU-T G.729 codec (the
packet time varies due to the jitter problem).
b) Computation of *,2 ieed : Also in this case, the computational complexity is O(M). In particular, for
every search level s, the algorithm estimates ( )seeNi d ,2,I . The resulting estimates are then used to
compute the R Factor through equations (7)-(13), executing a total of about 100 operations for
every search level. At the same time, the maximum value that allows for obtaining the maximum
expected quality is found *,2 ieed . Note that this procure has to be executed during the silence
period
From this analysis, it follows that the computational complexity is not affected by the number of
talkspurts used for the prediction (N), but only by the resolution (M) used to find the optimal end-to-
end delay.
V. Experiments
Several tests have been performed applying the proposed playout technique and alternative ones to
traces relevant to voice connections over the Internet. In this Section, we present the results for the
eight traces introduced in Section III.A, whose main characteristics are provided in Table I. The results
have been evaluated by computing the R Factor for each talkspurt on the basis of measured burst and
gap lengths, burst and gap packet loss ratios, and introduced delay.
To evaluate the influence of the selection of the number N of talkspurts used to make the prediction,
we have carried out some experiments changing this parameter in the range 1÷30 for every trace. Fig.
6 shows the results. It can be observed that for low values of N, roughly lower than 15, the algorithm
performance is affected by this parameter for almost all the traces, while it not true for high values.
Then, there exists a minimum number of talkspurts that have to be used to obtain good estimates of the
packet loss process parameters. Additionally, the packet loss process seems to be slowly time-variant;
in fact, in the range 15÷30, the results can be considered invariant respect to N, proving that for at least
30 talkspurts the loss process remains almost unchanged. Based on these results, in our experiments
we have set N to 17 for all the traces. Recall that the algorithm complexity is not affected by N;
30
58
60
62
64
66
68
70
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
N
R F
acto
r
Trace 1 Trace 2 Trace 3 Trace 4
Trace 5 Trace 6 Trace 7 Trace 8
Figure 6. Average R Factor versus N for the 8 test traces.
accordingly, we have not been forced to look for the lowest value that allowed for good estimation
performance. As to the other eEM algorithm settings, Table III presents the used values.
Table III. Settings for the eEM algorithm parameters used in the experiments.
Parameter Description Value
N Number of talkspurts for prediction 17
+− ÷ eeee dd 22 Search range (ms) 50÷400
D Search step for the optimal
eed 2 (ms) 1
M Number of eed 2 levels in the search range
350
In the following we compare the performance of the eEM algorithm with the Linear Filter [10], the
Concord [14] and the E-MOS [15]. In particular, for the Linear Filter, we have used Algorithm 1 in
[10] with a weighting factor 998002.0=α . The Concord algorithm has been tested with default
parameters values: expected eed 2 (named ted in [14]) recalculation at each arriving packet, histogram
with one millisecond bin-width, aging every 1000 packets with 9.0=F , and the maximum late
packets (mlp) set to 0.01. To better evaluate the obtained results, for each talkspurt we have computed
31
the maximum R Factor (R Max), which represents the upper bound for the achievable quality level. It
is obtained setting the end-to-end delay within a talkspurt by means of the eEM algorithm with an
important variation: the R Factor optimization is performed using ( )eei d 2I instead of ( )eei d 2I ,
therefore excluding prediction errors. Clearly, this operation is not applicable in a real framework
since the measured values need to be available before the beginning of the relevant talkspurt in this
case. Tables IV and V, in the next page, show the R Factor averaged over all the talkspurts in a trace,
together with the average eed 2 and total loss ratio. Note that the eEM algorithm generally outperforms
the others in terms of the R Factor. It was an expected result since the other algorithms do not take into
account a quality model, except the E-MOS, which however makes use of a quite different
mathematical representation of the conversational quality. As to the first four traces, the proposed
algorithm allows for obtaining, on average, an R Factor of almost 2 point higher than the others. As to
Trace 5, it presents high network delay variability. In this case, it can be observed a significant
decrease of the R Factor (i.e., about 10 point lower than Trace 4), mainly due to an increase in the
packet loss ratio respect to the previous traces.
Table IV. Comparison of eEM with other strategies for Traces 1-4. In the last two columns
Trace# Algorithm Average loss in percentage
Average eed 2 in ms
Average R Factor (MOS)
Average R Max (MOS)
eEM 2.28 180.2 71.5 (3.67) Linear Filter 2.40 173.7 62.0 (3.20) Concord 2.43 169.0 68.1 (3.51)
Tra
ce 1
E-MOS 1.65 295.9 58.6 (3.03)
74.4 (3.80)
eEM 3.31 177.5 70.8 (3.63) Linear Filter 3.32 173.0 66.6 (3.43) Concord 3.51 168.9 68.7 (3.54) T
race
2
E-MOS 2.81 294.9 57.0 (2.94)
71.8 (3.68)
eEM 4.37 174.8 69.2 (3.56) Linear Filter 4.67 173.7 65.2 (3.36) Concord 4.42 169.0 67.8 (3.49) T
race
3
E-MOS 3.65 282.2 58.6 (3.03)
69.5 (3.57)
eEM 4.91 178.9 68.1 (3.51) Linear Filter 5.23 173.9 64.0 (3.30) Concord 5.31 169.2 66.1 (3.41) T
race
4
E-MOS 4.27 291.3 52.1 (2.69)
68.1 (3.51)
32
Table V. Comparison of eEM with other strategies for Traces 5-8.
Trace# Algorithm Average loss in percentage
Average eed 2 in ms
Average R Factor (MOS)
Average R Max (MOS)
eEM 5.05 176.1 58.8 (3.04) Linear Filter 5.08 173.3 56.5 (2.92) Concord 5.22 165.2 58.1 (3.00)
Tra
ce 5
E-MOS 4.41 305.3 53.0 (2.73)
59.0 (3.05)
eEM 0.44 348.0 75.2 (3.83) Linear Filter 0.30 399.6 67.6 (3.48) Concord 0.80 409.4 68.8 (3.54) T
race
6
E-MOS 1.60 373.0 65.0 (3.35)
77.3 (3.92)
eEM 1.40 157.4 77.8 (3.94) Linear Filter 1.20 140.2 68.3 (3.52) Concord 1.90 130.6 70.6 (3.63) T
race
7
E-MOS 1.15 220.1 60.1 (3.11)
78.4 (3.96)
eEM 1.21 135.2 76.8 (3.90) Linear Filter 1.81 129.0 73.4 (3.75) Concord 1.40 128.8 74.8 (3.81) T
race
8
E-MOS 0.72 295.3 60.8 (3.14)
77.5 (3.93)
Such phenomenon has been observed in particular for the Concord, which allowed for an increase in
the number of late packets to reduce the end-to-end delay. For Trace 6, characterized by high network
delays and low packet loss ratios, the eEM algorithm has provided high quality values respect to the
others. In particular, the Concord showed a problem in the optimization of the eed 2 and in the keeping
the packet loss ratio low; the E-MOS is competitive, notwithstanding the introduced high packet loss
ratio. The characteristics of this trace seem to represent good operative conditions for this strategy.
Similar results have been observed for the last two traces. To evaluate the improvement of the
proposed algorithm in a different scale, we have converted the R Factor in the MOS according to (6).
These values are shown in brackets in Tables IV and V. In this scale, the average improvement of the
eEM algorithm respect to the others is of about 0.34 points.
Table VI. Results of the eEM algorithm for Traces 1-4 when assuming loss randomness: the equipment impairment factor has been evaluated by means of (5).
Average eed 2 in ms Average R Factor (MOS)
Trace 1 191.2 72.6 (3.72) Trace 2 180.5 72.9 (3.73) Trace 3 182.8 71.4 (3.66) Trace 4 181.9 70.2 (3.61)
In Table VI, we provide the results of the quality-based playout algorithm when using the expression
of the equipment impairment factor in case of random losses. In particular, we have applied the eEM
33
algorithm to the first four traces directly using (5) to evaluate the effects of the losses. We present the
resulting R Factor values and the average end-to-end delay. As expected, higher quality levels have
been obtained, due to the fact that the assumption of loss randomness increases the expected perceived
quality respect to the case of correlated losses. This phenomenon also yielded to higher optimal end-
to-end delay values, since while the effeI , decreased at a given loss ratio the dI remained unchanged.
45
50
55
6065
70
75
80
85
1 11 21 31Talkspurt index
R F
acto
r
R Max eEM Concord Linear Filter E-MOS
Figure 7. Comparison of the proposed algorithm with the competing ones for the first talkspurts of
Trace 8.
To evaluate the evolution of the R Factor within a session, in Fig. 7 we provide the performance of the
playout algorithm for the first 35 talkspurts in Trace 8. The curves present a very similar behavior and
differ for a vertical shift of some points. The eEM curve is always very close to the maximum
achievable values in accordance with the average results presented in Table V. Note that the quality is
characterized by a high variability, oscillating in a range of about 20 points for all the algorithms. This
is due to the fluctuations of the network delay, which heavily controls the quality.
34
60
65
70
75
80
85R
Fac
tor
55
60
65
70
75
80
R F
acto
r
Figure 8. Overall R Factor for Trace 7 and 8.
Finally, in Fig. 8 we provide the overall call quality level instead of showing it for each talkspurt
separately. In particular, g, b, gD , bD , and y have been measured for the entire conversation and
directly used in (12) to evaluate the overall performance of the playout algorithms. This figure
presents the graphs for Trace 7 and 8, showing results similar to those obtained measuring the
performance on a talkspurt basis.
VI. Conclusions
A new algorithm, named eEM, for playout buffering has been presented for IP Telephony
applications. This algorithm exploits a quality model that allows the receiver to automatically find the
optimal end-to-end delay in terms of conversational quality. The major contribution of this paper is the
adoption of a model that is able to evaluate the effects of packet loss temporal correlation on the end-
user perceived quality. To this aim, the ETSI Tiphon model study on bursty losses has been
incorporated in the proposed algorithm. Extensive experiments have been carried out so as to evaluate
the performance of the proposed strategy showing that the eEM algorithm allows for improving the
conversational quality of some point in terms of the R Factor respect to other playout techniques.
Future work is devoted to the investigation of alternative solutions for the prediction of the parameters
of the ETSI Tiphon model and the extension of this algorithm to audio-video communications.
35
References
[1] ITU-T Recommendation G.107, “The E-Model, a computational model for use in transmission planning,”
03/2003.
[2] ITU-T Recommendation G.108, “Application of the E-model: A planning guide,” 09/1999.
[3] J-C. Bolot, “Characterizing end-to-end packet delay and loss in the Internet,” Journal of High-Speed
Networks, vol. 2, no. 3, pp. 305-323, Dec. 1993.
[4] W. Jiang and H. Schulzrinne, “Modeling of packet loss and delay and their effect on real-time multimedia
service quality,” in Proc. NOSSDAV, June 2000.
[5] ETSI T1A1.1/2001-037, “Extensions to the E Model to incorporate the effects of time varying packet loss
and recency,” Alan Clark, Telchemy Incorporated.
[6] ANSI T1A1.7/98-031: “Testing the quality of connections having time varying impairments,” AT&T.
[7] ITU-T SG12 D.139: “Study of the relationship between instantaneous and overall subjective speech quality
for time-varying quality speech sequences: influence of a recency effect,” France Telecom.
[8] L. Sun, G. Wade, B. Lines, and E. Ifeachor, “Impact of packet loss location on perceived speech quality,” in
Proc. 2nd IP Telephony Workshop, pp. 114-122, New York, April 2001.
[9] H. Sanneck, N. Le, A. Wolisz, and G. Carle, “Intra-flow loss recovery and control for VoIP,” in Proc. ACM
Multimedia 2001, Ottawa (ON), Sept. 2001.
[10] R. Ramjee, J. Kurose, D. Towsley, and H. Schulzrinne, “Adaptive playout mechanisms for packetized audio
applications in wide-area networks,” in Proc. IEEE INFOCOM, Toronto, Canada, June 1994.
[11] A. Kansar and A. Karandikar, “Jitter-free audio playout over best effort packet networks,” in ATM Forum
International Symposium, New Delhi, India, 2001.
[12] P. DeLeon and C. J. Sreenan, “An adaptive predictor for media playout buffering,” in Proc. IEEE ICASSP,
vol.6, pp. 3097–3100, March 1999.
[13] J. Pinto and K. J. Christenen, “An algorithm for playout of packet voice based on adaptive adjustment of
talkspurt silence periods,” in Proc. 24th Conference on Local Computer Networks, Lowell, Massachusetts, Oct.
1999.
[14] C.J. Sreenan, J.-C. Chen, P. Agrawal, and B. Narendran, “Delay reduction techniques for playout
buffering,” IEEE Multimedia Transactions on Multimedia, vol.02, no. 02, pp. 88-100, June 2000.
[15] K. Fujimoto, S. Ata, and M. Murata, “Adaptive playout buffer algorithm for enhancing perceived quality of
streaming applications,” in Proc. IEEE GLOBECOM 2002, pp. 2463-2469, Taipei, Taiwan, Nov. 2002.
36
[16] K. Fujimoto, S. Ata, and M. Murata, “Playout control for streaming applications by statistical delay
analysis,” in Proc. IEEE INFOCOM, vol 8, pp. 2337-2342, June 2001.
[17] K. Fujimoto, S. Ata, and M. Murata, “Statistical analysis of packet delays in the internet and its application
to playout control for streaming applications,” IEICE Transactions on Communications, vol. E84-B, no. 6, pp.
1504, June 2001.
[18] C. Savolaine, “QoS/VoIP overview,” in Proc. IEEE Communications Quality & Reliability International
Workshop, April 2001.
[19] ITU-T Recommendation G.711, “Pulse code modulation (PCM) of voice frequencies,” 1/1988.
[20] C. Boutremans and J-Y. Le Boudec, “Adaptive playout buffer and FEC adjustment for Internet Telephony,”
in Proc. IEEE INFOCOM 2003, pp. 652-662, April 2003.
[21] ITU-T Recommendation G.113, “Transmission impairments due to speech processing,” 02/2001.
[22] L. Sun and E. Ifeachor, “New models for perceived voice quality prediction and their applications in
playout buffer optimization for VoIP networks,” in Proc. ICC 2004, June 2004.
[23] ITU-T Recommendation P.862, “Perceptual evaluation of speech quality (PESQ), an objective method for
end-to-end quality assessment of narrow-band telephone networks and speech codecs,” February 2001.
[24] A.W. Rix, “Comparison between subjective listening quality and p.862 PESQ score,” in Proc. of Online
Workshop Measurement of Speech and Audio Quality in Networks, pp. 17–25, May 2003.
[25] L. Sun and E. Ifeachor, “Prediction of perceived conversational speech quality and effects of playout buffer
algorithms,” in Proc. of IEEE ICC 2003, pp. 1–6, 2003.
[26] L. Atzori and M. L. Lobina, “Speech playout buffering based on a simplified version of the ITU-T E-
Model,” IEEE Signal Processing Letter, vol. 11, no. 3, pp. 382-385, March 2004.
[27] W. Jiang and H. Schulzrinne, “Comparisons of FEC and codec robustness on VoIP quality quality and
bandwidth efficiency,” in Proc. ICN, August 2001.
[28] J. Matta, C. Pepin, K. Lashkari, and R. Jain, “A source and channel rate adaptation algorithm for AMR in
VoIP using the Emodel,” in Proc. NOSSDAV 2003, June 2003.
[29] M. Gardner, V. S. Frost, and D. W. Petr, “Using optimization to achieve efficient quality of service in voice
over IP networks,” in Proc. of IPCCC 2003, April 2003.
[30] R. Cole and J. Rosenbluth, “Voice over IP performance monitoring,” ACM Computer Communication
Review, vol. 31, no. 2, Apr. 2001.
[31] ITU-T Recommendation G.114, “One-way transmission time,“ 05/2000.
37
[32] M. Yajnik, S. Moon, J. Kurose, and D. Towsley, “Measurement and modeling of the temporal dependence
in packet loss,” in Proc. IEEE INFOCOM, New York, March 1999.
[33] J. Rosenberg, “G.729 Error Recovery for Internet Telephony,” Columbia University Computer Science
Technical Report CUCS-016-01, vol. 19, Dec. 2001.
[34] ITU-T Recommendation G.729-A, “Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-
excited linear-prediction (SC-ACELP),” 03/1996.
[35] A. Mood, F. Graybill, and D.Boes, “Introduction to the theory of statistics,” McGraw-Hill.
[36] W. Jiang and H. Schulzrinne, “Perceived quality of packet audio under bursty losses,” in Proc. IEEE
INFOCOM, New York, June 2002.
[37] ETSI TIPHON TS 101329-5: “QoS measurement methodologies,” Annex E.
38
39
Appendix I: The ITU-T H.323
Document Source
The contents of this document have been obtained from the International Engineering Consortium.
The International Engineering Consortium (IEC) is a nonprofit organization dedicated to catalyzing
technology and business progress worldwide in a range of high-technology industries and their
university communities. Since 1944, the IEC has provided high-quality educational opportunities for
industry professionals, academics, and students. In conjunction with industry-leading companies, the
IEC has developed an extensive, free, on-line educational program. The IEC conducts industry-
university programs that have substantial impact on curricula. It also conducts research and develops
publications, conferences, and technological exhibits that address major opportunities and challenges
of the information age. More than 70 leading high-technology universities are IEC affiliates, and the
IEC handles the affairs of the Electrical and Computer Engineering Department Heads Association.
Definition
H.323 is a standard that specifies the components, protocols and procedures that provide multimedia
communication services (real-time audio, video, and data communications) over packet networks,
including Internet protocol (IP) based networks. H.323 is part of a family of ITU-T recommendations
called H.32x that provides multimedia communication services over a variety of networks.
Overview
This appendix discusses the H.323 protocol standard. H.323 is explained with an emphasis on
gateways and gatekeepers, which are components of an H.323 network. The call flows between
entities in an H.323 network are explained, and the interworking aspects of H.323 with H.32x family
protocols are discussed.
What is H.323?
The H.323 standard is a cornerstone technology for the transmission of real-time audio, video, and
data communications over packet-based networks. It specifies the components, protocols, and
procedures providing multimedia communication over packet-based networks (see Figure 9). Packet-
based networks include Internet Protocol (IP) based (including the Internet) or Internet packet
exchange (IPX) based local-area networks (LANs), enterprise networks (ENs), metropolitan-area
40
networks (MANs), and wide area networks (WANs). H.323 can be applied in a variety of
mechanisms: audio only (IP telephony); audio and video (videotelephony); audio and data; and audio,
video and data. H.323 can also be applied to multipoint-multimedia communications. H.323 provides
myriad services and, therefore, can be applied in a wide variety of areas: consumer, business, and
entertainment applications.
Packet Network
H.323 terminal H.323 terminal
H.323
Figure 9. H.323 Terminals on a Packet Network
H.323 Versions
The H.323 standard is specified by the ITU-T Study Group 16. Version 1 of the H.323
recommendation - visual telephone systems and equipment for LANs that provide a nonguaranteed
quality of service (QoS) - was accepted in October 1996. It was, as the name suggests, heavily
weighted towards multimedia communications in a LAN environment. Version 1 of the H.323
standard does not provide guaranteed QoS.
The emergence of Voice Over Internet Protocol (VoIP) applications and IP telephony has paved the
way for a revision of the H.323 specification. The absence of a standard for voice over IP resulted in
products that were incompatible. With the development of VoIP, new requirements emerged, such as
providing communication between a PC based phone and a phone on a traditional switched circuit
network (SCN). Such requirements forced the need for a standard for IP telephony. Version 2 of
H.323 (packet-based multimedia communications systems) was defined to accommodate these
additional requirements and was accepted in January 1998.
New features are being added to the H.323 standard, which will evolve to Version 3 shortly. The
features being added include fax-over-packet networks, gatekeeper-gatekeeper communications, and
fast-connection mechanisms.
41
H.323 in Relation to Other Standards of the H.32x Family
The H.323 standard is part of the H.32x family of recommendations specified by ITU-T. The other
recommendations of the family specify multimedia communication services over different networks:
o H.324 over SCN
o H.320 over integrated services digital networks (ISDN)
o H.321 and H.310 over broadband integrated services digital networks (B-ISDN)
o H.322 over LANs that provide guaranteed QoS
One of the primary goals in the development of the H.323 standard was interoperability with other
multimedia-services networks. This interoperability is achieved through the use of a gateway. A
gateway performs any network or signaling translation required for interoperability..
Interworking with Other Multimedia Networks
The H.323 standard specifies four kinds of components, which, when networked together, provide the
point-to-point and point-to-multipoint multimedia-communication services:
o Terminals
o Gateways
o Gatekeepers
o Multipoint Control Units (MCUs)
Terminals
Used for real-time bidirectional multimedia communications, an H.323 terminal can either be a
personal computer (PC) or a stand-alone device, running an H.323 and the multimedia applications. It
supports audio communications and can optionally support video or data communications. Because
the basic service provided by an H.323 terminal is audio communications, an H.323 terminal plays a
key role in IP telephony services. An H.323 terminal can either be a PC or a stand-alone device,
running an H.323 stack and multimedia applications. The primary goal of H.323 is to interwork with
other multimedia terminals. H.323 terminals are compatible with H.324 terminals on SCN and
wireless networks, H.310 terminals on B-ISDN, H.320 terminals on ISDN, H.321 terminals on B-
ISDN, and H.322 terminals on guaranteed QoS LANs. H.323 terminals may be used in multipoint
conferences.
42
Gateways
A gateway connects two dissimilar networks. An H.323 gateway provides connectivity between an
H.323 network and a non-H.323 network. For example, a gateway can connect and provide
communication between an H.323 terminal and SCN networks (SCN networks include all switched
telephony networks, e.g., public switched telephone network [PSTN]). This connectivity of dissimilar
networks is achieved by translating protocols for call setup and release, converting media formats
between different networks, and transferring information between the networks connected by the
gateway. A gateway is not required, however, for communication between two terminals on an H.323
network.
Gatekeepers
A gatekeeper can be considered the brain of the H.323 network. It is the focal point for all calls within
the H.323 network. Although they are not required, gatekeepers provide important services such as
addressing, authorization and authentication of terminals and gateways; bandwidth management;
accounting; billing; and charging. Gatekeepers may also provide call-routing services.
Multipoint Control Units
MCUs provide support for conferences of three or more H.323 terminals. All terminals participating in
the conference establish a connection with the MCU. The MCU manages conference resources,
negotiates between terminals for the purpose of determining the audio or video coder/decoder
(CODEC) to use, and may handle the media stream. The gatekeepers, gateways, and MCUs are
logically separate components of the H.323 standard but can be implemented as a single physical
device.
H.323 Components
An H.323 zone is a collection of all terminals, gateways, and MCUs managed by a single gatekeeper
(see Figure 10). A zone includes at least one terminal and may include gateways or MCUs. A zone has
only one gatekeeper. A zone may be independent of network topology and may be comprised of
multiple network segments that are connected using routers or other devices.
43
Non-H.323 Network
(e.g., ISDN) terminal
Non-H.323 Network
(e.g., PSTN) terminal
MCU
Router Router
Gateway Gateway
Gatekeeper
Figure 10. An H.323 Zone
H.323 Zone
The protocols specified by H.323 are listed below. H.323 is independent of the packet network and the
transport protocols over which it runs and does not specify them.
o Audio CODEC
o Video CODEC
o H.225 registration, admission, and status (RAS)
o H.225 call signaling
o H.245 control signaling
o Real-time Transfer Protocol (RTP)
o Real-time Control Protocol (RTCP)
Audio CODEC
An audio CODEC encodes the audio signal from the microphone for transmission on the transmitting
H.323 terminal and decodes the received audio code that is sent to the speaker on the receiving H.323
terminal. Because audio is the minimum service provided by the H.323 standard, all H.323 terminals
must have at least one audio CODEC support, as specified in the ITU-T G.711 recommendation
(audio coding at 64 kbps). Additional audio CODEC recommendations such as G.722 (64, 56, and 48
kbps), G.723.1 (5.3 and 6.3 kbps), G.728 (16 kbps), and G.729 (8 kbps) may also be supported.
44
Video CODEC
A video CODEC encodes video from the camera for transmission on the transmitting H.323 terminal
and decodes the received video code that is sent to the video display on the receiving H.323 terminal.
Because H.323 specifies support of video as optional, the support of video CODECs is optional as
well. However, any H.323 terminal providing video communications must support video encoding and
decoding as specified in the ITU-T H.261 recommendation.
H.225 Registration, Admission, and Status
Registration, admission, and status (RAS) is the protocol between endpoints (terminals and gateways)
and gatekeepers. The RAS is used to perform registration, admission control, bandwidth changes,
status, and disengage procedures between endpoints and gatekeepers. An RAS channel is used to
exchange RAS messages. This signaling channel is opened between an endpoint and a gatekeeper
prior to the establishment of any other channels.
H.225 Call Signaling
The H.225 call signaling is used to establish a connection between two H.323 endpoints. This is
achieved by exchanging H.225 protocol messages on the call-signaling channel. The call-signaling
channel is opened between two H.323 endpoints or between an endpoint and the gatekeeper.
H.245 Control Signaling
H.245 control signaling is used to exchange end-to-end control messages governing the operation of
the H.323 endpoint. These control messages carry information related to the following:
o Capabilities exchange
o Opening and closing of logical channels used to carry media streams
o Flow-control messages
o General commands and indications
Real-Time Transport Protocol
Real-time transport protocol (RTP) provides end-to-end delivery services of real-time audio and video.
Whereas H.323 is used to transport data over IP based networks, RTP is typically used to transport
data via the user datagram protocol (UDP). RTP, together with UDP, provides transport-protocol
functionality. RTP provides payload-type identification, sequence numbering, timestamping, and
45
delivery monitoring. UDP provides multiplexing and checksum services. RTP can also be used with
other transport protocols.
Real-Time Transport Control Protocol
Real-time transport control protocol (RTCP) is the counterpart of RTP that provides control services.
The primary function of RTCP is to provide feedback on the quality of the data distribution. Other
RTCP functions include carrying a transport-level identifier for an RTP source, called a canonical
name, which is used by receivers to synchronize audio and video.
Protocols Specified by H.323
H.323 terminals must support the following:
o H.245 for exchanging terminal capabilities and creation of media channels
o H.225 for call signaling and call setup
o RAS for registration and other admission control with a gatekeeper
o RTP/RTCP for sequencing audio and video packets
H.323 terminals must also support the G.711 audio CODEC. Optional components in an H.323
terminal are video CODECs, T.120 data-conferencing protocols, and MCU capabilities.
Terminal Characteristics
Gateway Characteristics
A gateway provides translation of protocols for call setup and release, conversion of media formats
between different networks, and the transfer of information between H.323 and non-H.323 networks.
An application of the H.323 gateway is in IP telephony, where the H.323 gateway connects an IP
network and SCN network (e.g., ISDN network).
On the H.323 side, a gateway runs H.245 control signaling for exchanging capabilities, H.225 call
signaling for call setup and release, and H.225 registration, admissions, and status (RAS) for
registration with the gatekeeper. On the SCN side, a gateway runs SCN-specific protocols (e.g., ISDN
and SS7 protocols).
Terminals communicate with gateways using the H.245 control-signaling protocol and H.225 call-
signaling protocol. The gateway translates these protocols in a transparent fashion to the respective
counterparts on the non-H.323 network and vice versa. The gateway also performs call setup and
46
clearing on both the H.323-network side and the non-H.323-network side. Translation between audio,
video, and data formats may also be performed by the gateway. Audio and video translation may not
be required if both terminal types find a common communications mode. For example, in the case of a
gateway to H.320 terminals on the ISDN, both terminal types require G.711 audio and H.261 video, so
a common mode always exists. The gateway has the characteristics of both an H.323 terminal on the
H.323 network and the other terminal on the non-H.323 network it connects.
Gatekeepers are aware of which endpoints are gateways because this is indicated when the terminals
and gateways register with the gatekeeper. A gateway may be able to support several simultaneous
calls between the H.323 and non-H.323 networks. In addition, a gateway may connect an H.323
network to a non-H.323 network. A gateway is a logical component of H.323 and can be implemented
as part of a gatekeeper or an MCU.
Gatekeeper Characteristics
Gatekeepers provide call-control services for H.323 endpoints, such as address translation and
bandwidth management as defined within RAS. Gatekeepers in H.323 networks are optional. If they
are present in a network, however, terminals and gateways must use their services. The H.323
standards both define mandatory services that the gatekeeper must provide and specify other optional
functionality that it can provide.
An optional feature of a gatekeeper is call-signaling routing. Endpoints send call-signaling messages
to the gatekeeper, which the gatekeeper routes to the destination endpoints. Alternately, endpoints can
send call-signaling messages directly to the peer endpoints. This feature of the gatekeeper is valuable,
as monitoring of the calls by the gatekeeper provides better control of the calls in the network. Routing
calls through gatekeepers provides better performance in the network, as the gatekeeper can make
routing decisions based on a variety of factors, for example, load balancing among gateways.
A gatekeeper is optional in an H.323 system. The services offered by a gatekeeper are defined by RAS
and include address translation, admissions control, bandwidth control, and zone management. H.323
networks that do not have gatekeepers may not have these capabilities, but H.323 networks that
contain IP-telephony gateways should also contain a gatekeeper to translate incoming E.164 telephone
47
addresses into transport addresses. A gatekeeper is a logical component of H.323 but can be
implemented as part of a gateway or MCU.
Mandatory Gatekeeper Functions
Address Translation
Calls originating within an H.323 network may use an alias to address the destination terminal. Calls
originating outside the H.323 network and received by a gateway may use an E.164 telephone number
(e.g., 310-442-9222) to address the destination terminal. The gatekeeper translates this E.164
telephone number or the alias into the network address (e.g., 204.252.32:456 for an IP-based network)
for the destination terminal. The destination endpoint can be reached using the network address on the
H.323 network.
Admission Control
The gatekeeper can control the admission of the endpoints into the H.323 network. It uses RAS
messages, admission request (ARQ), confirm (ACF), and reject (ARJ) to achieve this. Admissions
control may be a null function that admits all endpoints to the H.323 network.
Bandwidth Control
The gatekeeper provides support for bandwidth control by using the RAS messages, bandwidth
request (BRQ), confirm (BCF), and reject (BRJ). For instance, if a network manager has specified a
threshold for the number of simultaneous connections on the H.323 network, the gatekeeper can refuse
to make any more connections once the threshold is reached. The result is to limit the total allocated
bandwidth to some fraction of the total available, leaving the remaining bandwidth for data
applications. Bandwidth control may also be a null function that accepts all requests for bandwidth
changes.
Zone Management
The gatekeeper provides the above functions (address translation, admissions control, and bandwidth
control) for terminals, gateways, and MCUs located within its zone of control. An H.323 zone is
defined further.
48
Optional Gatekeeper Functions
Call-Control Signaling
The gatekeeper can route call-signaling messages between H.323 endpoints. In a point-to-point
conference, the gatekeeper may process H.225 call-signaling messages. Alternatively, the gatekeeper
may allow the endpoints to send H.225 call-signaling messages directly to each other.
Call Authorization
When an endpoint sends call-signaling messages to the gatekeeper, the gatekeeper may accept or
reject the call, according to the H.225 specification. The reasons for rejection may include access-
based or time-based restrictions, to and from particular terminals or gateways.
Call Management
The gatekeeper may maintain information about all active H.323 calls so that it can control its zone by
providing the maintained information to the bandwidth-management function or by rerouting the calls
to different endpoints to achieve load balancing.
Gateway and Gatekeeper Characteristics
The H.225 RAS is used between H.323 endpoints (terminals and gateways) and gatekeepers for the
following:
o Gatekeeper discovery (GRQ)
o Endpoint registration
o Endpoint location
o Admission control
o Access tokens
The RAS messages are carried on a RAS channel that is unreliable. Hence, RAS message exchange
may be associated with timeouts and retry counts.
Gatekeeper Discovery
The gatekeeper discovery process is used by the H.323 endpoints to determine the gatekeeper with
which the endpoint must register. The gatekeeper discovery can be done statically or dynamically. In
static discovery, the endpoint knows the transport address of its gatekeeper a priori. In the dynamic
method of gatekeeper discovery, the endpoint multicasts a GRQ message on the gatekeeper's
49
discovery multicast address: "Who is my gatekeeper?" One or more gatekeepers may respond with a
GCF message: "I can be your gatekeeper."
Endpoint Registration
Registration is a process used by the endpoints to join a zone and inform the gatekeeper of the zone's
transport and alias addresses. All endpoints register with a gatekeeper as part of their configuration.
Endpoint Location
Endpoint location is a process by which the transport address of an endpoint is determined and given
its alias name or E.164 address.
Other Control
The RAS channel is used for other kinds of control mechanisms, such as admission control, to restrict
the entry of an endpoint into a zone, bandwidth control, and disengagement control, where an endpoint
is disassociated from a gatekeeper and its zone.
H.225 Registration, Admission, and Status
H.225 Call Signaling
H.225 call signaling is used to set up connections between H.323 endpoints (terminals and gateways),
over which the real-time data can be transported. Call signaling involves the exchange of H.225
protocol messages over a reliable call-signaling channel. For example, H.225 protocol messages are
carried over TCP in an IP-based H.323 network.
H.225 messages are exchanged between the endpoints if there is no gatekeeper in the H.323 network.
When a gatekeeper exists in the network, the H.225 messages are exchanged either directly between
the endpoints or between the endpoints after being routed through the gatekeeper. The first case is
direct call signaling. The second case is called gatekeeper-routed call signaling. The method chosen is
decided by the gatekeeper during RAS-admission message exchange.
Gatekeeper-Routed Call Signaling
The admission messages are exchanged between endpoints and the gatekeeper on RAS channels. The
gatekeeper receives the call-signaling messages on the call-signaling channel from one endpoint and
routes them to the other endpoint on the call-signaling channel of the other endpoint.
50
Direct Call Signaling
During the admission confirmation, the gatekeeper indicates that the endpoints can exchange call-
signaling messages directly. The endpoints exchange the call signaling on the call-signaling channel.
H.245 Control Signaling
H.245 control signaling consists of the exchange of end-to-end H.245 messages between
communicating H.323 endpoints. The H.245 control messages are carried over H.245 control
channels. The H.245 control channel is the logical channel 0 and is permanently open, unlike the
media channels. The messages carried include messages to exchange capabilities of terminals and to
open and close logical channels.
Capabilities Exchange
Capabilities exchange is a process using the communicating terminals' exchange messages to provide
their transmit and receive capabilities to the peer endpoint. Transmit capabilities describe the
terminal's ability to transmit media streams. Receive capabilities describe a terminal's ability to receive
and process incoming media streams.
Logical Channel Signaling
A logical channel carries information from one endpoint to another endpoint (in the case of a point-to-
point conference) or multiple endpoints (in the case of a point-to-multipoint conference). H.245
provides messages to open or close a logical channel; a logical channel is unidirectional.
H.225 Call Signaling and H.245 Control Signaling
This module describes the steps involved in creating an H.323 call, establishing media
communication, and releasing the call. The example network contains two H.323 terminals (T1 and
T2) connected to a gatekeeper. Direct call signaling is assumed. It is also assumed that the media
stream uses RTP encapsulation. Figure 11 illustrates H.323 call establishment.
51
ARQ (1)
ACF (2)
SETUP (3)
CALL PROCEEDING (4)
ARQ (5)
ACF (6)
ALERTING (7)
CONNECT (8)
Figure 11. H.323 Call Establishment
1. T1 sends the RAS ARQ message on the RAS channel to the gatekeeper for registration. T1 requests the use
of direct call signaling.
2. The gatekeeper confirms the admission of T1 by sending ACF to T1. The gatekeeper indicates in ACF that
T1 can use direct call signaling.
3. T1 sends an H.225 call signaling setup message to T2 requesting a connection.
4. T2 responds with an H.225 call proceeding message to T1.
5. Now T2 has to register with the gatekeeper. It sends an RAS ARQ message to the gatekeeper on the RAS
channel.
6. The gatekeeper confirms the registration by sending an RAS ACF message to T2.
7. T2 alerts T1 of the connection establishment by sending an H.225 alerting message.
8. Then T2 confirms the connection establishment by sending an H.225 connect message to T1, and the call is
established.
Connection Procedures
The H.323 protocol is specified so that it interoperates with other networks. The most popular H.323
interworking is IP telephony, when the underlying network of H.323 is an IP network and the
interoperating network is SCN (see Figure 12). SCN includes PSTN and ISDN networks.
52
H.323 Network (IP based)
SCN
terminal
Gateway Gatekeeper
phones
Figure 12. IP Telephony: H.323 Interworking with SCN
Appendix II: Intrusive and Non-Intrusive Evaluation of Speech Quality
PESQ
Previous objective speech quality assessment models, such as bark spectral distortion (BSD), the
perceptual speech quality measure (PSQM), and measuring normalizing blocks (MNB), have been
found to be suitable for assessing only a limited range of distortions. A new model has therefore been
developed for use across a wider range of network conditions, including analogue connections, codec,
packet loss and variable delay. Known as perceptual evaluation of speech quality (PESQ), it is the
result of integration of the perceptual analysis measurement system (PAMS) and PSQM99, an
enhanced version of PSQM. PESQ is expected to become a new ITU-T recommendation P.862,
replacing P.861 which specified PSQM and MNB.
Figure 13. The PESQ Strategy
The model begins by level aligning both signals to a standard listening level (See Fig.13). They are
filtered (using an FFT) with an input filter to model a standard telephone handset. The signals are
aligned in time and then processed through an auditory transform similar to that of PSQM. The
53
transformation also involves equalizing for linear filtering in the system and for gain variation. Two
distortion parameters are extracted from the disturbance (the difference between the transforms of the
signals), and are aggregated in frequency and time and mapped to a prediction of subjective mean
opinion score (MOS). Some details are discussed below. The time alignment of PESQ assumes that
the delay of the system is piecewise constant. This assumption appears to be valid for a wide range of
systems, including packet-based transmission such as voice over IP (VoIP). Delay changes are
allowed in silent periods (where they will normally be inaudible) and in speech (where they are
usually audible). The signals are aligned using the following steps.
• Narrowband filter applied to both signals to emphasize perceptually important parts. These filtered signals
are only used for time alignment.
• Envelope-based delay estimation.
• Division of reference signal into utterances.
• Envelope-based delay estimation for each utterance.
• Fine correlation histogram-based delay identification for each utterance.
• Utterance splitting and re-alignment to test for delay changes during speech.
These give a delay estimate for each utterance, which is used to find the frame-by-frame delay for use
in the auditory transform.
The auditory transform in PESQ is a psychoacoustic model which maps the signals into a
representation of perceived loudness in time and frequency. It includes the following stages.
Bark spectrum. An FFT with a Hamming window is used to calculate the instantaneous power
spectrum in each frame, for 50% overlapping frames of 32ms duration. This is grouped without
smearing into 42 bins, equally spaced in perceptual frequency on a modified Bark scale similar to that
of PSQM [2].
Frequency equalization. The mean Bark spectrum for active speech frames is calculated. The ratio
between the spectra of reference and degraded gives a transfer function estimate, assuming that the
system under test has a constant frequency response. The reference is equalized to the degraded signal
using this estimate, with bounds to limit the equalization to ±20dB.
54
Equalization of gain variation. The ratio between the audible power of the reference and the degraded
in each frame is used to identify gain variations. This is filtered with a first-order lowpass filter, and
bounded, then the degraded signal is equalized to the reference.
Loudness mapping. The Bark spectrum is mapped to (Sone) loudness, including a frequency-
dependent threshold and exponent. This gives the perceived loudness in each time-frequency cell.
The absolute difference between the degraded and the reference signals gives a measure of audible
error. In PESQ, this is processed through several steps before a non-linear average over time and
frequency is calculated.
Deletion. A deletion (a negative delay change) leaves a section which overlaps in the degraded signal.
If the deletion is longer than half a frame, the overlapping sections are discarded.
Masking. Masking in each time-frequency cell is modeled using a simple threshold below which
disturbances are inaudible; this is set to the lesser of the loudness of the reference and degraded
signals, divided by four. The threshold is subtracted from the absolute loudness difference, and values
less than zero are set to zero. Methods for applying masking over distances larger than one time-
frequency cell were examined with earlier versions of PSQM and PSQM99, but did not improve
overall performance [14], and were not used in PESQ.
Asymmetry. Unlike P.861 PSQM [2], PESQ computes two different error averages, one without and
one with an asymmetry factor. The PESQ asymmetry factor is calculated from an established ratio of
the Bark spectral density of the degraded to the reference signals in each time-frequency cell. This is
raised to the power 1.2 and is bounded with an upper limit of 12.0. Values smaller than 3.0 are set to
zero. The asymmetric weighted disturbance, obtained by multiplying by this factor, thus measures
only additive distortions.
Following the understanding that localized errors dominate perception [9], PESQ integrates
disturbance over several time-frequency scales using a method designed to take optimal account of the
distribution of error in time and amplitude. The disturbance values are aggregated using an Lp norm,
which calculates a non-linear average using the following formula:
[ ] pN
m
pp medisturbanc
NL
1
1
1��
��
⋅= �=
55
The disturbance is first summed across frequency using an Lp norms, giving a frame-by-frame
measure of perceived distortion. This frame disturbance is multiplied by two weightings. The first
weight is inversely proportional to the instantaneous energy of the reference, raised to the power 0.04,
giving slightly greater emphasis on sections for which the reference is quieter. This process replaces
the silent interval weighting used in P.861. After this, the frame disturbance is bounded with an upper
limit of 45. The second weight gives reduced emphasis on the start of the signal if the total length is
over 16s, modeling the effect of short-term memory in subjective listening. This multiplies the frame
disturbance at the start of the signal by a factor decreasing linearly from 1.0 (for files shorter than 16
seconds) to 0.5 (for files longer than 60 seconds). After weighting, the frame disturbance is averaged
in time oversplit second intervals of 20 frames (approx 320ms, accounting for the overlap of frames)
using Lp norms. These intervals overlap 50%, and no window function is used. The split second
disturbance values are finally averaged over the length of the speech files, again using Lp norms. Thus
the aggregation process uses three Lp norms – in general with different values of p – to map the
disturbance to a single figure. The value of p is higher for averaging over the split second intervals to
give greatest weight to localized distortions. The symmetric and asymmetric disturbance are averaged
separately.
To train PESQ a large number of different symmetric and asymmetric disturbance parameters were
calculated by using multiple values of p for each of the three averaging stages. A linear combination of
disturbance parameters was used as a predictor of subjective MOS. A further regression is required for
each subjective test to account for context and voting preferences of different subjects, as discussed in
section 3; for calibration a linear mapping was also used at this stage. Parameter selection was
performed for all candidate sets of up to four disturbance parameters. The optimal combination –
giving the highest average correlation coefficient – was found. This enabled the best parameters to be
chosen from the full set of several hundred candidate disturbance parameters. The use of partial
compensation in PESQ, for example in equalizing for gain modulation, avoids the need for using a
large number of parameters to predict quality. A combination of only two parameters – one symmetric
disturbance (dSYM) and one asymmetric disturbance (dASYM) – gave a good balance between
accuracy of prediction and ability to generalize. However, as this low-dimension model depends on
56
earlier stages to incorporate complex perceptual effects, several design iterations were required.
Coefficients in the auditory transform and disturbance processing were optimized then the optimal
parameter combination was found, and the process repeated several times. Final training was
performed on a database of 30 subjective tests, giving the following output mapping used in PESQ:
PESQMOS = 4.5 – 0.1 dSYM – 0.0309 dASYM For normal subjective test material the values lie
between 1.0 (bad) and 4.5 (no distortion). In extremely high distortion the PESQMOS may fall below
1.0, but this is very uncommon.
The ITU-T R-Factor (from Psytechnics Society)
The essential information about ITU-T R Factor have been provided previously. In this part of
Appendix II, some other general information are provided and discussed.
The E-Model is a planning tool for estimating the overall quality in a telephone network. It was first
submitted to standards bodies in 1993 although its origins date back to the models first developed in
the 1960’s by BT, Bellcore and others. The basic premise for the model is that impairments are always
psychologically additive. Simply put, if network impairments such as noise, echo, delay, codec
performance, jitter, etc. are cleverly added then an overall objective rating of quality or “caller
experience” can be estimated. The basic formula for the E-Model is below.
AIIIR effeds +−−−= ,100
• R Factor: Overall network quality rating (ranges between 0 and 100)
• Ro: Signal to noise ratio
• Is: Impairments simultaneous to voice signal transmission
• Id: Impairments delayed after voice signal transmission
• Ie,eff (or Ie): Effects of Equipment (e.g. codec)
• A: Advantage factor (attempts to account for caller expectations)
In simple terms, the overall quality (R Factor) is calculated by estimating the signal to noise ratio of a
connection (Ro) and subtracting the network impairments (Is, Id, Ie) that in turn are offset by any
expectations of quality had by the caller (A).
While a network is still on paper, a network planner can use the E Model to estimate its likely quality.
The engineer gathers input information from reference tables, enters it into the E-Model, and
57
calculates the resulting Transmission Quality Rating (R Factor). The table below shows how R Factor
values may be interpreted.
Some of the inputs to the E Model are complex mathematical formulae which consider various
impairments acting together. These formulae are visually represented below by the Calc Ro, Calc Id,
and Calc Is boxes.
Figure 14. The ITU-T R-Factor
The general approach used for live measurement situations is to measure a limited number of E-Model
parameters while making assumptions for non-measured parameters. In the example below, an
objective speech quality measurement is outputting a Mean Opinion Score (MOS) which is converted
to an Ie value.
Network Assumptions are used on some inputs while an objective speech quality measure is providing
an Ie value. Used in this capacity, an engineer may be able to compare original estimated R Ratings
58
with actual R ratings achieved in a live situation. Clearly, for this scenario to be beneficial the
objective speech quality measure must be accurate. The EModel recommendation has various Ie
tables. For planning purposes these provide Ie values for codec combinations as well recently VoIP
degradations such as packet loss. However, the voice quality of a live VoIP system can be radically
different to the numbers in the ITU tables. A live monitoring system must be able to accurately
measure speech quality rather than use IP network statistics to look up Ie in the tables.
Figure 15. A non-intrusive strategy for measuring objective voice quality (Psytechnics)
The E-Model was designed to provide estimated network quality and has shown to be reasonably
accurate for this purpose. It has not been accepted as a valid measurement tool for live networks.
Increasingly, and against ITU recommendations, the E-Model is being marketed to the industry as a
live voice quality measurement tool. The ITU-T G.107 Recommendation states at the beginning of the
document that “Such estimates are only made for transmission planning purposes and not for actual
customer opinion prediction (for which there is no agreed-upon model recommended by the ITU-T). “
It also provides a caution with the following paragraph. “The E-Model has not been fully verified by
field surveys or laboratory tests for the very large number of possible combinations of input
parameters. For many combinations of high importance to transmission planners, the E-Model can be
used with confidence, but for other parameter combinations, E-Model predictions have been
59
questioned and are currently under study. Accordingly, caution must be exercised when using the E-
Model for some conditions; for example, the E-Model may give inaccurate results for combinations of
certain types of impairments. Annex A provides further information in this regard.”
Appendix III: The ITU-T G.729A-VAD, a low bit-rate speech codec (Lucent & Bell
Laboratories)
Overview
G.729, also known as CS-ACELP (Conjugate Structure Algebraic Code Excited Linear Prediction), is
specified by the ITU (International Telecommunications Union). It compresses speech from 16 bit,
8kHz samples (128 kbps) to 8 kbps, and was designed for cellular and and networking applications. It
provides "toll quality" speech (that is, as good as the telephone network), works well with background
noise, and has been designed to perform well under error conditions. It, along with G.723.1 (a 5.3/6.3
dual mode speech coder), are the main contenders for the baseline codec for internet telephony. G.729
fits into the general category of CELP (Code Excited Linear Prediction) speech coders. These coders
are all based on a model of the human vocal system. In that model, the throat and mouth are modeled
as a linear filter, and voice is generated by a periodic vibration of air exciting this filter. In the
frequency domain, this implies that speech looks somewhat like a smooth response (called the
envelope), modulated by a set of discrete frequency components. CELP coders all vary in the manner
in which the excitation is specified, and the way in which the coefficients of the filter are represented.
All of them generally break speech up into units called frames, which can be anywhere from 1ms to
100ms in duration. For each frame of speech, a set of parameters for the model are generated and sent
to the decoder. This implies that the frame time represents a lower bound on the system delay; the
encoder must wait for at least a frames worth of speech before it can even begin the encode process. In
G.729, each frame is 10ms, or 80 samples, in duration. This frame is further broken into two 5ms
subframes. The filter parameters are specified just once for each frame, but each subframe has its own
excitation specified. It is also important to note that speech can generally be classified into two types:
voiced and unvoiced. Voiced sounds, such as b,d, and g, are generated from the throat, where as
unvoiced sounds, such as th, f, and sh, are generated from the mouth. The model works better for
voiced sounds, but the excitation can be tailored for voiced or unvoiced so that it works in both cases.
60
The approach for finding the filter parameters and excitation is called analysis by synthesis. The
encoder searches through the parameter space, performing the decode operation in each loop of the
search. The output of the decoder (the synthesized signal), is compared with the original speech signal.
The parameters which yield the closest match are then chosen, and sent to the decoder. In this fashion,
we have analyzed the signal by repeatedly synthesizing the output of the decoder, and thus the name
analysis by synthesis. First, we discuss how G.729 computes and transmits the filter coefficients. Then
we discuss the excitation.
In G.729, the filter is a 10th order all-pole filter. Since it is used to synthesize voice, it is also called
the synthesis filter. Its inverse, the analysis filter, is an all- zero FIR filter (which we done by A(z)).
When speech is passed through it, the result is the excitation for that speech. In fact, the analysis filter
can be thought of as a linear predictor, the output of which is the error signal in predicting the speech
from the past 10 samples. With this in mind, the problem of finding the coefficients of the analysis
filter reduces to finding the optimal 10th order linear predictor for a given signal. This problem is well
known, and the solution is a function of the correlation matrix of the speech. The correlation function,
however, is likely to vary over time. In each frame, it is re-measured over a 30ms interval. This
interval consists of 15ms from the past, 10ms from the current frame, and 5ms from the future. Of
course, this look-ahead of 5ms requires the encoder to wait an additional 5ms beyond the 10ms frame
delay. This means that the total encoding delay, also known as the algorithmic delay, is 15ms. Instead
of just computing the correlation directly from those speech samples, a window, called the LPanalysis
window, is applied to the samples. The window is half of a Hamming window on one side, and a
quarter cosine cycle on the other side. The curved shape of the window helps emphasize the current
time as opposed to the future or past when computing the correlation function. One further step is
taken before the correlation coefficients are used to generate the filter coefficients. For high pitch
speech signals (such as those from females), the modulation frequency of the spectral envelope is
higher than for than for lower pitch signals. Thus, the LP analysis will tend to result in filters whic h
underestimate the envelope at frequencies between the pitch periods. To resolve this, the correlation
coefficients are multiplied by a Gaussian function. This is equivalent to convolution in frequency of
the spectral envelope by a Gaussian. The result is a widening of the peaks of the spectral envelope,
61
filling in the gaps. With the correlation function r(k) computed, the 10 LP analysis filter coefficients
can be computed. The optimal coefficients ai, i=1..10, are the solution to: This well known Yule-
Walker equation is readily solved with the Levinson-Durbin algorithm, which defines an iterative
approach to its solution. The next step is to quantize the filter coefficients. However, just quantizing
them directly has several drawbacks. First, it is possible that the quantization noise may move one of
the poles of the synthesis filter outside of the unit-circle, yielding an unstable filter. Secondly, since
human perception of noise is based on frequency components, it is hard to relate the quantization noise
of the coefficients to the noise that will actually be perceived. To resolve this, the coefficients are
transformed into Line Spectral Frequencies, or LSF's. This is done by defining two new polynomials:
)()()(
)()()(111
2
1111
−−
−−
⋅+=
⋅+=
zAzzAzF
zAzzAzF
The LSF's are defined as the zeroes of these polynomials. These two polynomials have several
important characteristics:
1. Their zeroes lie on the unit circle
2. Their zeroes alternate each other
3. For any two polynomials defined as above, with their zeroes on the unit circle and alternating, the filter A(z)
is minimum phase, and therefore, its inverse, the synthesis filter, is stable.
4. A change in any LSF causes a change in the shape of the analysis filter only in a small frequency range
around the frequency of that LSF.
Because of property 3, the decoder can easily verify stability of the filter by making sure the zeroes are
on the unit circle and alternate. Property 4 allows the quantization of the LSF to relate to the frequency
response of the synthesis filter. To reduce the bandwidth, the encoder and decoder predict the values
of the LSF's via a 4th order moving average. Two predictors are possible; the encoder chooses which
one to use and indicates it with a bit in the bitstream. After prediction, the prediction error is
computed. This error is sent to the decoder by vector quantizing it. The vector quantization proceeds in
two stages. In the first, a 10-dimensional codebook (recall their are 10 coefficients) containing 128
entries is searched, and the "best" one is chosen. "Best" is defined here as the entry which results in the
minimum mean square error between the correct LSF's and their quantized versions. This 10-
dimensional vector is then subtracted from the original LSF's. The resulting 10 dimensional difference
62
is split into two 5 dimensional vectors. The best match for the first vector is found (best here is defined
as minimizing a weighted m.s.e) from a second codebook, and the best match for the second vector is
found from a third codebook. This second codebook is 5-dimensional, and contains 32 entries, as does
the third. This two stage structure is called a conjugate structure, and represents the CS in the codec's
name. Note that 7+5+5=18 bits are needed for the vector quantization, and another bit to specify
which moving average function is used. In the decoder, the LSF's are received and decoded. However,
for the first subframe in the frame, the LSF's are interpolated as the average of the LSF's for the
current and previous frames. The second subframe uses the LSF's received for the current frame.
The next step is to compute the excitation. This is done separately for each subframe. In each, the
excitation is represented as the sum of two components. The first is a delayed version of the excitation
used so far, and the second is a signal with four impulses at various positions. The first component is
called the adaptive codebook contribution, and it models the periodicity in the speech. Therefore, this
delay is actually the pitch delay in the speech signal. The first step in the process is to compute the
pitch delay. This is done by computing the autocorrelation of the speech (weighted to emphasize
various frequency characteristics), and finding the least maximum. Searching for the least maximum
ensures that multiples of the pitch delay are not used. This is called an open- loop pitch analysis. With
this quantity, a search is done in a region around the open loop pitch delay to find the best pitch. Best
is defined by filtering the previous excitation (delayed by the appropriate amount) through the LP
synthesis filter. The result is correlated with the actual speech signal, and divided by the magnitude of
the output of the synthesis filter (thus the gain is eliminated from the search). The delay which
maximizes this quantity is chosen. The gain is then computed directly for the optimal excitation. The
output of the synthesis filter using the optimally delayed and amplified excitation is then subtracted
from the desired speech signal, and the difference, called the target, is then used to find the second part
of the excitation. The second part of the excitation is referred to as the fixed codebook contribution.
The excitation consists of four impulses. Each impulse has an amplitude of either plus or minus one,
and can sit at a fixed set of positions (the set of positions is different for each impulse). These pulses
are then filtered through a simple harmonic filter. A search is done, first identifying the ideal
amplitudes (plus or minus one), and then the positions. As before, the search is executed by filtering
63
the excitation through the synthesis filter, and computing the product of the result with the target. This
is then divided by the energy in the output of the synthesis filter (again, eliminating the gain from the
search), resulting in the search metric. The set of amplitudes and positions which maximize this metric
are chosen. Finally, the gain is computed directly. For each subframe, a number of parameters have
now been computed: the pitch delay, the adaptive codebook gain, the fixed codebook excitation
(consisting of impulse positions and signs), and the fixed codebook gain. These parameters are then
quantized and sent to the decoder. The pitch delay is directly represented with 8 bits in the first
subframe. In the second subframe, the pitch delay is sent as the difference from the pitch delay in the
first subframe. This requires 5 bits. The fixed codebook contribution is also sent directly, using 4 bits
for the signs and 13 bits for the positions. What remains are the gains. The fixed codebook gain is
predicted from previous frames, and a multiplicative gain factor to compensate for the prediction error
is transmitted. The gain factor and fixed codebook gain are jointly vector quantized using a two stage
vector quantization process. The first stage consists of a 3 bit two dimensional codebook, and the
second stage consists of a 4 bit two dimensional codebook. The sum of the two codewords is used to
represent the gain factor and fixed codebook gain.
Once the decoder receives and reconstructs the speech signal, it applies post processing to clean it up.
The post processing consists of four components:
1. A long term postfilter, denoted Hp(z)
2. A short term postfilter, denoted Hf(z)
3. A Tilt compensation filter, denoted Ht(z)
4. A gain compensation factor, denoted g
The long term postfilter is constructed from the decoded gain and pitch delay parameters. Its basic
function is to emphasize the speech signal in frequency bands around multiples of the pitch period.
The filter is therefore constructed as a 1st order all- zero filter, with its peaks precisely at multiples of
the pitch period. The short term postfilter is designed to emphasize the formants, which are frequency
bands of energy present in the synthesis filter. The postfilter is therefore derived from the synthesis
filter, but with its peaks expanded to make them mo re predominant. In a speech signal, tilt is defined
as the general slope of the energy of the frequency domain. The tilt compensation filter attempts to
adjust for distortions in this quantity caused by the short term postfilter. Finally, the gain
64
compensation factor is just the quotient of the energy in the un- filtered decoder output divided by the
energy in the postfiltered energy. It restores the original signal strength to the speech.
The following table lists all of the bits (i.e., total: 80) which are placed in the bitstream:
1. Parameter Name Number of Bits
2. Switched MA Predictor of LSF 1
3. First stage LSF VQ 7
4. Second stage VQ, first half 5
5. Second stage VQ, second half 5
6. Pitch Delay, First Subframe 8
7. Parity bit for pitch delay 1
8. Fixed codebook for First Subframe 13
9. Signs of fixed codebook for First Subframe 4
10. Gain codebook, stage 1, for First Subframe 3
11. Gain codebook, stage 2, for First Subframe 4
12. Pitch Delay, Second Subframe 5
13. Fixed codebook for Second Subframe 13
14. Signs of fixed codebook for Second Subframe 4
15. Gain codebook, stage 1, for Second Subframe 3
16. Gain codebook, stage 2, for Second Subframe 4
Annex B
G.729 has an optional annex, Annex B, which specifies the use of silence suppression and comfort
noise generation. In typical speech, only one person talks at a time. Therefore, speech consists of
periods of talking (called talkspurts), followed by periods of silence. Additional compression can be
achieved by discovering the silence periods. Older approaches would send either nothing for the
silence periods, or would send a simple energy value, which the decoder would use to insert white
noise. However, in environments with loud and non stationary background noise, both approaches are
inadequate. G.729 has an optional annex, Annex B [7], which specifies the use of silence suppression
and comfort noise generation. In typical speech, only one person talks at a time.
Therefore, speech consists of periods of talking (called talkspurts), followed by periods of silence.
Additional compression can be achieved by discovering the silence periods. Older approaches would
65
send either nothing for the silence periods, or would send a simple energy value, which the decoder
would use to insert white noise. However, in environments with loud and nonstationary background
noise, both approaches are inadequate. The algorithm operates by first making a Voice Activity
Detection (VAD) decision in each frame. The decision is made by keeping a running average of four
quantities:
1. The LSF's during silence periods
2. The full band energy in the speech signal (computed as the logarithm of the first autocorrelation coefficient)
during silence periods.
3. The low band energy in the speech signal (computed by filtering the autocorrelation coefficients), during
silence periods.
4. The rate of zero crossing of the signal, during silence periods.
In each frame, the above parameters are extracted, and compared with the running averages.
Depending on the magnitudes of the differences for the various parameters, an activity decision is
made. Furthermore, the running averages are updated if the parameters in the current frame are less
than the running averages. The decision itself (speech or silence), is filtered, using the past two frames
parameters and decisions as inputs. This ensures that sufficient hangover (i.e., speech transmission
just after the end of a talkspurt) is present. If the decision for the current frame is silence, the next step
is to decide whether to send a Silence Insertion Description frame (SID), or to send nothing (a null
frame). The SID frames contain a small amount of information which allow the decoder to generate
comfort noise. They consist of an excitation energy (5 bits), and the prediction error for the LSF
coefficients, as in G.729 (10 bits). The SID frames need only be sent when the parameters of the
background noise have changed since last transmitted. The decision is made by the encoder in any
way it likes, generally by comparing filter coefficient and energy changes to some thresholds. Note
that the bitstream does not contain any information about which of the three frame types are present
(speech, SID, or null). This information must either be sent out of band, or can be extracted from the
size of the frame (80, 15, or 0 bits, respectively). Since G.729 was developed for environments such as
cellular and data networks, an algorithm has been specified for concealing the loss of a frame. A frame
is lost when the network layer indicates sufficient bit errors in the frame, or when the frame never
66
arrives at all (due to a packet loss in the Internet, for example). When this happens, all of the
parameters in the packet are interpolated from parameters from the previous frame. In particular:
1. The LSF parameters for the current frame are repeated from the previous frame.
2. The adaptive and fixed codebook gains are taken from the previous frame, but are attenuated to gradually
reduce their impact.
3. The excitation depends on the classification of the previous frame as voiced or unvoiced. If the previous
frame was voiced, the fixed codebook contribution is set to zero, and the pitch delay is taken as the same as
the previous frame. If the previous frame was unvoiced, the adaptive codebook contribution is set to zero,
and the fixed codebook contributions are selected randomly.
The effect of this interpolation will be to introduce errors into the decoded speech signal, both for the
frames which are erased, and for subsequent correctly received frames, due to the divergence of
encoder and decoder state. Unfortunately, quite a bit of state is maintained in the decoder, including:
1. The 4th order MAR predictor filter memories for the LSF's.
2. The past excitation signal
3. The fixed codebook energies for the past four frames, which are used to predict the fixed codebook gain
4. The adaptive codebook gain from the previous frame, which is used to generate the harmonic filter used on
the fixed codebook excitation.
5. The synthesis filter memories (10th order)
Appendix III: Other Research Fields during the Ph.D. period
“Speech Playout Buffering Based on a Simplified Version of the ITU-T E-Model“, IEEE SPL 03/2004 In multimedia real-time streaming over packet networks, the problem of transmission delay variations
between packet arrivals is frequently addressed with de-jitter buffers on the receive side. This
introduces an additional delay but it also provides a better sense of smoothness in the playout of the
output packet stream. Most of the proposed algorithms for the control of the playout buffer are based
on adaptive approaches. In particular, in [2], mean and variance values for packet delays are estimated
mainly with recursive linear filtering. The linear filter is used to adjust the total delay as a function of
the most recently observed values and works for every received packet. The buffer is then set to a size
so that only a small fraction of the arriving packets should be lost due to late arrival. In [3],
67
information about a consistent amount of received packets is instead used to construct a histogram to
approximate a packet delay distribution (PDD) that is dynamically updated over time. Based on that
histogram, the buffer dimension is set to the minimum value that will smooth out network jitter so that
the stream’s requirement of maximum late packets percentage and maximum acceptable delay are
satisfied, if possible.
In this letter, we focus on the problem of setting the buffer dimension that simultaneously affects the
packet end-to-end delay and the packet loss. To solve this problem we propose to use a perceptually
motivated optimality criterion that allows the receiver to automatically balance packet delay versus
loss. In the proposed approach, the de-jitter buffer size is adaptively set and the adopted criterion relies
on the use of a simplified version proposed by Cole and Rosenbluth in [1] of the conversational
quality ITU-T E-Model [4]-[5]. The computation of the optimal buffer size is presented in next
section; section three presents the use of these results in a voice de-jitter buffering framework; and
performance of this approach is reported in the last section.
The E-Model is a computational paradigm defined by the ITU-T to assess the combined effects of
variations in several transmission parameters that affect the conversational quality. This model was
designed for planning purposes and not for actual customer opinion prediction, though the E-Model
output, the R Factor, can be mapped to estimates of customer opinion, such as the Mean Opinion
Score (MOS) [5]. Several impairment factors are considered and weighted in order to obtain an overall
index correlated with the speech transmission quality: the simultaneous impairment factor ( sI ),
function of the signal-to-noise impairments associated to Switched Circuit Network paths; the delay
impairment factor ( dI ), that includes all delay and echo effects; the equipment impairment factor
( eI ), that models impairments caused by low bit-rate codecs; and the expectation factor (A). The
output of the algorithm is the R Factor that is a scalar value defined as a linear combination of the
cited components [4]:
AIIIR eds +−−−= 100 (1)
This model is applicable only if constant buffers are used during each conversational unit without
pauses (talkspurt). This point is quite important for the proposed algorithm.
68
Despite the apparent simplicity of (1), the computation of each impairment factor is very complex
making the model unusable in a practical context. Based on this, some studies have been carried out to
make the model operative. In particular, in [1] Cole and Rosenbluth introduce some simplifications of
the model when the 8Kbps ITU-T G.729-A speech codec [6] is used. In the following, we present
these simplifications and relevant equations (2)-(6) so as to better understand the assumptions under
which this model holds. Since sI is not a function of the underlying packet network and the aim of
Cole and Rosenbluth was to analyze the impairments introduced by a packet network, in [1] such
factor was set according to the G.729-A default value that led to setting (100- sI ) equal to 93.2. The
dI factor is a function of several average delay components within the end-to-end “signal paths”. In
VoIP connections without circuit switched networks, dI becomes a function of only the single one-
way mouth-to-ear delay measurement eed 2 (in milliseconds):
)3.177()3.177(11.0024.0 222 −⋅−⋅+⋅= eeeeeed dFddI (2)
where )(xF is the step function ( 0)( =xF if 0<x , else 1)( =xF ). The impairment factor eI for
the G.729-A+VAD codec, in the case of random packet losses and using the native G.729-A packet
loss concealment algorithm, is obtained from [4] and Table I.2 in [7] as a function of the total packet
loss ratio eee 2 :
( ) )101ln(4011,729. 2eee erandomVADAGI ⋅+⋅+=+− (3)
Equation (3), proposed in [1], provides very similar results to those obtainable from the new
expression of ( )eeee eII 2= in [4] and the new Appendix I/G.113 [8]. The suggested value of A for
wirebound connections is zero. The eed 2 and eee 2 components in (2) and (3) are defined as follows:
dejitternetneteenetdejittercodee eee e ; dddd ⋅−+=++= )1(22 (4)
Equation (4) defines eed 2 and eee 2 as a combination of: delays (d) and losses (e) due to speech
processing, to network impairments, and to de-jitter buffering. Among these factors, only dejitterd and
dejittere are directly affected by de-jitter buffer management operations. These two variables are
69
strongly correlated and, to find out this relationship, the following expression has been proposed in
[1]: gbd dejitter ⋅= , where b represents the number of packets to be bufferized to compensate the jitter
and g represents the average packet inter-arrival time. Then, the late packet ratio is computed on the
basis of the probability to receive a packet after an interval l with respect to the previous packet greater
than the dejitterd , than ( )gblPedejitter ⋅>≅ .
By applying Chebyshev’s inequality to dejittere , Cole and Rosenbluth obtained:
( )22
2
2
2
)1()( −⋅=
−⋅<−⋅>−≅
bgggbggbglPe ll
dejitterσσ
(5)
where lσ is the packet inter-arrival time standard deviation. Equation (5) represents an
approximation that links the late packet ratio with the used buffer dimension. On applying these
simplifications to (1), the following expression is obtained:
( )
( )( )( )( )
( ) ��
�
��
�
�
��
�
�
��
−−+⋅+⋅−−
−+⋅+⋅⋅−+⋅+⋅−
+⋅+⋅−=
22
2
1)1(101ln4011
3.177
3.17711.0
024.02.93
bgee
dgbdF
dgbd
dgbd R
lnetnet
netcod
netcod
netcod
σ (6)
Equation (6) is an expression of the speech transmission quality as a function of the buffer
dimension b. Once network statistics g , lσ , netd and nete for future conversational units are
predicted, this expression can be maximized in order to find out the optimal buffer value b as
described in the following section.
It is important to note that maximization of the R Factor does not assure maximization of the user
perceived quality. However, under certain important conditions (the most common), the correlation of
the R Factor with the MOS quality index has been verified for some quality levels: from “user very
satisfied” to “nearly all users dissatisfied”. Annex A of [4] provides the situations where the model
validity has not been completely verified. Additionally, the assumptions used by Cole and Rosenbluth
need to be taken into account when using such a simplified version: no circuit switched network
interworking, no echo, randomness of packet losses, and use of the native G.729-A concealment
70
algorithm. The use of different conditions would influence the final R index. The assessment and
enhancement of the E-Model is under study by ITU (ITU-T SG 12).
Buffer adjustment between talkspurts has the advantage of producing a smoother playout with
respect to continuously updating approaches. This is the approach adopted in the proposed algorithm,
where the buffer is tuned based on maximizing the expected future transmission quality during
conversational pauses. The following operations are performed: during a talkspurt information about
occurred packet losses, inter-arrival time, and transmission delay are stored; during the silence period,
variables g , lσ , netd and nete are estimated based on previously stored information; at the beginning
of a new talkspurt, the values obtained in the previous step are introduced in (6) and this expression is
maximized to obtain the optimal buffer dimension ( optb ); finally, for the new talkspurt the buffer is set
in accordance with gbd optdejitter ⋅= . The proposed algorithm is disjoined from a specific network
statistics estimation approach and different well-known techniques can be used for this purpose. In
particular, we implemented three different E-Model based (EM) algorithms:
• EMv1: g and lσ are set to the mean and standard deviation computed for the packets belonging to
the last talkspurt. Also netd and nete are taken equal to the mean values over this period.
• EMv2: g , lσ , and netd are computed using the prediction algorithm (linear filter) described in [2].
nete is computed again as in the previous algorithm.
• EMv3: this algorithm instead is based on the use of a histogram based probability density function
(PDF) approach as proposed in [3]. Essentially, two different PDFs are used: one for network delay,
the other for packet inter-arrival time. One PDF is used to compute g and lσ while the other is used
to compute netd . nete is set equal to the loss ratio experienced during the same period used to
construct the PDFs.
EMv1 is the simplest and the least accurate algorithm. Nevertheless, it has the advantage of working
automatically without any input parameter setting. On the other hand, this is not true for the other two
algorithms that require the selection of appropriate values for a certain set of input parameters to work
correctly. The computation of optb is performed using a bisection strategy to guarantee fast
convergence while searching for the maximum R function value.
71
The performance of the proposed approach was analyzed by applying the devised playout
algorithms to several traffic traces, each one lasted an average of 10 minutes. Two H.323 hosts were
used during the experiments using the G.729-A+VAD codec with the native error concealment and
sending 2 packets of 10msec voice frames for every UDP packet.
The packet losses after de-jittering were analyzed in order to verify that the probability of loss of a
packet was independent of the probability of loss of any other. Such analysis was performed using the
statistical chi-squared test by considering the random variable K associated to the length of packet loss
bursts. In case of random packet loss, the variable K should be distributed according to a geometric
PMF (Probability Mass Function) and the chi-squared test has been used to verify this hypothesis
(hypothesis 0H ). For each experiment, we computed the chi-squared value and the associated
probability 2χP of having a chi-squared value equal or greater than the obtained value by chance only.
We were then able to reject or not reject 0H depending on whether 2χP was smaller or not than a
given significance level that is usually selected equal to 1% or 5%. An additional investigation was
performed by modeling the packet loss process with a Gilbert model as done in [9]. This model is
based on the unconditional packet loss probability uP and the conditional packet loss probability cP .
The distance between uP and cP gives an indication of the deviation of the packet loss process from a
memoryless Bernoulli process that is characterized by having these two probabilities equal. We show
the results of these two investigations for four of the performed experiments in Table I. The chi-
squared tests prove that 0H has to be rejected for experiments I and IV and that it cannot be rejected
for experiments II and III on the basis of a significance level of 1%. These results are also in
accordance with the computed distances between uP and cP that are quite low for experiments II and
III and significantly high for the others. As to the other performed experiments, we have to say that
most of these were characterized by having a bursty behavior. This result highlights the need for an
expression of eI in case of bursty losses in order to correctly use the E-Model for the quality
evaluation of IP Telephony in the Internet. Based on this analysis, the proposed algorithm can be
applied only to experiments II and III for which we present the results in this Section.
The performances of the proposed playout approach are compared with those of the cited works. In
72
particular, for the linear filter, we used Algorithm 1 in [2] (Linear Filter, 998002.0=α ). The Concord
algorithm in [3] was tested with default parameter values: expected eed 2 (i.e., the ted) recalculation at
each arriving packet, histogram with 1ms bin-width, aging every 1000 packets with 9.0=F , and the
maximum late packets (mlp) set to 0.01. Fig. 1 makes a comparison of the proposed EMv1 technique
with alternative approaches in terms of the R Factor for the third experiment. The graph shows the R
Factor computed at the end of each talkspurt based on the obtained packet loss and delay using the
exact R Factor formula instead of the approximation in (8). This graph shows that our strategy
obtained higher values generally diffused in spite of the presence of spikes. The mean, maximum, and
minimum quality values are presented in Table II, together with the resulting eed 2 values. The results
for EMv2 and EMv3 are also presented. We can observe that the proposed algorithms EMv1 and
EMv3 allow obtaining an average R Factor of almost 5 points higher than the alternative solutions
with comparable total end-to-end delays. The maximum achievable values are also provided,
computed making the assumption of knowing exactly the network statistics for that talkspurt instead of
predicting them.
As to the second experiment, we present the final results directly in the second part of Table II. This
experiment was characterized by a null network packet loss and an average network delay of
138.41msec. We obtained similar results for the linear filter and Concord solutions, while with the
EMv1 and EMv3 approaches, very good results were achieved due to the absence of network losses.
These two approaches outperformed the alternative ones by about 14 points. The obtained quality
values were quite close to the maximum achievable ones.
[1] R.Cole and J.Rosenbluth, “Voice over IP performance monitoring,” in ACM Computer Communication Review, 31(2), Apr. 2001.
[2] R. Ramjee, J. Kurose, D. Towsley, and H. Schulzrinne, “Adaptive playout mechanisms for packetized audio applications in wide-area networks,” in Proc. IEEE Infocom, vol. 3, pp. 1352-1351, June 1994.
[3] C.J. Sreenan, J.-C. Chen, P. Agrawal, and B. Narendran, “Delay reduction techniques for playout buffering,” in IEEE Transactions on Multimedia, vol.02, no. 02, pp. 88-100, June 2000.
[4] ITU-T Recommendation G.107, “The E-Model, a computational model for use in transmission planning,” 03/2003.
[5] ITU-T Recommendation G.108, “Application of the E-Model: A planning guide,” 09/1999. [6] ITU-T Recommendation G.729-A, “Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-
excited linear-prediction (SC-ACELP),” 03/1996. [7] ITU-T Recommendation G.113, “Transmission impairments due to speech processing,” 02/2001. [8] ITU-T Recommendation G.113 Appendix I, “Provisional planning values for the equipment impairment
factor Ie and packet-loss robustness factor Bpl,” 05/2002. [9] W.Jiang and H.Schulzrinne, “Modeling of packet loss and delay and their effect on real-time multimedia
service quality,” in Proc. Intern.ational Workshop on Network and Operating Systems Support for Digital Audio and Video, June 2002.
73
45
50
55
60
65
70
75
80
85
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Number of talkspurt
R F
acto
r
Linear Filter Concord EMv1
Figure 16: Comparison of the proposed EMv1 algorithm with the Linear Filter and Concord
algorithms for the third experiment.
Table VII: 2χP , uP , and cP for four of the performed experiments after applying the EMv1
algorithm. 2χP (%) uP (%) cP (%)
I Experiment <1 2.36 20.39 II Experiment >1 0.88 1.05 III Experiment >5 1.88 1.90 IV Experiment <1 2.09 5.65
Table VIII: Results in terms of the de2e and the R Factor for the three proposed algorithms (EMv1, EMv2, and EMv3) and the two comparing approaches (Linear Filter and Concord). The potential
maximum R Factor is also presented.
The Audio Watermarking: “An Audio Patchwork Shaping Framework with Psychoacoustic Model 2”, accepted and presented in WIAMIS 2004 (Lisboa, 04/04) The recent years have been characterized by a growing diffusion in the fruition of digital audio
contents, and consequently in the need for copyright and ownership property. The watermarking
de2e (msec) R Factor Algorithm
Min Max Mean Min Max Mean
Linear Filter 190.7 318.3 250.3 50.5 80.1 57.1
Concord 156.1 207.1 179.9 50.0 77.0 58.8
EMv1 129.7 277.4 212.4 52.1 80.1 63.6
EMv2 82.6 247.4 147.3 46.8 65.1 56.7
EMv3 128.8 247.2 183.3 51.1 81.2 63.2 III E
xper
imen
t
Max R Factor - - - 54.0 82.1 66.8
Linear Filter 220.1 517.8 333.9 51.3 73.2 61.2
Concord 167.0 275.0 212.0 52.2 77.7 62.4
EMv1 165.5 303.9 201.5 62.5 79.0 76.3
EMv2 140.3 311.7 189.6 58.7 80.0 73.4
EMv3 137.9 277.4 182.2 65.8 80.0 77.2 II E
xper
imen
t
Max R Factor - - - 66.7 80.9 78.2
74
techniques represent a good solution for these supplies: a mark is opportunely inserted in a host signal
in a way its ownership is provable. Lots of strategies have been presented in the recent past with this
purpose. Several of these techniques inherited their core from image watermarking; more in general,
this legacy was not always possible due to the differences in sensibility and perception between human
ear and eye. A set of basic features for a reliable watermarking strategy was presented in [2]. Two
characters are mostly significant, and, apparently contradicting: inaudibility and robustness to signal
processing. Inaudibility means that the differences between the original and the watermarked signal
should not be perceivable by the human ear. Secondly the watermark must be robust against
intentional or unintentional attacks. One of the most impairing attack is the signal processing, and
specifically the lossy compression. Such a compression guarantees enhanced portability of digital
information, but can have an undesirable effect on the embedded watermark. The developing of our
strategy is accomplished referring constantly to these two features. In this paper we refer to an adaptive
approach of patchwork algorithm. The patchwork is originally presented in [1] for image
watermarking. The original implementation of this technique presents several limitations when applied
to audio samples. Quite a few adaptations have been proposed to improve considerably its performance
([2]-[5]). From these studies, the work-dominion and the adaptive patch shaping appear as the key
points for the applying of the original strategy to audio samples such as for its perfection. The
proposed strategy works on an assumption introduced in [1]: treating patches of several points have the
effect of shifting the noise to low frequencies, where there is lower probability to be filtered by lossy
compression techniques. The right dimension is fixed by comparing the spectrum of watermarked
signal to the minimum masking threshold, as obtained referring to the psychoacoustic model 2 [6]. The
patch shaping is performed in the Fourier dominion. The proposed technique is applied to audio
samples and compared with the adaptive patchwork state of art, referring to the framework proposed in
[7]. The patchwork shaping framework shows particular good results in terms of robustness to
compression and quality. The paper is organized as follows. Section 2 presents the state of art of
adaptive patchwork algorithms. Section 3 introduces the watermark shaping despite of the threshold of
audibility. Section 4 illustrates our technique. Section 5 presents tests and results, while in Section 6
the conclusions are drawn.
75
The patchwork strategy is a two-set method, that is it makes different two sets from a host signal [4].
This difference is used to verify, or not, a hypothesis Ho (e.g., the watermark is embedded).
Figure 17: Distribution of the mean difference of the samples in Un-Watermarked and Watermarked
signals. The original strategy [1] is applied to sets with more than 5,000 elements. The samples of each subset
are considered uniformly distributed and with equal mean values. The elements are modified by
adding and subtracting the same quantity d. Thus, the detection of a watermark is related to the
condition:
dBAE markedmarked 2][ =− Several of these statements must be reassessed when working with audio samples [2]. In particular the
distribution of the sample value is assumed as normal (See Fig.1). Recent approaches modify the
original strategy to better take into account the human ear sensibility to noise interferences. These
methods can be classified in temporal and spectral approaches, depending on the domain where the
watermark is embedded. In [5] a technique is proposed, that is based on the transformation of time-
domain data. A set of N samples, corresponding to 1sec of stereo audio signal, is modified by a
watermark signal w(i). [2] – [3] – [4] propose spectral patchwork approaches. In particular, [2] works
with a dataset of 2N Fourier coefficients. The relationship between d and the elements of the dataset is
multiplicative. The parameter d is adaptively chosen to prevent perceptual audibility, basing on the
characteristics of the audio signal (i.e., it is introduced for the first time the concept of power density
function in the hypothesis tests).
Watermarked Un-Watermarked
76
In [3] the patchwork algorithm is applied to the coarsest wavelet coefficients, providing a fast
synchronization between the watermark embedding and detection. While, in [4] the Modified
Patchwork Algorithm (i.e., MPA) is presented. Such approach is very robust due to three attributes:
the factor d is evaluated adaptively and is based on sample mean and variance; the patch size in the
transformed domain is very little: this guarantees good inaudibility; finally, a sign function is used to
enhance the detection rate.
These features are included in an embedding function so that the distance between the sample means
of the two set bigger than a certain value d. The temporal approaches are easier to implement than the
spectral ones; at the same time, they present several weaknesses against general signal processing
modifications [3].
The association between a watermarking algorithm and a noisy communication system is not new [8].
Actually, a watermarking strategy adds a mark (i.e., the noise) in a host signal (i.e., the communication
channel). In this sense, the watermark embedding can be considered as an operation of channel
coding: the watermark is adapted to the characteristics of the transmission channel (i.e., the host signal
in which the watermark should be embedded). In case of audio contents, what it is usually considered
as an impairment, the sensibility of the human ear, can be used as a way to spread and dimension the
watermark. The human auditory system (HAS) is well known, that is it is sensible to specific
frequencies (i.e., from 2KHz to 4KHz) and reacts to specific events (i.e., frequency and temporal
masking). Given a signal S, it is possible to recover its minimum masking threshold. The minimum
masking threshold of audibility represents a limit between the audible and inaudible signals for S at
different frequencies. Independently from S, it is also possible to recover the absolute threshold of
hearing (i.e., ATH). Such curve (i.e., referred to as quiet curve [9]) is different than the previous and
defines the required intensity of a single sound expressed in unit of decibel (dB) to be heard in the
absence of another sound [10].
Several methods, outside the patchwork fashion, have been proposed that make use of psychoacoustic
models to guarantee perceptual inaudibility of the mark [11], [9], [12]. Usually, the state of art
methods shape the watermark referring mainly to the quiet. The filtered watermark signal is scaled in
order to embed the watermark noise below the quiet curve [13]. In addition, other methods increase
77
the noise energy of the watermark, referring undeniably to the minimum threshold of audibility. Such
threshold can be recovered through a well defined psychoacoustic model.
The MPEG/audio standard provides two example implementations of the psychoacoustic model.
Psychoacoustic model 1 is less complex than psychoacoustic model 2 and has more compromises to
simplify the calculations. Either model works for any of the layers of compression. However, only
model 2 includes specific modifications to accommodate Layer III. In this paper, we refer to the model
2, differently from the past approaches.
Unmarked
Psycho-model 2
FFT RNG
Patchwork Shaping
Seed + Watermark
Marked
Figure 18: Steps (1-4) of the Patchwork shaping algorithm.
As already stated, the proposed patchwork strategy modifies two set of N elements/coefficients from
the original signal (i.e., signalUn-Marked). The signalMarked strongly belongs to the correspondent
signalUn-Marked. The core of our strategy is the shaping of the frequency-response of the mark
signal, using psychoacoustic model 2. The algorithm proposed in this work embeds the watermark in
the frequency domain, by modifying 2N Fourier coefficients. The choice of this transform domain is
justified by the use of the psychoacoustic model. The embedding steps (See Fig.3) can be summarized
as follows:
1. Evaluate the threshold of minimum audibility for the signalUn-Marked, referring to psychoacoustic model 2.
2. Map the secret key and the watermark to the seed of a random number generator. Next, generate two N-
points index sets { }NAN aaaI ,...,, 21= and { }N
BN bbbI ,...,, 21= .
3. Let { }NXXXX 221 ,...,,= be 2N DFT coefficients of the signalUn-Marked, corresponding to the index sets ANI and B
NI . 4. The original amplitude of the patch and the number of re-touched coefficients, starting from the generic
elements of index ai or bi, have respectively standard values ( )θδ , . Such values are modified iteratively to verify that the spectrum of the watermark signal is under the minimum audibility threshold (i.e., obtained from point 1). Iteratively means a constant referring to the block of model 2 from the block-shaping (See the dotted loop in Fig.3).
5. The time-domain representation of the output signal is found, applying an Inverse DFT to the signalMarked .
The phase of detection is as follows:
78
1. Define two test hypothesis: Ho (the watermark in not embedded) and H1 (the watermark is embedded).
2. Map the seed and the watermark to a Random Number Generator and generate two sets 'A
NI and 'B
NI .
3. Fix a threshold ∆ for the detection, and evaluate the mean value (i.e., ( )⋅=⋅ E ) of the random variable ''ii baz −= , for { }'' A
Ni Ia ∈ and { }'' BNi Ib ∈ .
4. Decide for Ho, or H1, depending on ∆<z , or ∆≥z . We tested the proposed algorithm on 16-bit stereo audio signals, sampled at Fs = 44.1KHz. The size of
the each patch (i.e., N) was fixed to 50 points; while the default values for ( )θδ , were set to ( )10 ,5.0
. Higher values for θ were also tested only for robustness evaluation, regardless of quality aspects.
The state of art proposes a framework for the evaluation of audio watermarking techniques [7]. In this
work, we referred to this framework and considered, in particular, two key factors: quality of the
watermarked signal and robustness to mp3 compression. The evaluation of quality is an essential part
in testing our strategy, since the basic idea was to guarantee the maximum rate of inaudibility of the
patch. The tests were performed using a subjective score (i.e., a MOS) and the SNR of the
watermarked signal versus the host signal. The robustness of the proposed strategy was tested in two
steps: at first, coding and decoding the watermarked signal with a commercial MP3 encoder at
different rates (e.g., usually 128Kbps); secondly, trying the detection of the watermarked on the
uncompressed signal. Quality and robustness can not be evaluated separately. These factors are
strongly correlated, that is: a decrease in quality causes an increase, in most cases significant, of
robustness. All the performed tests showed good results. The idea of increasing the number of points
of the patches reveals its successfulness. Good subjective quality appears since all the patches are
below the audibility threshold for that signal ( 26≤SNR ).
79
Figure 19: Probability density function of detection for the random variable z, varying the dimension
of the patch with SNR = 26. At the same time, treating more points has the effect of shifting the patchwork-noise to low
frequencies, where it has a lower probability to be filtered by the mp3 compression. Figure 4 shows
different probability density functions (i.e., introduced as empirical Pdf in [5]) of the random variable
z, as described in the detection phase. The density function of z, before the mp3 compression, is
compared with different behaviours (i.e., varying the dimension of the patch). This test shows clearly
that higher dimensions of θ lead to lower alterations in the detection values. This results in a Pdf
nearer to that of the uncompressed signal. We have also evaluated the error probability at different
rates of compression (i.e., 128, 96 and 64 Kbps). Two kinds of errors can be individuated. The state of
art refers to them is terms of Type I (Rejection of Ho, when Ho is true) and Type II (Non-Rejection of
Ho, when H1 is true) [4]. Type II errors seem to be the most impairing (the watermark is inserted (i.e.,
quality degradation) but the ownership can not be proven). Table VII presents the Type II errors for a
test audio signal. Clearly, the probability of rejection of Ho, when H1 is true, decreases
correspondently with the mp3-rate of compression.
In this paper an audio watermarking framework has been presented, that is based on a patchwork
approach. The core of the proposed technique is the shaping of the patchwork, performed referring to
psychoacoustic model 2. This results in higher robustness and inaudibility of the patch noise. The
strategy was evaluated in terms of robustness to lossy compression, quality. Good results were
obtained during the tests with 44.1Khz audio and speech traces. The proposed strategy can be
improved in the modelling of the patch.
z
Pdf(z) Before mp3
θ = 9
θ = 4
θ =6/5
80
Actually this step is quite coarse. Further studies will be centred on more refined mathematical
mechanisms of patch-shaping (i.e., such as the curve fitting) respect to the minimum masking
threshold.
Table VII: Error Probabilities for lossy compression at different rates. Type II Errors(%)
MPEG I Layer III (128Kbps) 0.1 MPEG I Layer III (96Kbps) 0.7 MPEG I Layer III (64Kbps) 1.6
References
[1] W.Bender, D.Gruhl, N.Morimoto, and A.Lu, “Techniques for data hiding,” IBM System Journal, Vol.35,
No.3-4, pp.313-335, 1996.
[2] M.Arnold, “Audio watermarking: features, applications and algorithms,” IEEE Int. Conf. Multimedia and
Expo 2000, Vol.2, pp.1013-1016, 2000.
[3] Hong Oh Kim, Bae Keun Lee, and Nam Yong Lee, "Wavelet-based audio watermarking techniques:
robustness and fast synchronization,” Research Report 01-11, Division of Applied Mathematics-Kaist.
[4] In-Kwon Yeo and Hyoung Joong Kim, “Modified patchwork algorithm: A novel audio watermarking
scheme, “ IEEE Trans. On Speech and Audio Processing, Vol.11, No.4, 07/2003.
[5] P.Bassia, I.Pitas, and N.Nicholaidis, “Robust audio watermarking in the time domain,” IEEE Transaction on
Multimedia, Vol.3, pp.232-241, 06/2001.
[6] ISO/IEC Joint Technical Committee 1 Subcommittee 29 Working Group 11, Information Technology-
Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbit/s, Part
3:Audio, ISO/IEC 11172-3, 1993.
[7] J. D. Gordy and L. T. Bruton, “Performance Evaluation of Digital Audio Watermarking Algorithms,”
Proceedings of the 43rd Midwest Symposium on Circuits and Systems, Lansing MI, USA, Aug. 2000;
[8] T.Muntean, E.Grivel, I.Nafornita, and M.Najim, “Audio digital watermarking for copyright protection,”
International Workshop on “Trends and Achievements in Information Technology”, 05/2002.
[9] N.Cevjic, A.Keskinarkaus, and T.Seppanen, “Audio watermarking using m-sequences and temporal
masking,” IEEE Workshops on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY,
pp.227-230, 2001.
[10] E.Zwicker and U.T.Zwicker "Audio Engineering and psychoacoustics: matching signals to the final
receiver: the Human Auditory System," Journal of Audio Engineering Society, Vol.39, No.3, pp.115-126,
03/1991.
81
[11] L.Boney, A.H.Twefik, and K.N.Hamdy, “Digital watermarks for audio signal,” International Conference on
Multimedia Computing and Systems, Hiroshima, Japan, pp.473-480, 1996.
[12] M.Arnold and K.Schiltz, “Quality evaluation of watermarked audio tracks,” SPIE Electronic Imaging,
Vol.4675, pp.91-101, 2002.
[13] Hyoung Joong Kim, “Audio watermarking techniques,” Pacific Rim workshop on Digital Steganography,
Kitakyushu, 07/2003.
[14] D. Pan, "A tutorial on MPEG/audio compression," IEEE MultiMedia, Vol.2, No.2, pp. 60--74, 1995.