82
Ph.D. Thesis VoIP Streaming Over Packet-Based Networks Mirko Luca Lobina Advisor: Prof. Luigi Atzori University of Electrical and Electronic Engineering Cagliari

VoIP Streaming Over Packet-Based Networks

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VoIP Streaming Over Packet-Based Networks

Ph.D. Thesis

VoIP Streaming Over Packet-Based Networks

Mirko Luca Lobina Advisor: Prof. Luigi Atzori

University of Electrical and Electronic Engineering Cagliari

Page 2: VoIP Streaming Over Packet-Based Networks

1

Page 3: VoIP Streaming Over Packet-Based Networks

2

Contents

My Thanks, Introduction and Sketch of the Thesis

I. The Playout Buffering

II. Previous Works

III. The ITU-T E-Model

IV. The eEM Playout Strategy

i. Analysis of the Packet Loss Burstiness

j. Extension of the E-Model

k. Playout Buffering by Quality Maximization (eEM)

l. Analysis of the Computational Complexity of the eEM Strategy

V. Experiments

VI. Conclusions

Appendix I: The ITU-T H.323

Appendix II: Intrusive and Non-Intrusive Evaluation of Speech Quality

Appendix III: The ITU-T G.729A-VAD, a low bit-rate speech codec

Appendix IV: Other Research Fields and Articles during the Ph.D. period

i) The EM Playout Strategy

ii) Audio Watermarking using Psychoacoustic Model II

Page 4: VoIP Streaming Over Packet-Based Networks

3

Index of Figures and Tables

Figure 1, pp.5: Playout buffering to compensate the transmission jitter.

Figure 2, pp19: Sketch of the proposed playout buffering strategy.

Figure 3, pp.23 4-State Markov model for the packet loss process.

Figure 4, pp.27 Effects of the playout buffer setting on the packet loss burstiness.

Figure 5, pp.28 Predicted and measured functions of g, b, ge , be for talkspurt 25 in Trace 4: the continuous and dashed

lines have been drawn from measured and predicted values, respectively.

Figure 6, pp.30 Average R Factor versus N for the 8 test traces.

Figure 7, pp.33 Comparison of the proposed algorithm with the competing ones for the first talkspurts of Trace 8.

Figure 8, pp.34 Overall R Factor for Trace 7 and 8.

Figure 9, pp.40 H.323 Terminals on a Packet Network

Figure 10, pp.43 An H.323 Zone

Figure 11, pp.51 H.323 Call Establishment

Figure 12, pp.52 IP Telephony: H.323 Interworking with SCN

Figure 13, pp.52 The PESQ Strategy

Figure 14, pp.57 The ITU-T R-Factor

Figure 15, pp.58 A non-intrusive strategy for measuring objective voice quality (Psytechnics)

Figure 16, pp.73 Comparison of the proposed EMv1 algorithm with the Linear Filter and Concord algorithms for the third

experiment.

Figure 17, pp.75 Distribution of the mean difference of the samples in Un-Watermarked and Watermarked signals.

Figure 18, pp.77 Steps (1-4) of the Patchwork shaping algorithm.

Figure 19, pp.78 Probability density function of detection for the random variable z, varying the dimension of the patch with

SNR = 26.

Table I, pp.21: Voice traces used during experiments.

Table II, pp.22: Burstiness analysis results.

Table III, pp.30 Settings for the eEM algorithm parameters used in the experiments.

Table IV, pp.31 Comparison of eEM with other strategies for Traces 1-4. In the last two columns

Table V, pp.32 Comparison of eEM with other strategies for Traces 5-8.

Table VI, pp.32 Results of the eEM algorithm for Traces 1-4 when assuming loss randomness: the equipment impairment

factor has been evaluated by means of (5).

Table VII, pp.73 2χP , uP , and cP for four of the performed experiments after applying the EMv1

Page 5: VoIP Streaming Over Packet-Based Networks

4

Table VIII, pp.73 Results in terms of the de2e and the R Factor for the three proposed algorithms (EMv1, EMv2, and EMv3)

and the two comparing approaches (Linear Filter and Concord). The potential maximum R Factor is also presented.

Table IX, pp.80 Error Probabilities for lossy compression at different rates.

Page 6: VoIP Streaming Over Packet-Based Networks

5

My Thanks: this Thesis is the result of the professional collaboration with Dr. Luigi Atzori. My Thanks

goes to Dr. Atzori mainly for his skilled guidelines and steady support. The algorithms and reflections

proposed in this work cover the three years spent as a Ph.D. Student at MCLab (DIEE), CNIT

Multimedia Communication Lab at University of Cagliari.

Page 7: VoIP Streaming Over Packet-Based Networks

6

Introduction to the Thesis: the core of this work is the study of real-time transmission and fruition

mechanisms of speech contents over an Internet protocol (IP) network (i.e., Internet telephony).

Internet telephony refers to communications services-voice, facsimile, and/or voice-messaging

applications, that are transported via the Internet, rather than the public switched telephone network

(PSTN). The basic steps involved in originating an Internet telephone call are conversion of the

analog voice signal to digital format and compression/translation of the signal into IP packets for

transmission over the Internet; the process is reversed at the receiving end. The possibility of voice

communications traveling over the Internet, rather than the PSTN, first became a reality in February

1995 when Vocaltec, Inc. introduced its Internet Phone software. Designed to run on a 486/33-MHz

(or higher) personal computer (PC) equipped with a sound card, speakers, microphone, and modem,

the software compresses the voice signal and translates it into IP packets for transmission over the

Internet. This PC-to-PC Internet telephony works, however, only if both parties are using Internet

Phone software. In the relatively short period of time since then, Internet telephony has advanced

rapidly. Many software developers now offer PC telephony software but, more importantly, gateway

servers are emerging to act as an interface between the Internet and the PSTN. Equipped with voice-

processing cards, these gateway servers enable users to communicate via standard telephones. A call

goes over the local PSTN network to the nearest gateway server, which digitizes the analog voice

signal, compresses it into IP packets, and moves it onto the Internet for transport to a gateway at the

receiving end. With its support for computer-to-telephone calls, telephone-to-computer calls and

telephone-to-telephone calls, Internet telephony represents a significant step toward the integration of

voice and data networks. A complete description of the main scenarios and standards for Internet

telephony (e.g., ITU-T H.323) has been provided in Appendix I.

Originally regarded as a novelty, Internet telephony is attracting more and more users because it

offers tremendous cost savings relative to the PSTN. Users can bypass long-distance carriers and

their per-minute usage rates and run their voice traffic over the Internet for a flat monthly Internet-

access fee.

Although progressing rapidly, Internet telephony still has some problems with reliability and quality.

Reliability means that You should ensure that: the voice quality of Your Internet telephony solution

Page 8: VoIP Streaming Over Packet-Based Networks

7

meets Your requirements; ensure Your unified network can prioritize voice traffic and can deal with

high traffic conditions. Quality means that: poor speech quality will make Your Internet telephony

solutions unpopular to use. Reliability and Quality strongly depend on several factors both at a

network and applicative level, such as bandwidth limitations, coding strategies, channel

characteristics and application design. In this optic, this Thesis primarily focuses on two aspects:

proposals to improve the end-user speech quality using ad hoc playout strategies (i.e., the EM and

eEM strategies), considering the coding strategies, bandwidth limitations and backbone

characteristics as fixed points; measuring the speech quality using specific metrics (i.e., See Section

III and Appendix II) able to traduce the end-to-end Internet telephony impairments into a friendly

score (i.e., Mean Opinion Score (MOS)).

Obviously, the two aspects described above are not independent, but few studies proposed in the past

focused directly on the maximization of the perceived quality. The EM and eEM algorithms fill this

lack. To this aim, it has been required a deep study of: mechanisms of coding and transmission of

speech contents; channel characteristics and their statistical characterization; study of the state of art

playout strategies.

As Appendixes to the Thesis two other research fields are also presented: an audio watermarking

strategy based on Psychoacoustic Model 2 (e.g., the same applied in MP3 compression strategy) and

an Error Concealment technique to be applied in ITU-T G.723.1 low bit rate codec.

Page 9: VoIP Streaming Over Packet-Based Networks

8

Sketch of the Thesis: This Thesis is composed of seven sections and four appendixes. The Sections

mainly treats the playout control problem in real time streaming applications (e.g., Internet

Telephony). Two new strategies are presented: a new playout method based on maximization of the R

Factor and its upgrade, named eEM strategy. Several points related to these methods are discussed

and commented in the first three appendixes of the Thesis. The last appendix argues on two other

research fields always related to audio processing: an audio watermarking technique based on

psychoacoustic model 2 and a new error concealment approach for the ITU-T G.723.1 speech codec.

Page 10: VoIP Streaming Over Packet-Based Networks

9

I. Playout buffering

IP Telephony applications have been developed over a set of protocols (RTP, UDP, and IP) that are

not able to natively guarantee the application required quality of service. In fact, different factors

deeply affect the end-user perceived quality. One of the most impairing factors is the variation of the

packet transmission delay during the streaming, named jitter, which is caused by the temporal

variability of the network conditions.

In real-time applications such as IP Telephony, every transmitted packet has an associated playout

time. If the packet arrives later than this time, it is discarded by the decoder, being useless. Otherwise

it is buffered at the de-jitter buffer until its playout time, so as to compensate the transmission jitter.

Fig. 1 illustrates this operation.

0,Dt

0,dejd T

Sender

Network

Receiver

Buffer

Playout

1,Dt 2,Dt 3,Dt 4,Dt 5,Dt

0,At1,At 2,At 4,At

5,At

0,Pt 1,Pt 2,Pt 5,Pt4,Pt3,Pt

ms

ms

ms

0,netd

Figure 1. Playout buffering to compensate the transmission jitter.

In the axis at the top of the figure, the departure instants iDt , are drawn, for every packet i

( ,...2,1,0=i ), with 0 ,Dt set to zero. These instants are uniformly spaced by an interval T: Tit iD ⋅= , .

This interval depends on the number and size of the speech frames conveyed in every transport packet

(e.g., for the ITU-T G.729 speech codec, T = 20 ms, corresponding to two speech frames of 10 ms).

The axis in the middle of the figure is used to show the arrival instants iAt , of the packets at the

receiver. The arrivals could be not ordered, since the network delay inetd , for every packet is a random

Page 11: VoIP Streaming Over Packet-Based Networks

10

variable. Additionally, some packets can be lost during transmission due to network problems, such as

nodes buffer congestion; this is the case of packet 3 in the figure. Note that with our notation

iDiAinet ttd ,,, −= and Tidt inetiA ⋅+= ,, . The delay variability is removed by the de-jitter buffer that

introduces an additional delay idejd , . The intent is to obtain a sequence of playout instants iPt ,

uniformly spaced, as illustrated in the axis at the bottom of the figure. Accordingly, the delay between

the departure and the playout of the packets eed 2 is equal for every packet so that

idejinetee ddd ,,2 += . A packet is discarded at the receiver if eeinet dd 2, > . The playout algorithm is

devoted to setting of the end-to-end delay, which can be changed occasionally during the streaming

session as described in the following.

The removal of the jitter is accomplished at the receive side by means of a playout buffer that masks

the jitter at the expense of an additional delay. Within this framework, an important task is the setting

of the total end-to-end delay, which should consider the network delay, the packet loss, and the

perceived subjective quality.

Originally, the setting of the playout buffer was purely based on the introduced additional delay and

loss performance. In the last years, a different approach has been proposed, which consists in taking

into account the effects of delay and losses on the subjective quality. Such an approach requires the

use of an appropriate tool to evaluate the combined effects of transmission impairments that affect the

conversational quality. On the basis of this tool, the playout buffering algorithm estimates the optimal

buffer configuration by weighting the contribution of delay and loss to the conversational quality. The

use of such a perceptually motivated optimality criterion allows the receiver to automatically balance

packet delay versus packet loss. Almost all the proposed works founded on this approach, which we

refer to with quality-based, make use of the ITU-T E-Model for quality evaluation [1], [2]. It is a

computational framework for the estimation of the conversational quality by means of a synthetic

index (the R Factor), which encloses the contributions of many features, presented as impairment

factors. However, an important problem limits the applicability of this model: it is valid only in case of

random packet losses, which are observed very rarely in IP Telephony. In fact, several studies have

shown the burstiness of packet losses in the Internet [3], [4]. Dealing with bursty losses as if these

Page 12: VoIP Streaming Over Packet-Based Networks

11

were random would be a significant error. Indeed, at a given total loss ratio, the subjective impact of

isolated losses respect to grouped losses is quite different.

Based on these considerations, in this work we study the application of the quality-based approach to

more realistic models for the packet loss process. We then propose a new playout buffering algorithm

based on an extension of the ITU-T E-Model proposed in the ETSI Tiphon to incorporate the effects

of loss burstiness on the perceived quality [5]. The resulting algorithm works during the silence

periods. It estimates the parameters of a 4-state Markov model representing the loss behavior during

the subsequent talkspurt, evaluates the expected conversation quality varying the end-to-end delay

within a certain range, and finds the optimal setting of the playout buffer. To evaluate the expected

quality, the algorithm considers packet loss correlations and takes into account important effects

recently studied, such as the recency effect [6], [7], the smoothing of the user perception respect to

sudden variations of the packet loss [7], and the temporal position of the losses in the speech stream

[8], [9].

Page 13: VoIP Streaming Over Packet-Based Networks

12

II. Previous Works

The problem of jitter compensation has been addressed in several ways in the past. The proposed

techniques can be classified into two groups: fixed and adaptive. According to the techniques

belonging to the first group, the end-to-end delay is kept constant for the entire session. Differently,

techniques in the second group work adapting this delay to the variable network conditions during the

streaming: this means avoiding lateness with potential high delays.

In adaptive playout buffering, the most important features are related to when the playout buffer is

adjusted and which criterion is adopted. Intra-talkspurt techniques modify the end-to-end delay during

the entire streaming independently from the silence periods, using some strategies of compression and

extension of the waveform. On the contrary, between-talkspurt methods act during the intervals of

silence. The last approach is more frequently used.

As to the used criterion, different approaches have been experimented. An autoregressive class of

between-talkspurt methods is described in [10]. These methods are mainly based on two steps:

estimation of the network delay conditions and tuning of the playout instants so as to encounter a

small fraction of late packets. Denoting with d and v the estimates of the mean and variation in the

network delay in the next talkspurt, the end-to-end delay is set as follows: vdd ee ˆˆ2 ⋅+= β , where β

is usually set to 4.0. The four algorithms presented in [10] differ in the way d is computed. The first

algorithm estimates the average delay by means of a linear recursive filter characterized by a

weighting factor α . The second algorithm presents a slight modification based on the use of two

values of α : one for the increasing trend and the other of the decreasing trend of the network delay. In

this way, it should be possible to follow short bursts of packet incurring long network delays. The

third algorithm sets d to the minimum the network delay experienced in the previous talkspurt. The

last one introduces the feature of detecting short-live bursts of delay variations (spikes) and working

differently whether a spike is detected or not.

A variant of the autoregressive approach in [10] is the α -Adaptive technique [11], which generalizes

the filtering method in [10] by defining different values of α . In fact, it was found that the choice of

α greatly affects the rapidity with which the mean network delay estimate may vary respect to sudden

Page 14: VoIP Streaming Over Packet-Based Networks

13

variations in the actual delay. Thus, the α -Adaptive technique proposes an adaptive adjustment of the

value of α after few packet arrivals. In [12], a different algorithm is proposed based on a normalized

least mean square (NLMS) active predictor. The strategy estimates the network delay for each packet

from the previous N ones, using a NLMS predictor. The computation of the end-to-end delay variance

and the choice of eed 2 are performed as in the autoregressive approach.

The algorithms in [10]-[12] estimate an average network delay and use it to fix the set the eed 2 so that

the faction of late packets is kept very small. However, IP Telephony applications can tolerate or

conceal a small amount of late packets. Thus, strategies performing a controlled tradeoff between the

packet lateness and the delay may offer better results. The gap-based algorithm proposes a strategy

where the packet loss ratio is settable to reduce the playout buffer delay [13]. For all the packets in a

talkspurt, the algorithm computes the “gap”, defined as the difference between the playout and the

arrival instants. This is a measure of the performance for a certain playout buffer delay setting. The

optimal buffer setting corresponds to the minimum amount of delay to be added to each packet that

would have allowed for obtaining the packet loss ratio tolerated by the application. The operations

performed during the adjustment may vary depending on the working conditions. In fact, the gap-

based algorithm includes a spike detection strategy. The strategy works either in a “Impulsive Mode”

or “Normal Mode”, whether a spike is detected or not. The Concord algorithm in [14] fixes two

thresholds for the maximum late packet percentage and maximum acceptable delay. This strategy

performs the following operations: for each packet, the network delay is stored and used to build a

histogram; from this, an approximated and sampled version of the packet delay distribution (PDD) is

computed. The PDD is weighted by applying an aging function to the collected information. By using

such PDD, the algorithm set the eed 2 so that the two constraints on the maximum late packets

percentage and maximum acceptable delay are satisfied.

A step forward respect to the algorithms presented so far consists in setting the end-to-end delay so

that the perceived quality is maximized. Some works exploiting this principle have been proposed in

the last years. In [15], the E-MOS strategy makes use of a statistical distribution approach to better

estimate the network delay. To this purpose, a cumulative density function (CDF) is built for the tail of

the network packet delay distribution using the Pareto function; this approach relies on the delay

Page 15: VoIP Streaming Over Packet-Based Networks

14

analysis in [16] and [17]. Then, a mathematical relationship between the MOS (Mean Opinion Score)

and the delay and loss is extracted from the results presented in [18]. This relationship is used to find

the end-to-end delay that maximizes the expected subjective quality. The advantage of this approach is

the direct use of the MOS index to evaluate the conversation quality. However, the application of this

index introduces some issues concerning the linear combination of different impairment factors in the

MOS scale. Indeed, the authors have added the impairments related to the delay and packet losses to

obtain the final quality index. This has been proved for the ITU-T G.711 [19] codec but has not been

investigated for low bit-rate codec. Then, the validity of this operation for other codecs needs

additional analysis.

In [20], the jointly design of FEC (Forward Error Correction) and a playout buffering for Internet

Telephony is presented. The authors show that a real benefit in using the joint control of both playout

and FEC is obtained only when delay is critical. The ITU-T E-Model is used to derive an expression

of the application quality as a function of the encoding rate, packet loss rate, and end-to-end delay.

The source-channel coding parameters and the end-to-end delay are selected by maximizing the

quality function. The authors focus on random losses, stating that these are equivalent to the case of

bursty losses if the loss percentage is lower than 5%. Indeed, this is true for the ITU-T G.711 codec

[19] as shown in [21], but it is unlikely to happen for low bit-rate codecs. In [22], a non-linear

regression model to predict voice quality, based on the ITU-T PESQ [23], PESQ-LQ [24] and ITU-T

E-Model, is presented. Several models for different speech codec are derived, which can be used for

general QoS (Quality of Service) control purposes, including voice quality monitoring and playout

buffering. Basically, this work is the natural progress of [25]. Also in this study, the packet losses are

assumed to be random. A similar approach is presented in the [26] for the IP Telephony applications

that make use of the ITU.T G.729.

Even if not directly related to the playout buffering problem, there are other works that are worth

mentioning at this point since they make use of the ITU-T E-Model. In [27], the perceived quality of

different codecs under the same bandwidth requirements is evaluated using the MOS. The important

results of this work are: FEC cannot reduce jitter unless out-of-order packets are common in the

Internet; a robust codec with a PLC (Packet Loss Concealment) strategy is more useful for battling

Page 16: VoIP Streaming Over Packet-Based Networks

15

jitter. In [28], a joint source-channel coding adaptation algorithm for the AMR (i.e., Adaptive Multi

Rate) speech codec is described. The paper presents an analysis of the best tradeoff between source

and channel bit rates, under constraints on packet loss, end-to-end delay, and transmission rates. The

performance is evaluated making use of the ITU-T E-Model. It is recognized that while a FEC strategy

mitigates the effects of packet loss, it also increases end-to-end delay. These two features work as

opposite respect to the speech quality. Assuming the loss to be random, the proposed algorithm tries to

find the optimal compromise between packet loss recovery and end-to-end delay. Finally, [29]

presents a decision system to select the coding scheme and routing path to maximize the amount of

calls to be placed in a VoIP system, still guaranteeing a minimum level of speech quality.

Page 17: VoIP Streaming Over Packet-Based Networks

16

III. The ITU-T E-Model

To evaluate the voice transmission quality, intrusive and non-intrusive methods have been proposed in

the past. The former are based on a comparison between the original and the distorted signals, while

the later compute a quality index from the analysis of the system configuration and the measurements

of transmission parameters, such as codec configuration, information loss, and transmission delay.

In this work, we focus on the ITU-T E-Model [1], which belongs to the category of non-intrusive

methods. The choice of this model arises from the need of evaluating the influence of the most

important system settings on the perceived quality. In fact, the E-Model is a tool that estimates the

voice quality when comparing different network equipments and designs. Its main feature is the

capacity of revealing the underlying causes of speech quality problems by means of an overall quality

index, the R Factor, which is the combination of a well-defined set of metrics linked to: low bit-rate

speech coding; delay and loss distribution; frame erasure distribution; loss concealment technique;

architectural choices such as de-jitter buffer, packet and codec frame size. The R Factor is defined as

follows:

AIIIR effeds +−−−= ,100 . (1)

The maximum value of this index is 100. The signal-to-noise impairment factor sI comprises the

distortions introduced by the circuit-switched part of the end-to-end communication network. A set of

default values for this parameter is provided in [1]. The term dI measures the impairments associated

with the mouth-to-ear delays encountered along the transmission path. effeI , represents the

impairments associated with the signal distortion, caused by low bit-rate codecs and packet losses. The

Expectation factor A increases the level of conversational quality when the end-user may accept some

decrease in quality for access advantage (e.g., mobility). A comprehensive description of the

Expectation factor is provided in [2], but no agreement has been reached for the value in case of IP

Telephony. For this reason, it is usually set to zero.

Despite its apparent simplicity, (1) represents a non operative form of the E-Model since the four

factors depend on several configuration parameters. A first simplification can be obtained when using

a set of default values and operative working conditions [1]:

Page 18: VoIP Streaming Over Packet-Based Networks

17

effed IIR ,2.93 −−= . (2)

Introducing the assumptions in [30] (i.e., no circuit switched network internetworking for the access to

the IP Telephony service) and using the default values, dI becomes a function of the average mouth-

to-ear delay. Such a delay, represented with d, is defined as the sum of the end-to-end delay eed 2 and

the encoding/decoding delay, which comprises both the packetization and the algorithmic components

(usually neglected). Thus, eecodec ddd 2+= , where the packetization delay codecd is equal to 25 ms in

case of ITU-T G.729-A+VAD. Note that eed 2 , as defined in Section II.A, is the interval of time

between the departure of a packet from the transmitter and the time its content is played out at the

receiver. It is controlled by the playout buffering algorithm and affects the conversational quality. The

resulting dI for a certain range of d values has been experimentally obtained [31]. In [30], these

values have been interpolated to obtain an analytical expression:

)3.177()3.177(11.0024.0 −⋅−⋅+⋅= dHddId . (3)

Where )(xH is the step function: 0)( =xH if 0<x and 1)( =xH if 0>x .

effeI , in (2) is a function of the end-to-end packet loss ratio eee 2 and the used speech codec:

( )eeeeffe ecodecII 2, ,= . (4)

eee 2 comprises the packets that have been lost during transmission, after the application of the FEC

loss recovery if used [20], and those that arrive correctly at the receiver but are too late to be played

out. For a fixed eee 2 value, different impairment values are observed changing the number of frames

inserted in a transport packet, the distribution of the packet losses, the sensibility of the used codec to

data frame losses, and the used concealment algorithm. In [21], the effeI , for some configurations are

provided. For the G.729-A+VAD speech codec, which requires the transmission of two frames of 10

ms in each transport packet, the following expression has been obtained [30]:

)101log(4011 2, eeeffe eI ⋅+⋅+= . (5)

(5) is valid in case of random packet losses and when the standard concealment algorithm is used. The

current expression of effeI , does not allow for considering the effects of bursty packet losses. Several

Page 19: VoIP Streaming Over Packet-Based Networks

18

changes in this direction have been introduced in the 2003 version of ITU-T E-Model, but a complete

integration of the burstiness model has not been included yet.

Under the most common conditions, the relationship between the R Factor and the MOS has been

verified for some quality levels: user very satisfied, user satisfied, some user dissatisfied, many user

dissatisfied, and nearly all users dissatisfied. For some other conditions, the E-Model is less accurate;

in particular, annex A of [1] provides the situations where the model validity has not been completely

verified, especially regarding the overall additive property of the model, which is applicable only to a

certain extent. An estimated MOS can be obtained from the R Factor using the following formulae:

��

��

>=<<⋅⋅+⋅+=

<=

100R if4.5 MOS100R0 if107R)-60)(100-R(RR0.0351MOS

0R if1MOS6- . (6)

IV. The eEM Playout Strategy

As discussed above, most of the existing techniques for playout buffering are based on two main

operations: prediction of delay statistics for future packets; setting of the playout buffering in

accordance with a constraint on either the maximum the end-to-end delay or the maximum fraction of

late packets. This methodology does not guarantee the maximization of the conversational quality.

Indeed, fixing a maximum value for the total end-to-end delay does not allow for controlling packet

losses and relevant effects on the end-user perceived quality; on the other hand, limiting the maximum

information loss does not enable to curb the end-to-end delay, which may heavily affect the

application interactivity. Differently from this approach, another class of playout buffering techniques

has appeared in the last years. These are based on controlling the playout buffer so as to maximize the

expected conversational quality. The aim is to jointly consider the expected delay and the information

loss making use of a perceptually motivated optimality criterion that allows the receiver to

automatically balance packet delay versus loss. The playout buffering technique presented in this

paper belongs to this category, making use of an extended version of the ITU-T E-Model proposed by

the ETSI Society [5].

Page 20: VoIP Streaming Over Packet-Based Networks

19

Figure 2. Sketch of the proposed playout buffering strategy.

Fig. 2 shows the main blocks involved in the proposed approach: statistics relevant to loss and delay

are predicted by means of the previously sent packets; based on this information, the playout buffer

setting is accomplished so as to maximize the expected conversational quality during the future

conversional unit. Two main features distinguish this approach from the past ones: the prediction of

the correlation features that characterize the packet loss process; the use of a quality model that

evaluates the effects of the loss burstiness on the perceived quality. It is a matter of fact that the

prediction of packet loss and delay statistics is a topic frequently addressed by past works also in

similar issues. These can also be applied to the prediction of the loss burstiness statistics with only few

changes. Differently, the use of a quality model within this framework still presents some open issues.

When using the ITU-T E-Model, these are mainly concerned with the fact that:

� This model can be applied only during stationary periods;

� This model is valid only within certain conditions.

Based on the first point, we apply the quality model to conversation units during which the playout

buffer settings are left unchanged: in fact, the impact of end-to-end delay variation on speech

transmission quality has not been included in the E-Model algorithm yet. In particular, we propose to

modify the end-to-end delay during silence periods based on the estimated quality for the next

talkspurt during which the playout buffer is kept constant.

As to the second point, it is important to note that the effects of information losses on the R Factor

have not been defined in case of bursty packet losses for some common codecs (e.g., ITU-T G.729 and

G.723). This is a crucial limitation not only for this context but for most of the E-Model uses. In fact,

several studies [3], [4] and [32] have shown the burstiness of packet losses in the Internet and more in

general in packet based networks. Dealing with bursty losses as if these were random would be a

significant error since the impact on the human perception of isolated losses respect to grouped losses

is quite different. This phenomenon was demonstrated on a study presented in [33] for the ITU-T

G.729-A+VAD speech codec. It has been shown that the concealment works well with a single speech

Loss and delay information

of past packets Prediction of network behavior

� Packet loss process (4-state Markov model)

Computation of buffer dimension � Maximizing the quality

(use of the E-Model)

end-to-end delay

Page 21: VoIP Streaming Over Packet-Based Networks

20

frame erasure, but not well enough with multiple frame erasures. This requires the introduction of

appropriate extensions on the E-Model to take into account the quality degradation caused by the

packet loss burstiness. Due to the importance of this point, in the following two sub-sections, we

present a packet loss analysis performed on real traces over the Internet and the extension of the E-

Model proposed by the ETSI Society in [5]. Then, based on the extended R Factor computation, in the

third sub-section we illustrate how the proposed playout algorithm works. In the fourth sub-section we

then discuss the computational complexity.

I.V.i. Analysis of the packet loss burstiness

We have analyzed the burstiness of packet losses in several traffic traces, relevant to voice connections

over the Italian GARR research network and the European Tiscali ISP network with dial-up and

ADSL access lines. Each trace lasted an average of 10 minutes and was registered at different times of

the day. Two H.323 hosts were used during the experiments employing the G.729-A+VAD codec (2

packets of 10 ms voice frames for every UDP packet and the native concealment strategy [34]). The

H.323 numbering mechanism was used to detect missing frames. The sender/receiver clocks were

synchronized, through the NTP (the Network Time Protocol) protocol, and all packets within the

conversation were captured at both sides.

The basic characteristics of the used traces are summarized in Table I. The first four traces were

obtained during conversations over LAN-to-ADSL connections. These traces are quite similar in terms

of network delay but not in terms of packet loss percentage. Trace 5 is the result of an intra-LAN

connection with high background traffic conditions: it is characterized by low average network delay

with high variability. Trace 6 is relevant to an ADSL-to-ADSL connection: it is characterized by a

high delay and null packet loss with low delay variability. Traces 7 and 8 were recorded during

conversations between hosts located in two different LANs and present a good behavior for what

concerns both loss and delay.

Page 22: VoIP Streaming Over Packet-Based Networks

21

Table I. Voice traces used during experiments.

Trace # Length (sec)

Average network loss (%)

Average network Delay (ms)

Network delay std

Trace 1 761.73 1.59 85.1 23.4

Trace 2 656.50 2.57 82.2 24.3

Trace 3 568.34 3.58 79.7 22.2

Trace 4 673.80 4.22 84.5 22.4

Trace 5 527.10 4.08 77.5 26.4

Trace 6 576.82 0.00 288.6 10.6

Trace 7 625.94 1.39 43.4 19.6

Trace 8 631.24 0.76 39.6 19.2

The burstiness analysis had to be conducted after the playout buffering, which means that the packet

loss had to include late packets other than those lost in the network, and with a “neutral” strategy for

buffer setting. So we decided to analyze all the traces making use of a fixed buffering strategy with a

high end-to-end delay of 1 sec, almost equivalent to observe only the network losses, and with a delay

of 100 ms.

A first investigation has been performed using the statistical chi-squared test [35] by considering the

random variable K associated to the length of packet loss bursts (K-1 lost packets and 1 correctly

received packet). In case of random packet loss, the variable K should be distributed according to a

geometric PMF (probability mass function) and the chi-squared test has been used to verify this

hypothesis (hypothesis 0H ). For the proposed experiments, we computed the chi-squared value and

the associated probability 2χP of having a chi-squared value equal or greater than the obtained value

by chance only. We were then able to reject or not reject 0H depending on whether 2χP was smaller

or not than a given significance level that is usually selected equal to 1% or 5%.

Additionally, we applied the 2-state Gilbert model to our traces, which is frequently used to represent

the temporal correlation existing in bit-error and packet loss sequences [36]. This model is based on

the unconditional packet loss probability uP and the conditional packet loss probability cP . The

distance between uP and cP gives an indication of the deviation of the packet loss process from a

memoryless Bernoulli process, which is characterized by having these two probabilities equal. The

outcomes of these tests are reported in Table II.

Page 23: VoIP Streaming Over Packet-Based Networks

22

Table II. Burstiness analysis results.

eed 2 of 1 sec eed 2 of 100 ms Trace #

2χP (%) uP (%) cP (%) 2χP (%) uP (%) cP (%)

Trace 1 > 1 1.59 1.8 > 1 1.81 2.01

Trace 2 << 0.1 2.57 11.1 << 0.1 2.91 21.3

Trace 3 << 0.1 3.58 16.2 << 0.1 3.82 24.2

Trace 4 << 0.1 4.22 19.6 << 0.1 4.43 28.1

Trace 5 << 0.1 4.08 21.9 << 0.1 4.36 35.7

Trace 6 << 0.1 0.00 77.1 << 0.1 0.27 86.9

Trace 7 << 0.1 1.39 25.3 << 0.1 1.62 37.1

Trace 8 << 0.1 0.76 17.2 << 0.1 1.01 37.2

By considering 0H as the hypothesis of having random losses, we found by means of the chi-squared

test that 0H has to be rejected for all the experiments, except for Trace 1. In fact, for almost all the

performed trials a chi-squared value lower than the quantile of order 0.99 with different degrees of

freedom has been obtained. Accordingly, the deviation of the observed variable K distribution from

the geometric PMF is significant, so we have to reject the hypothesis of random losses in these traces.

The chi-squared test results are in accordance with those based on the Gilbert model as shown in Table

II. Additionally, note that the loss pattern burstiness increases decreasing the end-to-end delay; it

means that the loss correlation is higher with an end-to-end delay of 100 ms respect to the case of 1

sec. This phenomenon is shown by the resulting distances between uP and cP that are significantly

high for all the experiments but higher after the playout buffering operation. This feature is a direct

consequence of the burstiness of the network delay. When the playout algorithm sets a high end-to-end

delay, the late packets are those located at the peaks of the high-delay bursts of packets. Decreasing

the end-to-end delay, more consecutive packets are discarded from each high-delay burst, increasing

the temporal correlation of losses.

I.V.j. Extension of the E-Model

At present, a standard integration of the loss burstiness effects in the E-Model is under study by ITU-T

Study Group 12. An alternative solution is proposed by the ETSI Society, through a step-by-step

extension of the equipment impairment factor effeI , [37], [5]. The first step is the modeling of the loss

Page 24: VoIP Streaming Over Packet-Based Networks

23

burstiness, where a burst period is defined as an interval of time during which a high packet loss

percentage is observed. The bursts are separated by gaps, which are characterized by sporadic loss

events. Specifically, two lost packets identify a burst if between these less than gmin packets have been

correctly received. If gmin or more packets are correctly received, such sequence is regarded as being

part of a gap. According to [37], gmin is set to 16. The system is then modeled with a 4-state Markov

chain, which is drawn in Fig. 3.

4 14

2 3

P41

P14

P11

P22

P33

P23

P32

P13

P31

Figure 3. 4-State Markov model for the packet loss process.

The four states are determined by which period the system belongs to (burst or gap) and whether the

last transmitted packet has been lost or correctly received: gap-no loss (state 1), burst-no loss (state 2),

burst-loss (state 3), and gap-loss (state 4). [5] also suggests a simple procedure to estimate the

transition probabilities from an observed sample path of the packet loss process. During the streaming,

for every transmitted packet a group of estimation counters are updated depending on the current

system state and whether or not the packet has been correctly received in time to be played out. The

transition probabilities are then obtained from these counters at the end of the observation period.

From the estimated probabilities, a set of additional parameters characterizing the bursty loss process

are derived. These are the followings: g and b, that represent the durations in seconds of the gap and

burst periods, respectively; ge and be , that represent the average packet loss ratios during the gap and

burst periods, respectively; and y, that is the time interval in seconds since the last burst of packet loss.

Details of the estimation procedure can be found in [5].

Once, the entire sequence of packets is divided into gaps and bursts, the equipment impairment factor

is separately evaluated for the gap ( egI ) and the burst ( ebI ) periods as if the losses were random. For

this reason, egI and ebI are computed for the ITU-T G.729A-VAD codec, by means of (5):

Page 25: VoIP Streaming Over Packet-Based Networks

24

( )geg eI ⋅+⋅+= 101log4011 , (7)

( )beb eI ⋅+⋅+= 101log4011 . (8)

These two contributions are integrated by considering the modeling of the smoothing effect: a sudden

variation in the packet loss ratio would not necessarily result in a sudden change in the perceived

quality. In fact, the user does not perceive instantaneously the start and the end of a burst, but he

notices the quality change by degrees. Several tests have shown such conduct [7]. In these tests, the

packet loss ratio of a VoIP connection was varied from 0% to 30%, for periods of time from 15 to 30 s

for a test call of 3 min. The listener furnished a feedback on the instantaneous quality during the call.

The results of the tests revealed that the temporal interval for the change gap-burst quality was of 4-5 s

(named t1), while for the change burst-gap was about 10-15 s (named t2). Let )(1 tI denote the

equipment impairment factor during a burst period, with 0=t corresponding to the instant at with loss

process changes from gap to a burst; and let )(2 tI represent the equipment impairment factor during a

gap period, with 0=t corresponding to the instant at with loss process changes from burst to a gap.

These two functions have an exponential behavior as follows:

( )121 exp)()( ttIIItI gebeb −⋅−−= , (9)

( )212 exp)()( ttIIItI egbeg −⋅−+= , (10)

where )(11 btII b == , and )(22 gtII g == . The overall equipment impairment factor is computed by

a temporal average:

( )( ) ( )( )gb

eIIteIItIgIbavI

tgegb

tbgebegeb

e +−−⋅+−−⋅−⋅+⋅

=−− 21 /

12/

21 11)(

. (11)

In the final expression of the impairment factor, the ETSI takes also into account the recency effect,

that is due to the influence of the position of a noisy/lossy burst within a call on the subjective

evaluation of the overall quality. Such effect was studied for the first time by AT&T [6]. In these

experiments, different types of impairments were introduced in different positions of the test traces.

The resulting MOS value decreased gradually from the case of impairment at the beginning of the

trace to the case of impairment at the end of the trace. For instance, in case of bursty noise impairment,

the two MOS values observed with these two configurations were of 3.82 and 3.18, respectively.

Page 26: VoIP Streaming Over Packet-Based Networks

25

The recency effect can be modelled hypothesizing an exponential behavior of the perceived quality,

which starts from the quality level of the last significant loss burst and goes asymptotically to the

average quality value of the call. In terms of equipment impairment factor, such effect results as:

( ) ( )( )( ) ( )3*1, exp tyavIIkavII eeeffe −⋅−⋅+= , (12)

where *1I is the exit value of the equipment impairment factor from the last burst, k is a constant

value, equal to 0.7 as suggested in [5], and t3 is the exponential time constant, equal to 30 s, which

represents the user memory about the last occurred event.

(12) represents the final expression for effeI , to be used in (2). Considering a set of default operative

conditions and explicating dI in (2), we then obtain the following expression for the R Factor:

effeIdHddR ,)3.177()3.177(11.0024.02.93 −−⋅−⋅−⋅−= (13)

(13) allows for the evaluation of the overall quality in case of bursty losses for the ITU-T G.729A-

VAD.

I.V.k. Playout buffering by quality maximization (eEM)

(9)-(13) provide an analytical expression of the quality index in terms of the end-to-end delay and of

the five parameters that characterize the packet loss process. It is important to highlight that also these

parameters are functions of the end-to-end delay: at increasing values of eed 2 , the burst length and the

packet loss ratios are expected to increase, while the gap length is expected to decrease. Additionally,

it is reasonable to assume that these functions are time-varying due to the temporal variability of the

network delay process.

The overall expression of the R Factor in terms of the end-to-end delay is used to evaluate the speech

quality in the proposed playout buffering algorithm, which we call eEM (extended E-Model based

playout buffering) for presentation convenience. The eEM algorithm works in a between talkspurt

fashion, that is, the buffer is adjusted during the silence periods by maximizing the expected R Factor

for the next talkspurt. Let ( )eei dR 2 denote the E-Model quality function for talkspurt i. Our objective

consists in finding the optimal setting of the playout buffer *,2 ieed for talkspurt i, which is defined as:

( ) ( ) +∈∀≥ RxxRdRd iieeiiee*

,2*

,2 : . (14)

Page 27: VoIP Streaming Over Packet-Based Networks

26

This value represents the end-to-end delay that will be used during talkspurt i.

To find *,2 ieed , we need to predict the relationships between the loss process parameters and the eed 2

for next talkspurt i. In particular, functions ( )eei dg 2 , ( )eei db 2 , ( )eeig de 2, , ( )eeib de 2, , and ( )eei dy 2

are required to obtain function ( )eei dR 2 . For presentation convenience, these functions are referred to

with the array ( )eei d 2I . To address this issue, we rely on a numeral approach that consists in

performing the estimation of the loss process parameters for a set of past talkspurts; the results of this

estimation are then used as prediction for talkspurt i. Let ( )eeNi d 2,I denote the result of the estimation

performed during the N talkspurts that precede the i-th one. Then, the prediction of ( )eei d 2I is:

( ) ( )eeNieei dd 2,2ˆ II = . (15)

Since it is not possible to obtain an analytical expression of ( )eeNi d 2,I , a discrete version is extracted

by selecting a search range ( +− ÷ eeee dd 22 ) and dividing it into search steps of size D.

( ) DddM eeee−+ −= 22 is the number of resulting levels seed ,2 ( Dsdd eesee ⋅+= −

2,2 , with s=1,…,M).

During the streaming session, for every transmitted packet, the counters in [5] required for the

estimation are updated for every search level s, considering the packet lost if its network delay is

greater than seed ,2 . During the silence period, ( )seeNi d ,2,I are estimated for every s, which are then

used as prediction of ( )seei d ,2I and to find the optimal end-to-end delay value.

To show the behavior of functions ( )eei d 2I , in Fig. 4 we present how the sequence of bursts and gaps

change varying the playout buffer setting. In particular, Fig. 4.a depicts the network delays observed

for a sequence of consecutive packets drawn from one of the traces of Table I. In this figure, two lines

have been drawn representing two possible values for the end-to-end delay. Each line separates late

packets, which are considered lost, from in-time packets. A different loss sequence configuration

results from each eed 2 setting as shown in Fig. 4.b (to simplify the presentation, we used gmin equal to

8 in this example). This figure illustrates how the durations of the gap and burst periods, and relevant

loss percentages, vary in a trace varying the buffer configuration. Note that a difference of 30 ms in

the end-to-end delay in this trace made the average length of the burst vary from 4,5 to 41 packets.

Page 28: VoIP Streaming Over Packet-Based Networks

27

0

20

40

60

80

100

120

140

0 10 20 30 40 50Packet Index

d net (

ms)

d e 2e = 110 ms

d e 2e = 80 ms

(a)

Received packet sequence

Packet sequence when de2e=110 ms

Packet sequence when de2e=80 ms

(b) Figure 4. Effects of the playout buffer setting on the packet loss burstiness. (a) packet network

delay with two possible settings of the playout buffer (dashed lines); these lines separate in-time packets from late packets. (b) sequences of gaps and bursts when considering only network losses and when considering also late packets with the two buffer configurations: ( valid packet, lost packet, discarded packet, gap, burst).

The graphs in Fig. 5 in the next page show the typical behavior that is observed for the loss process

parameters varying the end-to-end delay. As expected, the trend of these curves is descending at

increasing eed 2 , except for the gap duration, which has an opposite behavior. In these graphs, the

continuous lines represent the effective behaviors, ( )eei d 2I , which have been estimated at the end of

talkspurt; instead, the dashed lines represent the predicted behavior, ( )eei d 2I , on the basis of the

previous N talkspurts (N has been set to 17). These figures show that the predicted and actual curves

have a similar behavior. Note that the curves relevant to the predicted data are more continuous than

Page 29: VoIP Streaming Over Packet-Based Networks

28

those of the measured data. This is due to the fact that the first are obtained from N talkspurts, while

the second from only one talkspurt.

There are other methods that may introduce some improvements in the prediction procedure. A system

based on the fuzzy logic based could be used to predict ( )seei d ,2I , with s=1,…,M, on the basis of

( )seeji d ,2−I , with j=1,…,N. In the same way, an approach using the neural networks can be adopted.

However, this would increase the computational complexity of the playout algorithm, which should be

simple to be implemented in low cost terminals.

Figure 5. Predicted and measured functions of g, b, ge , be for talkspurt 25 in Trace 4: the continuous

and dashed lines have been drawn from measured and predicted values, respectively.

I.V.l. Analysis of the computational complexity of the eEM strategy

The computational complexity of the algorithm has to be examined separately for the operations

executed within and between talkspurts:

a) Update of the estimation counters: According to [5], the computational complexity is O(M). To

be more precise, 9M counters track the system state, and at most 10M operations are executed for

every packet. This procedure is performed during the streaming so that 10M operations are

Page 30: VoIP Streaming Over Packet-Based Networks

29

executed during a packet time, which is of 20 ms on average for the ITU-T G.729 codec (the

packet time varies due to the jitter problem).

b) Computation of *,2 ieed : Also in this case, the computational complexity is O(M). In particular, for

every search level s, the algorithm estimates ( )seeNi d ,2,I . The resulting estimates are then used to

compute the R Factor through equations (7)-(13), executing a total of about 100 operations for

every search level. At the same time, the maximum value that allows for obtaining the maximum

expected quality is found *,2 ieed . Note that this procure has to be executed during the silence

period

From this analysis, it follows that the computational complexity is not affected by the number of

talkspurts used for the prediction (N), but only by the resolution (M) used to find the optimal end-to-

end delay.

V. Experiments

Several tests have been performed applying the proposed playout technique and alternative ones to

traces relevant to voice connections over the Internet. In this Section, we present the results for the

eight traces introduced in Section III.A, whose main characteristics are provided in Table I. The results

have been evaluated by computing the R Factor for each talkspurt on the basis of measured burst and

gap lengths, burst and gap packet loss ratios, and introduced delay.

To evaluate the influence of the selection of the number N of talkspurts used to make the prediction,

we have carried out some experiments changing this parameter in the range 1÷30 for every trace. Fig.

6 shows the results. It can be observed that for low values of N, roughly lower than 15, the algorithm

performance is affected by this parameter for almost all the traces, while it not true for high values.

Then, there exists a minimum number of talkspurts that have to be used to obtain good estimates of the

packet loss process parameters. Additionally, the packet loss process seems to be slowly time-variant;

in fact, in the range 15÷30, the results can be considered invariant respect to N, proving that for at least

30 talkspurts the loss process remains almost unchanged. Based on these results, in our experiments

we have set N to 17 for all the traces. Recall that the algorithm complexity is not affected by N;

Page 31: VoIP Streaming Over Packet-Based Networks

30

58

60

62

64

66

68

70

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

N

R F

acto

r

Trace 1 Trace 2 Trace 3 Trace 4

Trace 5 Trace 6 Trace 7 Trace 8

Figure 6. Average R Factor versus N for the 8 test traces.

accordingly, we have not been forced to look for the lowest value that allowed for good estimation

performance. As to the other eEM algorithm settings, Table III presents the used values.

Table III. Settings for the eEM algorithm parameters used in the experiments.

Parameter Description Value

N Number of talkspurts for prediction 17

+− ÷ eeee dd 22 Search range (ms) 50÷400

D Search step for the optimal

eed 2 (ms) 1

M Number of eed 2 levels in the search range

350

In the following we compare the performance of the eEM algorithm with the Linear Filter [10], the

Concord [14] and the E-MOS [15]. In particular, for the Linear Filter, we have used Algorithm 1 in

[10] with a weighting factor 998002.0=α . The Concord algorithm has been tested with default

parameters values: expected eed 2 (named ted in [14]) recalculation at each arriving packet, histogram

with one millisecond bin-width, aging every 1000 packets with 9.0=F , and the maximum late

packets (mlp) set to 0.01. To better evaluate the obtained results, for each talkspurt we have computed

Page 32: VoIP Streaming Over Packet-Based Networks

31

the maximum R Factor (R Max), which represents the upper bound for the achievable quality level. It

is obtained setting the end-to-end delay within a talkspurt by means of the eEM algorithm with an

important variation: the R Factor optimization is performed using ( )eei d 2I instead of ( )eei d 2I ,

therefore excluding prediction errors. Clearly, this operation is not applicable in a real framework

since the measured values need to be available before the beginning of the relevant talkspurt in this

case. Tables IV and V, in the next page, show the R Factor averaged over all the talkspurts in a trace,

together with the average eed 2 and total loss ratio. Note that the eEM algorithm generally outperforms

the others in terms of the R Factor. It was an expected result since the other algorithms do not take into

account a quality model, except the E-MOS, which however makes use of a quite different

mathematical representation of the conversational quality. As to the first four traces, the proposed

algorithm allows for obtaining, on average, an R Factor of almost 2 point higher than the others. As to

Trace 5, it presents high network delay variability. In this case, it can be observed a significant

decrease of the R Factor (i.e., about 10 point lower than Trace 4), mainly due to an increase in the

packet loss ratio respect to the previous traces.

Table IV. Comparison of eEM with other strategies for Traces 1-4. In the last two columns

Trace# Algorithm Average loss in percentage

Average eed 2 in ms

Average R Factor (MOS)

Average R Max (MOS)

eEM 2.28 180.2 71.5 (3.67) Linear Filter 2.40 173.7 62.0 (3.20) Concord 2.43 169.0 68.1 (3.51)

Tra

ce 1

E-MOS 1.65 295.9 58.6 (3.03)

74.4 (3.80)

eEM 3.31 177.5 70.8 (3.63) Linear Filter 3.32 173.0 66.6 (3.43) Concord 3.51 168.9 68.7 (3.54) T

race

2

E-MOS 2.81 294.9 57.0 (2.94)

71.8 (3.68)

eEM 4.37 174.8 69.2 (3.56) Linear Filter 4.67 173.7 65.2 (3.36) Concord 4.42 169.0 67.8 (3.49) T

race

3

E-MOS 3.65 282.2 58.6 (3.03)

69.5 (3.57)

eEM 4.91 178.9 68.1 (3.51) Linear Filter 5.23 173.9 64.0 (3.30) Concord 5.31 169.2 66.1 (3.41) T

race

4

E-MOS 4.27 291.3 52.1 (2.69)

68.1 (3.51)

Page 33: VoIP Streaming Over Packet-Based Networks

32

Table V. Comparison of eEM with other strategies for Traces 5-8.

Trace# Algorithm Average loss in percentage

Average eed 2 in ms

Average R Factor (MOS)

Average R Max (MOS)

eEM 5.05 176.1 58.8 (3.04) Linear Filter 5.08 173.3 56.5 (2.92) Concord 5.22 165.2 58.1 (3.00)

Tra

ce 5

E-MOS 4.41 305.3 53.0 (2.73)

59.0 (3.05)

eEM 0.44 348.0 75.2 (3.83) Linear Filter 0.30 399.6 67.6 (3.48) Concord 0.80 409.4 68.8 (3.54) T

race

6

E-MOS 1.60 373.0 65.0 (3.35)

77.3 (3.92)

eEM 1.40 157.4 77.8 (3.94) Linear Filter 1.20 140.2 68.3 (3.52) Concord 1.90 130.6 70.6 (3.63) T

race

7

E-MOS 1.15 220.1 60.1 (3.11)

78.4 (3.96)

eEM 1.21 135.2 76.8 (3.90) Linear Filter 1.81 129.0 73.4 (3.75) Concord 1.40 128.8 74.8 (3.81) T

race

8

E-MOS 0.72 295.3 60.8 (3.14)

77.5 (3.93)

Such phenomenon has been observed in particular for the Concord, which allowed for an increase in

the number of late packets to reduce the end-to-end delay. For Trace 6, characterized by high network

delays and low packet loss ratios, the eEM algorithm has provided high quality values respect to the

others. In particular, the Concord showed a problem in the optimization of the eed 2 and in the keeping

the packet loss ratio low; the E-MOS is competitive, notwithstanding the introduced high packet loss

ratio. The characteristics of this trace seem to represent good operative conditions for this strategy.

Similar results have been observed for the last two traces. To evaluate the improvement of the

proposed algorithm in a different scale, we have converted the R Factor in the MOS according to (6).

These values are shown in brackets in Tables IV and V. In this scale, the average improvement of the

eEM algorithm respect to the others is of about 0.34 points.

Table VI. Results of the eEM algorithm for Traces 1-4 when assuming loss randomness: the equipment impairment factor has been evaluated by means of (5).

Average eed 2 in ms Average R Factor (MOS)

Trace 1 191.2 72.6 (3.72) Trace 2 180.5 72.9 (3.73) Trace 3 182.8 71.4 (3.66) Trace 4 181.9 70.2 (3.61)

In Table VI, we provide the results of the quality-based playout algorithm when using the expression

of the equipment impairment factor in case of random losses. In particular, we have applied the eEM

Page 34: VoIP Streaming Over Packet-Based Networks

33

algorithm to the first four traces directly using (5) to evaluate the effects of the losses. We present the

resulting R Factor values and the average end-to-end delay. As expected, higher quality levels have

been obtained, due to the fact that the assumption of loss randomness increases the expected perceived

quality respect to the case of correlated losses. This phenomenon also yielded to higher optimal end-

to-end delay values, since while the effeI , decreased at a given loss ratio the dI remained unchanged.

45

50

55

6065

70

75

80

85

1 11 21 31Talkspurt index

R F

acto

r

R Max eEM Concord Linear Filter E-MOS

Figure 7. Comparison of the proposed algorithm with the competing ones for the first talkspurts of

Trace 8.

To evaluate the evolution of the R Factor within a session, in Fig. 7 we provide the performance of the

playout algorithm for the first 35 talkspurts in Trace 8. The curves present a very similar behavior and

differ for a vertical shift of some points. The eEM curve is always very close to the maximum

achievable values in accordance with the average results presented in Table V. Note that the quality is

characterized by a high variability, oscillating in a range of about 20 points for all the algorithms. This

is due to the fluctuations of the network delay, which heavily controls the quality.

Page 35: VoIP Streaming Over Packet-Based Networks

34

60

65

70

75

80

85R

Fac

tor

55

60

65

70

75

80

R F

acto

r

Figure 8. Overall R Factor for Trace 7 and 8.

Finally, in Fig. 8 we provide the overall call quality level instead of showing it for each talkspurt

separately. In particular, g, b, gD , bD , and y have been measured for the entire conversation and

directly used in (12) to evaluate the overall performance of the playout algorithms. This figure

presents the graphs for Trace 7 and 8, showing results similar to those obtained measuring the

performance on a talkspurt basis.

VI. Conclusions

A new algorithm, named eEM, for playout buffering has been presented for IP Telephony

applications. This algorithm exploits a quality model that allows the receiver to automatically find the

optimal end-to-end delay in terms of conversational quality. The major contribution of this paper is the

adoption of a model that is able to evaluate the effects of packet loss temporal correlation on the end-

user perceived quality. To this aim, the ETSI Tiphon model study on bursty losses has been

incorporated in the proposed algorithm. Extensive experiments have been carried out so as to evaluate

the performance of the proposed strategy showing that the eEM algorithm allows for improving the

conversational quality of some point in terms of the R Factor respect to other playout techniques.

Future work is devoted to the investigation of alternative solutions for the prediction of the parameters

of the ETSI Tiphon model and the extension of this algorithm to audio-video communications.

Page 36: VoIP Streaming Over Packet-Based Networks

35

References

[1] ITU-T Recommendation G.107, “The E-Model, a computational model for use in transmission planning,”

03/2003.

[2] ITU-T Recommendation G.108, “Application of the E-model: A planning guide,” 09/1999.

[3] J-C. Bolot, “Characterizing end-to-end packet delay and loss in the Internet,” Journal of High-Speed

Networks, vol. 2, no. 3, pp. 305-323, Dec. 1993.

[4] W. Jiang and H. Schulzrinne, “Modeling of packet loss and delay and their effect on real-time multimedia

service quality,” in Proc. NOSSDAV, June 2000.

[5] ETSI T1A1.1/2001-037, “Extensions to the E Model to incorporate the effects of time varying packet loss

and recency,” Alan Clark, Telchemy Incorporated.

[6] ANSI T1A1.7/98-031: “Testing the quality of connections having time varying impairments,” AT&T.

[7] ITU-T SG12 D.139: “Study of the relationship between instantaneous and overall subjective speech quality

for time-varying quality speech sequences: influence of a recency effect,” France Telecom.

[8] L. Sun, G. Wade, B. Lines, and E. Ifeachor, “Impact of packet loss location on perceived speech quality,” in

Proc. 2nd IP Telephony Workshop, pp. 114-122, New York, April 2001.

[9] H. Sanneck, N. Le, A. Wolisz, and G. Carle, “Intra-flow loss recovery and control for VoIP,” in Proc. ACM

Multimedia 2001, Ottawa (ON), Sept. 2001.

[10] R. Ramjee, J. Kurose, D. Towsley, and H. Schulzrinne, “Adaptive playout mechanisms for packetized audio

applications in wide-area networks,” in Proc. IEEE INFOCOM, Toronto, Canada, June 1994.

[11] A. Kansar and A. Karandikar, “Jitter-free audio playout over best effort packet networks,” in ATM Forum

International Symposium, New Delhi, India, 2001.

[12] P. DeLeon and C. J. Sreenan, “An adaptive predictor for media playout buffering,” in Proc. IEEE ICASSP,

vol.6, pp. 3097–3100, March 1999.

[13] J. Pinto and K. J. Christenen, “An algorithm for playout of packet voice based on adaptive adjustment of

talkspurt silence periods,” in Proc. 24th Conference on Local Computer Networks, Lowell, Massachusetts, Oct.

1999.

[14] C.J. Sreenan, J.-C. Chen, P. Agrawal, and B. Narendran, “Delay reduction techniques for playout

buffering,” IEEE Multimedia Transactions on Multimedia, vol.02, no. 02, pp. 88-100, June 2000.

[15] K. Fujimoto, S. Ata, and M. Murata, “Adaptive playout buffer algorithm for enhancing perceived quality of

streaming applications,” in Proc. IEEE GLOBECOM 2002, pp. 2463-2469, Taipei, Taiwan, Nov. 2002.

Page 37: VoIP Streaming Over Packet-Based Networks

36

[16] K. Fujimoto, S. Ata, and M. Murata, “Playout control for streaming applications by statistical delay

analysis,” in Proc. IEEE INFOCOM, vol 8, pp. 2337-2342, June 2001.

[17] K. Fujimoto, S. Ata, and M. Murata, “Statistical analysis of packet delays in the internet and its application

to playout control for streaming applications,” IEICE Transactions on Communications, vol. E84-B, no. 6, pp.

1504, June 2001.

[18] C. Savolaine, “QoS/VoIP overview,” in Proc. IEEE Communications Quality & Reliability International

Workshop, April 2001.

[19] ITU-T Recommendation G.711, “Pulse code modulation (PCM) of voice frequencies,” 1/1988.

[20] C. Boutremans and J-Y. Le Boudec, “Adaptive playout buffer and FEC adjustment for Internet Telephony,”

in Proc. IEEE INFOCOM 2003, pp. 652-662, April 2003.

[21] ITU-T Recommendation G.113, “Transmission impairments due to speech processing,” 02/2001.

[22] L. Sun and E. Ifeachor, “New models for perceived voice quality prediction and their applications in

playout buffer optimization for VoIP networks,” in Proc. ICC 2004, June 2004.

[23] ITU-T Recommendation P.862, “Perceptual evaluation of speech quality (PESQ), an objective method for

end-to-end quality assessment of narrow-band telephone networks and speech codecs,” February 2001.

[24] A.W. Rix, “Comparison between subjective listening quality and p.862 PESQ score,” in Proc. of Online

Workshop Measurement of Speech and Audio Quality in Networks, pp. 17–25, May 2003.

[25] L. Sun and E. Ifeachor, “Prediction of perceived conversational speech quality and effects of playout buffer

algorithms,” in Proc. of IEEE ICC 2003, pp. 1–6, 2003.

[26] L. Atzori and M. L. Lobina, “Speech playout buffering based on a simplified version of the ITU-T E-

Model,” IEEE Signal Processing Letter, vol. 11, no. 3, pp. 382-385, March 2004.

[27] W. Jiang and H. Schulzrinne, “Comparisons of FEC and codec robustness on VoIP quality quality and

bandwidth efficiency,” in Proc. ICN, August 2001.

[28] J. Matta, C. Pepin, K. Lashkari, and R. Jain, “A source and channel rate adaptation algorithm for AMR in

VoIP using the Emodel,” in Proc. NOSSDAV 2003, June 2003.

[29] M. Gardner, V. S. Frost, and D. W. Petr, “Using optimization to achieve efficient quality of service in voice

over IP networks,” in Proc. of IPCCC 2003, April 2003.

[30] R. Cole and J. Rosenbluth, “Voice over IP performance monitoring,” ACM Computer Communication

Review, vol. 31, no. 2, Apr. 2001.

[31] ITU-T Recommendation G.114, “One-way transmission time,“ 05/2000.

Page 38: VoIP Streaming Over Packet-Based Networks

37

[32] M. Yajnik, S. Moon, J. Kurose, and D. Towsley, “Measurement and modeling of the temporal dependence

in packet loss,” in Proc. IEEE INFOCOM, New York, March 1999.

[33] J. Rosenberg, “G.729 Error Recovery for Internet Telephony,” Columbia University Computer Science

Technical Report CUCS-016-01, vol. 19, Dec. 2001.

[34] ITU-T Recommendation G.729-A, “Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-

excited linear-prediction (SC-ACELP),” 03/1996.

[35] A. Mood, F. Graybill, and D.Boes, “Introduction to the theory of statistics,” McGraw-Hill.

[36] W. Jiang and H. Schulzrinne, “Perceived quality of packet audio under bursty losses,” in Proc. IEEE

INFOCOM, New York, June 2002.

[37] ETSI TIPHON TS 101329-5: “QoS measurement methodologies,” Annex E.

Page 39: VoIP Streaming Over Packet-Based Networks

38

Page 40: VoIP Streaming Over Packet-Based Networks

39

Appendix I: The ITU-T H.323

Document Source

The contents of this document have been obtained from the International Engineering Consortium.

The International Engineering Consortium (IEC) is a nonprofit organization dedicated to catalyzing

technology and business progress worldwide in a range of high-technology industries and their

university communities. Since 1944, the IEC has provided high-quality educational opportunities for

industry professionals, academics, and students. In conjunction with industry-leading companies, the

IEC has developed an extensive, free, on-line educational program. The IEC conducts industry-

university programs that have substantial impact on curricula. It also conducts research and develops

publications, conferences, and technological exhibits that address major opportunities and challenges

of the information age. More than 70 leading high-technology universities are IEC affiliates, and the

IEC handles the affairs of the Electrical and Computer Engineering Department Heads Association.

Definition

H.323 is a standard that specifies the components, protocols and procedures that provide multimedia

communication services (real-time audio, video, and data communications) over packet networks,

including Internet protocol (IP) based networks. H.323 is part of a family of ITU-T recommendations

called H.32x that provides multimedia communication services over a variety of networks.

Overview

This appendix discusses the H.323 protocol standard. H.323 is explained with an emphasis on

gateways and gatekeepers, which are components of an H.323 network. The call flows between

entities in an H.323 network are explained, and the interworking aspects of H.323 with H.32x family

protocols are discussed.

What is H.323?

The H.323 standard is a cornerstone technology for the transmission of real-time audio, video, and

data communications over packet-based networks. It specifies the components, protocols, and

procedures providing multimedia communication over packet-based networks (see Figure 9). Packet-

based networks include Internet Protocol (IP) based (including the Internet) or Internet packet

exchange (IPX) based local-area networks (LANs), enterprise networks (ENs), metropolitan-area

Page 41: VoIP Streaming Over Packet-Based Networks

40

networks (MANs), and wide area networks (WANs). H.323 can be applied in a variety of

mechanisms: audio only (IP telephony); audio and video (videotelephony); audio and data; and audio,

video and data. H.323 can also be applied to multipoint-multimedia communications. H.323 provides

myriad services and, therefore, can be applied in a wide variety of areas: consumer, business, and

entertainment applications.

Packet Network

H.323 terminal H.323 terminal

H.323

Figure 9. H.323 Terminals on a Packet Network

H.323 Versions

The H.323 standard is specified by the ITU-T Study Group 16. Version 1 of the H.323

recommendation - visual telephone systems and equipment for LANs that provide a nonguaranteed

quality of service (QoS) - was accepted in October 1996. It was, as the name suggests, heavily

weighted towards multimedia communications in a LAN environment. Version 1 of the H.323

standard does not provide guaranteed QoS.

The emergence of Voice Over Internet Protocol (VoIP) applications and IP telephony has paved the

way for a revision of the H.323 specification. The absence of a standard for voice over IP resulted in

products that were incompatible. With the development of VoIP, new requirements emerged, such as

providing communication between a PC based phone and a phone on a traditional switched circuit

network (SCN). Such requirements forced the need for a standard for IP telephony. Version 2 of

H.323 (packet-based multimedia communications systems) was defined to accommodate these

additional requirements and was accepted in January 1998.

New features are being added to the H.323 standard, which will evolve to Version 3 shortly. The

features being added include fax-over-packet networks, gatekeeper-gatekeeper communications, and

fast-connection mechanisms.

Page 42: VoIP Streaming Over Packet-Based Networks

41

H.323 in Relation to Other Standards of the H.32x Family

The H.323 standard is part of the H.32x family of recommendations specified by ITU-T. The other

recommendations of the family specify multimedia communication services over different networks:

o H.324 over SCN

o H.320 over integrated services digital networks (ISDN)

o H.321 and H.310 over broadband integrated services digital networks (B-ISDN)

o H.322 over LANs that provide guaranteed QoS

One of the primary goals in the development of the H.323 standard was interoperability with other

multimedia-services networks. This interoperability is achieved through the use of a gateway. A

gateway performs any network or signaling translation required for interoperability..

Interworking with Other Multimedia Networks

The H.323 standard specifies four kinds of components, which, when networked together, provide the

point-to-point and point-to-multipoint multimedia-communication services:

o Terminals

o Gateways

o Gatekeepers

o Multipoint Control Units (MCUs)

Terminals

Used for real-time bidirectional multimedia communications, an H.323 terminal can either be a

personal computer (PC) or a stand-alone device, running an H.323 and the multimedia applications. It

supports audio communications and can optionally support video or data communications. Because

the basic service provided by an H.323 terminal is audio communications, an H.323 terminal plays a

key role in IP telephony services. An H.323 terminal can either be a PC or a stand-alone device,

running an H.323 stack and multimedia applications. The primary goal of H.323 is to interwork with

other multimedia terminals. H.323 terminals are compatible with H.324 terminals on SCN and

wireless networks, H.310 terminals on B-ISDN, H.320 terminals on ISDN, H.321 terminals on B-

ISDN, and H.322 terminals on guaranteed QoS LANs. H.323 terminals may be used in multipoint

conferences.

Page 43: VoIP Streaming Over Packet-Based Networks

42

Gateways

A gateway connects two dissimilar networks. An H.323 gateway provides connectivity between an

H.323 network and a non-H.323 network. For example, a gateway can connect and provide

communication between an H.323 terminal and SCN networks (SCN networks include all switched

telephony networks, e.g., public switched telephone network [PSTN]). This connectivity of dissimilar

networks is achieved by translating protocols for call setup and release, converting media formats

between different networks, and transferring information between the networks connected by the

gateway. A gateway is not required, however, for communication between two terminals on an H.323

network.

Gatekeepers

A gatekeeper can be considered the brain of the H.323 network. It is the focal point for all calls within

the H.323 network. Although they are not required, gatekeepers provide important services such as

addressing, authorization and authentication of terminals and gateways; bandwidth management;

accounting; billing; and charging. Gatekeepers may also provide call-routing services.

Multipoint Control Units

MCUs provide support for conferences of three or more H.323 terminals. All terminals participating in

the conference establish a connection with the MCU. The MCU manages conference resources,

negotiates between terminals for the purpose of determining the audio or video coder/decoder

(CODEC) to use, and may handle the media stream. The gatekeepers, gateways, and MCUs are

logically separate components of the H.323 standard but can be implemented as a single physical

device.

H.323 Components

An H.323 zone is a collection of all terminals, gateways, and MCUs managed by a single gatekeeper

(see Figure 10). A zone includes at least one terminal and may include gateways or MCUs. A zone has

only one gatekeeper. A zone may be independent of network topology and may be comprised of

multiple network segments that are connected using routers or other devices.

Page 44: VoIP Streaming Over Packet-Based Networks

43

Non-H.323 Network

(e.g., ISDN) terminal

Non-H.323 Network

(e.g., PSTN) terminal

MCU

Router Router

Gateway Gateway

Gatekeeper

Figure 10. An H.323 Zone

H.323 Zone

The protocols specified by H.323 are listed below. H.323 is independent of the packet network and the

transport protocols over which it runs and does not specify them.

o Audio CODEC

o Video CODEC

o H.225 registration, admission, and status (RAS)

o H.225 call signaling

o H.245 control signaling

o Real-time Transfer Protocol (RTP)

o Real-time Control Protocol (RTCP)

Audio CODEC

An audio CODEC encodes the audio signal from the microphone for transmission on the transmitting

H.323 terminal and decodes the received audio code that is sent to the speaker on the receiving H.323

terminal. Because audio is the minimum service provided by the H.323 standard, all H.323 terminals

must have at least one audio CODEC support, as specified in the ITU-T G.711 recommendation

(audio coding at 64 kbps). Additional audio CODEC recommendations such as G.722 (64, 56, and 48

kbps), G.723.1 (5.3 and 6.3 kbps), G.728 (16 kbps), and G.729 (8 kbps) may also be supported.

Page 45: VoIP Streaming Over Packet-Based Networks

44

Video CODEC

A video CODEC encodes video from the camera for transmission on the transmitting H.323 terminal

and decodes the received video code that is sent to the video display on the receiving H.323 terminal.

Because H.323 specifies support of video as optional, the support of video CODECs is optional as

well. However, any H.323 terminal providing video communications must support video encoding and

decoding as specified in the ITU-T H.261 recommendation.

H.225 Registration, Admission, and Status

Registration, admission, and status (RAS) is the protocol between endpoints (terminals and gateways)

and gatekeepers. The RAS is used to perform registration, admission control, bandwidth changes,

status, and disengage procedures between endpoints and gatekeepers. An RAS channel is used to

exchange RAS messages. This signaling channel is opened between an endpoint and a gatekeeper

prior to the establishment of any other channels.

H.225 Call Signaling

The H.225 call signaling is used to establish a connection between two H.323 endpoints. This is

achieved by exchanging H.225 protocol messages on the call-signaling channel. The call-signaling

channel is opened between two H.323 endpoints or between an endpoint and the gatekeeper.

H.245 Control Signaling

H.245 control signaling is used to exchange end-to-end control messages governing the operation of

the H.323 endpoint. These control messages carry information related to the following:

o Capabilities exchange

o Opening and closing of logical channels used to carry media streams

o Flow-control messages

o General commands and indications

Real-Time Transport Protocol

Real-time transport protocol (RTP) provides end-to-end delivery services of real-time audio and video.

Whereas H.323 is used to transport data over IP based networks, RTP is typically used to transport

data via the user datagram protocol (UDP). RTP, together with UDP, provides transport-protocol

functionality. RTP provides payload-type identification, sequence numbering, timestamping, and

Page 46: VoIP Streaming Over Packet-Based Networks

45

delivery monitoring. UDP provides multiplexing and checksum services. RTP can also be used with

other transport protocols.

Real-Time Transport Control Protocol

Real-time transport control protocol (RTCP) is the counterpart of RTP that provides control services.

The primary function of RTCP is to provide feedback on the quality of the data distribution. Other

RTCP functions include carrying a transport-level identifier for an RTP source, called a canonical

name, which is used by receivers to synchronize audio and video.

Protocols Specified by H.323

H.323 terminals must support the following:

o H.245 for exchanging terminal capabilities and creation of media channels

o H.225 for call signaling and call setup

o RAS for registration and other admission control with a gatekeeper

o RTP/RTCP for sequencing audio and video packets

H.323 terminals must also support the G.711 audio CODEC. Optional components in an H.323

terminal are video CODECs, T.120 data-conferencing protocols, and MCU capabilities.

Terminal Characteristics

Gateway Characteristics

A gateway provides translation of protocols for call setup and release, conversion of media formats

between different networks, and the transfer of information between H.323 and non-H.323 networks.

An application of the H.323 gateway is in IP telephony, where the H.323 gateway connects an IP

network and SCN network (e.g., ISDN network).

On the H.323 side, a gateway runs H.245 control signaling for exchanging capabilities, H.225 call

signaling for call setup and release, and H.225 registration, admissions, and status (RAS) for

registration with the gatekeeper. On the SCN side, a gateway runs SCN-specific protocols (e.g., ISDN

and SS7 protocols).

Terminals communicate with gateways using the H.245 control-signaling protocol and H.225 call-

signaling protocol. The gateway translates these protocols in a transparent fashion to the respective

counterparts on the non-H.323 network and vice versa. The gateway also performs call setup and

Page 47: VoIP Streaming Over Packet-Based Networks

46

clearing on both the H.323-network side and the non-H.323-network side. Translation between audio,

video, and data formats may also be performed by the gateway. Audio and video translation may not

be required if both terminal types find a common communications mode. For example, in the case of a

gateway to H.320 terminals on the ISDN, both terminal types require G.711 audio and H.261 video, so

a common mode always exists. The gateway has the characteristics of both an H.323 terminal on the

H.323 network and the other terminal on the non-H.323 network it connects.

Gatekeepers are aware of which endpoints are gateways because this is indicated when the terminals

and gateways register with the gatekeeper. A gateway may be able to support several simultaneous

calls between the H.323 and non-H.323 networks. In addition, a gateway may connect an H.323

network to a non-H.323 network. A gateway is a logical component of H.323 and can be implemented

as part of a gatekeeper or an MCU.

Gatekeeper Characteristics

Gatekeepers provide call-control services for H.323 endpoints, such as address translation and

bandwidth management as defined within RAS. Gatekeepers in H.323 networks are optional. If they

are present in a network, however, terminals and gateways must use their services. The H.323

standards both define mandatory services that the gatekeeper must provide and specify other optional

functionality that it can provide.

An optional feature of a gatekeeper is call-signaling routing. Endpoints send call-signaling messages

to the gatekeeper, which the gatekeeper routes to the destination endpoints. Alternately, endpoints can

send call-signaling messages directly to the peer endpoints. This feature of the gatekeeper is valuable,

as monitoring of the calls by the gatekeeper provides better control of the calls in the network. Routing

calls through gatekeepers provides better performance in the network, as the gatekeeper can make

routing decisions based on a variety of factors, for example, load balancing among gateways.

A gatekeeper is optional in an H.323 system. The services offered by a gatekeeper are defined by RAS

and include address translation, admissions control, bandwidth control, and zone management. H.323

networks that do not have gatekeepers may not have these capabilities, but H.323 networks that

contain IP-telephony gateways should also contain a gatekeeper to translate incoming E.164 telephone

Page 48: VoIP Streaming Over Packet-Based Networks

47

addresses into transport addresses. A gatekeeper is a logical component of H.323 but can be

implemented as part of a gateway or MCU.

Mandatory Gatekeeper Functions

Address Translation

Calls originating within an H.323 network may use an alias to address the destination terminal. Calls

originating outside the H.323 network and received by a gateway may use an E.164 telephone number

(e.g., 310-442-9222) to address the destination terminal. The gatekeeper translates this E.164

telephone number or the alias into the network address (e.g., 204.252.32:456 for an IP-based network)

for the destination terminal. The destination endpoint can be reached using the network address on the

H.323 network.

Admission Control

The gatekeeper can control the admission of the endpoints into the H.323 network. It uses RAS

messages, admission request (ARQ), confirm (ACF), and reject (ARJ) to achieve this. Admissions

control may be a null function that admits all endpoints to the H.323 network.

Bandwidth Control

The gatekeeper provides support for bandwidth control by using the RAS messages, bandwidth

request (BRQ), confirm (BCF), and reject (BRJ). For instance, if a network manager has specified a

threshold for the number of simultaneous connections on the H.323 network, the gatekeeper can refuse

to make any more connections once the threshold is reached. The result is to limit the total allocated

bandwidth to some fraction of the total available, leaving the remaining bandwidth for data

applications. Bandwidth control may also be a null function that accepts all requests for bandwidth

changes.

Zone Management

The gatekeeper provides the above functions (address translation, admissions control, and bandwidth

control) for terminals, gateways, and MCUs located within its zone of control. An H.323 zone is

defined further.

Page 49: VoIP Streaming Over Packet-Based Networks

48

Optional Gatekeeper Functions

Call-Control Signaling

The gatekeeper can route call-signaling messages between H.323 endpoints. In a point-to-point

conference, the gatekeeper may process H.225 call-signaling messages. Alternatively, the gatekeeper

may allow the endpoints to send H.225 call-signaling messages directly to each other.

Call Authorization

When an endpoint sends call-signaling messages to the gatekeeper, the gatekeeper may accept or

reject the call, according to the H.225 specification. The reasons for rejection may include access-

based or time-based restrictions, to and from particular terminals or gateways.

Call Management

The gatekeeper may maintain information about all active H.323 calls so that it can control its zone by

providing the maintained information to the bandwidth-management function or by rerouting the calls

to different endpoints to achieve load balancing.

Gateway and Gatekeeper Characteristics

The H.225 RAS is used between H.323 endpoints (terminals and gateways) and gatekeepers for the

following:

o Gatekeeper discovery (GRQ)

o Endpoint registration

o Endpoint location

o Admission control

o Access tokens

The RAS messages are carried on a RAS channel that is unreliable. Hence, RAS message exchange

may be associated with timeouts and retry counts.

Gatekeeper Discovery

The gatekeeper discovery process is used by the H.323 endpoints to determine the gatekeeper with

which the endpoint must register. The gatekeeper discovery can be done statically or dynamically. In

static discovery, the endpoint knows the transport address of its gatekeeper a priori. In the dynamic

method of gatekeeper discovery, the endpoint multicasts a GRQ message on the gatekeeper's

Page 50: VoIP Streaming Over Packet-Based Networks

49

discovery multicast address: "Who is my gatekeeper?" One or more gatekeepers may respond with a

GCF message: "I can be your gatekeeper."

Endpoint Registration

Registration is a process used by the endpoints to join a zone and inform the gatekeeper of the zone's

transport and alias addresses. All endpoints register with a gatekeeper as part of their configuration.

Endpoint Location

Endpoint location is a process by which the transport address of an endpoint is determined and given

its alias name or E.164 address.

Other Control

The RAS channel is used for other kinds of control mechanisms, such as admission control, to restrict

the entry of an endpoint into a zone, bandwidth control, and disengagement control, where an endpoint

is disassociated from a gatekeeper and its zone.

H.225 Registration, Admission, and Status

H.225 Call Signaling

H.225 call signaling is used to set up connections between H.323 endpoints (terminals and gateways),

over which the real-time data can be transported. Call signaling involves the exchange of H.225

protocol messages over a reliable call-signaling channel. For example, H.225 protocol messages are

carried over TCP in an IP-based H.323 network.

H.225 messages are exchanged between the endpoints if there is no gatekeeper in the H.323 network.

When a gatekeeper exists in the network, the H.225 messages are exchanged either directly between

the endpoints or between the endpoints after being routed through the gatekeeper. The first case is

direct call signaling. The second case is called gatekeeper-routed call signaling. The method chosen is

decided by the gatekeeper during RAS-admission message exchange.

Gatekeeper-Routed Call Signaling

The admission messages are exchanged between endpoints and the gatekeeper on RAS channels. The

gatekeeper receives the call-signaling messages on the call-signaling channel from one endpoint and

routes them to the other endpoint on the call-signaling channel of the other endpoint.

Page 51: VoIP Streaming Over Packet-Based Networks

50

Direct Call Signaling

During the admission confirmation, the gatekeeper indicates that the endpoints can exchange call-

signaling messages directly. The endpoints exchange the call signaling on the call-signaling channel.

H.245 Control Signaling

H.245 control signaling consists of the exchange of end-to-end H.245 messages between

communicating H.323 endpoints. The H.245 control messages are carried over H.245 control

channels. The H.245 control channel is the logical channel 0 and is permanently open, unlike the

media channels. The messages carried include messages to exchange capabilities of terminals and to

open and close logical channels.

Capabilities Exchange

Capabilities exchange is a process using the communicating terminals' exchange messages to provide

their transmit and receive capabilities to the peer endpoint. Transmit capabilities describe the

terminal's ability to transmit media streams. Receive capabilities describe a terminal's ability to receive

and process incoming media streams.

Logical Channel Signaling

A logical channel carries information from one endpoint to another endpoint (in the case of a point-to-

point conference) or multiple endpoints (in the case of a point-to-multipoint conference). H.245

provides messages to open or close a logical channel; a logical channel is unidirectional.

H.225 Call Signaling and H.245 Control Signaling

This module describes the steps involved in creating an H.323 call, establishing media

communication, and releasing the call. The example network contains two H.323 terminals (T1 and

T2) connected to a gatekeeper. Direct call signaling is assumed. It is also assumed that the media

stream uses RTP encapsulation. Figure 11 illustrates H.323 call establishment.

Page 52: VoIP Streaming Over Packet-Based Networks

51

ARQ (1)

ACF (2)

SETUP (3)

CALL PROCEEDING (4)

ARQ (5)

ACF (6)

ALERTING (7)

CONNECT (8)

Figure 11. H.323 Call Establishment

1. T1 sends the RAS ARQ message on the RAS channel to the gatekeeper for registration. T1 requests the use

of direct call signaling.

2. The gatekeeper confirms the admission of T1 by sending ACF to T1. The gatekeeper indicates in ACF that

T1 can use direct call signaling.

3. T1 sends an H.225 call signaling setup message to T2 requesting a connection.

4. T2 responds with an H.225 call proceeding message to T1.

5. Now T2 has to register with the gatekeeper. It sends an RAS ARQ message to the gatekeeper on the RAS

channel.

6. The gatekeeper confirms the registration by sending an RAS ACF message to T2.

7. T2 alerts T1 of the connection establishment by sending an H.225 alerting message.

8. Then T2 confirms the connection establishment by sending an H.225 connect message to T1, and the call is

established.

Connection Procedures

The H.323 protocol is specified so that it interoperates with other networks. The most popular H.323

interworking is IP telephony, when the underlying network of H.323 is an IP network and the

interoperating network is SCN (see Figure 12). SCN includes PSTN and ISDN networks.

Page 53: VoIP Streaming Over Packet-Based Networks

52

H.323 Network (IP based)

SCN

terminal

Gateway Gatekeeper

phones

Figure 12. IP Telephony: H.323 Interworking with SCN

Appendix II: Intrusive and Non-Intrusive Evaluation of Speech Quality

PESQ

Previous objective speech quality assessment models, such as bark spectral distortion (BSD), the

perceptual speech quality measure (PSQM), and measuring normalizing blocks (MNB), have been

found to be suitable for assessing only a limited range of distortions. A new model has therefore been

developed for use across a wider range of network conditions, including analogue connections, codec,

packet loss and variable delay. Known as perceptual evaluation of speech quality (PESQ), it is the

result of integration of the perceptual analysis measurement system (PAMS) and PSQM99, an

enhanced version of PSQM. PESQ is expected to become a new ITU-T recommendation P.862,

replacing P.861 which specified PSQM and MNB.

Figure 13. The PESQ Strategy

The model begins by level aligning both signals to a standard listening level (See Fig.13). They are

filtered (using an FFT) with an input filter to model a standard telephone handset. The signals are

aligned in time and then processed through an auditory transform similar to that of PSQM. The

Page 54: VoIP Streaming Over Packet-Based Networks

53

transformation also involves equalizing for linear filtering in the system and for gain variation. Two

distortion parameters are extracted from the disturbance (the difference between the transforms of the

signals), and are aggregated in frequency and time and mapped to a prediction of subjective mean

opinion score (MOS). Some details are discussed below. The time alignment of PESQ assumes that

the delay of the system is piecewise constant. This assumption appears to be valid for a wide range of

systems, including packet-based transmission such as voice over IP (VoIP). Delay changes are

allowed in silent periods (where they will normally be inaudible) and in speech (where they are

usually audible). The signals are aligned using the following steps.

• Narrowband filter applied to both signals to emphasize perceptually important parts. These filtered signals

are only used for time alignment.

• Envelope-based delay estimation.

• Division of reference signal into utterances.

• Envelope-based delay estimation for each utterance.

• Fine correlation histogram-based delay identification for each utterance.

• Utterance splitting and re-alignment to test for delay changes during speech.

These give a delay estimate for each utterance, which is used to find the frame-by-frame delay for use

in the auditory transform.

The auditory transform in PESQ is a psychoacoustic model which maps the signals into a

representation of perceived loudness in time and frequency. It includes the following stages.

Bark spectrum. An FFT with a Hamming window is used to calculate the instantaneous power

spectrum in each frame, for 50% overlapping frames of 32ms duration. This is grouped without

smearing into 42 bins, equally spaced in perceptual frequency on a modified Bark scale similar to that

of PSQM [2].

Frequency equalization. The mean Bark spectrum for active speech frames is calculated. The ratio

between the spectra of reference and degraded gives a transfer function estimate, assuming that the

system under test has a constant frequency response. The reference is equalized to the degraded signal

using this estimate, with bounds to limit the equalization to ±20dB.

Page 55: VoIP Streaming Over Packet-Based Networks

54

Equalization of gain variation. The ratio between the audible power of the reference and the degraded

in each frame is used to identify gain variations. This is filtered with a first-order lowpass filter, and

bounded, then the degraded signal is equalized to the reference.

Loudness mapping. The Bark spectrum is mapped to (Sone) loudness, including a frequency-

dependent threshold and exponent. This gives the perceived loudness in each time-frequency cell.

The absolute difference between the degraded and the reference signals gives a measure of audible

error. In PESQ, this is processed through several steps before a non-linear average over time and

frequency is calculated.

Deletion. A deletion (a negative delay change) leaves a section which overlaps in the degraded signal.

If the deletion is longer than half a frame, the overlapping sections are discarded.

Masking. Masking in each time-frequency cell is modeled using a simple threshold below which

disturbances are inaudible; this is set to the lesser of the loudness of the reference and degraded

signals, divided by four. The threshold is subtracted from the absolute loudness difference, and values

less than zero are set to zero. Methods for applying masking over distances larger than one time-

frequency cell were examined with earlier versions of PSQM and PSQM99, but did not improve

overall performance [14], and were not used in PESQ.

Asymmetry. Unlike P.861 PSQM [2], PESQ computes two different error averages, one without and

one with an asymmetry factor. The PESQ asymmetry factor is calculated from an established ratio of

the Bark spectral density of the degraded to the reference signals in each time-frequency cell. This is

raised to the power 1.2 and is bounded with an upper limit of 12.0. Values smaller than 3.0 are set to

zero. The asymmetric weighted disturbance, obtained by multiplying by this factor, thus measures

only additive distortions.

Following the understanding that localized errors dominate perception [9], PESQ integrates

disturbance over several time-frequency scales using a method designed to take optimal account of the

distribution of error in time and amplitude. The disturbance values are aggregated using an Lp norm,

which calculates a non-linear average using the following formula:

[ ] pN

m

pp medisturbanc

NL

1

1

1��

��

⋅= �=

Page 56: VoIP Streaming Over Packet-Based Networks

55

The disturbance is first summed across frequency using an Lp norms, giving a frame-by-frame

measure of perceived distortion. This frame disturbance is multiplied by two weightings. The first

weight is inversely proportional to the instantaneous energy of the reference, raised to the power 0.04,

giving slightly greater emphasis on sections for which the reference is quieter. This process replaces

the silent interval weighting used in P.861. After this, the frame disturbance is bounded with an upper

limit of 45. The second weight gives reduced emphasis on the start of the signal if the total length is

over 16s, modeling the effect of short-term memory in subjective listening. This multiplies the frame

disturbance at the start of the signal by a factor decreasing linearly from 1.0 (for files shorter than 16

seconds) to 0.5 (for files longer than 60 seconds). After weighting, the frame disturbance is averaged

in time oversplit second intervals of 20 frames (approx 320ms, accounting for the overlap of frames)

using Lp norms. These intervals overlap 50%, and no window function is used. The split second

disturbance values are finally averaged over the length of the speech files, again using Lp norms. Thus

the aggregation process uses three Lp norms – in general with different values of p – to map the

disturbance to a single figure. The value of p is higher for averaging over the split second intervals to

give greatest weight to localized distortions. The symmetric and asymmetric disturbance are averaged

separately.

To train PESQ a large number of different symmetric and asymmetric disturbance parameters were

calculated by using multiple values of p for each of the three averaging stages. A linear combination of

disturbance parameters was used as a predictor of subjective MOS. A further regression is required for

each subjective test to account for context and voting preferences of different subjects, as discussed in

section 3; for calibration a linear mapping was also used at this stage. Parameter selection was

performed for all candidate sets of up to four disturbance parameters. The optimal combination –

giving the highest average correlation coefficient – was found. This enabled the best parameters to be

chosen from the full set of several hundred candidate disturbance parameters. The use of partial

compensation in PESQ, for example in equalizing for gain modulation, avoids the need for using a

large number of parameters to predict quality. A combination of only two parameters – one symmetric

disturbance (dSYM) and one asymmetric disturbance (dASYM) – gave a good balance between

accuracy of prediction and ability to generalize. However, as this low-dimension model depends on

Page 57: VoIP Streaming Over Packet-Based Networks

56

earlier stages to incorporate complex perceptual effects, several design iterations were required.

Coefficients in the auditory transform and disturbance processing were optimized then the optimal

parameter combination was found, and the process repeated several times. Final training was

performed on a database of 30 subjective tests, giving the following output mapping used in PESQ:

PESQMOS = 4.5 – 0.1 dSYM – 0.0309 dASYM For normal subjective test material the values lie

between 1.0 (bad) and 4.5 (no distortion). In extremely high distortion the PESQMOS may fall below

1.0, but this is very uncommon.

The ITU-T R-Factor (from Psytechnics Society)

The essential information about ITU-T R Factor have been provided previously. In this part of

Appendix II, some other general information are provided and discussed.

The E-Model is a planning tool for estimating the overall quality in a telephone network. It was first

submitted to standards bodies in 1993 although its origins date back to the models first developed in

the 1960’s by BT, Bellcore and others. The basic premise for the model is that impairments are always

psychologically additive. Simply put, if network impairments such as noise, echo, delay, codec

performance, jitter, etc. are cleverly added then an overall objective rating of quality or “caller

experience” can be estimated. The basic formula for the E-Model is below.

AIIIR effeds +−−−= ,100

• R Factor: Overall network quality rating (ranges between 0 and 100)

• Ro: Signal to noise ratio

• Is: Impairments simultaneous to voice signal transmission

• Id: Impairments delayed after voice signal transmission

• Ie,eff (or Ie): Effects of Equipment (e.g. codec)

• A: Advantage factor (attempts to account for caller expectations)

In simple terms, the overall quality (R Factor) is calculated by estimating the signal to noise ratio of a

connection (Ro) and subtracting the network impairments (Is, Id, Ie) that in turn are offset by any

expectations of quality had by the caller (A).

While a network is still on paper, a network planner can use the E Model to estimate its likely quality.

The engineer gathers input information from reference tables, enters it into the E-Model, and

Page 58: VoIP Streaming Over Packet-Based Networks

57

calculates the resulting Transmission Quality Rating (R Factor). The table below shows how R Factor

values may be interpreted.

Some of the inputs to the E Model are complex mathematical formulae which consider various

impairments acting together. These formulae are visually represented below by the Calc Ro, Calc Id,

and Calc Is boxes.

Figure 14. The ITU-T R-Factor

The general approach used for live measurement situations is to measure a limited number of E-Model

parameters while making assumptions for non-measured parameters. In the example below, an

objective speech quality measurement is outputting a Mean Opinion Score (MOS) which is converted

to an Ie value.

Network Assumptions are used on some inputs while an objective speech quality measure is providing

an Ie value. Used in this capacity, an engineer may be able to compare original estimated R Ratings

Page 59: VoIP Streaming Over Packet-Based Networks

58

with actual R ratings achieved in a live situation. Clearly, for this scenario to be beneficial the

objective speech quality measure must be accurate. The EModel recommendation has various Ie

tables. For planning purposes these provide Ie values for codec combinations as well recently VoIP

degradations such as packet loss. However, the voice quality of a live VoIP system can be radically

different to the numbers in the ITU tables. A live monitoring system must be able to accurately

measure speech quality rather than use IP network statistics to look up Ie in the tables.

Figure 15. A non-intrusive strategy for measuring objective voice quality (Psytechnics)

The E-Model was designed to provide estimated network quality and has shown to be reasonably

accurate for this purpose. It has not been accepted as a valid measurement tool for live networks.

Increasingly, and against ITU recommendations, the E-Model is being marketed to the industry as a

live voice quality measurement tool. The ITU-T G.107 Recommendation states at the beginning of the

document that “Such estimates are only made for transmission planning purposes and not for actual

customer opinion prediction (for which there is no agreed-upon model recommended by the ITU-T). “

It also provides a caution with the following paragraph. “The E-Model has not been fully verified by

field surveys or laboratory tests for the very large number of possible combinations of input

parameters. For many combinations of high importance to transmission planners, the E-Model can be

used with confidence, but for other parameter combinations, E-Model predictions have been

Page 60: VoIP Streaming Over Packet-Based Networks

59

questioned and are currently under study. Accordingly, caution must be exercised when using the E-

Model for some conditions; for example, the E-Model may give inaccurate results for combinations of

certain types of impairments. Annex A provides further information in this regard.”

Appendix III: The ITU-T G.729A-VAD, a low bit-rate speech codec (Lucent & Bell

Laboratories)

Overview

G.729, also known as CS-ACELP (Conjugate Structure Algebraic Code Excited Linear Prediction), is

specified by the ITU (International Telecommunications Union). It compresses speech from 16 bit,

8kHz samples (128 kbps) to 8 kbps, and was designed for cellular and and networking applications. It

provides "toll quality" speech (that is, as good as the telephone network), works well with background

noise, and has been designed to perform well under error conditions. It, along with G.723.1 (a 5.3/6.3

dual mode speech coder), are the main contenders for the baseline codec for internet telephony. G.729

fits into the general category of CELP (Code Excited Linear Prediction) speech coders. These coders

are all based on a model of the human vocal system. In that model, the throat and mouth are modeled

as a linear filter, and voice is generated by a periodic vibration of air exciting this filter. In the

frequency domain, this implies that speech looks somewhat like a smooth response (called the

envelope), modulated by a set of discrete frequency components. CELP coders all vary in the manner

in which the excitation is specified, and the way in which the coefficients of the filter are represented.

All of them generally break speech up into units called frames, which can be anywhere from 1ms to

100ms in duration. For each frame of speech, a set of parameters for the model are generated and sent

to the decoder. This implies that the frame time represents a lower bound on the system delay; the

encoder must wait for at least a frames worth of speech before it can even begin the encode process. In

G.729, each frame is 10ms, or 80 samples, in duration. This frame is further broken into two 5ms

subframes. The filter parameters are specified just once for each frame, but each subframe has its own

excitation specified. It is also important to note that speech can generally be classified into two types:

voiced and unvoiced. Voiced sounds, such as b,d, and g, are generated from the throat, where as

unvoiced sounds, such as th, f, and sh, are generated from the mouth. The model works better for

voiced sounds, but the excitation can be tailored for voiced or unvoiced so that it works in both cases.

Page 61: VoIP Streaming Over Packet-Based Networks

60

The approach for finding the filter parameters and excitation is called analysis by synthesis. The

encoder searches through the parameter space, performing the decode operation in each loop of the

search. The output of the decoder (the synthesized signal), is compared with the original speech signal.

The parameters which yield the closest match are then chosen, and sent to the decoder. In this fashion,

we have analyzed the signal by repeatedly synthesizing the output of the decoder, and thus the name

analysis by synthesis. First, we discuss how G.729 computes and transmits the filter coefficients. Then

we discuss the excitation.

In G.729, the filter is a 10th order all-pole filter. Since it is used to synthesize voice, it is also called

the synthesis filter. Its inverse, the analysis filter, is an all- zero FIR filter (which we done by A(z)).

When speech is passed through it, the result is the excitation for that speech. In fact, the analysis filter

can be thought of as a linear predictor, the output of which is the error signal in predicting the speech

from the past 10 samples. With this in mind, the problem of finding the coefficients of the analysis

filter reduces to finding the optimal 10th order linear predictor for a given signal. This problem is well

known, and the solution is a function of the correlation matrix of the speech. The correlation function,

however, is likely to vary over time. In each frame, it is re-measured over a 30ms interval. This

interval consists of 15ms from the past, 10ms from the current frame, and 5ms from the future. Of

course, this look-ahead of 5ms requires the encoder to wait an additional 5ms beyond the 10ms frame

delay. This means that the total encoding delay, also known as the algorithmic delay, is 15ms. Instead

of just computing the correlation directly from those speech samples, a window, called the LPanalysis

window, is applied to the samples. The window is half of a Hamming window on one side, and a

quarter cosine cycle on the other side. The curved shape of the window helps emphasize the current

time as opposed to the future or past when computing the correlation function. One further step is

taken before the correlation coefficients are used to generate the filter coefficients. For high pitch

speech signals (such as those from females), the modulation frequency of the spectral envelope is

higher than for than for lower pitch signals. Thus, the LP analysis will tend to result in filters whic h

underestimate the envelope at frequencies between the pitch periods. To resolve this, the correlation

coefficients are multiplied by a Gaussian function. This is equivalent to convolution in frequency of

the spectral envelope by a Gaussian. The result is a widening of the peaks of the spectral envelope,

Page 62: VoIP Streaming Over Packet-Based Networks

61

filling in the gaps. With the correlation function r(k) computed, the 10 LP analysis filter coefficients

can be computed. The optimal coefficients ai, i=1..10, are the solution to: This well known Yule-

Walker equation is readily solved with the Levinson-Durbin algorithm, which defines an iterative

approach to its solution. The next step is to quantize the filter coefficients. However, just quantizing

them directly has several drawbacks. First, it is possible that the quantization noise may move one of

the poles of the synthesis filter outside of the unit-circle, yielding an unstable filter. Secondly, since

human perception of noise is based on frequency components, it is hard to relate the quantization noise

of the coefficients to the noise that will actually be perceived. To resolve this, the coefficients are

transformed into Line Spectral Frequencies, or LSF's. This is done by defining two new polynomials:

)()()(

)()()(111

2

1111

−−

−−

⋅+=

⋅+=

zAzzAzF

zAzzAzF

The LSF's are defined as the zeroes of these polynomials. These two polynomials have several

important characteristics:

1. Their zeroes lie on the unit circle

2. Their zeroes alternate each other

3. For any two polynomials defined as above, with their zeroes on the unit circle and alternating, the filter A(z)

is minimum phase, and therefore, its inverse, the synthesis filter, is stable.

4. A change in any LSF causes a change in the shape of the analysis filter only in a small frequency range

around the frequency of that LSF.

Because of property 3, the decoder can easily verify stability of the filter by making sure the zeroes are

on the unit circle and alternate. Property 4 allows the quantization of the LSF to relate to the frequency

response of the synthesis filter. To reduce the bandwidth, the encoder and decoder predict the values

of the LSF's via a 4th order moving average. Two predictors are possible; the encoder chooses which

one to use and indicates it with a bit in the bitstream. After prediction, the prediction error is

computed. This error is sent to the decoder by vector quantizing it. The vector quantization proceeds in

two stages. In the first, a 10-dimensional codebook (recall their are 10 coefficients) containing 128

entries is searched, and the "best" one is chosen. "Best" is defined here as the entry which results in the

minimum mean square error between the correct LSF's and their quantized versions. This 10-

dimensional vector is then subtracted from the original LSF's. The resulting 10 dimensional difference

Page 63: VoIP Streaming Over Packet-Based Networks

62

is split into two 5 dimensional vectors. The best match for the first vector is found (best here is defined

as minimizing a weighted m.s.e) from a second codebook, and the best match for the second vector is

found from a third codebook. This second codebook is 5-dimensional, and contains 32 entries, as does

the third. This two stage structure is called a conjugate structure, and represents the CS in the codec's

name. Note that 7+5+5=18 bits are needed for the vector quantization, and another bit to specify

which moving average function is used. In the decoder, the LSF's are received and decoded. However,

for the first subframe in the frame, the LSF's are interpolated as the average of the LSF's for the

current and previous frames. The second subframe uses the LSF's received for the current frame.

The next step is to compute the excitation. This is done separately for each subframe. In each, the

excitation is represented as the sum of two components. The first is a delayed version of the excitation

used so far, and the second is a signal with four impulses at various positions. The first component is

called the adaptive codebook contribution, and it models the periodicity in the speech. Therefore, this

delay is actually the pitch delay in the speech signal. The first step in the process is to compute the

pitch delay. This is done by computing the autocorrelation of the speech (weighted to emphasize

various frequency characteristics), and finding the least maximum. Searching for the least maximum

ensures that multiples of the pitch delay are not used. This is called an open- loop pitch analysis. With

this quantity, a search is done in a region around the open loop pitch delay to find the best pitch. Best

is defined by filtering the previous excitation (delayed by the appropriate amount) through the LP

synthesis filter. The result is correlated with the actual speech signal, and divided by the magnitude of

the output of the synthesis filter (thus the gain is eliminated from the search). The delay which

maximizes this quantity is chosen. The gain is then computed directly for the optimal excitation. The

output of the synthesis filter using the optimally delayed and amplified excitation is then subtracted

from the desired speech signal, and the difference, called the target, is then used to find the second part

of the excitation. The second part of the excitation is referred to as the fixed codebook contribution.

The excitation consists of four impulses. Each impulse has an amplitude of either plus or minus one,

and can sit at a fixed set of positions (the set of positions is different for each impulse). These pulses

are then filtered through a simple harmonic filter. A search is done, first identifying the ideal

amplitudes (plus or minus one), and then the positions. As before, the search is executed by filtering

Page 64: VoIP Streaming Over Packet-Based Networks

63

the excitation through the synthesis filter, and computing the product of the result with the target. This

is then divided by the energy in the output of the synthesis filter (again, eliminating the gain from the

search), resulting in the search metric. The set of amplitudes and positions which maximize this metric

are chosen. Finally, the gain is computed directly. For each subframe, a number of parameters have

now been computed: the pitch delay, the adaptive codebook gain, the fixed codebook excitation

(consisting of impulse positions and signs), and the fixed codebook gain. These parameters are then

quantized and sent to the decoder. The pitch delay is directly represented with 8 bits in the first

subframe. In the second subframe, the pitch delay is sent as the difference from the pitch delay in the

first subframe. This requires 5 bits. The fixed codebook contribution is also sent directly, using 4 bits

for the signs and 13 bits for the positions. What remains are the gains. The fixed codebook gain is

predicted from previous frames, and a multiplicative gain factor to compensate for the prediction error

is transmitted. The gain factor and fixed codebook gain are jointly vector quantized using a two stage

vector quantization process. The first stage consists of a 3 bit two dimensional codebook, and the

second stage consists of a 4 bit two dimensional codebook. The sum of the two codewords is used to

represent the gain factor and fixed codebook gain.

Once the decoder receives and reconstructs the speech signal, it applies post processing to clean it up.

The post processing consists of four components:

1. A long term postfilter, denoted Hp(z)

2. A short term postfilter, denoted Hf(z)

3. A Tilt compensation filter, denoted Ht(z)

4. A gain compensation factor, denoted g

The long term postfilter is constructed from the decoded gain and pitch delay parameters. Its basic

function is to emphasize the speech signal in frequency bands around multiples of the pitch period.

The filter is therefore constructed as a 1st order all- zero filter, with its peaks precisely at multiples of

the pitch period. The short term postfilter is designed to emphasize the formants, which are frequency

bands of energy present in the synthesis filter. The postfilter is therefore derived from the synthesis

filter, but with its peaks expanded to make them mo re predominant. In a speech signal, tilt is defined

as the general slope of the energy of the frequency domain. The tilt compensation filter attempts to

adjust for distortions in this quantity caused by the short term postfilter. Finally, the gain

Page 65: VoIP Streaming Over Packet-Based Networks

64

compensation factor is just the quotient of the energy in the un- filtered decoder output divided by the

energy in the postfiltered energy. It restores the original signal strength to the speech.

The following table lists all of the bits (i.e., total: 80) which are placed in the bitstream:

1. Parameter Name Number of Bits

2. Switched MA Predictor of LSF 1

3. First stage LSF VQ 7

4. Second stage VQ, first half 5

5. Second stage VQ, second half 5

6. Pitch Delay, First Subframe 8

7. Parity bit for pitch delay 1

8. Fixed codebook for First Subframe 13

9. Signs of fixed codebook for First Subframe 4

10. Gain codebook, stage 1, for First Subframe 3

11. Gain codebook, stage 2, for First Subframe 4

12. Pitch Delay, Second Subframe 5

13. Fixed codebook for Second Subframe 13

14. Signs of fixed codebook for Second Subframe 4

15. Gain codebook, stage 1, for Second Subframe 3

16. Gain codebook, stage 2, for Second Subframe 4

Annex B

G.729 has an optional annex, Annex B, which specifies the use of silence suppression and comfort

noise generation. In typical speech, only one person talks at a time. Therefore, speech consists of

periods of talking (called talkspurts), followed by periods of silence. Additional compression can be

achieved by discovering the silence periods. Older approaches would send either nothing for the

silence periods, or would send a simple energy value, which the decoder would use to insert white

noise. However, in environments with loud and non stationary background noise, both approaches are

inadequate. G.729 has an optional annex, Annex B [7], which specifies the use of silence suppression

and comfort noise generation. In typical speech, only one person talks at a time.

Therefore, speech consists of periods of talking (called talkspurts), followed by periods of silence.

Additional compression can be achieved by discovering the silence periods. Older approaches would

Page 66: VoIP Streaming Over Packet-Based Networks

65

send either nothing for the silence periods, or would send a simple energy value, which the decoder

would use to insert white noise. However, in environments with loud and nonstationary background

noise, both approaches are inadequate. The algorithm operates by first making a Voice Activity

Detection (VAD) decision in each frame. The decision is made by keeping a running average of four

quantities:

1. The LSF's during silence periods

2. The full band energy in the speech signal (computed as the logarithm of the first autocorrelation coefficient)

during silence periods.

3. The low band energy in the speech signal (computed by filtering the autocorrelation coefficients), during

silence periods.

4. The rate of zero crossing of the signal, during silence periods.

In each frame, the above parameters are extracted, and compared with the running averages.

Depending on the magnitudes of the differences for the various parameters, an activity decision is

made. Furthermore, the running averages are updated if the parameters in the current frame are less

than the running averages. The decision itself (speech or silence), is filtered, using the past two frames

parameters and decisions as inputs. This ensures that sufficient hangover (i.e., speech transmission

just after the end of a talkspurt) is present. If the decision for the current frame is silence, the next step

is to decide whether to send a Silence Insertion Description frame (SID), or to send nothing (a null

frame). The SID frames contain a small amount of information which allow the decoder to generate

comfort noise. They consist of an excitation energy (5 bits), and the prediction error for the LSF

coefficients, as in G.729 (10 bits). The SID frames need only be sent when the parameters of the

background noise have changed since last transmitted. The decision is made by the encoder in any

way it likes, generally by comparing filter coefficient and energy changes to some thresholds. Note

that the bitstream does not contain any information about which of the three frame types are present

(speech, SID, or null). This information must either be sent out of band, or can be extracted from the

size of the frame (80, 15, or 0 bits, respectively). Since G.729 was developed for environments such as

cellular and data networks, an algorithm has been specified for concealing the loss of a frame. A frame

is lost when the network layer indicates sufficient bit errors in the frame, or when the frame never

Page 67: VoIP Streaming Over Packet-Based Networks

66

arrives at all (due to a packet loss in the Internet, for example). When this happens, all of the

parameters in the packet are interpolated from parameters from the previous frame. In particular:

1. The LSF parameters for the current frame are repeated from the previous frame.

2. The adaptive and fixed codebook gains are taken from the previous frame, but are attenuated to gradually

reduce their impact.

3. The excitation depends on the classification of the previous frame as voiced or unvoiced. If the previous

frame was voiced, the fixed codebook contribution is set to zero, and the pitch delay is taken as the same as

the previous frame. If the previous frame was unvoiced, the adaptive codebook contribution is set to zero,

and the fixed codebook contributions are selected randomly.

The effect of this interpolation will be to introduce errors into the decoded speech signal, both for the

frames which are erased, and for subsequent correctly received frames, due to the divergence of

encoder and decoder state. Unfortunately, quite a bit of state is maintained in the decoder, including:

1. The 4th order MAR predictor filter memories for the LSF's.

2. The past excitation signal

3. The fixed codebook energies for the past four frames, which are used to predict the fixed codebook gain

4. The adaptive codebook gain from the previous frame, which is used to generate the harmonic filter used on

the fixed codebook excitation.

5. The synthesis filter memories (10th order)

Appendix III: Other Research Fields during the Ph.D. period

“Speech Playout Buffering Based on a Simplified Version of the ITU-T E-Model“, IEEE SPL 03/2004 In multimedia real-time streaming over packet networks, the problem of transmission delay variations

between packet arrivals is frequently addressed with de-jitter buffers on the receive side. This

introduces an additional delay but it also provides a better sense of smoothness in the playout of the

output packet stream. Most of the proposed algorithms for the control of the playout buffer are based

on adaptive approaches. In particular, in [2], mean and variance values for packet delays are estimated

mainly with recursive linear filtering. The linear filter is used to adjust the total delay as a function of

the most recently observed values and works for every received packet. The buffer is then set to a size

so that only a small fraction of the arriving packets should be lost due to late arrival. In [3],

Page 68: VoIP Streaming Over Packet-Based Networks

67

information about a consistent amount of received packets is instead used to construct a histogram to

approximate a packet delay distribution (PDD) that is dynamically updated over time. Based on that

histogram, the buffer dimension is set to the minimum value that will smooth out network jitter so that

the stream’s requirement of maximum late packets percentage and maximum acceptable delay are

satisfied, if possible.

In this letter, we focus on the problem of setting the buffer dimension that simultaneously affects the

packet end-to-end delay and the packet loss. To solve this problem we propose to use a perceptually

motivated optimality criterion that allows the receiver to automatically balance packet delay versus

loss. In the proposed approach, the de-jitter buffer size is adaptively set and the adopted criterion relies

on the use of a simplified version proposed by Cole and Rosenbluth in [1] of the conversational

quality ITU-T E-Model [4]-[5]. The computation of the optimal buffer size is presented in next

section; section three presents the use of these results in a voice de-jitter buffering framework; and

performance of this approach is reported in the last section.

The E-Model is a computational paradigm defined by the ITU-T to assess the combined effects of

variations in several transmission parameters that affect the conversational quality. This model was

designed for planning purposes and not for actual customer opinion prediction, though the E-Model

output, the R Factor, can be mapped to estimates of customer opinion, such as the Mean Opinion

Score (MOS) [5]. Several impairment factors are considered and weighted in order to obtain an overall

index correlated with the speech transmission quality: the simultaneous impairment factor ( sI ),

function of the signal-to-noise impairments associated to Switched Circuit Network paths; the delay

impairment factor ( dI ), that includes all delay and echo effects; the equipment impairment factor

( eI ), that models impairments caused by low bit-rate codecs; and the expectation factor (A). The

output of the algorithm is the R Factor that is a scalar value defined as a linear combination of the

cited components [4]:

AIIIR eds +−−−= 100 (1)

This model is applicable only if constant buffers are used during each conversational unit without

pauses (talkspurt). This point is quite important for the proposed algorithm.

Page 69: VoIP Streaming Over Packet-Based Networks

68

Despite the apparent simplicity of (1), the computation of each impairment factor is very complex

making the model unusable in a practical context. Based on this, some studies have been carried out to

make the model operative. In particular, in [1] Cole and Rosenbluth introduce some simplifications of

the model when the 8Kbps ITU-T G.729-A speech codec [6] is used. In the following, we present

these simplifications and relevant equations (2)-(6) so as to better understand the assumptions under

which this model holds. Since sI is not a function of the underlying packet network and the aim of

Cole and Rosenbluth was to analyze the impairments introduced by a packet network, in [1] such

factor was set according to the G.729-A default value that led to setting (100- sI ) equal to 93.2. The

dI factor is a function of several average delay components within the end-to-end “signal paths”. In

VoIP connections without circuit switched networks, dI becomes a function of only the single one-

way mouth-to-ear delay measurement eed 2 (in milliseconds):

)3.177()3.177(11.0024.0 222 −⋅−⋅+⋅= eeeeeed dFddI (2)

where )(xF is the step function ( 0)( =xF if 0<x , else 1)( =xF ). The impairment factor eI for

the G.729-A+VAD codec, in the case of random packet losses and using the native G.729-A packet

loss concealment algorithm, is obtained from [4] and Table I.2 in [7] as a function of the total packet

loss ratio eee 2 :

( ) )101ln(4011,729. 2eee erandomVADAGI ⋅+⋅+=+− (3)

Equation (3), proposed in [1], provides very similar results to those obtainable from the new

expression of ( )eeee eII 2= in [4] and the new Appendix I/G.113 [8]. The suggested value of A for

wirebound connections is zero. The eed 2 and eee 2 components in (2) and (3) are defined as follows:

dejitternetneteenetdejittercodee eee e ; dddd ⋅−+=++= )1(22 (4)

Equation (4) defines eed 2 and eee 2 as a combination of: delays (d) and losses (e) due to speech

processing, to network impairments, and to de-jitter buffering. Among these factors, only dejitterd and

dejittere are directly affected by de-jitter buffer management operations. These two variables are

Page 70: VoIP Streaming Over Packet-Based Networks

69

strongly correlated and, to find out this relationship, the following expression has been proposed in

[1]: gbd dejitter ⋅= , where b represents the number of packets to be bufferized to compensate the jitter

and g represents the average packet inter-arrival time. Then, the late packet ratio is computed on the

basis of the probability to receive a packet after an interval l with respect to the previous packet greater

than the dejitterd , than ( )gblPedejitter ⋅>≅ .

By applying Chebyshev’s inequality to dejittere , Cole and Rosenbluth obtained:

( )22

2

2

2

)1()( −⋅=

−⋅<−⋅>−≅

bgggbggbglPe ll

dejitterσσ

(5)

where lσ is the packet inter-arrival time standard deviation. Equation (5) represents an

approximation that links the late packet ratio with the used buffer dimension. On applying these

simplifications to (1), the following expression is obtained:

( )

( )( )( )( )

( ) ��

��

��

��

−−+⋅+⋅−−

−+⋅+⋅⋅−+⋅+⋅−

+⋅+⋅−=

22

2

1)1(101ln4011

3.177

3.17711.0

024.02.93

bgee

dgbdF

dgbd

dgbd R

lnetnet

netcod

netcod

netcod

σ (6)

Equation (6) is an expression of the speech transmission quality as a function of the buffer

dimension b. Once network statistics g , lσ , netd and nete for future conversational units are

predicted, this expression can be maximized in order to find out the optimal buffer value b as

described in the following section.

It is important to note that maximization of the R Factor does not assure maximization of the user

perceived quality. However, under certain important conditions (the most common), the correlation of

the R Factor with the MOS quality index has been verified for some quality levels: from “user very

satisfied” to “nearly all users dissatisfied”. Annex A of [4] provides the situations where the model

validity has not been completely verified. Additionally, the assumptions used by Cole and Rosenbluth

need to be taken into account when using such a simplified version: no circuit switched network

interworking, no echo, randomness of packet losses, and use of the native G.729-A concealment

Page 71: VoIP Streaming Over Packet-Based Networks

70

algorithm. The use of different conditions would influence the final R index. The assessment and

enhancement of the E-Model is under study by ITU (ITU-T SG 12).

Buffer adjustment between talkspurts has the advantage of producing a smoother playout with

respect to continuously updating approaches. This is the approach adopted in the proposed algorithm,

where the buffer is tuned based on maximizing the expected future transmission quality during

conversational pauses. The following operations are performed: during a talkspurt information about

occurred packet losses, inter-arrival time, and transmission delay are stored; during the silence period,

variables g , lσ , netd and nete are estimated based on previously stored information; at the beginning

of a new talkspurt, the values obtained in the previous step are introduced in (6) and this expression is

maximized to obtain the optimal buffer dimension ( optb ); finally, for the new talkspurt the buffer is set

in accordance with gbd optdejitter ⋅= . The proposed algorithm is disjoined from a specific network

statistics estimation approach and different well-known techniques can be used for this purpose. In

particular, we implemented three different E-Model based (EM) algorithms:

• EMv1: g and lσ are set to the mean and standard deviation computed for the packets belonging to

the last talkspurt. Also netd and nete are taken equal to the mean values over this period.

• EMv2: g , lσ , and netd are computed using the prediction algorithm (linear filter) described in [2].

nete is computed again as in the previous algorithm.

• EMv3: this algorithm instead is based on the use of a histogram based probability density function

(PDF) approach as proposed in [3]. Essentially, two different PDFs are used: one for network delay,

the other for packet inter-arrival time. One PDF is used to compute g and lσ while the other is used

to compute netd . nete is set equal to the loss ratio experienced during the same period used to

construct the PDFs.

EMv1 is the simplest and the least accurate algorithm. Nevertheless, it has the advantage of working

automatically without any input parameter setting. On the other hand, this is not true for the other two

algorithms that require the selection of appropriate values for a certain set of input parameters to work

correctly. The computation of optb is performed using a bisection strategy to guarantee fast

convergence while searching for the maximum R function value.

Page 72: VoIP Streaming Over Packet-Based Networks

71

The performance of the proposed approach was analyzed by applying the devised playout

algorithms to several traffic traces, each one lasted an average of 10 minutes. Two H.323 hosts were

used during the experiments using the G.729-A+VAD codec with the native error concealment and

sending 2 packets of 10msec voice frames for every UDP packet.

The packet losses after de-jittering were analyzed in order to verify that the probability of loss of a

packet was independent of the probability of loss of any other. Such analysis was performed using the

statistical chi-squared test by considering the random variable K associated to the length of packet loss

bursts. In case of random packet loss, the variable K should be distributed according to a geometric

PMF (Probability Mass Function) and the chi-squared test has been used to verify this hypothesis

(hypothesis 0H ). For each experiment, we computed the chi-squared value and the associated

probability 2χP of having a chi-squared value equal or greater than the obtained value by chance only.

We were then able to reject or not reject 0H depending on whether 2χP was smaller or not than a

given significance level that is usually selected equal to 1% or 5%. An additional investigation was

performed by modeling the packet loss process with a Gilbert model as done in [9]. This model is

based on the unconditional packet loss probability uP and the conditional packet loss probability cP .

The distance between uP and cP gives an indication of the deviation of the packet loss process from a

memoryless Bernoulli process that is characterized by having these two probabilities equal. We show

the results of these two investigations for four of the performed experiments in Table I. The chi-

squared tests prove that 0H has to be rejected for experiments I and IV and that it cannot be rejected

for experiments II and III on the basis of a significance level of 1%. These results are also in

accordance with the computed distances between uP and cP that are quite low for experiments II and

III and significantly high for the others. As to the other performed experiments, we have to say that

most of these were characterized by having a bursty behavior. This result highlights the need for an

expression of eI in case of bursty losses in order to correctly use the E-Model for the quality

evaluation of IP Telephony in the Internet. Based on this analysis, the proposed algorithm can be

applied only to experiments II and III for which we present the results in this Section.

The performances of the proposed playout approach are compared with those of the cited works. In

Page 73: VoIP Streaming Over Packet-Based Networks

72

particular, for the linear filter, we used Algorithm 1 in [2] (Linear Filter, 998002.0=α ). The Concord

algorithm in [3] was tested with default parameter values: expected eed 2 (i.e., the ted) recalculation at

each arriving packet, histogram with 1ms bin-width, aging every 1000 packets with 9.0=F , and the

maximum late packets (mlp) set to 0.01. Fig. 1 makes a comparison of the proposed EMv1 technique

with alternative approaches in terms of the R Factor for the third experiment. The graph shows the R

Factor computed at the end of each talkspurt based on the obtained packet loss and delay using the

exact R Factor formula instead of the approximation in (8). This graph shows that our strategy

obtained higher values generally diffused in spite of the presence of spikes. The mean, maximum, and

minimum quality values are presented in Table II, together with the resulting eed 2 values. The results

for EMv2 and EMv3 are also presented. We can observe that the proposed algorithms EMv1 and

EMv3 allow obtaining an average R Factor of almost 5 points higher than the alternative solutions

with comparable total end-to-end delays. The maximum achievable values are also provided,

computed making the assumption of knowing exactly the network statistics for that talkspurt instead of

predicting them.

As to the second experiment, we present the final results directly in the second part of Table II. This

experiment was characterized by a null network packet loss and an average network delay of

138.41msec. We obtained similar results for the linear filter and Concord solutions, while with the

EMv1 and EMv3 approaches, very good results were achieved due to the absence of network losses.

These two approaches outperformed the alternative ones by about 14 points. The obtained quality

values were quite close to the maximum achievable ones.

[1] R.Cole and J.Rosenbluth, “Voice over IP performance monitoring,” in ACM Computer Communication Review, 31(2), Apr. 2001.

[2] R. Ramjee, J. Kurose, D. Towsley, and H. Schulzrinne, “Adaptive playout mechanisms for packetized audio applications in wide-area networks,” in Proc. IEEE Infocom, vol. 3, pp. 1352-1351, June 1994.

[3] C.J. Sreenan, J.-C. Chen, P. Agrawal, and B. Narendran, “Delay reduction techniques for playout buffering,” in IEEE Transactions on Multimedia, vol.02, no. 02, pp. 88-100, June 2000.

[4] ITU-T Recommendation G.107, “The E-Model, a computational model for use in transmission planning,” 03/2003.

[5] ITU-T Recommendation G.108, “Application of the E-Model: A planning guide,” 09/1999. [6] ITU-T Recommendation G.729-A, “Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-

excited linear-prediction (SC-ACELP),” 03/1996. [7] ITU-T Recommendation G.113, “Transmission impairments due to speech processing,” 02/2001. [8] ITU-T Recommendation G.113 Appendix I, “Provisional planning values for the equipment impairment

factor Ie and packet-loss robustness factor Bpl,” 05/2002. [9] W.Jiang and H.Schulzrinne, “Modeling of packet loss and delay and their effect on real-time multimedia

service quality,” in Proc. Intern.ational Workshop on Network and Operating Systems Support for Digital Audio and Video, June 2002.

Page 74: VoIP Streaming Over Packet-Based Networks

73

45

50

55

60

65

70

75

80

85

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Number of talkspurt

R F

acto

r

Linear Filter Concord EMv1

Figure 16: Comparison of the proposed EMv1 algorithm with the Linear Filter and Concord

algorithms for the third experiment.

Table VII: 2χP , uP , and cP for four of the performed experiments after applying the EMv1

algorithm. 2χP (%) uP (%) cP (%)

I Experiment <1 2.36 20.39 II Experiment >1 0.88 1.05 III Experiment >5 1.88 1.90 IV Experiment <1 2.09 5.65

Table VIII: Results in terms of the de2e and the R Factor for the three proposed algorithms (EMv1, EMv2, and EMv3) and the two comparing approaches (Linear Filter and Concord). The potential

maximum R Factor is also presented.

The Audio Watermarking: “An Audio Patchwork Shaping Framework with Psychoacoustic Model 2”, accepted and presented in WIAMIS 2004 (Lisboa, 04/04) The recent years have been characterized by a growing diffusion in the fruition of digital audio

contents, and consequently in the need for copyright and ownership property. The watermarking

de2e (msec) R Factor Algorithm

Min Max Mean Min Max Mean

Linear Filter 190.7 318.3 250.3 50.5 80.1 57.1

Concord 156.1 207.1 179.9 50.0 77.0 58.8

EMv1 129.7 277.4 212.4 52.1 80.1 63.6

EMv2 82.6 247.4 147.3 46.8 65.1 56.7

EMv3 128.8 247.2 183.3 51.1 81.2 63.2 III E

xper

imen

t

Max R Factor - - - 54.0 82.1 66.8

Linear Filter 220.1 517.8 333.9 51.3 73.2 61.2

Concord 167.0 275.0 212.0 52.2 77.7 62.4

EMv1 165.5 303.9 201.5 62.5 79.0 76.3

EMv2 140.3 311.7 189.6 58.7 80.0 73.4

EMv3 137.9 277.4 182.2 65.8 80.0 77.2 II E

xper

imen

t

Max R Factor - - - 66.7 80.9 78.2

Page 75: VoIP Streaming Over Packet-Based Networks

74

techniques represent a good solution for these supplies: a mark is opportunely inserted in a host signal

in a way its ownership is provable. Lots of strategies have been presented in the recent past with this

purpose. Several of these techniques inherited their core from image watermarking; more in general,

this legacy was not always possible due to the differences in sensibility and perception between human

ear and eye. A set of basic features for a reliable watermarking strategy was presented in [2]. Two

characters are mostly significant, and, apparently contradicting: inaudibility and robustness to signal

processing. Inaudibility means that the differences between the original and the watermarked signal

should not be perceivable by the human ear. Secondly the watermark must be robust against

intentional or unintentional attacks. One of the most impairing attack is the signal processing, and

specifically the lossy compression. Such a compression guarantees enhanced portability of digital

information, but can have an undesirable effect on the embedded watermark. The developing of our

strategy is accomplished referring constantly to these two features. In this paper we refer to an adaptive

approach of patchwork algorithm. The patchwork is originally presented in [1] for image

watermarking. The original implementation of this technique presents several limitations when applied

to audio samples. Quite a few adaptations have been proposed to improve considerably its performance

([2]-[5]). From these studies, the work-dominion and the adaptive patch shaping appear as the key

points for the applying of the original strategy to audio samples such as for its perfection. The

proposed strategy works on an assumption introduced in [1]: treating patches of several points have the

effect of shifting the noise to low frequencies, where there is lower probability to be filtered by lossy

compression techniques. The right dimension is fixed by comparing the spectrum of watermarked

signal to the minimum masking threshold, as obtained referring to the psychoacoustic model 2 [6]. The

patch shaping is performed in the Fourier dominion. The proposed technique is applied to audio

samples and compared with the adaptive patchwork state of art, referring to the framework proposed in

[7]. The patchwork shaping framework shows particular good results in terms of robustness to

compression and quality. The paper is organized as follows. Section 2 presents the state of art of

adaptive patchwork algorithms. Section 3 introduces the watermark shaping despite of the threshold of

audibility. Section 4 illustrates our technique. Section 5 presents tests and results, while in Section 6

the conclusions are drawn.

Page 76: VoIP Streaming Over Packet-Based Networks

75

The patchwork strategy is a two-set method, that is it makes different two sets from a host signal [4].

This difference is used to verify, or not, a hypothesis Ho (e.g., the watermark is embedded).

Figure 17: Distribution of the mean difference of the samples in Un-Watermarked and Watermarked

signals. The original strategy [1] is applied to sets with more than 5,000 elements. The samples of each subset

are considered uniformly distributed and with equal mean values. The elements are modified by

adding and subtracting the same quantity d. Thus, the detection of a watermark is related to the

condition:

dBAE markedmarked 2][ =− Several of these statements must be reassessed when working with audio samples [2]. In particular the

distribution of the sample value is assumed as normal (See Fig.1). Recent approaches modify the

original strategy to better take into account the human ear sensibility to noise interferences. These

methods can be classified in temporal and spectral approaches, depending on the domain where the

watermark is embedded. In [5] a technique is proposed, that is based on the transformation of time-

domain data. A set of N samples, corresponding to 1sec of stereo audio signal, is modified by a

watermark signal w(i). [2] – [3] – [4] propose spectral patchwork approaches. In particular, [2] works

with a dataset of 2N Fourier coefficients. The relationship between d and the elements of the dataset is

multiplicative. The parameter d is adaptively chosen to prevent perceptual audibility, basing on the

characteristics of the audio signal (i.e., it is introduced for the first time the concept of power density

function in the hypothesis tests).

Watermarked Un-Watermarked

Page 77: VoIP Streaming Over Packet-Based Networks

76

In [3] the patchwork algorithm is applied to the coarsest wavelet coefficients, providing a fast

synchronization between the watermark embedding and detection. While, in [4] the Modified

Patchwork Algorithm (i.e., MPA) is presented. Such approach is very robust due to three attributes:

the factor d is evaluated adaptively and is based on sample mean and variance; the patch size in the

transformed domain is very little: this guarantees good inaudibility; finally, a sign function is used to

enhance the detection rate.

These features are included in an embedding function so that the distance between the sample means

of the two set bigger than a certain value d. The temporal approaches are easier to implement than the

spectral ones; at the same time, they present several weaknesses against general signal processing

modifications [3].

The association between a watermarking algorithm and a noisy communication system is not new [8].

Actually, a watermarking strategy adds a mark (i.e., the noise) in a host signal (i.e., the communication

channel). In this sense, the watermark embedding can be considered as an operation of channel

coding: the watermark is adapted to the characteristics of the transmission channel (i.e., the host signal

in which the watermark should be embedded). In case of audio contents, what it is usually considered

as an impairment, the sensibility of the human ear, can be used as a way to spread and dimension the

watermark. The human auditory system (HAS) is well known, that is it is sensible to specific

frequencies (i.e., from 2KHz to 4KHz) and reacts to specific events (i.e., frequency and temporal

masking). Given a signal S, it is possible to recover its minimum masking threshold. The minimum

masking threshold of audibility represents a limit between the audible and inaudible signals for S at

different frequencies. Independently from S, it is also possible to recover the absolute threshold of

hearing (i.e., ATH). Such curve (i.e., referred to as quiet curve [9]) is different than the previous and

defines the required intensity of a single sound expressed in unit of decibel (dB) to be heard in the

absence of another sound [10].

Several methods, outside the patchwork fashion, have been proposed that make use of psychoacoustic

models to guarantee perceptual inaudibility of the mark [11], [9], [12]. Usually, the state of art

methods shape the watermark referring mainly to the quiet. The filtered watermark signal is scaled in

order to embed the watermark noise below the quiet curve [13]. In addition, other methods increase

Page 78: VoIP Streaming Over Packet-Based Networks

77

the noise energy of the watermark, referring undeniably to the minimum threshold of audibility. Such

threshold can be recovered through a well defined psychoacoustic model.

The MPEG/audio standard provides two example implementations of the psychoacoustic model.

Psychoacoustic model 1 is less complex than psychoacoustic model 2 and has more compromises to

simplify the calculations. Either model works for any of the layers of compression. However, only

model 2 includes specific modifications to accommodate Layer III. In this paper, we refer to the model

2, differently from the past approaches.

Unmarked

Psycho-model 2

FFT RNG

Patchwork Shaping

Seed + Watermark

Marked

Figure 18: Steps (1-4) of the Patchwork shaping algorithm.

As already stated, the proposed patchwork strategy modifies two set of N elements/coefficients from

the original signal (i.e., signalUn-Marked). The signalMarked strongly belongs to the correspondent

signalUn-Marked. The core of our strategy is the shaping of the frequency-response of the mark

signal, using psychoacoustic model 2. The algorithm proposed in this work embeds the watermark in

the frequency domain, by modifying 2N Fourier coefficients. The choice of this transform domain is

justified by the use of the psychoacoustic model. The embedding steps (See Fig.3) can be summarized

as follows:

1. Evaluate the threshold of minimum audibility for the signalUn-Marked, referring to psychoacoustic model 2.

2. Map the secret key and the watermark to the seed of a random number generator. Next, generate two N-

points index sets { }NAN aaaI ,...,, 21= and { }N

BN bbbI ,...,, 21= .

3. Let { }NXXXX 221 ,...,,= be 2N DFT coefficients of the signalUn-Marked, corresponding to the index sets ANI and B

NI . 4. The original amplitude of the patch and the number of re-touched coefficients, starting from the generic

elements of index ai or bi, have respectively standard values ( )θδ , . Such values are modified iteratively to verify that the spectrum of the watermark signal is under the minimum audibility threshold (i.e., obtained from point 1). Iteratively means a constant referring to the block of model 2 from the block-shaping (See the dotted loop in Fig.3).

5. The time-domain representation of the output signal is found, applying an Inverse DFT to the signalMarked .

The phase of detection is as follows:

Page 79: VoIP Streaming Over Packet-Based Networks

78

1. Define two test hypothesis: Ho (the watermark in not embedded) and H1 (the watermark is embedded).

2. Map the seed and the watermark to a Random Number Generator and generate two sets 'A

NI and 'B

NI .

3. Fix a threshold ∆ for the detection, and evaluate the mean value (i.e., ( )⋅=⋅ E ) of the random variable ''ii baz −= , for { }'' A

Ni Ia ∈ and { }'' BNi Ib ∈ .

4. Decide for Ho, or H1, depending on ∆<z , or ∆≥z . We tested the proposed algorithm on 16-bit stereo audio signals, sampled at Fs = 44.1KHz. The size of

the each patch (i.e., N) was fixed to 50 points; while the default values for ( )θδ , were set to ( )10 ,5.0

. Higher values for θ were also tested only for robustness evaluation, regardless of quality aspects.

The state of art proposes a framework for the evaluation of audio watermarking techniques [7]. In this

work, we referred to this framework and considered, in particular, two key factors: quality of the

watermarked signal and robustness to mp3 compression. The evaluation of quality is an essential part

in testing our strategy, since the basic idea was to guarantee the maximum rate of inaudibility of the

patch. The tests were performed using a subjective score (i.e., a MOS) and the SNR of the

watermarked signal versus the host signal. The robustness of the proposed strategy was tested in two

steps: at first, coding and decoding the watermarked signal with a commercial MP3 encoder at

different rates (e.g., usually 128Kbps); secondly, trying the detection of the watermarked on the

uncompressed signal. Quality and robustness can not be evaluated separately. These factors are

strongly correlated, that is: a decrease in quality causes an increase, in most cases significant, of

robustness. All the performed tests showed good results. The idea of increasing the number of points

of the patches reveals its successfulness. Good subjective quality appears since all the patches are

below the audibility threshold for that signal ( 26≤SNR ).

Page 80: VoIP Streaming Over Packet-Based Networks

79

Figure 19: Probability density function of detection for the random variable z, varying the dimension

of the patch with SNR = 26. At the same time, treating more points has the effect of shifting the patchwork-noise to low

frequencies, where it has a lower probability to be filtered by the mp3 compression. Figure 4 shows

different probability density functions (i.e., introduced as empirical Pdf in [5]) of the random variable

z, as described in the detection phase. The density function of z, before the mp3 compression, is

compared with different behaviours (i.e., varying the dimension of the patch). This test shows clearly

that higher dimensions of θ lead to lower alterations in the detection values. This results in a Pdf

nearer to that of the uncompressed signal. We have also evaluated the error probability at different

rates of compression (i.e., 128, 96 and 64 Kbps). Two kinds of errors can be individuated. The state of

art refers to them is terms of Type I (Rejection of Ho, when Ho is true) and Type II (Non-Rejection of

Ho, when H1 is true) [4]. Type II errors seem to be the most impairing (the watermark is inserted (i.e.,

quality degradation) but the ownership can not be proven). Table VII presents the Type II errors for a

test audio signal. Clearly, the probability of rejection of Ho, when H1 is true, decreases

correspondently with the mp3-rate of compression.

In this paper an audio watermarking framework has been presented, that is based on a patchwork

approach. The core of the proposed technique is the shaping of the patchwork, performed referring to

psychoacoustic model 2. This results in higher robustness and inaudibility of the patch noise. The

strategy was evaluated in terms of robustness to lossy compression, quality. Good results were

obtained during the tests with 44.1Khz audio and speech traces. The proposed strategy can be

improved in the modelling of the patch.

z

Pdf(z) Before mp3

θ = 9

θ = 4

θ =6/5

Page 81: VoIP Streaming Over Packet-Based Networks

80

Actually this step is quite coarse. Further studies will be centred on more refined mathematical

mechanisms of patch-shaping (i.e., such as the curve fitting) respect to the minimum masking

threshold.

Table VII: Error Probabilities for lossy compression at different rates. Type II Errors(%)

MPEG I Layer III (128Kbps) 0.1 MPEG I Layer III (96Kbps) 0.7 MPEG I Layer III (64Kbps) 1.6

References

[1] W.Bender, D.Gruhl, N.Morimoto, and A.Lu, “Techniques for data hiding,” IBM System Journal, Vol.35,

No.3-4, pp.313-335, 1996.

[2] M.Arnold, “Audio watermarking: features, applications and algorithms,” IEEE Int. Conf. Multimedia and

Expo 2000, Vol.2, pp.1013-1016, 2000.

[3] Hong Oh Kim, Bae Keun Lee, and Nam Yong Lee, "Wavelet-based audio watermarking techniques:

robustness and fast synchronization,” Research Report 01-11, Division of Applied Mathematics-Kaist.

[4] In-Kwon Yeo and Hyoung Joong Kim, “Modified patchwork algorithm: A novel audio watermarking

scheme, “ IEEE Trans. On Speech and Audio Processing, Vol.11, No.4, 07/2003.

[5] P.Bassia, I.Pitas, and N.Nicholaidis, “Robust audio watermarking in the time domain,” IEEE Transaction on

Multimedia, Vol.3, pp.232-241, 06/2001.

[6] ISO/IEC Joint Technical Committee 1 Subcommittee 29 Working Group 11, Information Technology-

Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbit/s, Part

3:Audio, ISO/IEC 11172-3, 1993.

[7] J. D. Gordy and L. T. Bruton, “Performance Evaluation of Digital Audio Watermarking Algorithms,”

Proceedings of the 43rd Midwest Symposium on Circuits and Systems, Lansing MI, USA, Aug. 2000;

[8] T.Muntean, E.Grivel, I.Nafornita, and M.Najim, “Audio digital watermarking for copyright protection,”

International Workshop on “Trends and Achievements in Information Technology”, 05/2002.

[9] N.Cevjic, A.Keskinarkaus, and T.Seppanen, “Audio watermarking using m-sequences and temporal

masking,” IEEE Workshops on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY,

pp.227-230, 2001.

[10] E.Zwicker and U.T.Zwicker "Audio Engineering and psychoacoustics: matching signals to the final

receiver: the Human Auditory System," Journal of Audio Engineering Society, Vol.39, No.3, pp.115-126,

03/1991.

Page 82: VoIP Streaming Over Packet-Based Networks

81

[11] L.Boney, A.H.Twefik, and K.N.Hamdy, “Digital watermarks for audio signal,” International Conference on

Multimedia Computing and Systems, Hiroshima, Japan, pp.473-480, 1996.

[12] M.Arnold and K.Schiltz, “Quality evaluation of watermarked audio tracks,” SPIE Electronic Imaging,

Vol.4675, pp.91-101, 2002.

[13] Hyoung Joong Kim, “Audio watermarking techniques,” Pacific Rim workshop on Digital Steganography,

Kitakyushu, 07/2003.

[14] D. Pan, "A tutorial on MPEG/audio compression," IEEE MultiMedia, Vol.2, No.2, pp. 60--74, 1995.