6
Using dynamic time warping of T0 contours in the evaluation of cycle-to-cycle Pitch Detection Algorithms Carlos Ferrer * , Diana Torres, Marı ´a E. Hernández-Dı ´ az Center for Studies on Electronics and Information Technologies, Central University of Las Villas, C. Camajuanı ´, km 51/2, Santa Clara, CP 54830, Cuba article info Article history: Available online 5 August 2009 Keywords: Dynamic time warping Pitch determination Waveform matching abstract This paper addresses the comparison of Pitch Detection Algorithms working on a cycle to cycle basis. An alignment problem between detected and reference pitch contours is described and a Dynamic Time Warping procedure to correct it is proposed. Results demonstrate the occurrence of shifts in practice and the usefulness of the proposed Dynamic Time Warping procedure. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction Pitch (T0) determination is one of the most common and needed tasks in speech processing. The requirements of each spe- cific application (e.g. speech coding in communications systems, prosodic and intonation analysis in phonetics, or perturbation measurements in clinical studies) are different, causing the prolif- eration of a large number on Pitch Determination Algorithms (PDAs). The most extensive work in categorization and compilation of PDAs is the one by Hess (Hess, 1983; later updated in Hess, 1992). The number of available PDAs continues to grow each year (e.g. Interspeech-2007 dedicated two sessions to pitch extraction) due to the absence of a definitive reliable and accurate solution for the different applications. The evaluation and comparison of the performances of this large number of PDAs results of para- mount importance. However, there is no standard comparison pro- cedure for this task. This paper addresses the latter topic, in the group of PDAs capa- ble of tracking cycle-to-cycle T0 contours, specifically oriented to pathologic voices. The remaining of this section describes some methodological aspects of cycle-to-cycle PDAs and its comparison, revealing an unreported inconsistency of some measures. 1.1. Cycle-to-cycle pitch determination PDAs can be broadly categorized in ‘‘short-term” PDAs (working on ‘‘small” analysis frames where signal can be considered station- ary and contain several pitch pulses, but yielding a single estimate of T0 for the whole frame) and cycle-to-cycle PDAs. The latter PDAs work over even smaller frames and provide individual pitch mark- ers, and hence, a more detailed T0 contour. This detailed T0 con- tour is required to obtain most of the time-domain perturbation measures used in Vocal Quality assessment (i.e. jitter, shimmer, and many forms of harmonics-to-noise ratios). Any types of cycle-to-cycle PDA can provide the necessary T0 contour for the calculation of perturbation measures in clinical applications. However, there is a general consensus (Shahnaz et al., 2005; Bagshaw et al., 1993; Titze, 1995; Parsa and Jamieson, 1999; Veprek and Scordilis, 2002) in that Waveform Matching (WM) techniques are the most precise T0 estimators. An exception to the reported overall superior performance of WM techniques is found in (de Chevigné and Kawahara, 2001) but for very broad search ranges (40–800 Hz) and out of the scope of perturbation measures. This broad search range is uncommon for cycle-to-cycle PDAs, which usually work ‘‘refining” an existing T0 estimate or ‘‘mean” (T0 E ). The estimate is known in synthetic signals, and obtainable by different means in real signals (e.g. manually by vi- sual inspection in (Milenkovick, 1987), from a low pass filtered ver- sion of the signal in (Titze and Liang) or using a short-term PDA as in the Praat system (Boersma and Weenink, 2007)). The existence of such T0 estimate allows the use of narrow search ranges (S r ), avoiding the occurrence of some of the errors more accounted for in several studies, like doubling or halving the actual T0. 1.2. Measures of performance The performance of the PDAs is measured according to the similitude between the detected T0 contour (T d (n)) and a reference contour (T r (n)). When comparing short-term PDAs, ‘‘n” is the frame index, while in cycle-to-cycle PDAs it stands for the pulse index. A brief survey of representative measures of performance follows. For practical purposes, measures are grouped in two categories: 0167-8655/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2009.07.021 * Corresponding author. E-mail addresses: [email protected] (C. Ferrer), [email protected] (D. Torres), [email protected] (M.E. Hernández-Dı ´ az). Pattern Recognition Letters 31 (2010) 517–522 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Using dynamic time warping of T0 contours in the evaluation of cycle-to-cycle Pitch Detection Algorithms

Embed Size (px)

Citation preview

Pattern Recognition Letters 31 (2010) 517–522

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

Using dynamic time warping of T0 contours in the evaluation of cycle-to-cyclePitch Detection Algorithms

Carlos Ferrer *, Diana Torres, Marı́a E. Hernández-Dı́azCenter for Studies on Electronics and Information Technologies, Central University of Las Villas, C. Camajuanı́, km 51/2, Santa Clara, CP 54830, Cuba

a r t i c l e i n f o

Article history:Available online 5 August 2009

Keywords:Dynamic time warpingPitch determinationWaveform matching

0167-8655/$ - see front matter � 2009 Elsevier B.V. Adoi:10.1016/j.patrec.2009.07.021

* Corresponding author.E-mail addresses: [email protected] (C. Ferrer),

[email protected] (M.E. Hernández-Dı́az).

a b s t r a c t

This paper addresses the comparison of Pitch Detection Algorithms working on a cycle to cycle basis. Analignment problem between detected and reference pitch contours is described and a Dynamic TimeWarping procedure to correct it is proposed. Results demonstrate the occurrence of shifts in practiceand the usefulness of the proposed Dynamic Time Warping procedure.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

Pitch (T0) determination is one of the most common andneeded tasks in speech processing. The requirements of each spe-cific application (e.g. speech coding in communications systems,prosodic and intonation analysis in phonetics, or perturbationmeasurements in clinical studies) are different, causing the prolif-eration of a large number on Pitch Determination Algorithms(PDAs). The most extensive work in categorization and compilationof PDAs is the one by Hess (Hess, 1983; later updated in Hess,1992). The number of available PDAs continues to grow each year(e.g. Interspeech-2007 dedicated two sessions to pitch extraction)due to the absence of a definitive reliable and accurate solutionfor the different applications. The evaluation and comparison ofthe performances of this large number of PDAs results of para-mount importance. However, there is no standard comparison pro-cedure for this task.

This paper addresses the latter topic, in the group of PDAs capa-ble of tracking cycle-to-cycle T0 contours, specifically oriented topathologic voices. The remaining of this section describes somemethodological aspects of cycle-to-cycle PDAs and its comparison,revealing an unreported inconsistency of some measures.

1.1. Cycle-to-cycle pitch determination

PDAs can be broadly categorized in ‘‘short-term” PDAs (workingon ‘‘small” analysis frames where signal can be considered station-ary and contain several pitch pulses, but yielding a single estimateof T0 for the whole frame) and cycle-to-cycle PDAs. The latter PDAs

ll rights reserved.

[email protected] (D. Torres),

work over even smaller frames and provide individual pitch mark-ers, and hence, a more detailed T0 contour. This detailed T0 con-tour is required to obtain most of the time-domain perturbationmeasures used in Vocal Quality assessment (i.e. jitter, shimmer,and many forms of harmonics-to-noise ratios).

Any types of cycle-to-cycle PDA can provide the necessary T0contour for the calculation of perturbation measures in clinicalapplications. However, there is a general consensus (Shahnazet al., 2005; Bagshaw et al., 1993; Titze, 1995; Parsa and Jamieson,1999; Veprek and Scordilis, 2002) in that Waveform Matching(WM) techniques are the most precise T0 estimators. An exceptionto the reported overall superior performance of WM techniques isfound in (de Chevigné and Kawahara, 2001) but for very broadsearch ranges (40–800 Hz) and out of the scope of perturbationmeasures. This broad search range is uncommon for cycle-to-cyclePDAs, which usually work ‘‘refining” an existing T0 estimate or‘‘mean” (T0E). The estimate is known in synthetic signals, andobtainable by different means in real signals (e.g. manually by vi-sual inspection in (Milenkovick, 1987), from a low pass filtered ver-sion of the signal in (Titze and Liang) or using a short-term PDA asin the Praat system (Boersma and Weenink, 2007)). The existenceof such T0 estimate allows the use of narrow search ranges (Sr),avoiding the occurrence of some of the errors more accountedfor in several studies, like doubling or halving the actual T0.

1.2. Measures of performance

The performance of the PDAs is measured according to thesimilitude between the detected T0 contour (Td(n)) and a referencecontour (Tr(n)). When comparing short-term PDAs, ‘‘n” is the frameindex, while in cycle-to-cycle PDAs it stands for the pulse index. Abrief survey of representative measures of performance follows.For practical purposes, measures are grouped in two categories:

518 C. Ferrer et al. / Pattern Recognition Letters 31 (2010) 517–522

‘‘reliability” measures, which account for how frequently the PDAmisses the right T0, and ‘‘accuracy” measures, accounting for themagnitude of the differences between Td(n) and Tr(n).

1.2.1. Reliability measures1.2.1.1. Voicing determination. Subdivided into Voiced/Unvoiced (V/U) and Unvoiced/Voiced (U/V) errors (Rabiner et al., 1976; Bagshawet al., 1993; de Chevigné and Kawahara, 2001; Veprek and Scordi-lis, 2002). Mostly relevant for short-term PDAs, can be excludedfrom the analysis in perturbation measures, where voicing is a pre-requisite and can be determined by other means.

1.2.1.2. Gross errors (GE). Number of times the difference amongthe two contours is greater than a certain threshold th. Most ofthe time the threshold is normalized to Tr(n) and given in percent(5% in (Nakatani and Irino, 2004), 20% in (Bagshaw et al., 1993) and(de Chevigné and Kawahara, 2001)), although absolute values havealso been used (5 samples in (Wise et al., 1976), 1 ms in (Rabineret al., 1976) and (Shahnaz et al., 2005), 10 Hz in (Moorer, 1974)).Special types of GE occurs when errors are of a magnitude of oneoctave or more, since they would either miss to report at leastone pulse (detected period one octave or more larger) or reportadditional pulses between two real pulses (one octave or moreshorter). In de Chevigné and Kawahara (2001) GE is further dividedinto these ‘‘too low” and ‘‘too high”, and the ‘‘too low” errors werelinked to the detection of subharmonics. These pulse insertion ordeletion errors form a different group of measures, described next.

1.2.1.3. Pitch insertions and deletions (PI/PD). In (Veprek and Scord-ilis, 2002) the occurrence of PI and PD was connected to the detec-tion of multiples and submultiples, respectively, of the true T0(Tr(n)). In (Deliyski et al., 2005) a change (increment or reduction)in the number of detected pulses is also linked to the detection ofmultiples and submultiples of Tr(n), whenever Td(n) falls outside ofthe range 0.75Tr(n) < Td(n) < 1.5Tr(n). In perturbation measures, ifthe estimate of Tr is available (T0E), it is possible to limit Sr (thesearch range of Td), and this type of errors would not occur. Sr isusually symmetrical to T0E and given relative to it, establishingthat T0E(1-Sr) < Td(n) < T0E(1 + Sr). Different values for Sr have beenused: 10% in (Milenkovick, 1987) and (Medan et al., 1991), 5–15%in (Titze and Liang, 1993), 15% in (Deem et al., 1989), 20% in (Boers-ma and Weenink, 2007). The symmetrical limit to Sr avoiding thedetection of multiples and submultiples of T0 is 33% (i.e. beingthe maximum less than twice the minimum), an alternative tothe asymmetrical range in (Deliyski et al., 2005).

1.2.2. Accuracy measures1.2.2.1. Intra-contour variability (Jitter). The values of jitter for thereference and detected contours can be compared, and the differ-ence would yield an estimate of the PDA’s success. There are manyformulas to account for jitter (see (Baken, 1999) for some variants,with linear, squared or logarithmic terms), although a representa-tive expression is given by (1), the first order perturbation:

ad ¼1

N � 1

XN�1

n¼1

j Tdðnþ 1Þ � TdðnÞ j0:5�ðTdðnþ 1Þ þ TdðnÞÞ

� 100 ð1Þ

which is the one used in (Titze and Liang, 1993) and (Titze, 1995).The usefulness of Da = jad � arj is criticized in (Parsa and Jamieson,1999) by noting that two constant but different Td(n) and Tr(n)would yield the same value of ad = ar = 0, thus giving a perfect coin-cidence in spite of being the contours different. The criticism canactually be extended: there are infinite pairs of random contoursTd(n) and Tr(n) which yields coincident ad and ar of any value, notonly zero. Jitter is defined as the irregularity of the T0 contour,and identical irregularities do not imply identical contours. In spite

of this inconsistency, jitter measures like a form the core of the re-ported comparisons among the PDAs of commercial systems (Kar-nell et al., 1995; Bielamowicz et al., 1996; Perry et al., 2000, 2003;Deliyski et al., 2005).

1.2.2.2. Inter-contour variability. When both contours are availableto the evaluation, a direct comparison evaluation of the differencesis possible, and several approaches have been used:

� Root Mean Squared (RMS) measures of Fine Errors: Used as thecomplementary measure to GE, measures the mean squaredvalue of the differences among Td(n) and Tr(n), when the differ-ences are smaller than the same threshold th used to determineGE. Slight variations in the formulas exist across studies, e.g. theRMS is most of the times calculated using the raw differences(Wise et al., 1976; Moorer, 1974; Bagshaw et al., 1993; Shahnazet al., 2005) in milliseconds, but also normalizing the differencesby the Tr(n) value (Parsa and Jamieson, 1999) and removing themean error (Rabiner et al., 1976).

� Average Normalized Differences: Non-RMS (i.e. linear) measuresof the difference between Td(n) and Tr(n) have also been pro-posed (Medan et al., 1991). An expression close to the definitionof ‘‘a” in (1) which can be interpreted as the ‘‘jitter” introducedby the PDA (and named ‘‘b” following the suggestion in (Parsaand Jamieson, 1999) as a necessary complement to ‘‘a”) is:

b ¼ 1N

XN

n¼1

j TdðnÞ � TrðnÞ jTrðnÞ

� 100 ð2Þ

Only measures considering differences between Tr(n) and Td(n)(like ‘‘b” or the RMS) can reveal the accuracy of a PDA. Theuse of measures like Da is justified only when Td(n) and Tr(n)are not available to the researcher.

1.3. Non reported failure of existing measures of performance

The existing set of measures of performance and comparisonprocedures do not account for the occurrence of shifts in the Td

contour. Fig. 1 shows a segment comprising pulses 25–36 ofTd(n) (circles) and Tr(n) (squares) for a given signal. In terms ofthe difference between Td(n) and Tr(n) and using a th = 10% thereare six GE in the interval (pulses 28–33). However, five out of thesesix ‘‘erroneous” Td(n) correspond to Tr(n + 1). In terms of perturba-tion measures, Td(n) follows the ‘‘true” variability of Tr(n) with theexception of two GE, the one in pulse 28 (which ‘‘shifted” the con-tours) and the one in pulse 33, which by coincidence restored thealignment.

It must be noted that no PD occurred in Pd(n) to causeTd(27) = Tr(27) and Td(28) = Tr(29). No PI occurred either in therealignment error in pulse #33. The number of pitch markers isnot changed in these ‘‘shifting” errors, which are only perceivedwhen analyzing the separation contours T(n). These shifting er-rors are a non-reported type of errors, which may be includedin GE for their magnitude but, unlike ‘‘normal” GE, cause an unde-served and unpredictable deterioration in inter-contour variabil-ity measures. Using the example in Fig. 1, the PDA incurred in ashifting GE error in the 28th separation, and another shifting GEerror in the 33rd separation, for a total of two actual GE. How-ever, the pulse by pulse comparison of both contours would yieldan undeserved total of six GE, since contours are different fromthe first to the second error (pulses 28th–33rd). The magnitudeof this undeserved deterioration in the PDA performance isunpredictable, reaching the worst case when the PDA commitsa unique shifting error in the first pulse, and thus the whole con-tour would be reported as erroneous. The deterioration of theperformance measures depends on the length of the shifted seg-

4000 4200 4400 4600 4800 5000 5200

-400

-200

0

200

400

600

800

Ampl

itude

(b)

Time in samples

27 28 29 30 31 32 33 34

27 28 29 30 31 32 33 34

26 28 30 32 34 36

125130135140

145150155160165170175

Glottal pulse separation

Dur

atio

n in

sam

ples

(a)

Fig. 1. Example of shift between Td(n) (circles) and Tr(n) (squares). (a) Glottal pulse separations T(n) in samples. (b) Glottal pulse marker positions P(n), in samples. It has beenused T(n) = P(n + 1)-P(n). Detected contours in circles, reference contours in squares.

Fig. 2. Sample Td(nd) and Tr(nr) and possible optimal path (trailing arrows in thegrid). DTW consists in finding nd = f(nr) such that the path is optimal according to adefined cost function.

Fig. 3. Possible routes from the kth point in the path: R1, R2 and R3.

C. Ferrer et al. / Pattern Recognition Letters 31 (2010) 517–522 519

ments, and not of the frequency of the error causing these shifts.To adequately appraise the performance of a particular PDA, addi-tional steps are required to detect and correct shifts prior to thecalculation of performance measures.

It must also be noted that, even with a narrow Sr avoiding thedetection of multiples and submultiples, PIs and PDs can actuallyoccur. Cumulative errors in the same direction can cause an actualdifference in the number of pitch markers between the two con-tours. In the example in Fig. 1, the 2nd error was in the oppositedirection than the first, and produced a realignment of the Td andTr contours. However, if it had occurred in the opposite direction

(the same as the first error) it would have caused the shift in sep-arations (T contours) to be of two indexes, and a pulse marker (in Pcontours) would be missing. The opposite, a pulse insertion, canoccur if two consecutive shifts in the direction opposite to theone causing the pitch deletion. These PIs and PDs are different fromthe ones described in (Veprek and Scordilis, 2002) and (Deliyskiet al., 2005) since the newly described do not occur in the time-span of a single pulse (which is not possible due to the Sr limit),but over several pulses. Previous works using narrow Sr have alsooverlooked the possibility of occurrence of these ‘‘cumulative” PIs

and PDs.In summary, there are two types of errors previously unre-

ported, and not accounted for in existing performance measures.These errors are the shifts between pulse separation contours,and the insertion or deletion of pulses when using small searchranges. The later errors can be seen as a consequence of repetitiveshifts in the same direction.

Fig. 4. Flowchart of the alignment procedure devised.

Table 1Values of the individual perturbation per level, combined in the synthetic signals.

Level 1 2 3 4 5 6 7

Jitter (%) 3.4 6.8 10.2 13.6 17 20.4 23.8Shimmer (%) 6.8 13.4 20.4 27.2 34 40.8 47.6SNR (dB) 22 18 15 12 8 5 2

Table 2Results of the DTW procedure for the three PDAs tested with real signals. Theaccuracy measures after the correction are shown with apostrophes (a0d and b0).

Accuracy measures Reliability measures

jad � arj j a0d � ar j b b0 GE SR SL PI PD

Mil 0.99 0.80 3.27 1.64 12.1 0.85 0.97 0.23 0.02Tit 1.10 0.94 3.25 1.41 10.6 0.77 0.85 0.34 0.02Med 1.05 0.90 2.97 1.26 9.8 0.77 0.97 0.34 0.02

Table 3Results of the DTW procedure for the three PDAs tested with synthetic signals.

Level 1 2 3

b0 Mil 0.32 1.70 2.63Tit 0.27 0.13 0.12Med 0.06 0.06 0.06

GE Mil 2.01 10.57 15.34Tit 1.67 0.90 0.83Med 0 0 0

SL Mil 0.65 1.13 1.19Tit 0.55 0.22 0.15Med 0 0 0

SR Mil 0.67 1.21 1.31Tit 0.57 0.22 0.15Med 0 0 0

520 C. Ferrer et al. / Pattern Recognition Letters 31 (2010) 517–522

2. Time warping of Td and Tr

To account for the errors described in Section 1.3. and allow afair comparison of Td(n) and Tr(n), both contours should be alignedwith respect to the corresponding pulse indexes. This can be per-formed by means of a dynamic time warping (DTW) proceduresimilar to the ones used in speech recognition systems (Sakoeand Chiba, 1978; Myers et al., 1980), later applied in other fieldslike medicine (Caiani et al., 1998), bioinformatics (Aach andChurch, 2001) and data mining (Kim et al., 2004). DTW is a partic-ular case of Dynamic Programming (Bellman, 1957), applied to se-quence alignment. DTW performs the alignment by searching foran optimal route, according to a particular cost function, in a bidi-mensional grid defined by the indexes of both sequences (i.e. Tr(nr)and Td(nd) with nr and nd the indexes of reference and detectedcontours, respectively), as depicted in Fig. 2:

Particular to each DTW algorithm are, besides the cost function,path restrictions like global and local constraints, monotony andendpoints (Sakoe and Chiba, 1978; Myers et al., 1980).

We define the time series of pitch markers positions Pd(nd) andPr(nr) such that Td(nd) = Pd(nd + 1) � Pd(nd) and Tr(nr) = Pr(nr + 1)� Pr(nr). The knowledge of Pd(nd) and Pr(nr) allows the setting ofrelevant index constraints to the time warping algorithm.

2.1. Global constraints

Global constraints define the region in the grid where the pathcan be located. In this particular case, two conditions were set toavoid the presence of more than two pulses between the currentpulse positions (Pr(nr) and Pd(nd)):

ðAÞ ¼ Pdðnd � 1Þ < Prðnr þ 1Þ ðBÞ ¼ Pdðnd þ 1Þ < Prðnr � 1Þ ð3Þ

When ‘‘A” or ‘‘B” are not met, a pulse insertion (PI) or deletion (PD)is reported, respectively, and indexes nd or nr are corrected accord-ingly (see ‘‘Global Constraints” in the flowchart shown in the figurebelow). The values PI and PD at the end of this DTW algorithm arenew measures of performance, accounting for the errors of this typeproduced by the PDA.

2.2. Local constraints

Local constraints define the possible transitions from a givenpoint in the grid. In this case, to account for the shifts describedin Section 1.3 only three routes are needed, as depicted in Fig. 3:

R1 is taken when Td is shifted to the left with respect to Tr (a SL

error is reported), R2 is taken when the shift is to the right (a SR er-ror is reported) and R3 is the default route, when no shift is de-tected. The decision on which route is taken can be expressed as:

4 5 6 7

3.26 2.53 3.37 5.390.20 0.28 0.63 1.370.05 0.18 0.56 0.93

19.09 15.88 18.93 25.651.37 1.70 2.82 5.460 0.67 3.05 3.96

1.65 1.63 1.74 1.780.21 0.23 0.24 0.370 0 0.52 0.52

1.84 2.07 2.47 3.170.27 0.36 0.45 0.920 0.10 0.58 0.62

C. Ferrer et al. / Pattern Recognition Letters 31 (2010) 517–522 521

R1 ¼ Eð0;0Þ ^ :Eð0;�1Þ ^ :Eð1;0Þ ^ ðPrðnrÞ 6 PdðndÞÞ ð4ÞR2 ¼ Eð0;0Þ ^ :Eð0;1Þ ^ :Eð�1;0Þ ^ ðPrðnrÞP PdðndÞÞR3 ¼ :R1 ^ :R2

where an auxiliary error presence E condition has been used, de-fined as:

Eða; bÞ ¼ ðj Trðnr þ aÞ � Tdðnd þ bÞ jP thÞ ð5Þ

E(a,b) represents the presence of a significant difference (greaterthan a given threshold th) between Td and Tr in a given point(nr + a,nd + b) in the grid. There are three heuristics included in R1

and R2 decisions in (4). First, only if there is an error in the currentposition in the grid (E(0,0) = True) the possibility of a shift is consid-ered. Second, there must be a strong similitude between Td and Tr

for two points in the grid with the purported shift (i.e.E(0, � 1) = False and E(1,0) = False for the test of SL, where nr wouldbe nd � 1). Finally, the position of the markers in samples must alsosupport the purported shift (e.g. Pr(nr) 6 Pd(nd) in the test of SL).

The similitude check in the second heuristic was limited to twopoints, since the larger the number of pulses evaluated, the largerthe chance that an error occurs in any of them, and an appropriatecorrection be discarded for that reason. An even number of pulseswas preferred since there is a more straightforward definition ofwhich pulses should be checked for those numbers. For instance,in the chosen case of checking two pulses, the two points in thegrid nearest to the current point (nr + 0,nd + 0) to be checked for er-rors in the SR condition (nd = nr � 1) are the chosen (�1,0) and(0,1). However, when checking three points, the set of (�2,�1),(�1,0) and (0,1) is equally close to the current point than the setof (�1,0),(0,1) and (1,2), turning the choice of which set to userather arbitrary. A two point check is the smaller even number ofpulses to consider, and that was the chosen configuration for theDTW algorithm. The drawback of checking fewer pulses is that un-due corrections are more probable.

2.3. Cost function

As can be inferred from previous sections, the proposed algo-rithm has been devised to minimize the number of errors presenton the path, and the cost function can be expressed as:

GE ¼X

k

EðndðkÞ;nrðkÞÞ ð6Þ

2.4. Flowchart

The flowchart of the proposed DTW procedure can be seen inFig. 4:

As a result of the algorithm, an ‘‘aligned” detected contourðT 0dðnÞÞ is obtained, comparable pulse by pulse to the reference con-tour Tr(n). The number of GE, PD, PI, SL and SR (shifts to left andright, respectively) is also reported. When the algorithm ends (byreaching the limit of pulses in either Td or Tr) the T 0d contour canbe compared to Tr in the same pulse indexes.

3. Experiments

The proposed comparison procedure was evaluated by testingthree well known WM variants (the one in (Milenkovick, 1987),its modification in (Titze and Liang, 1993) and the method in(Medan et al. 1991)) against synthetic and handmarked real sig-nals. These WM PDAs have shown better performance than othercycle-to-cycle (Titze and Liang, 1993) and short-term PDAs (Bag-shaw et al., 1993; Parsa and Jamieson, 1999; Veprek and Scord-ilis, 2002), but a direct comparison among them has not beenperformed.

3.1. Real signals

To evaluate the actual occurrence in real signals of the differenttypes of shifts described, samples were taken from the Massachus-sets Eye and Ear Infirmary/ Kay Elemetrics Database (Kay Elemet-rics Corp, 1994). Since the database is not annotated, handmarkingof individual pitch epochs was required. To save time, a screeningprocedure was applied, consisting of several steps, to select a re-duced set of signals (namely a dozen) which should result repre-sentative for our purposes. The steps of the screening procedure,and their results, are the following:

1st: Evaluate only pathological samples. This reduces the num-ber of signals from 710 to 657.

2nd: Use signals where at least one of the considered methodsfails. To guarantee this failure, an unsupervised run(unchecked T0E estimate) of the three methods was per-formed. The median of the resulting T0 from the classicAMDF short-term PDA (Ross et al., 1974), with search rangefrom 80 to 600 Hz, was used as the T0E for the WM methods.Signals where non-significant differences (12%) were foundamong the WM methods were discarded. This reduced thenumber of signals to consider to 108.

3rd: Spectrographic check: The spectrogram of the 108 signalswere inspected to suppress signals where the AMDF T0E esti-mate resulted wrong (15) and completely aperiodic signals(58). This step left 39 signals to consider.

4th: Oscilographic check: A final factor further reduced the num-ber of signals when visually inspected in the time domain.10 signals showed a time-varying pattern that made impos-sible to uniquely define the pitch marker to use.

The remaining 29 signals were hand marked, following themost prominent peak in the pulse (maximum or minimum). Thehand marking process produced 4695 pitch markers (averageF0 = 161 Hz).

3.2. Synthetic signals

Synthetic signals were generated according to the procedure in(Medan et al., 1991; Parsa and Jamieson, 1999; Ferrer et al., 2006),where an all pole filter with formant frequencies corresponding toa vowel ‘‘a” is excited by a train of impulses. Jitter and shimmer areadded by modifying the impulses separations and amplitudes,while noise is added to meet a desired signal to noise ratio(SNR). Eight combinations of the three types of perturbations weregenerated, in ascending order of periodicity degradation. The val-ues for the different factor for each level are given in Table 1.

The length of the synthesized signals is 2 s, with sampling fre-quency Fs = 22050 Hz, mean F0 = 150 Hz, giving an average of300 pulses per signal.

4. Results

The alignment procedure was performed by setting the E(a,b)threshold to 3% of the estimated mean period (th = 0.03 T0E). Thisis somewhat stricter than values previously reported (5–20%) butthis threshold is also used for checking the R1 and R2 expressionsand lower values prevent unnecessary adjustments.

4.1. Real signals

In addition to the reliability measures produced by the align-ment procedure (GE, PI, PD, SR and SL), the values for the accuracymeasures a and b were also calculated. To better appraise the effect

522 C. Ferrer et al. / Pattern Recognition Letters 31 (2010) 517–522

of the alignment procedure, a and b were obtained for the contoursbefore and after the alignment. PDAs methods are denoted ‘‘Mil”(Milenkovick, 1987), ‘‘Tit” (Titze and Liang, 1993)T, and ‘‘Med”(Medan et al., 1991). All results are shown in Table 2, and givenin percent.

The inadequacy of jitter measures to evaluate contours similar-ities is confirmed from the results of the accuracy measures. Thechange observed between results before and after the alignmentis far from the reduction observed in b, which is approximately re-duced to the half. The reliability measures obtained, as well as thereduction in b, demonstrate the occurrence of the described shifts.The occurrence of PIs is caused by some missing pulses in the realsignals, while the PDAs are forced to detect pulses in those places.The measures of reliability produced by the DTW algorithm shouldbe seen as an initial performance measure for a given PDA underevaluation. The sum of SR,SL, PI, PD for the three PDAs is similar(2.07, 1.98 and 2.10), and it is noteworthy this 2% of shifts causinga doubling in the value of b0. The most adequate accuracy indicatorfor a PDA performance is this b0 measure.

Regarding the performance of the PDAs, there is no definitivesuperiority, although ‘‘Med” commits fewer gross errors andpresents the lowest b0 (both significant at the 0.01 level whencompared to ‘‘Mil” and ‘‘Tit”, one tailed, paired samples, nonpara-metric Wilcoxon sign rank test).

4.2. Synthetic signals

Signals for each level of perturbation were generated 100 times,and the averaged results are shown in Table 3. The jitter valueshave been omitted, due to their demonstrated lack of relevance,and only b0 (calculated on the corrected contour) is shown, sincethe value with the original contour is randomly inflated by shifterrors.

In this test with synthetic signals the superiority of ‘‘Med” isevidenced, with better results in all measures of performance forall levels of perturbation. PI and PD were zero for all methods(the DTW algorithm avoided the cumulative errors that could pro-duce insertions and deletions). There is also a slight higher fre-quency of occurrence of SR than SL, which might imply that thethree PDAs have an asymmetrical tendency for reporting longerrather than shorter pulses. A similar bias was reported by Milen-kovick (1987) for his method, but it seems that the tendency isshared by the Medan et al. (1991) approach.

5. Conclusions

Two new types of PDA errors have been described, and a DTWprocedure to detect and correct them was devised. The proposedDTW procedure produced an increment in contour similarities(reduction of b to the half in real signals) and demonstrated theoccurrence of shifts in practice. These findings are particularlyimportant in perturbation measures, a field where PDAs are be-lieved to be unreliable for jitter values above 6% (Titze and Liang,1993). The reported PDAs performances in terms of accuracy mea-sures can actually have been hindered by the presence the unre-ported types of errors described and corrected here. The newmeasures of performance obtained can serve as an initial indicatorof the PDAs performance, and the aligned Td contour used for accu-racy measures.

The DTW algorithm proposed has been devised with simplicityin mind, and can be far from optimal. Other choices for the param-eters of the DTW procedure (threshold values, number of pulseschecked, etc.) should be evaluated in future works. The amountof data produced by such tests could provide additional findings.

Regarding the performed comparison of the PDAs, the superres-olution method proposed in (Medan et al., 1991) produced the bestresults among the three WM PDAs considered, in both the hand-marked real signals and the synthetic signals set. These methodshad not been compared before, making these results also relevant.

References

Aach, J., Church, G., 2001. Aligning gene expression time series with time warpingalgorithms. Bioinformatics 17, 495–508.

Bagshaw, S., Hiller, M., Jack, M.A., 1993. Enhanced pitch tracking and the processingof F0 contours for computer aided intonation teaching. Proc. Eurospeech., 1003–1006.

Baken, R.J., 1999. Clinical Measurement of Speech and Voice. Singular, San Diego.Bellman, R., 1957. Dynamic Programming. Princeton University Press.Bielamowicz, S., Kreiman, J., Gerratt, B.R., Dauer, M.S., Berke, G.S., 1996. Comparison

of voice analysis systems for perturbation measurement. J. Speech Hear Res. 39,126–134.

Boersma, P. Weenink, D., 2007. Praat: Doing phonetics by computer. <http://www.fon.hum.uva.nl/praat/>.

Caiani, E.G., Porta, A., Baselli, G., Turiel, M., Muzzupappa, S., Pieruzzi, F., Crema, C.,Malliani, A., Cerutti, S., 1998. Warped-average template technique to track on acycle-by-cycle basis the cardiac filling phases on left ventricular volume. IEEEComput. Cardiol., 73–76.

de Chevigné, A., Kawahara, H., 2001. Comparative evaluation of F0 estimationalgorithms. In: Proc. of EuroSpeech’2001, Scandinavia.

Deem, J.F., Manning, W.H., Knack, J.V., Matesich, J.S., 1989. The automatic extractionof pitch perturbation using microcomputers: Some methodologicalconsiderations. J. Speech Hear. Res. 32, 689–697.

Deliyski, D.D., Shaw, H.S., Evans, M.K., 2005. Adverse effects of environmental noiseon acoustic voice quality measures. J. Voice 19, 15–28.

Ferrer, C., González, E., Hernández-Dı́az, M.E., 2006. Evaluation of time andfrequency domain-based methods for the estimation of harmonics-to-noise-ratios in voice signals. LNCS 4225, 406–415.

Hess, W., 1983. Pitch Determination of Speech Signal: Algorithms and Devices.Springer Verlag, Berlin.

Hess, W.J., 1992. Pitch and voicing determination. In: Furui, S., Sondhi, M.M. (Eds.),Advances in Speech Signal Processing. Marcel Dekker, New York.

Karnell, M.P., Hall, K.D., Landahl, K.L., 1995. Comparison of fundamental frequency andperturbation measurements among three analysis systems. J. Voice 9, 383–393.

Kay Elemetrics Corp., 1994. Voice disorders database, Voice and Speech LaboratoryMassachussets Eye and Ear Infirmary, 2 BridgeWater Lane, Lincoln Park, NJ07035, USA.

Kim, S.W., Park, S., Chu, W.W., 2004. Efficient processing of similarity search undertime warping in sequence databases: an index-based approach. Inf. Syst. 29,405–420.

Medan, Y., Yair, E., Chazan, D., 1991. Super resolution pitch determination of thespeech signals. IEEE Trans. Signal Proc. 39, 40–48.

Milenkovick, P., 1987. Least mean squares measures of voice perturbation. J. SpeechHear. Res. 30, 529–538.

Moorer, J.A., 1974. The optimum comb method of pitch period analysis of continuousdigitized speech. IEEE Trans. Acoust. Speech Signal Process. 22, 330–338.

Myers, C., Rosenberg, L., Rosenberg, A., 1980. Performance tradeoffs in dynamic timewarping algorithms for isolated word recognition. IEEE Trans. Acoust., SpeechSignal Process. ASSP-28 (6), 623–635.

Nakatani, T., Irino, T., 2004. Robust and accurate fundamental frequency estimationbased on dominant harmonic components. J. Acoust. Soc. Am. 116, 3690–3700.

Parsa, V., Jamieson, D.G., 1999. A comparison of high precision Fo extractionalgorithms for sustained vowels. J. Speech Lang. Hear. Res. 42, 112–126.

Perry, C.K., Ingrisano, D.R., Eggleston, K.D., 2003. The effect of noise on computer-aided measures of voice: A comparison of CSpeechSP and theMulti-Dimensional Voice Program software using the CSL 4300B module andMulti-Speech for Windows. J. Voice 17, 12–20.

Perry, C.K., Ingrisano, D.R., Palmer, M.A., McDonald, E.J., 2000. Effects ofenvironmental noise on computer-derived voice estimates from femalespeakers. J. Voice 14, 146–153.

Rabiner, L.R., Cheng, M.J., Rosemberg, A.E., McGonegal, C.A., 1976. A comparativestudy of several pitch-detection algorithms. IEEE Trans. Acoust. Speech SignalProcess. 24, 399–417.

Ross, M.J., Shaffer, H.L., Cohen, A., Freudberg, R., Manley, H.J., 1974. Averagemagnitude difference function pitch extractor. IEEE Trans. Acoust. Speech SignalProcess. 5, 353–362.

Sakoe, H., Chiba, S., 1978. Dynamic programming algorithm optimization for spokenword recognition. IEEE Trans. Acoust. Speech Signal Process. ASSP-26, 43–49.

Shahnaz, C., Zhu, W.P., Ahmad, M.O., 2005. Robust pitch estimation at very low SNRexploiting time and frequency domain cues. ICASSP 389–392.

Titze, I.R., 1995. Summary Statement, Workshop-on Acoustic Voice Analysis,National Center of Voice and Speech, Iowa.

Titze, I.R., Liang, H., 1993. Comparison of Fo extraction methods for high-precisionvoice perturbation measurements. J. Speech Hear. Res. 36, 1120–1133.

Veprek, P., Scordilis, M.S., 2002. Analysis, enhancement and evaluation of five pitchdetermination techniques. Speech Commun. 37, 249–270.

Wise, J.D., Caprio, J.R., Parks, T.W., 1976. Maximum likelihood pitch estimation. IEEETrans. Acoust. Speech Signal Process. 24, 418–423.