Prior Abstracts: CCRMA DSP Seminar (MUS423)jos/mus423h/mus423h.pdf · We demonstrate analysis-synthesis applications including standard time and pitch scal-ing, as well as new e ect

Prior Abstracts: CCRMA DSP Seminar (MUS423)

Center for Computer Research in Music and Acoustics (CCRMA)∗

Department of Music, Stanford UniversityStanford, California 94305

Abstract

The CCRMA DSP Seminar1 is a seminar for graduate students and visiting scholars pursuingmusic and audio applications of signal processing. This is a collection of prior abstracts for pastquarters. The purpose is to provide a broadened view of CCRMA research in recent years. Fora more complete summary, see the research section of the CCRMA Overview.2

∗http://ccrma.stanford.edu/1http://ccrma.stanford.edu/˜jos/mus423/2http://ccrma.stanford.edu/CCRMA/Overview/

1

http://www.stanford.edu/group/Music/

http://www.stanford.edu/

http://ccrma.stanford.edu/overview/research.html

http://ccrma.stanford.edu/

http://ccrma.stanford.edu/~jos/mus423/

http://ccrma.stanford.edu/CCRMA/Overview/

Spring Quarter 2003-2004

Analysis and Resynthesis of Quasi-Harmonic Sounds: AnIterative Filterbank Approach

Harvey Thornburg and Randal Leistikow <harv23,randal at ccrma> (EE)

We employ a hybrid state-space sinusoidal model for general use in analysis-synthesisbased audio transformations. This model combines the advantages of a source-filtermodel with the time-frequency character of the sinusoidal model. We specialize theparameter identification task to a class of “quasi-harmonic” sounds. The latter representa variety of acoustic sources in which multiple, closely spaced modes cluster aboutprincipal harmonics loosely following a harmonic structure.

To estimate the sinusoidal parameters, an iterative filterbank splits the signal into sub-bands, one per principal harmonic. Each filter is optimally designed by a linear pro-gramming approach to be concave in the passband, monotonic in transition regions, andto specifically null out sinusoids in other subband regions. Within each subband, in-stantaneous frequency estimates averaged over all modes in the subband region are usedto update the filter’s center frequency for the next iteration. In this way, the filterbankprogressively adapts to the specific inharmonicity structure of the source recording.

We demonstrate analysis-synthesis applications including standard time and pitch scal-ing, as well as new effect types exploiting the “source-filter” aspect.

Presentation Program

1. Harvey will begin by discussing the state-space sinusoidal model and Kalman-based analysis-synthesis engine, focusing on applications and rules for parameter tuning. (40 min.).

2. Randal will discuss the design of the filterbank responsible for extracting sinusoidal dynamicsparameters for any quasi-harmonic sound model (40 min.).

3. Harvey will introduce, very briefly, a hybrid MATLAB/C code package useful in generallinear-programming based FIR filter design. (20 min.)

2

Bayesian Identification of Closely-Spaced Chords fromSingle-Frame STFT Peaks


Identifying chords and related musical attributes from digital audio has proven a long-standing problem spanning many decades of research. A robust identification mayfacilitate automatic transcription, semantic indexing, polyphonic source separation andother emerging applications. To this end, we develop a Bayesian inference engine op-erating on single-frame STFT peaks. Peak likelihoods conditional on pitch componentinformation are evaluated by an MCMC approach accounting for overlapping harmonicsas well as undetected/spurious peaks, thus facilitating operation in noisy environmentsat very low computational cost. Our inference engine evaluates posterior probabilities ofmusical attributes such as root, chroma (including inversion), octave and tuning, givenSTFT peak frequency and amplitude observations. The resultant posteriors becomehighly concentrated around the correct attributes, as demonstrated using 227 ms pianorecordings with -14 dB additive white Gaussian noise.

3

Psychoacoustic Issues in Audio Watermarking

Yi-Wen Liu <jacobliu at stanford> (EE)

Outline

• Overview of watermarking: terminologies and goals

• Applications of watermarking, and competing technologies:

– Copyright protection

– Copy protection (different!)

– Piracy tracking

– Broadcast monitoring

– Tamper Detection

• Methods in audio watermarking

– Spread spectrum

– Time domain processing

– Spectral manipulation using phase insensity

• The role of attackers

– Categories

– Game of watermarking

– Counter measures and counter counter measures

• The Liu-Smith method (2003)

– Watermarking based on parametric modeling

– the Fisher information bound on data hiding rates

– as an assymetric game

– Implementation using the sine + noise + transient model

– Advantages, disadvantages, applicability and usefulness

• Psychoacoustics:

– How well can we hear a frequency shift?

– How well can we hear a frequency shift in a noisy environment?

– How well can we hear frequency shifts superposed?

• Sound demonstration

4

Sinusoidal Parameter Estimation based on QuadraticInterpolation of FFT Magnitude Peaks

Mototsugu Abe <abemoto at ccrma> (Visiting Scholar, Sony Corporation)

Among various approximate maximum likelihood (ML) estimators, quadratic interpo-lation of magnitude peaks in a Fast Fourier Transform (QIFFT) has been widely useddue to its simplicity and accuracy, which is sufficient for most audio purposes. Indeed,it works nearly as well as a ML estimator for a single, time-invariant sinusoid (or wellresolved multiple, quasi time-invariant sinusoids) when a large FFT zero-padding factoris used.

In practice, however, we are shackled with various restrictions, such as 1) we maywish to minimize zero padding for computational efficiency, 2) we may need to dealwith interference from nearby signal components, and 3) we may have to consider time-varying sinusoids. All of these requirements raise the error bias and restrict the availablerange of the estimator.

In this talk, we theoretically predict and numerically confirm the errors associated withvarious parameter choices such as window types, lengths, and zero-padding factors, andprovide precise criteria for designing the estimator. Also, as expansions of the QIFFTframework, we discuss simple bias correction functions and a method for estimation ofAM and FM rates.

5

Piano Dispersion Filter Design

Julius Smith <jos at ccrma> (CCRMA)

This informal seminar presentation is devoted to the problem of designing digital filtersto implement dispersion in a waveguide string model.

6

Winter Quarter 2003-2004

Acoustics of Percussion Instruments

Thomas Rossing <rossing at physics.niu.edu> (CCRMA Visiting ScholarProfessor of Physics, Northern Illinois University)

Most percussion instruments are impulsively excited and vibrate in a rather complexway, until their oscillations die out. At low to medium amplitudes, their vibrations canbe conveniently described in terms of normal modes of vibration, but at large ampli-tude they may show distinctly nonlinear or chaotic behavior. This talk will describesome of the experimental techniques that are used to observe both linear and nonlinearvibrations in some percussion instruments.

7

Recent developments in musical sound synthesis based on aphysical model


This talk will begin with a brief review of general finite difference methods for discrete-time simulation of acoustic systems. Next, these methods are related to faster comp-utational forms normally associated with digital signal processing techniques. In par-ticular, the “digital waveguide” and “wave digital” paradigms are briefly reviewed andcompared. More specialized methods will be discussed, which may be applied to thesimulation of nonlinearities, distributed scattering, and model-order reduction.

8

Discrete-Time Peak and Shelf Filter Design for Analog Modeling

Jonathan Abel and David P. Berners <abel,dpberner at uaudio.com> (Universal Audio, Inc.)

A method for the design of discrete-time second-order peak and shelf filters is developedwhich allows the response of analog peak and shelf filters to be approximated in thedigital domain. For filters whose features approach the Nyquist limit, the proposedmethod provides a closer approximation to the analog response than direct applicationof the bilinear transform. A peak filter and three types of resonant shelf filter arediscussed, and design examples are presented.

9

A Filter Design Method Using Second-Order Peaking andShelving Sections

Jonathan Abel and David P. Berners <abel,dpberner at uaudio.com> (Universal Audio, Inc.)

A method for designing audio filters is developed based on the observation that second-order peaking and shelving filters can be made nearly self-similar on a log magnitudescale with respect to peak and shelf gain changes. By cascading such second-ordersections, filters are formed which may be fit to dB magnitude characteristics via linearleast-squares techniques. A graphic equalizer interpolating prescribed band gains ispresented, along with a filter minimizing the Bark-weighted mean square dB differencebetween modeled and desired transfer function magnitudes. It is noted that usingsecond-order sections parametrized by transition frequency and gain provides a naturalmechanism for slewing and interpolation between tabulated designs.

10

A method of automatic recognition of structural boundaries inrecorded musical signals

Unjung Nam <unjung at ccrma> (Music)

Machine recognition of musical features in an audio signal is an important and de-sirable tool both for identification and retrieval of a given musical recording and forcomparisons of multiple music recordings. A robust machine recognition system is crit-ical for successful multimedia database management and useful for interactive musicsystems. Furthermore, it suggests new research opportunities in music theory, analysisand musicology.

Music is structured at a variety of levels ranging from the quasi-periodicity of frequencyto the macro levels of large-scale musical forms. Intermediate levels of structure includestructures such as motives and phrases. These structures constitute the salient per-ceptual units that listeners use to comparatively assess music in terms of the degree ofsimilarity either within a given piece or between pieces. While human listeners are facileat distinguishing recurrence and contrast in music the same task has proven elusive inmachine listening paradigms.

In this talk a methodology is proposed that attempts to derive salient and hierarchi-cal musical structures from a raw audio signal by accessing the degree of novelty andredundancy throughout a musical signal.

11

From Spectral Pitch Estimation to Automatic PolyphonicTranscription: Recent Results


In this talk, we review our maximum likelihood spectral pitch estimator robust to un-detected harmonics, spurious/interference peaks, skewed peak estimates, and unknowninherent deviations from ideal or other assumed inharmonicity structure. Inherent to ourapproach is a hidden discrete-valued descriptor variable identifying spurious/undetectedpeaks. The likelihood evaluation, requiring a computationally unwieldy summation overall descriptor states, is successfully approximated by a MCMC traversal chiefly amongsthigh-probability states. For historical interest, we compare our method to the classicmaximum likelihood estimator of Julius Goldstein. Our method seems to compare fa-vorably, and becomes crucially important in the case of low-frequency interference peakswhere Goldstein’s method suddenly breaks down. Furthermore, we exploit peak amp-litude as well as frequency information, utilizing timbral domain knowledge where thismay exist.

Recently, we have begun to extend these developments, presently pursuing a method forBayesian spectral chord recognition, with our likelihood evaluation as the main comp-utational engine. Octave, key, chord and tuning are abstracted, and may jointly beidentified given only the peaks from a STFT frame. Time permitting, we may discussthe development and preliminary results, as well as outline a scheme for integratingframewise correspondences. The latter results in nothing more sophisticated than anHMM, where exact inference is possible (apart from the aforementioned MCMC stepsin likelihood evaluation), and it is at least clear where to insert “domain knowledge”(from music theory) in the hierarchy. The latter provides an extremely simple and hope-fully robust method for handling automatic transcription tasks, at least those involvingpolyphonic recordings of a single instrument.

12

Dynamic Range Compression based on Models of Time-VaryingLoudness

Ryan Cassidy <ryanc at ieee.org> (EE)

We incorporate facts from recently proposed time-varying loudness models in the tuningof a contemporary level detection mechanism for the dynamic range compression ofaudio signals. An analysis of the time-varying and steady-state behaviour of such amechanism is provided, thus revealing its shortcomings in light of well-known factsabout steady-state loudness. To remedy this problem, the design of an equal-loudnessfilter is presented. The performance and implementation requirements of this filterare compared to the well-known A- and C-weighting schemes used in many sound levelmeters. Next the aforementioned tuning of the level detector to match recently proposedtime-varying loudness models is presented.

13

Fall Quarter 2003-2004

Fundamentals of musical acoustics and mechanics of musicalinstruments: An overview

Antoine Chaigne <achaigne at ccrma> (CCRMA Visiting Scholar, Professor, ENSTA and Ecolepolytechnique, France)

This overview will present most of the topics which will be developed in the upcominglectures of the seminar.

One central question, when attempting to synthesize musical sounds from physical mod-els, is to know the degree of refinement of the model needed. Using examples from thefamilies of strings and percussive instruments, this presentation will focus on the nec-essary features that a physical model of musical instruments must contain in order toproduce interesting sounds. Stringed and percussive instruments are characterized bythe vibrations of at least some of their constitutive parts: in a guitar, for example,it is mandatory to take the vibrations of strings and soundboard into account. Thesevibrations are governed by partial differential equations complemented by initial andboundary conditions. Another crucial point is damping. Instrument makers know verywell that changing the material of a top-plate has a pronounced effect on the resultingsound. Most of these effects are due to the rate of attenuation of the vibrations inthe material itself. In xylophones and vibraphones, the presence of a tubular resonatorunder the vibrating bar modifies the sound dramatically. In this case, a good modelmust include the radiation of the bar, the excitation of the tube by the radiated waveand the resulting radiation of the tube. Finally, in all stringed instruments, and alsoin drums and timpani, one cannot reproduce the sound of the instrument accuratelywithout considering the reaction of the radiated wave on the body itself.

14

Elements of numerical analysis with application to sound synthesis


This presentation is essentially a tutorial on basic concepts related to the use andproperties of numerical analysis methods in the context of sound synthesis. The mainresults are illustrated by simple examples, such as piano strings, xylophone bars, andso on.

The physical behavior of musical instruments is described by continuous equations.These equations include derivatives and integrals vs. space and time. In order tosolve these equations it is necessary to approximate the model by discrete formulations.Several strategies are possible: finite differences, finite elements, modal truncation, andothers. In each case, fundamental questions arise: which spatial/time step should beselected? What are the consequences of these choices? How to ensure the stabilityof the model? Is it possible to predict the accuracy of a given method? The lecturewill focus on the basic problems of stability and numerical dispersion, in the case ofsimple finite difference modeling of string and bar equations. Analogies and differenceswith respect to other methods will be discussed. It will be shown, in particular, thatthe well-known Courant-Friedrich-Levy (or CFL) condition for a second-order explicitfinite difference scheme is equivalent to the Shannon (or Nyquist) sampling theorem insignal processing. Fruitful discussion on the comparison between waveguide modelingand finite difference modeling will certainly be of great interest.

15

Audio watermarking via sinusoidal modeling

Yi-Wen Liu <jacobliu at ccrma> (CCRMA EE Ph.D. student)

Audio watermarking refers to hiding bit-streams in sounds. Among existing applicationssuch as copyright protection and broadcast monitoring, transparency and robustnessare always required. Here, transparency means that watermarks should introduce noaudible distortion, and robustness means that watermarks can not be erased withoutdegrading the perceptual quality of sounds.

Many existing audio watermarking methods utilize the masking phenomena of humanhearing. Fourier-like transform coefficients of a signal are amplitude-modulated to carryinformation. The modulation is made small enough to introduce distortion only belowthe masking threshold. Our method, instead, utilizes the frequency just noticeabledifference (JND). The idea is to introduce small frequency modulation within JND, andthen design watermark encoding and decoding procedures robust to selected types ofsignal manipulation, such as MP3 compression.

The presentation outline goes as follows:

• Brief introduction to watermarking: definition and requirements

• Sinusoidal modeling for watermarking

– Analysis at encoder: sine + noise + transient

– Synthesis at encoder: frequency quantization

– Analysis at decoder: frequency re-quantization

• Listening tests

• Updates, discussions, and future directions

16

Automatic music identification

Avery Wang <avery at shazamteam.com> (Shazam Ltd)

We have developed and commercially deployed a flexible audio search engine. Thealgorithm is noise and distortion resistant, computationally efficient, and massivelyscalable, capable of quickly identifying a short segment of music captured through acellphone microphone in the presence of foreground voices and other dominant noise,and through voice CODEC compression, out of a database of over a million tracks.The algorithm uses a combinatorially hashed time-frequency constellation analysis ofthe audio, yielding unusual properties such as transparency, in which multiple tracksmixed together may each be identified. Furthermore, for applications such as radiomonitoring, search times on the order of a few milliseconds per query are attained, evenon a massive music database.

17

Numerical experiments for piano and xylophone


One interest of developing physical models of musical instruments is to make numericalexperiments. In practice, this means that the model is used for investigating, throughsimulations, the relevance of the constitutive parts of the instruments on the producedsounds. This strategy will be illustrated in this lecture by numerous examples relatedto both the piano and xylophone. For the piano, the model yields interesting resultsrelated to the propagation of waves along the string, the hammer striking position, thehammer/string mass ratio and the pulse waveform of the impact force. Future possibledevelopments for this instrument will be presented. For the xylophone, the modelshows the influence of mallet and bar properties on the initial transient of the sound,the relevance of the undercut profile and the function of the resonator. Extensions tomarimbas and vibraphones will also be presented and discussed.

18

Application of Loudness/Pitch/Timbre decomposition operatorsto auditory scene analysis

Mototsugu Abe <moto at ccrma> (CCRMA Visiting ScholarSony Corporation, Japan)

In this presentation, I will present my former work at the University of Tokyo, whichwas completed in the mid 1990s and was a part of my doctoral thesis research.

The first half of the presentation is on ”Loudness/Pitch/Timbre(*) Decomposition Op-erators.” In this research, we constructed a set of operators as general audio signalprocessing tools in the time-frequency domain.

More concretely, we focus on the instantaneous change of audio in the Wavelet domain.The change is decomposed into three orthogonal components, and a method is givenfor projecting the change onto these components.

The second half is on an application of the operators to the problem of computationalauditory scene analysis(*). In a (monaural) multi-stream sound(*), when frequencycomponents of the sound change together in amplitude and frequency, they are groupedtogether as one auditory stream. Since the above operators provide us with a methodfor quantifying the instantaneous changes in amplitude and frequency, we utilize themfor the initial stage. Then the estimated amplitude and frequency changes of the compo-nents are used to construct a probabilistic space in which peaks correspond to streams.The probability distribution may be updated with new data to follow streams throughtime.

Notes:

(*) Though ”loudness”, ”pitch” and ”timbre” are terms corresponding to human perception, theyare used in our context to mean ”amplitude change”, ”frequency change” and ”the other typesof change”, respectively. (At the time, my adviser wanted to somehow relate this work tohuman perception, but we haven’t done that yet.)

(*) The term ”auditory scene analysis” is the title of a book written by Albert S. Bregman in1990, in which he summarizes psychophysical characteristics of human perception in group-ing/separating sounds which occur simultaneously or sequentially. In the 1990s, many re-searchers/engineers tried to simulate/implement them as a computational model on a com-puter.

(*) The term ”multi-stream sound” is almost the same as ”multi-source sound”. However,”stream” corresponds to human perception regardless of how many sources actually exist,whereas ”source” corresponds to an actual physical sound source.

19

Bayesian Two-Source Modeling for Separation of N Sources fromStereo Signals

Aaron Master <asmaster at ccrma> (CCRMA EE Ph.D. student)

We consider an enhancement to the DUET sound source separation system of Rickardand others, which allowed for the separation of N localized sparse sources given stereomixture signals. Specifically, we expand the system and the related delay and scalesubtraction scoring (DASSS) of Master to consider cases when two sources, rather thanone, are active at the same point in STFT time-frequency space. We begin with a reviewof the DUET system and its sparsity and independence assumptions. We then considerhow the DUET system and DASSS respond when faced with two active sources, anduse this information in a Bayesian context to score the probability that two particularsources are active. We conclude with a musical example illustrating the benefit of ourapproach.

20

Polyphonic Instrument Identification Using IndependentSubspace Analysis

Pamornpol (‘Tak”) Jinachitra <pj97 at stanford> (CCRMA EE Ph.D. student)

In this talk, I will present an ongoing work of a system which tries to identify the musi-cal instruments played in a polyphonic mixture. The features used in classification arederived from the Independent Subspace Analysis (ISA) which somewhat decomposeseach source, and the mixture, into its statistically ”independent” components. Withoutre-grouping or actually separating the sources, these features can be used as finger-prints of each instrument, assuming the decomposed fingerprints are robust enough tothe mixing process. The accuracy test on two-tonal instrument mixes (possibly withpercussion) will be presented along with some tests on real songs. Interesting demosof the original ISA for source separation and auditory grouping will be given as anintroduction.

21

Materials and sound properties. The role of damping.


In the daily life, we often knock on a table or on a wall in order to get informationon their internal structure and materials. From the acoustic signal, our brain is ableto recognize whether the objects we are seeing are made or glass, metal of wood, forexample.

Therefore, if we want to simulate virtual structure-borne sound sources properly, thesynthesis program must include specific features that allow recognition and discrimina-tion among materials without ambiguity.

In this presentation, particular attention will be paid to the role of damping, whose roleis crucial in the recognition of materials. It will be shown that accurate modeling ofthree major causes of damping (thermoelasticity, viscoelasticity and radiation) yieldsvery realistic simulations of a large variety of materials (metal, wood, glass, carbonfibers,...). The presentation will be illustrated by comparison between measured andsimulated sounds radiated by impacted plates.

References: http://www.ensta.fr/~chaigne/articles/

(papers chaigne plate.pdf and lambourg plate.pdf) http://www.ensta.fr/ chaigne/articles/

22

http://www.ensta.fr/~chaigne/articles/

The sounds of gongs and cymbals


Gongs and cymbals belong to that category of instruments for which no physical modelis available yet. Therefore, there is a real challenge for developing such models andimplementing them in synthesizers. In order to do so, it is necessary to understand themain properties of these instruments and, in particular, the fact that the sound changesdrastically depending on the strength of the impact.

The presentation will be mainly focused on experimental results that show, in particular,that the vibrations of gongs and cymbals are highly nonlinear, the magnitude of thetransverse motion being often larger than the thickness of the structure. Accurate signalanalysis shows that the frequencies which can be seen on sound spectra for strongimpact are governed by very simple arithmetic rules between the eigenfrequencies ofthe structure. These frequencies are called “combination resonances”. For very strongimpact, specific signal processing tools must be applied in order to show that the signalsare chaotic and governed by only a small number of degrees of freedom. In this case, thespectra do not exhibit isolated peaks but rather a “deterministic noise” which shouldbe possible to reproduce with only a small number of nonlinear oscillators.

References: http://www.ensta.fr/~chaigne/articles/Percussive instruments/

(papers ismagong V1.pdf and ShellGONGFA02.pdf)

23

http://www.ensta.fr/~chaigne/articles/Percussive_instruments/


Estimating performance parameters from the acoustic waveformof the violin

Arvindh Krishnaswamy <arvindh at ccrma> (EE)

Friday, April 4, 3:15 PM, CCRMA Library

In this friday’s seminar I wish to talk about the following:

• My ongoing project on inferring control inputs to a musical instrument based ondiscriminant analysis. Please see

http://ccrma/~arvindh/pprs/icme03.pdf

for the paper that is going to be presented at IEEE ICME 03. It has an abstractand references which may be useful. I will present the ideas from this basic paperas well as discuss the numerous areas for future improvement which I have in mind.

• I will also give an overall ’research update’ and talk about what research I planto pursue on a long-term basis. Comments and suggestions will be appreciated. Iam interested basically in an Analysis -¿ Feature extraction -¿ ”Resynthesis fromscratch” system for processing music/audio signals using musical models and acous-tic signal models.

• An update on my latest thoughts on South Indian music will be given.

All of the above topics may be stretched, compressed, truncated or even changed de-pending on time and, most importantly, my level of preparation. ;-)

24

http://ccrma/~arvindh/pprs/icme03.pdf

Comb Filter Modulation Effects

David Lowenfels <dfl> (CCRMA)

In this week’s CCRMA DSP Seminar (Friday, 3:15pm, Ballroom), MA/MST studentDavid Lowenfels will talk about his recent work in Virtual Analog synthesis. All arewelcome. – Julius

ABSTRACT:

This is an invitation to come hear about my research into Virtual Analog synthesis. TheBLIT model of Stilson and Smith has been extended with a single comb filter, allowingfor effects such as waveform morphing, PWM, chorus, detune/unison, and hard sync.Forays into recursive FM (actually PM) have drastically reduced the computationalcomplexity, superseding BLIT. I will show working demos and discuss my paper inprogress.

David Lowenfels (http://ccrma/˜dfl/)

25

http://ccrma/~dfl/

Physical Modeling Synthesis of the Recorder

Hiroko Terasawa <hiroko at ccrma> (CCRMA)

Hi,

In the DSP seminar tomorrow, I will be presenting about physical modeling synthesis ofrecorder sound, which I worked on last year, and which I will present in ASA Nashvillemeeting next week. It will be a short presentation (3:30 - 3:50). Please drop in ifinterested. Your feedback will be greatly appreciated!

Thank you, - Hiroko

— ABSTRACT —

A time-domain simulation of the soprano baroque recorder based on the digital wave-guide model (DWM) and an air reed model is introduced. The air reed model is de-veloped upon the negative acoustic displacement model (NADM), which was proposedfor the organ flue-pipe simulation [Adachi, Proc. of ISMA 1997, pp. 251–260], basedon the semiempirical model by Fletcher [Fletcher and Rossing, The Physics of MusicalInstruments, 2nd ed. (Springer, Berlin, 2000)]. Two models are proposed to coupleDWM and NADM. The jet amplification coefficient is remodeled for the applicationof NADM for the recorder, regarding the recent experimental reports [Yoshikawa andArimoto, Proc. of ISMA 2001, pp. 309–312]. The simulation results are presentedin terms of the mode transient characteristics and the spectral characteristics of thesynthesized sounds. They indicate that the NADM is not sufficient to describe the re-alistic mode transient of the recorder, while the synthesized sounds maintained almostresemble timbre to the recorder sounds.

Today, after Hiroko presents her upcoming ASA talk, Eric Lindemann will talk abouthis generalized sampling synthesis based on reconstructing phrase segments. Then afterthat, Arvindh Krishnaswamy will talk about his recent experience with nonlinear stringsynthesis algorithms. – Julius

26

AUDIO Quality Analysis with PEAQ

Opticom Inc.

This Friday (5/2/03) at 3:15pm in the CCRMA Ballroom, a representative from Op-ticom will present software for the objective measurement of sound quality. This is along standing problem in audio signal processing, and one which few have addressedvery thoroughly. All are welcome – Julius

Abstract

Audio quality is one of the key issues for a range of modern electronic equipment.Whether it is during the development of the equipment or while operating it, one willalways be faced with the problem to determine and optimize the quality. New evaluationtechniques based on modeling human perception have been devised to tackle this issue.This presentation intends to give a comprehensive overview on the latest standards forstate of the art audio quality testing.

Up until now, one of the best ways to measure the sound quality of modern audio equip-ment has been to implement elaborate listening tests with experienced test subjects,so-called ”expert listeners”. The traditional measurement technologies have proven in-sufficient at providing any detailed information on the quality of modern digitally basedcircuit designs. Due to the increased use of such designs, in digital broadcasting, pro-fessional and consumer electronics, the storing of audio signals, as well as in the fieldof multi-media computing and Internet, it became necessary to develop an innovative,uniform method of measuring with which it is possible to objectively measure the qual-ity of audio. After thorough verification, a model was recommended by the ITU-R asa measure for the perceived audio quality (”PEAQ”) under recommendation BS.1387.The PEAQ measurement method is applicable to most types of audio signal processingequipment, both digital and analog.

Outline for the Presentation:

• Who is OPTICOM?

• Introduction to the problem

• Subjective Audio Quality Testing

• Objective Audio Quality Testing

• Principles of Psycho-Acoustics

• What is ”PEAQ”?

• OPERA - A Comprehensive Solution

27


Prior Identification for Harmonic-Comb Piano Model

Harvey Thornburg <harv23 at ccrma> (EE)

Friday, Jan. 17 3:15 PM, CCRMA Library

In a previous seminar, we presented a “physically-informed” Bayesian sinusoidal modelfor reproduction of one vibrational mode of a single piano note. This model, based onsolutions to the PDE of the physical model of Bensa et al. (2002) comprises a bankof coordinated exponentially decaying sinusoidal oscillators which exist in a roughlyharmonic relationship (hence the name “harmonic comb”), but with adjustments forinharmonicity and frequency-dependent losses. Tolerances for parameter variation, de-viations from model structure, and recording noise are introduced in the sense purely*functional* dependences are replaced by *statistical* dependences, i.e. conditionalprobability distributions.

Our statistical model is driven by the purely physical parameters: fundamental fre-quency, inharmonicity factor, and the frequency-dependent loss parameters, as well asthe initial amplitudes and phases for each oscillator. Furthermore, in the interests ofmodeling multi-stage decay behavior, we must introduce *multiple* vibrational modes;each with its own set of physical parameters, initializations, etc. A single, exponentiallydecaying envelope for each partial does not suffice.

To identify so many input parameters from a such a complex statistical model, usingnothing but the time-domain audio signal as “observation”, is a daunting task. Never-theless, we already possess a good deal of information about the input parameters andtheir relationships. The importance of the Bayesian methodology is that it allows priorinformation and domain knowledge to be introduced in the form of a prior distributionover the unknown parameters.

In this talk, we address the prior identification over all input parameters across multiplevibrational modes, using a precomputed database of partials’ fundamental frequency,decay rate, and initial amplitude information from Bensa et al. (2002). We add tothe set of “physical parameters”, two input parameters modeling the sequence of initialamplitudes (consequential of the hammer strike). For a given piano note, mdeling ofinitial amplitude is linear on a log scale w.r.t log frequency, roughly justified by Hundleyet al (1978).

Our initial attempt concerns the prior identification on a note-by-note basis.To this end,we explore two hypotheses concerning the interaction of multiple vibrational modes.One hypothesis, called the “discrete”approach, implied by numerous papers in theacoustics literature (e.g. Weinreich (1977) and Nakamura (1989), among others), im-plies the existence of discrete “mode classes”, (i.e. “fast” and “slow” modes, the fastmodes exhibit higher initial amplitudes etc.) Another hypothesis treats each mode asi.i.d, but attempts to discover correlations within the deviations, i.e. unusually fastmodes exhibit unusually high amplitudes.

28

With the discrete mode-class approach, there is the difficulty of sorting the precom-puted data according to mode class. A certain separability criterion is optimized; theexact optimization requires an exhaustive search. Nevertheless, we develop two meth-ods: an approximate “Viterbi” algorithm, and a random search with simulated anneal-ing. Both approximate algorithms demonstrate near-optimal performance on “known”sorting problems of similar size and dimension, though the “Viterbi” algorithm is con-siderably faster.

With the continuous approach, the main technical issue is the (suitably constrained)covariance identification. The direct maximum-likelihood approach has (at this time)proven intractable in closed-form. It is also unknown whether multiple local maximaexist for any configuration of the problem. As such, we pursue an iterative optimizationbased on the EM algorithm, which is guaranteed to converge to a local maximum ofthe likelihood function. An initialization trick in practice avoids the possible problemof converging to the wrong local maximum, although a theoretical guarantee has notyet been found. Nevertheless, we attain convergence to an acceptable result in very few(3-5) iterations.

Our results show that the data fails to support the discrete mode-class approach, despiteample support in the theoretical acoustics literature. The culprit, excessive variabilityin the range of partials’ initial amplitudes and decay rates, induces significant overlapamong the classes. On the other hand, the continuous approach finds success: theexpected correlation between high initial amplitude and fast decay rate is established.As such, we extend the continuous approach to cover prior identification for the entirepiano range. We introduce polynomial fits for the means of all input parameters as wellas the covariance parameters. Fortunately, it seems unnecessary to pursue a “piecewise”fit as previously thought, which responds to the differing numbers of strings.

29

A Power-Normalized Wave Digital Piano Hammer


This Friday at 3:15, in the CCRMA Library, I will present the recent ASA-02 paper byStefan Bilbao, Julien Bensa, Richard Kronland-Martinet, and myself.

ABSTRACT

For sound synthesis purposes, the vibration of a piano string may be simply modeledusing bidirectional delay lines or digital waveguides which transport traveling wave-likesignals in both directions. Such a digital wave-type formulation, in addition to yielding aparticularly computationally efficient simulation routine, also possesses other importantadvantages. In particular, it is possible to couple the delay lines to a nonlinear excitingmechanism (the hammer) without compromising stability; in fact, if the hammer andstring are lossless, their digital counterparts will be exactly lossless as well. The key tothis good property (which can be carried over to other nonlinear elements in musicalsystems) is that all operations are framed in terms of the passive scattering of discretesignals in the network, the sum-of-squares of which serves as a discrete-time Lyapunovfunction for the system as a whole. Simulations are presented.

30

Modeling vocal-tract influence in reed wind instruments

Gary Scavone <gary at ccrma> (CCRMA)

In this week’s DSP seminar, I’ll present some early explorations regarding the influenceof upstream vocal tract resonances in reed wind instrument performance and modeling.First, I’ll provide a demonstration of vocal tract effects with my saxophone and reviewprevious research in this area. Then I’ll discuss a digital waveguide simulation that I’vemade and demo its behavior. Finally, I’ll open the floor to feedback and suggestions.

31

Recent Work in Self-Similarity Analysis of Audio

Jonathan Foote <foote at fxpal.com> (FX Palo Alto Laboratory, Inc.)

Matthew Cooper and Jonathan FooteFX Palo Alto Laboratory, Inc.

We will present some new approaches and recent results from our work in the analysisof audio. Starting with a similarity matrix consisting of all pairwise measurement ofspectral similarity, we review how the matrix can be used for segmentation and rhythmicanalysis of music. We present more recent work on how this analysis can be used forretrieving music by rhythmic similarity, and how the matrix can be factorized to locaterepeating segments, such as verses and choruses in popular music.

32

Autumn Quarter 2002-2003

Copyright protection of synthesized objects: a geometricinterpretation


Synthesized multimedia objects are emerging everywhere now. One can talk on thephone to a virtual representative that speaks a synthesized tongue, drink soda of syn-thesized taste, such as Coke, or even fall in love with Simone, a synthesized character.It becomes urgent to protect such objects as intelectual properties, for the synthesis ofthem often involves a lot of computation power and human hours. This talk presentsa mathematical definition of the term synthesis, and aims to provide a framework forthe design of robust watermarking algorithms to, hopefully, protect copyrights to syn-thesized objects. In particular, this talk proposes a strategy of watermarking beforesynthesis, and presents a geometric method to evaluate its robustness to random at-tacks.

• PART I: INTRODUCTION

– Synthesized multimedia objects: creation and distribution

– A mathematical definition of synthesis

– A mathematical formulation of the watermarking game

– A review of the duality between audio watermarking and audio coding Copyright pro-tection by watermarking before synthesis (WBS)

• PART II: ANALYSIS

– A simple case study – protecting single-tone cell phone ringers

– Problem formulation – frequency estimation under Gaussian white attacks

– Solving the problem by designing a robust frequency estimator

– The Marti Gras beads(MGB) geometric interpretation

– Calculating the Cramer-Rao bounds (CRB)

– Generalization to K dimensional manifold MGB interpretation

– Calculation of variance of parameter estimation error based on K-manifold MGB inter-pretation

– Comparing the MGB variance and CRB

• PART III: POSSIBLE FUTURE DIRECTIONS

More case studies – e.g., non-stationary sinusoids, synthesized speech,

– new music by physical modeling

– Customizing WBS according to various types of anticipated attacks

– Vector quantization on manifolds

– Adaptive learning on manifolds

33

Doppler Simulation and the Leslie


with Stefania Serafin (Music), and Jonathan Abel and David Berners (Universal Audio)

An efficient algorithm for simulating the Doppler effect using interpolating and de-interpolating delay lines is described. The Doppler simulator is used to simulate arotating horn to achieve the Leslie effect. Measurements of a horn from a real Leslieare used to calibrate angle-dependent digital filters which simulate the changing, angle-dependent, frequency response of the rotating horn.

The Doppler effect causes the pitch of a sound source to appear to rise or fall due to motionof the source and/or listener relative to each other. The Doppler effect has been used to enhancethe realism of simulated moving sound sources for compositional purposes, and it is an importantcomponent of the “Leslie effect.”

The Leslie is a popular audio processor used with electronic organs and other instruments. Itemploys a rotating horn and rotating speaker port to “choralize” the sound. Since the horn rotateswithin a cabinet, the listener hears multiple reflections at different Doppler shifts, giving a kindof chorus effect. Additionally, the Leslie amplifier distorts at high volumes, producing a pleasing“growl” highly prized by keyboard players.

In this presentation, an efficient algorithm for digital simulation of the Doppler effect is pre-sented, and the algorithm is applied to the problem of rotating-horn simulation for the Leslie effect.

34

http://ccrma/~serafin

Gaussian Magic


At the CCRMA DSP Seminar Friday, 10/11/02, at 3:15-4:05 in the CCRMA Library, Iwill present a tutorial review of the amazing Gaussian (a.k.a. “bell curve” or “normalcurve”). In particular, the interesting and elegant case of exp(−pt2) for p complex willbe treated (a Gaussian-windowed chirp), and applications to sinusoidal modeling ofaudio will be mentioned. In addition, I will review other classic results as time permits,such as the maximum entropy property, closure under multiplication and convolution,properties of moments, appearances in physics, the central limit theorem, and so on. –jos

35

More Gaussian magic!

Aaron Master <asmaster at ccrma> (EE)

At the CCRMA DSP Seminar this Friday, 3:15-4:05 in the CCRMA Library (or Ball-room... we’ll see) I will present even more on sinusoidal / chirp modeling. This sessionwill be very much related to Julius’s presentation last week and will include:

• A review of my summer research in chirp modeling

• A quick summary of Yi-Wen’s conclusions on chirp modeling

• Applications of chirp parameter estimation techniques by Yi-Wen, Julius, andmyself

• An answer to the oft-raised question from the last seminar: ”What happens whenwe have affine amplitude modulation of the chirp?”

• ”Competitive” results of the different algorithms under various SNR conditions,phase offsets, and amplitude modulation. (Let me preview the answer: Gaussiansare amazing).

Hope to see everyone there! —Aaron

36

3D Sound Project

Stephanie Salaun <salauns at ccrma> (CCRMA Visiting Researcher)

Hi everybody!

On Friday, October 25th, at 3h15 in the CCRMA Library, I am presenting part ofmy 3D sound project (it’s not quite finished). Here is the abstract for those who areinterested:

In every day life sounds are all around us in the horizontal and vertical planes. One ofthe methods for reproducing virtual sound is to binaurally record a sound and play itback through headphones or through loudspeakers. When using headphones the soundis played back with only a small transmission loss, so the signal is almost reproducedcorrectly at each ear.

However playing back the sound through loudspeakers creates crosstalk between them,an unwanted effect that may be removed by signal processing (crosstalk cancellationalgorithm).

When trying to produce 3D sound through loudspeakers, we have to consider the loud-speakers ’setup, the crosstalk cancellation algorithm, the room, and other factors. Thismethod will reproduce the 3D sound accurately at only one head position, known asthe “sweet spot”.

My system has four loudspeakers above the listener splitting the frequency into twofrequency bands. The crosstalk cancellation is achieved using adaptive filtering.

Currently, I have having some difficulties with the implementation of my algorithmsand I hope to get some feedback during the seminar.

Stephanie

37

Constrained EM Estimates for Harmonic Source Separation

Pamornpol Jinachitra (“Tak”) <pj97 at stanford> (EE)

A constrained iterative method for harmonic source parameter estimation is proposedbased on an EM algorithm by Feder&Weinstein(1988) with an intent for harmonicsource separation. The algorithm is attractive in a guarantee of convergence to localminimum and the ability to deal with each partial separately. The problem of coincidingpartials and interference among them in general is mitigated by the constraints on the“weak” partials on the stronger ones of the same harmonic source. A useful schemeto determine the weakness of a partial is proposed. The constrained iteration is shownto give highly accurate estimates of the sinusoidal parameters which results in a goodsource separation in most cases of highly overlapping spectra. However, the encounteredproblem of amplitude estimate inaccuracy of coinciding partials has not been addressed.

38

Sinusoidal Peak Estimators


I will present some Matlab simulation results comparing several peak-frequency estima-tors, specifically Kay’s method, Lank’s method, and our old friend, the quadraticallyinterpolated FFT dB magnitude peak. The estimators are compared to each other andto the Cramer-Rao lower bound.

39

Gaussian Chirp Rate Estimation


I will show the mathematical equivalence between a modified version of the Smithestimator and the Marques+Almeida estimator for frequency chirp parameter in aGaussian-windowed signal. I will also explain the iterative aspect of the Marques andAlmeida estimator, as well as their magnitude and phase estimation technique. Resultscomparing one iteration of the Smith and M+A methods will be presented. The variousimpacts of parabolic curve fitting, single point second order differencing, and averagedsecond order differencing will be considered for each method.

40

Watermarking Parametric Representations for Synthetic Audio: aMid-Term Research Update


I will propose a system to watermark parametric representations for synthetic audio.The system combines quantization index modulation at the encoder and maximumlikelihood parameter estimation at the decoder. To guarantee error-free data hidingunder expected types of attacks, knowledge of Fisher information and Cramer-Raobounds is applied to the system design. Experiments show that, merely by quantizingthe frequency of sinusoidal tones, one can achieve 50b/s of data hiding that is robustto perceptually shaped additive attacks such as an MP3 compression.

41


Pitch Detection for Automatic Transcription

Ryan Hinton <rhinton at stanford> (EE)

The goal is to produce a general transcription tool with output appropriate for musicnotation software. The tool should accept a CD-quality WAV file of music based on theWestern tempered scale. The source of the music should not be constrained: it shouldbe able to process pitches from instruments, human voices, electronic synthesizers, orfrom any other source. Furthermore, the tool should accept reasonably many voices.The tool should record to a file the pitches in the music along with their start and stoptimes. One possible output file format is a MIDI file: several popular notation packagesnotate music from MIDI files. Other output file formats may be more appropriatedepending upon file format complexity and eventual tool feature set.

Future algorithms may include determining which key the music is in; how much themusic is off from the true key (i.e. a violin tuned a little off from a true ”A”); whether aparticular voice is sharp or flat; and the tempo and meter of the music. The tool couldeven attempt to identify particular instruments.

The tool may also be enhanced to accept different sample rates; different scales (i.e.,American vs. international basis frequency and non-Western music); and different inputand output file formats.

I will put together a summary of the goal and the challenges I know of along withalgorithms I have considered.

42

Excitations models in self-sustained oscillators with a focus on thebowed string

Stefania Serafin <serafin at ccrma> (Music)

Talk 1:

The goal of this talk is to analyze different kinds of excitation models with a focus onfriction-driven self-sustained oscillators like the bowed string. Different kinds of modelshave been proposed to reproduce the action of a bow exciting a violin string. Nowadaysthe availability of fast processors allows implemention of highly refined models in real-time. In this talk, I will present the evolution of friction models from the very simpleone proposed by Coulomb to the latest discoveries on bowed string modeling.

Talk 2: Randal Leistikow will present his latest results constructing a ”piano filter bank”which sets a passband around each piano key and rejects all other keys.

43

Applications of Probabilistic Monitoring to the Stabilization ofthe Joint Segmenter/Rhythm Tracker


Harvey ThornburgCCRMA DSP Seminar

Friday, Feb. 8 2002

An unfortunate property of the joint segmenter/rhythm tracker, where signal and eventprocessing are decoupled, is that bursts of segmentation errors (misses, false alarms) leadto adjustments in the rhythm tracker’s belief state which reinforce future segmentationerrors of the same type. Though the joint engine is still locally stable, too large adisruption (involving many missing segments) leads to a of global instability whichmanifests in tempo period doubling. An interesting point for discussion is whetherthis instability is an inherent property of the decoupling or whether some structuraladjustments might eliminate it. The former appears most likely to be true.

A variety of simple approaches can mitigate this instability, such as tuning the rhythmtracker to respond more slowly to tempo variations, or artificially increasing the entropyof the switching transition matrix. However, these measures, being unneccessarily con-servative, tend to bias the overall estimation. A better approach is to exploit additional,independent information such that the tempo usually exists in a certain range (say, 50-200 bpm). Since this information is actually ”true”, we should not in theory experienceadverse effects from bias.

In this talk, we will formalize the concept of ”independent information”, and show howit leads to a ”feedback control” interpretation in the context of segmentation/rhythmtracking, thus providing a natural guard against instability. This approach can also bereferred to as ”probabilistic monitoring”. Additionally, we discuss substantial pitfallsinduced by computationally necessary approximations. So far, results give only mixedsuccess in the application to joint segmentation/rhythm tracking, and some improve-ments are desired.

44

Perceptually similar orthogonal sounds and applications tomultichannel acoustic echo cancellation


With the availability of increased communication bandwidths in recent years, peoplehave become interested in full-duplexed, multichannel sounds for telepresence servicesbecause of the potentials for providing much better hearing experiences. Nevertheless,in any full-duplex connection of audio network, the problem of acoustic echoes arisesdue to the coupling between loudspeaker(s) and microphone(s) placed in the same room.

Moreover, it is known that the cancellation of multichannel acoustic echoes is a math-ematically ill-conditioned problem. What happens is that, due to the high correlationbetween signals in multiple channels, an adaptive echo canceler tends to converge to adegenerate solution and fails to find the true coupling paths.

Although several types of algorithms for decorrelating the channels have been proposedto regularize the problem in the context of speech teleconferencing, these algorithmsare all developed based on the criterion that any sound other than the speech signalsshould not be heard. However, it is not necessarily so in applications such as videogames and performing arts where back-ground sound effects and background music arecommon and desired practices.

In this talk, it is presented to utilize background sounds for multichannel echo canceling.

Methods are developed to generate arbitrarily many orthogonal and perceptually similarsounds from a mono source, and the sounds are fed into a multichannel echo cancelerfor the canceler to better identify the echo paths. 2-channel and 5-channel simulationswill be demonstrated.

45

Design of MLS Measuring System for Objective Room AcousticalParameters

Stephanie Salaun <salauns at ccrma> (CCRMA)

A room can be defined by different acoustical parameters (decay curve, reverberationtime, early decay time, clarity..). In order to measure these parameters, different tech-niques can be used like periodic pulse testing, time-delay spectrometry (TDS) andmaximum-length sequences (MLS).

This project deals with a system measuring room acoustical parameters using theMaximum-Length Sequence (MLS) method. This system is based on both a DSP anda PC where the DSP generates the MLS and extracts the room impulse response. ThePC calculates the parameters and displays them in a graphical user interface (GUI)which also controls the measurement.

46

Introduction to wavelet filter banks


Subject: DSP Seminar, CCRMA Library, Friday 3:15

This week, Harvey Thornburg will present on Bayesian transient estimation in the firsthalf, and I will present an introduction to wavelet filter banks in the second half.

47

Peak-Adaptive Phase Vocoder


In this talk, based on my ICASSP-02 presentation this year, I will describe a new methodfor refined estimation of amplitude and frequency trajectories in sinusoidal modeling.The method consists of performing a phase-vocoder style modeling of individual peaksin the short-time Fourier transform (STFT). When a peak corresponds to a true sinu-soid for the duration of the analysis window, the instantaneous amplitude and frequencyreduce to constants. More generally, however, narrow-band amplitude and frequencymodulations can be accurately recovered over the duration of the analysis frame. Therefined estimates can replace the usual piecewise-linear amplitude and frequency tra-jectories in sinusoidal modeling, allowing for more accurate modeling of musical attacksand pulsed waveforms such as voice. Instantaneous phase may be optionally tracked aswell. In summary, the peak-adaptive phase vocoder can be viewed as a phase vocoderhaving analysis channels which are constructed adaptively about time-varying peaks inthe STFT.

48


Polyphonic Music Transcription

Clint Martin <cmartin 121 at yahoo.com> (EE)

In the DSP seminar on Friday, October 5, I will be discussing polyphonic music tran-scription. A music-transcription system takes a recorded piece of music, tries to accu-rately determine what was played, and outputs some sort of high-level musical notation;the output could be a MIDI file or sheet music. Part of the presentation will briefly coverdifferent approaches people are taking to the polyphonic music-transcription problemand will demonstrate the results obtained by two recent music-transcription softwarepackages. The majority of the presentation will be spent discussing my research on thetopic, which includes an original signal segmentation scheme and a note-finding algo-rithm, based on the research of Andranick Tanguiane, that takes the partials presentin the music and tries to determine the notes that were played. The theory and perfor-mance of the note-finding algorithm will be discussed in depth, and sound files will beplayed that demonstrate the system’s strengths and weaknesses.

49

Bayesian Segmentation and Rhythm Tracking: Part I:Constructing and Identifying Probabilistic Models of Rhythm


DSP SeminarHarvey Thornburg

Friday, October 19 2001

Segmentation of polyphonic music is quite difficult, especially in the context of au-tomatic transcription, where one task is to find the change times which correspondexplicitly to note onsets. Thanks to rhythmic structure, the pattern of onsets is highlyregular. The pattern of onsets admits a probabilistic structure (parameterized jointdistribution over all onsets) known as the rhythm model. One goal is to exploit this“regularity” to improve the overall segmentation performance and better adapt seg-mentation to the problem of onset detection, via Bayesian segmentation: the rhythmmodel serves as a machine for generating priors about future/past onset locations usedsubsequently in segmentation. In a practical situation the rhythm model is incompletelyspecified; even basic attributes such as tempo, meter, and so forth must be identified aswell: any stream of onsets serves as data with which to learn and identify parametersin the rhythm model. Hence, a fundamental task in automatic transcription is the jointsegmentation and rhythm modeling.

In this talk (first in a two-part series) I will focus on rhythm modeling, and the identifica-tion of rhythm models from onset data. The sequel will focus on Bayesian segmentationand the issues of interfacing segmentation and rhythm modeling such that both oper-ate in tandem on an audio signal. Since the focus is on modeling “events”, this talkmay be less “DSP-centric” and more “machine learning” than the usual seminar, butthese aspects are very important to a problem which is fundamental to signal processing(where there exists any kind of regular structure to the evolution of a signal model’sparameters)

To begin, I introduce the rhythm model’s representation. This has two components: Ifirst review the “metronome” model of Cemgil et al. for a constant-interval sequencewith unknown tempo. Next, I introduce simple Markovian model for the sequence ofnote intervals and metric positions. The meter can be obtained as a function of theparameters (transition probabilities) of this model. The key challenge is the obtainmentof a common sparse representation encompassing a variety of common meters. Finally,to conclude the representation part, I show how everything may be integrated in theframework of a linear switching state space model, where the metronome state capturesthe evolution of tempo/event position, and the switching state captures the evolutionof rhythmic interval/metric position. The issue of linearization will also be discussedhere.

The remainder of the talk focuses on inference and learning in switching state-spacemodels. The inference step involves tracking “state” quantities like tempo, the expected

50

time location of a change, the current interval, metric position and so on. One obtainsthe conditional distribution of the states at all event locations given all observed onsettimes (smoothing) or just past and present observations (filtering) Since any particularevent location is a function of the metronome state, the inference step supplies theappropriate prior distributions to use in segmentation.

Exact inference in switching state-space models is intractable; as we will see, the repre-sentation of the conditional distribution involves an exponentially growing mixture ofGaussians. (This is by contrast to a hidden Markov model, in which the only hiddenstate is discrete: we could have an exponentially growing mixture of scalar probabilities,but we don’t incur this as a cost because a mixture of scalars is scalar.) In this talk Icompare two approximate methods: Viterbi inference and a direct junction-tree basedapproach with distributional approximations. The junction-tree approach is based on ageneral method for organizing inference steps in probablistic networks called “variableelimination” or the “junction tree algorithm”.

The learning step, by contrast, involves identifying parameters specifying “dynamic”quantities which describe the evolution of the state (in this case, the Markov transitionprobabilities, because everything else is fixed). Parameters are chosen to maximize thelikelihood of the observations; the resultant optimization is solved via the EM algorithm.The likelihood computation is handled by the inference step, which highlights the needfor accurate local approximations across groups of consecutive switching states (andthus the need to develop the junction tree algorithm.)

51

Making teleconferencing a better experience—stereophonicsolutions


This talk is going to be a summary of my work on a stereophonic teleconferencingproject at the department of broadband communication services of AT&T ResearchLabs during the past summer. We believed that although there have been a lot ofefforts recently on improving video, the audio quality of commerially available, mono-sound teleconferencing systems is not very good. To build up a sound system devotedto teleconferences, in which many people often need to be able to argue at the sametime, we claim that a stereo-sound, full-duplex system is a must.

It turns out that the cancellation of the echoes that arise due to the coupling betweenmultiple loudspeakers and microphones is a mathematically ill-conditioned and hencestill unsolved problem. Perceptually, these echoes ”enlarge” the conference room andmay degrade the speech clarity seriously. We propose a new idea to digitally watermarksounds from different loudspeakers so that the coupling paths from multiple loudspeak-ers to each of the microphones can be distinguished and the stereophonic echo cancellingproblem is reduced to parallel monophonic echo cancelling problems.

A preliminary example of echo cancellation using digital watermarks will be demon-strated and the performance will be evaluated.

52

Bayesian Segmentation and Rhythm Tracking: Part I:Constructing and Identifying Probabilistic Models of Rhythm


DSP SeminarHarvey Thornburg

Friday, October 19 2001

This seminar will begin as an informal continuation of my previous seminar on theuse and identification of probabilistic models for rhythm. I will conclude this partby giving more examples and proofs concerning the junction tree algorithm, as wellas focusing on the interpretation of the resultant recursions as a bank of extendedKalman filters which expands during a switching state update and contracts upon thedistributional approximation. What’s really important for understanding the recursionsis that they can be verified directly, using conditional independence relations. The factthe recursions “work” doesn’t rest on the abstract graph-theoretical concepts of thejunction tree algorithm, which makes the whole process a bit easier to understand. Thejunction tree algorithm can be thought of as a template for organizing the computationsone needs to develop these recursions from scratch. Had it been around in the 1960’swe would have immediately discovered all smoothing approaches to the Kalman filter,as well as all useful algorithms for HMM’s and multilayer HMM’s!

I will also show results of tracking and EM in the cases of accelerando, decelerando etc,because I didn’t get to this point either, due to difficulties with the EM and with theuse of noninformative priors (singular Fisher’s information matrices) which have nowbeen resolved.

Next I will begin the dual set of topics: what to do when you have a prior distributionover a segment time and wish to perform segmentation on the actual signal. I willintroduce the basic problem, then discuss the various representations of distributionsover segment times. Finally, I hope to complete coverage of the online (real-time)sequential test as well as show some interesting results and specializations from thistest. However, I could certainly postpone discussion of these issues if there’s anythingelse someone in the group wants to present.

What remains to be covered later, in a third seminar, are strategies for integration, suchthat we can really do joint segmentation and rhythm modeling using the actual signal.As well, we can discuss Bayesian offline segmentation methods. A byproduct of theBayesian offline methods is we now have a useful nonparametric method for representingthe posterior probability of change anywhere in the signal, based on “information in localwindows”. The original purpose was to obtain a general, “empirical Bayes” prior foroffline segmentation based on quasi-perceptual criteria. This “changeogram” method Ithink relates to some of the issues in Clint’s seminar.

If anyone is interested, Chapter 2 of the report covers all of these topics except theintegration.

53

Also, please feel free to ask questions during any time of the presentation.

–Harvey

54

Two upcoming ASA talks on woodwind modeling

Gary Scavone <gary at ccrma> (CCRMA)

Tonehole radiation directivity measurementsGary P. Scavone and Matti Karjalainen

Abstract:Measurements have been conducted in an anechoic chamber for a comparison to currentacoustic theory with regard to radiation directivity from a cylindrical pipe with tone-holes. Several difficulties arise in measurements of this sort, including (1) the generationof sufficient driving signal strength at the pipe input for pickup by an external micro-phone; (2) external source-to-pickup isolation; (3) measurement contamination due tononlinear driver distortion. Time-delay spectrometry using an exponentially swept sinesignal was employed to determine impulse responses at points external to the exper-imental air column. This technique is effective in clearly isolating nonlinear artifactsfrom the desired linear system response along the time axis, thus allowing the use of astrong driving signal without fear of nonlinear distortion. The experimental air columnwas positioned through a wall conduit into the anechoic chamber such that the driverand pipe input were located outside the chamber while the open pipe end and tone holeswere inside the chamber, effectively isolating the source from the pickup. Measured re-sults are compared to both transmission-line, frequency–domain simulations, as well astime–domain digital waveguide calculations.

Time-domain synthesis of conical bore instrumentsGary P. Scavone

Abstract:A series of approaches are presented for discrete time–domain synthesis of conical boreinstrument sounds. This study uses a simple clarinet-like system, involving a memo-ryless nonlinear “reed” function and a distributed cylindrical air column model, as apoint of departure for subsequent development. The generation of steady, self-sustainedoscillations in such a system is complicated by properties inherent to truncated conicalwaveguides. Alternative methods include a structure equivalent to Benade’s “cylindri-cal saxophone,” with two separate parametrization schemes. Finally, a more general“virtual” model is presented that is capable of synthesizing both half-wave and quarter-wave resonators. This structure provides a rich variety of possible sounds and offers aninteresting perspective on conical waveguides.


3D Positional Audio

55

Rachel Wilkinson <rwilkinson at ausim3d.com> (Ausim)

The first DSP Seminar this quarter will be this Friday, April 6th at 3:15 at the CCRMABallroom.

When we hear a sound, how do we tell where it’s coming from? We often instinctivelyturn to pinpoint the origin of a new sound, but how do we know which way to turn?Moreover, why is the capability of determining source direction diminished or lost whenwe listen to audio over headphones? What’s missing from the headphone signal?

The answers lie in how our brains process the sound waves caught by our ears, to deter-mine where sounds we are hearing originate. Studies of these processes have revealedcertain distinct features or cues that the brain has learned to use to localize sounds. Byunderstanding and intelligently recreating these cues, technologists can now synthesizesurprisingly realistic virtual aural environments.

Physical and empirical modeling are technologies that can be used as bases for sim-ulating natural sounds. With a good mathematical model of the physical or acousticcharacteristics of the sound source, the environment through which it propagates, andthe listener, flat sound sources can be altered to sound as if they had actually propagatedthrough and interacted with a physical environment.

This presentation will explore new research into ways of creating realistic 3D soundsimulations. We will discuss processing and delivery strategies as well as psychoacousticfactors. Ideas will be generated for future applications and areas of further research.

E. Rachel WilkinsonAudio and Acoustics Applications SpecialistAuSIM, Inc.http://www.audiosimulation.com

56

Introduction to Multirate, Polyphase, and Wavelet Filter Banks


This two-hour talk contains tutorial introductions to

• multirate digital systems,

• perfect reconstruction filter banks,

• quadrature mirror filters,

• polyphase filter bank analysis,

• multi-input, multi-output allpass filters,

• paraunitary filter banks, and

• wavelet filter banks (particularly the dyadic wavelet filter bank).

Additionally, it is demonstrated how polyphase filterbank analysis naturally convertsthe “filter bank summation” (FBS) representation of the STFT to an “overlap add”(OLA) representation, while simultaneously showing that the STFT implements a per-fect reconstruction filter bank (usually oversampled).

This talk is a synthesis of Music 420 lecture overheads with contributions from pastteaching assistants Scott Levine and Harvey Thornburg. They can be perused (oncampus) at http://ccrma/˜jos/JFB/.

57

http://ccrma/~jos/JFB/

Linear prediction analysis of voice under the presence ofsinusoidal interference

Aaron Hipple, Yi-Wen Liu, and Kyungsuk Pyun <hipple—jacobliu—kspyun at stanford> (EE)

We are interested in tackling the single channel sound source separation problem ofvoice and non-voice signals. An interesting task would be to separate singing frominstrumental accompaniment, pianos or guitars for example. In that case, it is crucialto make estimation of the glottal source of the voice part in the presence of interferingsinusoids.

The focus of our ongoing research is to study the linear prediction analysis of voice andtry to come of with new methods to separate voice and non-voice from a single channelmixture.

Particularly, we’ve worked on an adaptive linear prediction (LP) analysis frameworkthat is based on the LMS algorithm. The adaptive algorithm is causal, and has thepotential of following the statistics of the voice more closely. However, the estimationof the LP coefficients is fluctuating around the optimal solution due to the nature ofthe LMS algorithm.

The outline of the talk is as follows,

• Introduction to the sound source separation problem: can it ever be done? Doesit need to be done? What are the benefits of solving it?

• Literature review of the methods: Computational Auditory scene analysis (CASA)versus blind source separation (BSS) and Indepedent component analysis(ICA)

• brief review of linear prediction analysis of speech/voice

• Experimental results of LP analysis of voice under the presence of sinusoidal in-terference

• Possible direction for sound source separation: statistical methods and heuristicmethods

58

The Bayesian Approach to Segmentation and Rhythm Tracking


Segmentation refers to the detection and estimation of abrupt change points. Rhythmtracking refers ultimately to the identification of tempo and meter, but in this contextit refers more generally to the identification of any higher-level “structure” or “pattern”amongst change points. Therefore the name “rhythm tracking”, despite being catchy,is somewhat over-optimistic as to the goals of this project. Nevertheless, it serves untila better one can be found.

A naive approach to rhythm tracking is to use the partial output of a segmentationalgorithm (say, the list of change *times*) to feed a pattern recognition algorithm whichlearns the structure of this event stream. What “learning” means depends on the howthe data is processed. In the online setting, which is easiest to develop, learning meansto predict, i.e. to develop posterior distributions over future change times given thosealready observed.

There are a number of problems with the naive approach. First, it is open loop. Wecould, and should somehow “close the loop”, i.e. use the tracker’s output posteriordistributions as priors for the segmentation. This requires a Bayesian framework forthe segmentation itself, which will be the *specific* focus of this talk. There are reallytwo motivations for developing Bayesian techniques:

1. For the specific problem of rhythm tracking, only a fraction of detected changetimes are likely to convey rhythmic information. As well as note onsets, therecould be expressive timbral and/or pitch fluctuations, interfering signals, and soon, which either do not correspond to rhythmic information or do so less reliablythan the onset information. Hence, it is desired that the underlying segmentationmethod learn to recognize specifically the changes corresponding to the desiredstructure, a phenomenon known as “stream filtering”.

2. Independent of rhythm tracking, there is still the concern that we make efficientuse of large data samples. When change times exhibit a regular pattern (i.e. any-thing that differs from Poisson) this means events in a certain location will giveinformation about those far away. Since any practical (i.e. linear time) approxi-mate change detection scheme uses only short windows with local information inthe signal about a given change point, it is desired to communicate more globalkinds of information, without resorting to large windows.

Second, structures based only on the change times do not seem to cope well withrests, subdivisions, and like phenomena. For instance, it is unlikely that a given list ofinterarrival times will simply repeat, so immediately we must cope with the problemof variable length cycles. Hence, lists of groups of rhythmic intervals is not a goodintermediate representation. We have to be a bit more clever. A proposed solution,easily adapted to general frameworks for change detection, is to propagate information

59

about the “strength” of each change as well as the change time. Formally, we define achange point as the pair Tk,Ek, where Tk is the time of a possible change, and Ek isan indicator that the change actually occurred. Right now the “strength” is defined asthe posterior likelihood P(Ek—Tk, Ek-1, Tk-1, Ek+1, Tk+1, Ek-2, Tk-2, Ek+2, Tk+2,...). This definition probably needs refinement. This likelihood may be computed usingfamiliar methods (i.e. subspace tests). Intuitively, P(Ek—Tk, ...) relates to informationabout dynamics, and also encodes the indications of “rests” (P(Ek—Tk, ...) ¡¡ 1). In thisrepresentation, encoding of temporal information becomes very easy (the “metronome”model), and all the crucial information gets encoded in the patterns of strengths.

During the talk, I will focus mostly on the Bayesian framework for *sequential* changedetection when the model after change is unknown. Here the appropriate prior concernsthe *location* of the next change point, as well as the prior probability the change hasoccured before time zero (I find one can just set this to zero, but it’s needed to calculateexpected cost). Our stopping rule minimizes an expected cost function involving theevent of a false alarm, the total number of miss-alarms, and the number of samplesobserved. Minimizing the latter also minimizes detection delay. Practically, we mustalso know something apriori concerning the distribution of models after change. Thereseem to be two promising approaches: the marginalized approach where we averageover the entire space of candidate models (we must have a prior distribution for thesemodels), and the multimodel approach where we sample the alternative model spaceto get some “representative” set. The way I do this using the fewest models possibleis to sample at the “poles” of significance shells. I have found multimodel samplingapproaches to work better in practice, but a thorough analysis is still being worked out.

It seems the “strength” prior cannot be integrated here, but we can use it in subsequent“backtracking” using the Bayesian subspace test. That part needs more refinement anddiscussion.

60

Linear Prediction of Voice in the presence of SinusoidalInterference


This presentation is about the ongoing research by Aaron Hipple, Kyunsuk Pyun, andYi-Wen Liu for Stanford’s EE373 adaptive signal processing series instructed by profes-sor B. Widrow of the EE department.

We are interested in the sound source separation problem of voice and non-voice signalsfrom single microphone mixtures. An interesting task would be to separate singing frominstrumental accompaniment such as guitar or piano sounds.

Particularly, we’ve worked on an adaptive LP framework that is based on the Widrow’sLMS algorithm. The adaptive algorithm is causal, and has the potential of followingthe statistics of the voice more closely. However, the estimation of the LP coefficientsis fluctuating around the optimal solution due to the nature of the LMS algorithm.

The outline of the talk is as follows:

• Motivations: commercial ones and academic ones

• what is voice? What is that ”voiceness” which makes a voice sound sound likevoice?

• brief review of the adaptive linear combiner/predictor

• brief review of linear prediction analysis of speech/voice

• Comparison between LMS-based adaptive LP and windowing-based framewise LP

• Toward sound source separation: a LP front-end plus an ICA backend; simpleexperiments and results

61

The advantages of wavelets to FFT for audio analysis and how toevaluate octave-band filter bank and improve it

Kyungsuk Pyun <kspyun at stanford> (EE)

This is about 60-75 minutes talk, followed by Harvey Thornburg’s 30-45 minutes talk.

As an extension to my CCRMA Open house lecture last Thursday on ”A fast andefficient octave-band filter bank for audio analysis”, I would like to talk about ”theadvantages of wavelet approach for audio analysis, compared to FFT based one andhow to evaluate the performance of octave-band filter bank and improve it”

This talk will have 3 sub-parts;

1. The advantages of wavelets over FFT for audio signal processing

In image processing, wavelets recently dominated FFT so that JPEG2000 already chosewavelets as a stanford. But in audio processing, this has not happened yet. I would liketo address this issue and discuss the advantage of wavelets over FFT and possibility ofusing it in audio signal processing, especially when a fast signal processing is neededin lowpower wireless device like portable MP3 player. Then I will talk about otherresearchers’ work using wavelet method.

Another important question is which wavelet to use for audio signal. I would like todiscuss why Gabor wavelet has the definite advantage for an audio signal.

2. How to measure the performance of the filter bank

To get a feature vector, filter bank output need to go through amplitude compression andDCT(dicrete cosine transform) to decorrelate the output. After front-end processing,popular distortion measure like L2 norm or LDA(linear disriminant analysis) as wasused in Dr. Yoon Kim’s thesis on NLP ceptrum need to be used. This measure will becompared to one from FFT based filter bank.

3. How to improve the performance

On my octave-band filter bank, the binary tree is expanded only to the low frequencyband. If the energy of music signal is concentrated mainly on the high frequencyband(i.e. signal from an instrument like piccolo), then the binary tree need to beexpanded to the high frequency band.

One way to do this is to expand the tree in full and prune it adaptively depending onthe input frequency characteristics. Dr. Naoki Saito proposed LDB(linear discriminantbases) method to do this on his Yale Ph.D. thesis on ”Local feature extraction and itsapplication using a library of bases” on Dec. 1994. I wil explain how the proposedalgorithm works. Basically after full expansion of the tree, K principal subband com-ponents are selected, where K¡M (M is the number of subbands when the tree is fullgrown). The core of the algorithm is to use an additive measure like minimum entropyto prune the tree, which makes the algorithm runs O(n*logn). This is the same orderof speed as the FFT algorithm. This approach reuses the same prototype lowpass andhighpass filter for an efficiency.

62

Thank you and I hope to see you all,

kyungsuk(peter)

63

Automatic music style classification: towards the detection ofperceptually similar music

Unjung Nam <unjung at ccrma> (Music)

This is a preliminary study on music classification system that tries to cluster twopieces of music with similar musical content and classifies them into two different genresaccording to their timbral information.

I will present an overview on the works in the field of audio/music information retrieval,a general procedure of building music classification system and discuss an experiment Iconducted on my system.

64

Bandlimited Interpolation and Virtual Analog Synthesis


At tomorrow’s DSP seminar (3:15, Thursday, CCRMA Ballroom), I will present a tu-torial introduction to bandlimited interpolation and, time permitting, virtual analogsynthesis. The lecture overheads can be perused online at

http://ccrma/˜jos/Interpolation/

and

http://ccrma/˜jos/VirtualAnalog/.

65

http://ccrma/~jos/Interpolation/

http://ccrma/~jos/VirtualAnalog/.

In search of golden HRTFs

Klaus A. J. Riederer <Klaus.Riederer at hut.fi> (HUT)

(This talk was ultimately moved to the Hearing Seminar.)

For commercial applications head-related transfer functions (HRTFs) are typically greatlysimplified for more real-time performance with less signal processing power. However,by definition HRTF is a measured response from a point in the free-field space to apoint in the ear canal. Hence, it contains all the monaural and binaural cues neededfor spatial hearing (for the particular sound incident and person). The essence is tounderstand in what extent are these auditory cues (perceptually) important, althoughmatters underlying are complex. Author’s wide-scaled analyses of carefully measuredHRTFs show excellent system quality, high HRTF repeatability and effects of vari-ous factors to HRTFs, including ear plug type, experimenter’s experience/carefulness,hairstyle and headpieces. The vast inter-person variance makes it difficult to find ro-bust common trends or groups in HRTFs. A mathematical index of deviance helpsto segregate atypical responses, hinting to possible artifacts and/or deviating anatom-ical structures. This measure also relates to quantitative analysis towards a genericHRTF model allowing a deeper understanding of (auditory) spatial hearing. Unfortu-nately, HRTFs are not sufficient to understand the true multisensory spatial hearing. Itsmultisensory interactions with vision, motion, bone conduction, tactile and vestibularsensing with their cognitive impacts need also to be considered in order to reveal thetrue enigma of spatial hearing. Elaborate multimodal perceptual experiments are thusbeing devised. Brain mechanisms are investigated by magnetoencephalography (MEG)measurements based on individual HRTFs. Results show differences in binaural neuralprocessing between azimuth and elevation angles of 3-D sounds. Related publicationsavailable at www.lce.hut.fi/˜kar/publications.html

Klaus A. J. RiedererHelsinki University of TechnologyLaboratory of Computational Engineering, Cognitive Science, and TechnologyLaboratory of Acoustics and Audio Signal ProcessingP.O. Box 9400, FIN-02015 HUT, FinlandE-mail: , URL: http://www.lce.hut.fi/˜kar

66


Parameter Engine for the Synthesis Tool Kit (STK)

Bryan Cook <bacook at ccrma> (EE)

The first DSP Seminar of this quarter will be this Thursday, Jan 11th at 3:15 at theCCRMA Ballroom.

I will be giving an overview and demo of the software infrastructure I’ve been developingfor the Synthesis Toolkit (STK) which I have dubbed the Parameter Engine.

Music synthesis algorithms, physical modeling algorithms in particular, are controlledby a large parameter space. Managing and exploring these parameters can often betricky.

The Parameter Engine tries to address these problems by addressing these goals:

1. Easy registration of synthesis parameters

2. Easy mapping of external controls to internal parameters

3. Easy mapping of parameter ranges

4. Easy parameter editing, loading and saving.

In addition to the Parameter Engine demo, I will also discuss interfacing STK to Lab-VIEW and give an overview of UIUC’s Vanilla Sound Server (VSS) platform whichalready has STK integrated.

-Bryan Cookbacook at ccrma

67

An efficient octave-band wavelet filter bank for voice analysis

Kyungsuk Pyun <kspyun at stanford> (EE)

For speech recognition systems, information lost at the front-end stage is not recoverable,which makes front-end processing crucial in the whole system. In this talk, a new speechfront-end feature extraction system using a 12 bandpass filter bank will be proposed tobe used for low-power small computers.

The need to perform signal processing at extremely low power, such as wearable com-puters, motivates the study of FIR filters having few taps with small integer coefficient.For example, digital watches do not have floating point operations.

The proposed speech front-end wavelet filter bank consists of 12 bandpass filters coveringa frequency range from 1Hz to 4KHz. There are two sets of filter banks. The firstone is regular wavelet expansion Octave band filter banks. The second set of filterswas implemented using same prototype filters except downsampled by 2/3 at the verybeginning of tree structure. Anti-aliasing filters were used before processing. Because ofthis downsampling factor of 3, sampling rate is 16128Hz, which is divisible by 3 and closeto 16KHz but not exactly equal. The signal is stored in a one dimensional buffer beforeprocessing, just enough to be processed by lowpass and highpass filters, which makesboth filters symmetric noncausal filters. Index of buffer is divided as follows. At firstevery other samples are taken. This is 2 downsampled version. After eliminating thissamples, again every other samples are taken among remainders. The process continuesin this way. One way to implement bandpass filter can be implemented sinc * sinusoid.During upsampling of bandpass filter, frequency of sine goes down and width of sincfunction increase. In frequency domain this is equivalent to the fact that both the centerfrequency of bandpass filter goes down and the bandwidth goes down. That is, filterbecomes closer to lowpass filter with smaller bandwidth. What’s unique in proposedapproach is the organization of the C code, namely one loop going through the samples,performing just a few assignment statements per sample. Our algorithm run on theorder of N. Also the filter coefficients are chosen so that only fast computer arithmeticslike addition and shifting are used. Multiresolution approach of wavelet transform givesthe fixed resolution for each band. In this approach it is about 6 samples/cycle atmaximum sensitivity of each filter. In this talk, frequency responses of 12 filters will beshown and filter bank output for chirp signal, several speech signals.

68

Segmentation of Audio Based on Stationarity Measures


Time: Thursday 1/18 3:15 pm

This DSP seminar I will give a research update. As the focus last time was on transientmodeling, this time will concern segmentation. Time permitting, I will present severaltopics loosely connected to the problem of *maximum likelihood estimation* of segmentlocations.

The first possible topic is to outline the steps in an exact, rigorous justification of thesegment-wise least squares criterion as an asymptotic maximum likelihood estimator(MLE) for the Gaussian AR model, in the case where the innovations variance is un-known and not assumed either to be constant. Though the results are what one wouldexpect using vague, intuitive arguments about initial conditions ”washing out”, it isnice at some point to have a proof. I follow a presentation indicated in Kay’s ”recursiveMLE” paper, but, instead of leaving the ”dirty work” up to a frustrating, hundredsof page presentation in Grenander and Szego, I derive and use the Gohberg-Semenculformula for the inverse of a finite Toeplitz matrix followed by a straightforward use ofGray’s theory of asymptotically equivalent matrices. Besides clarifying the presenta-tion, the ”Gohberg-Semencul” approach seems to indicate systematic estimation biasesfor finite data, and suggests how to correct them.

The next topic is to formulate the idea of ”subspace tests”, used to test whether a pa-rameter lies in a linear subspace, based on the MLE. Unlike conventional order selectioncriteria, it is possible to specify a confidence level (probability of false alarm) for thesetests, which becomes useful in adjusting the ”sensitivity” of the segmentation. Thereis a straightforward application to change detection, which to my knowledge appearsfor the first time. A ”nested subspace test” is presented that recursively computes Nsubspace tests in an N-dimensional parameter space using only N multiplies and di-vides, involving no explicit matrix inversions. All of this leads to a new segmentationalgorithm, which is much simpler and more accurate than those previously presented.I will discuss this and show results on speech.

Finally, I will review the O(N 2) exact dynamic programming solution for the MLE ofsegment locations, then discuss something I am currently working on which is how toadapt Blake’s ”weak string” approach to get an approximate solution in much less time.The exact MLE is far too slow to use in practice, so this investigation is important.Basically, the MLE objective is relaxed by permitting, but heavily penalizing, parametervariations within a segment. The resultant cost objective may be transformed, by achange of variable, into a sum of quasiconvex functions. If the quasiconvex functionswere convex the overall objective would be convex and then we could use gradientdescent. The idea, which Blake calls GNC (”gradated nonconvexity”) is to designrelated convex functions and morph them over time to the original functions, meanwhilerunning gradient descent. The discussion concerns exactly how the GNC can be adaptedfor the MLE cost objective at hand.

69

In addition, tomorrow night I will give a concise overview of the entire transient modelingproject at the Motorola/AES meeting. I hope to see everyone at both.

Thanks,–Harvey

70

Multimodel Coding


At tomorrow’s DSP seminar (3:15, Thursday, CCRMA Ballroom), I will present over-heads from my talk last week at the Institute for Mathematics and its Applications(IMA). This was a workshop consisting primarily of image compression researchers, somy remarks were aimed in that direction.

Abstract: Musical Signal Models for Audio Rendering

This talk will summarize several lines of research going on in the field of music/audiosignal processing that are applicable to audio compression and data reduction. Whilethe techniques were motivated originally by the desire for realistic “virtual musicalinstruments” (including the human voice), the resulting rendering models may be effi-ciently transmitted to a receiver as a “specialized decoder” in software form which isthen “played” by a very sparse data stream. In most cases, there is also a straight-forward tradeoff between rendering quality and computational expense at the receiver.Since all models are built from well behaved audio signal processing components, thedistortion at low complexity levels tends to be of a high level character, sounding morelike a different instrument or performance than a distorted waveform.

71

Natural Voice Synthesis

Vicky Lu <vickylu at ccrma> (EE)

In previous quarters, I have developed an singing synthesis model for the non-nasalvoiced sound. The associated analysis/re-synthesis procedure has shown that this modelcan generate naturally sounding voices with different voice qualities ranged from thepress, normal to breathy mode. However, in addition to the analysis/resynthesis ability,a good synthesizer should provide easy controls on the model parameters. By exploringthe correlation between the model parameters, we could reduce the complexity on thecontrol interface.

To generate sustained voiced sound via the synthesis model proposed, we need to specifypitch contour, glottal excitation strength (Ee) contour, glottal shape (Rd) contour andvocal tract filter. For the breathy mode, we also need to specify model parameters forpitch synchronous Gaussian amplitude modulated noise. These parameters are oftencorrelated to each other.

In the non-breathy phonation mode, it turns out that Rd and Ee are highly correlated.We can induce Rd parameter from Ee, hence, we only need to specify Ee contour whichis more intuitive. In the breathy phonation mode, high-passed NHR (noise to harmonicsratio) is introduced to represent the aspiration strength. NHR is also highly correlatedto Rd. Therefore, NHR is used to control the degree of breathiness in the breathy mode.

For sustained voiced sound synthesis, the dynamic fluctuations of the model parametersare the key factors for generating human-like sound. In this talk, I will also talk aboutthe methods to describe the fluctuations for the purpose of synthesis.

72

Pitch Tracking Applied to South Indian Classical Music

Arvindh Krishnaswamy <arvindh at stanford> (EE)

I will first give an introduction to South Indian classical music, also known as ”CarnaticMusic.” I will briefly cover fundamental concepts like ’Ragas,’ ’Talas’ and ’Gamakas’and how they are used in practice.

I will also talk about pitch intervals used in Carnatic music and some experiments andmeasurements I did myself with my violin and electronic tambura. I will present mypitch tracking scheme, and explain some differences with traditional methods.

I will also comment on my ideas on how to approach the Swara/Gamaka pattern iden-tification problem, and also how to use the concept of Ragas to classify pieces of music.

73

Pattern recognition approaches to invert a violin physical model


In this talk I will introduce different techniques to estimate the input parameters fora bowed string physical model based on pattern recognition. The aim is to invert themodel, which means to obtain the correct input parameters corresponding to the righthand of the performer, i.e., the bow force, bow velocity and bow position from theoutput of a synthetic model and from recordings made on real instruments.

The ultimate goal is to be able to extract meaningful parameters from recordings madeon real instruments and map those parameters to synthetic models, in order to obtainexpressive synthesis.

All the approaches consist of assigning an a-priori probability to each set of data, andcalculating the likelihood of each target spectrum using a set of training data for differentvalues of bow velocity, bow position and bow force.

I will present the current results of this ongoing research made in collaboration withJulius Smith and Harvey Thornburg.

74

Abstracts: Autumn Quarter 2000-2001

Adaptive Additive Synthesis for Nonstationary Sounds

Dr. Axel Robel <roebel at ccrma> (Berlin Technical University)

I will present a new approach to additive synthesis which is intended to model transientand nonstationary sounds by means of superposition of partials with segment-wise linearamplitude and frequency evolution. Due to the nonstationarity of the partials, STFTanalysis is not applicable (there exists actual research on estimating slope parametersfrom STFT but I do not rely on this), but the parameters are optimized by an adaptiveapproach. The higher flexibility of the model requires regularization to favor reasonable(desired) solutions.

The talk will have the following sections:

• Description of the problem, the model, and partial tracking

• Optimization objective

• Analytical investigation of the tracking performance in case of simple partials andblock wise analysis

• Regularization terms that are used to prevent non unique and undesired solutions

• Sound examples

• Description of planned work at ccrma:

– physical interpretation of the results obtained so far

– improve frequency resolution without decreasing parameter flexibility

– regularization for physical meaningful results

– handling of non-physical partials

Seminar overheads are available on the Web at

http://ccrma/˜roebel/addsyn/

75

http://ccrma/~roebel/addsyn/

Time-variant modal decomposition and uses in transient modeling


My talk focuses on a ”piecewise-slowly-nonstationary” parametric model for trackingmultiple sinusoids. The ultimate goal of the model is an analysis-synthesis method fortransient sounds, where the assumption of local stationarity (so important to frequency-domain methods) fails to hold.

In the piecewise model, we assume sinusoidal parameters vary smoothly except ona finite set of points where abrupt changes occur. After we segment according toboundaries of abrupt change, we constrain AR parameters to vary according to a basiswhere the number of functions indicates the degree of smoothness. Then order selectioncriteria may be applied, to choose both the number of components and their degree ofsmoothness. To extract sinusoidal parameters from this ”AR” representation, we useKamen’s time-variant cascade decomposition, then show exactly how this can be usedfor the modal decomposition. The modal decomposition is followed by a process of stateestimation for amplitude/phase information.

However, the modal decomposition does not guarantee that ”smooth” or ”slow” ARparameter variations result in this same behavior for the sinusoids. Fortunately, I canshow in the case of convergent AR parameters, corresponding sinusoidal parameters willhave limit cycles which trace out a convex path, thus enabling a feedback stabilizationof Kamen’s recursions which yields at least an asymptotically smooth trajectory for thesinusoids. However, the stabilization fails in the case of real-valued modes.

In the talk, I review the entire system (both segmentation and modeling tasks), thenfocus on the specifics of time-variant mode extraction as it relates to the problem ofsmoothness.

76

A statistical pattern recognition approach to speech endpointsdetection


ABSTRACT: The talk will summarize what I’ve done during the past summer as an intern ata startup company called VerbalTek, which is devoted to speech recognition applications on smallerplatforms such as cellphones and PDAs. The problem we were facing was the noise-robustness issueof speech recognition, and it turned out that accuracy of speech endpoints detection (EPD) waswhat really mattered. Therefore, we developed a new EPD algorithm that is based on statisticalpattern recognition. The key idea of our algorithm is to apply ”N-weatherperson theory” onspeech/non-speech classification of each frame. The performance of this algorithm was shown tobe better than conventional heuristic methods under various types of noisy environments.

OUTLINE

Motivation: noise-robustness of EPD

Method and Theory

• Features selection

• Modeling the feature statistics

• Application of ”N-weatherperson theory” to speech/non-speech classification

• EPD based on classification

Experiments and results

• EPD under realistic noisy environment

• Recognition rates at various levels of SNR

77

Music DSP in the Ivy League

Perry Cook <prc at cs.princeton.edu> (Princeton CS/Music)

Perry Cook slipped silently away from Stanford CCRMA in December of 1995. Sincethen he’s reappeared occasionally like a bad penny, but mostly he’s been roaming thewoods of Central New Jersey, walking the same paths that were trod by Albert Einstein,John Von Neumann, Alvin Turing, Brook Shields, and countless lyme-tick infested deer.This talk will be a shocking expose on what Perry has been doing these last five years.Much of it will even have to do with DSP. Topics to include: Performance interfaces forcomputer music, analysis and synthesis of sound effects, Music Information Retrievaltools and systems, and Banded Waveguides (for stiff structures).

78

Turbulence noise modeling for the breathy singing voices

Vicky Lu <vickylu at ccrma> (CCRMA/EE)

The focus of my research is to improve the naturalness of the vowel tone quality via glot-tal excitation modeling. Based on the source-filter type synthesis model, I have proposedto use the LF-model for the glottal wave shape in conjunction with pitch-synchronous,amplitude-modulated Gaussian noise, which adds the flow-induced turbulence compo-nent to the glottal excitation. The turbulence noise model is especially important forperceiving breathy sound quality. Recently, I have been seeking for a better physicallyself-explained turbulence model.

In this talk, I will first give a review on the acoustic theory of the turbulence noisefor human voice. I will also give a summary of the existing turbulence models in thespeech literature. Based on these studies, a series pressure source is adopted to modelthe turbulence source. I will derive the pressure source from the estimated glottalexcitation and vocal tract shape parameters, which are obtained via SUMT algorithmin my previous study. The manigtude and the spectrum of pressure source are thenestimated afterwards. Since the area of the glottis is varying while the turbulence isgenerated, the turbulence noise is non-stationary, hence, the spectrum of the turbulencenoise is time-varying. There are two alternatives to model the spectrum of the pressuresource. One is to use a fix spectrum shaping filter which is the average spectrum. Theother alternative is to use a time-varying ERB model. I will describe these two modelsin this talk.

79

Abstracts: Spring Quarter 1999-2000

Detection and Modeling of Transients in Analysis-SynthesisFrameworks


My talk for the CCRMA DSP seminar (4/7 at 3:15 pm) consists of an update formy research on the detection and modeling of transients in analysis-synthesis frame-works. By contrast to some previous approaches, the focus is on parametric models andinformation-theoretic detection criteria. There are essentially six problem areas:

1. Develop model structure(s) for extensions to rapid decay and fast local variations

2. Detect abrupt changes, estimate locations.

3. Identify appropriate model structures for each region

4. Specify family of transformations

5. Stitch regions back together (in resynthesis)

6. Residual processing (nonparametric, allow artifacts)

For this talk, my focus is #(2), the joint detection and temporal estmation of abruptchange points (segment boundaries). I will discuss the current state of online and offlinemethods, then detail a new solution which gives an efficient computational structure forintegrating features of both methods.

–Harvey

80

Glottal Excitation Modeling for the Singing Voice


Naturalness of the sound quality is essential for the singing synthesis. Since 95% insinging is voiced sound, the focus of this study is to improve the naturalness of thevowel tone quality via the glottal excitation modeling.

In addition to the abilities of flexible pitch and volume control, the desired excitationmodel is expected to be capable of changing the voice quality so that the voice qualitycan be modified from laryngealized (pressed) to normal to breathy phonation.

To trade off between the complexity of the modeling and the analysis procedure toacquire the model parameters, we propose to use the source-filter type synthesis model,based on a simplified human voice production system. The source-filter model decom-poses the human voice production system into three linear systems: glottal source, vocaltract and radiation. The radiation is simplified as a differencing filter. The vocal tractfilter is assumed all-poled for non-nasal sound. The glottal source and the radiation arethen combined as the derivative glottal wave. We shall call it as the glottal excitation.

The effort is then to estimate the vocal tract filter parameter and glottal excitation tomimic the desired singing vowels. The de-convolution of the vocal tract filter and glottalexcitation was developed via the convex optimization technique. Through this de-convolution, one could obtain the vocal tract filter parameters and the glottal excitationwaveform.

The next step is to build the glottal excitation synthesis model after the vocal tractfilter has been found. Since both the wave-shape of the glottal excitation and the as-piration noise are important factors to change the breathiness of the sound quality,the glottal excitation is considered as two parts: one is the smoothed quasi-periodicderivative glottal wave and the other one is the glottis noise (turbulence noise). Thesetwo components are separated via wavelet decomposition of the glottal excitation wave-form from de-convolution. The coarse wave-shape of the smoothed derivative wave isintended to be modeled via the LF model. The noise part is then roughly modeled as apitch synchronous amplitude modulated Gaussian noise with larger power around theglottal closure instants. Due to the model mismatch and the source-tract interaction, amodel for the residual fine structure of the smoothed derivative glottal wave becomesnecessary. (This part is still under the survey. Inputs are very welcomed. )

In this talk, the de-convolution and glottal excitation modeling results for both syntheticdata and real baritone recordings will be shown. I will focus on the synthetic datasimulation results and discuss the impact of aspiration noise, GCI detection error andsource-filter.

81

Wave Digital Filters (WDFs) Applied to Physical ModelingProblems

Stefan Bilbao <bilbao at ccrma> (EE)

Hi everyone,

I’ll be presenting material this week on wave digital filters (WDFs) applied to physicalmodelling problems. There will be two installments:

Tuesday, 3:15, CCRMA Ballroom: Quick recap of wave digital filtering principles (i.e.classical network theoretic principles, wave variables, scattering junctions, the bilineartransform etc.), then the beginning of an overview of the generalization of WDFs tothe multi-D case, where they can be applied to the numerical integration of systems ofpartial differential equations.

Friday, 3:15, CCRMA Ballroom: Sequel of the overview, then applications of the tech-nique towards simulating the vibration of stiff beams, plates and cylindrical shells, aswell as some matlab simulations.

Stefan

—

82

Adaptive Additive Synthesis for Nonstationary Sounds

Dr. Axel Robel <roebel at ccrma> (Berlin Technical University)

I will present a new approach to additive synthesis which is intended to model transientand nonstationary sounds by means of superposition of partials with segment-wise linearamplitude and frequency evolution. Due to the nonstationarity of the partials, STFTanalysis is not applicable (there exists actual research on estimating slope parametersfrom STFT but I do not rely on this), but the parameters are optimized by an adaptiveapproach. The higher flexibility of the model requires regularization to favor reasonable(desired) solutions.

The talk will have the following sections:

• Description of the problem, the model, and partial tracking

• Optimization objective

• Analytical investigation of the tracking performance in case of simple partials andblock wise analysis

• Regularization terms that are used to prevent non unique and undesired solutions

• Sound examples

• Description of planned work at ccrma:

– physical interpretation of the results obtained so far

– improve frequency resolution without decreasing parameter flexibility

– regularization for physical meaningful results

– handling of non-physical partials

Seminar overheads are available on the Web at

http://ccrma/˜roebel/addsyn/

83

http://ccrma/~roebel/addsyn/

Impact of String Stiffness on Virtual Bowed Strings


Recent work in the field of bowed-string synthesis has produced a real-time instrumentwhich, despite its simplicity, is able to reproduce most of the phenomena that appear inreal instruments. Current research consists of improving this model, including refine-ments made possible by the improvement of hardware technology and the developmentof efficient digital signal processing algorithms.

In particular, I will focus on a technique used to model string stiffness, whose maineffect is to disperse the sharp corners that characterize the ideal Helmholtz motion.

Another improvement concerns modeling the high frequency violin body resonancesusing a 3D waveguide mesh. The playability of all the different features of the modelwill be examined.

84

Application of Wavelet and other orthonormal transforms to aPerceptual Audio Codec with Switching Windows


Perceptual audio codecs using long windows produce an artifact called “pre-echo” whencoding at 64kbps or lower bitrates. To eliminate pre-echoes, we need to switch to shorterwindows whenever a sudden increase in signal intensity is detected. However, there aresome difficulties to do perceptual coding when we switch to shorter windows:

• at lower bit rates (≤64kbps), the bit reservoir is not big enough even to code theside information

• due to duality of Fourier transform, frequency resolution is low when we use shortwindows. Hence, it doesn’t fully make sense to apply frequency domain masking.

I will present a hybrid audio coder that does perceptual coding at long windows andwavelet coding at short windows. The coder aims to provide better sound quality thana plain perceptual coder does at 64kbps. Following is the outline of my talk:

• brief review about the following keywords: perceptual audio coding, window switch-ing, wavelet decomposition, etc.

• pre-echoes and motivation of window-switching

• attack detection algorithm

• the criterions for perfect reconstruction

• wavelet approximation as a problem of minimizing the L2 norm of the error

• experimental results: comparison of the error energy using different wavelet basis

• discussion about future work

85

Abstracts: Spring Quarter 1998-1999

Speech Feature based on Bark Frequency Warping — theNon-Uniform Linear Prediction (NLP) Cepstrum

Yoon Kim <yoonie at ccrma> (CCRMA/EE)

For this week’s CCRMA DSP seminar:

I’ll be presenting some recent results on the NLP technique for speech/speaker recog-nition. For those of you who missed my talk in the winter quarter, NLP cepstrum is aspeech feature based on Bark frequency warping and multiple all-pole (LP) modeling.It has shown to effectively suppress speaker-dependent information while preserving thelinguistic component of speech segments.

Also, I’ll be talking about solving spectral estimation / filter design problems usingconvex optimization. The above promises to be a good basis for achieving speakernormalization, i.e. a process of eliminating speaker-dependent characteristics of speechuttered by multiple speakers.

Hope to see you !

Yoon

Time: 3:15pm Thursday, Apr 8Place: CCRMA Ballroom

86

Joint Estimation of Glottal Source and Vocal Tract Filter fromSpeech Signal and some Fundamental Frequency Detection

Methods


For this week’s seminar, I will start my talk at 3:15 and Prof. Risset will give his talkat 4:30.

I will give a update for joint estimation of glottal source parameters and vocal tractfilter from speech pressure signal that I talked about last quarter. I will give somesound results that reconstructed from the estimated parameters. Finally, I will have asummary on what I leanrned from these simulations for voice synthesis.

Hope to see u there and give me suggestions!

Thanks!!

Vicky

87

Multiscale Modeling of Audio Textures


Today I will present a joint project with Caroline Traube, concerning multiscale model-ing of audio textures. The ultimate goal is to obtain a generalization of the 1/f processesrich enough to handle highly complex, nonstationary sounds such as rainfall, cracklingfire, and outputs of turbulent/chaotic systems. An important aspect of such a general-ization is that models be easily adapted to the hypothesized correlation structure of realsignals, such that new classes of transformations may be developed. The model shouldtell us exactly *to what extent* and *how* this scale-correlation structure exists.

For today, the pertinent framework is that of compression. Based on the work of AlbertBeneviste, jointly with Michele Basseville, Alan Willsky and many other researchers,we have developed a compression algorithm designed to measure and capture the scale-correlations structure of 1-D signals. The algorithm is analogous to LPC for time series,using the more general Levinson-Schur (or direct-PARCOR) decomposition since it isimpractical to set up Wiener-Hopf equations. I tried (I think successfully) to redevelopthe method for time series, which I will present carefully as part of the background.

Two important developments from the previously published work, which are *vital* inthe transition from theory to implementation are 1) direct elimination of redundancystructure from multiscale Levinson recursions and 2) a correction for finite data length.Up to half the residuals get lost when assuming infinite data. This fact is independent ofthe data size. 2) is especially important because it makes the technique implementableon real signals, and it allows for the kind of windowing one might use to adapt tochanging scale-correlation structure in real signals.

Time permitting I will present a brief history of the field.

Hope to see everyone there, and apologies for the lateness of the notice.

–Harvey

88

Impact of Torsion Waves and Friction Characteristics on thePlayability of Virtual Bowed Strings

Stefania Serafin and Julius Smith <serafin at ccrmaand jos at ccrma> (CCRMA)

Tomorrow’s DSP seminar will be presented by Julius and me. This week we will talkabout physical modeling of bowed string instruments. In particular, we focus on their”playability”, examining in which zones of a multidimensional parameter spaces ”goodtone” is produced. We focus on the influence of the torsional waves and on the shape ofthe friction curve. The aim is to analyze which elements of bowed string instruments arefundamental for bowed string synthesizers, and which can be neglected, to reduce comp-utational cost. An application that will be shown is an implementation in Max/MSP ofa bowed string model that can be controlled in real time using a graphical tablet. Theplayability region obtained is mapped to the parameters given by the tablet, to makeeasier to ”play” the model.

Hope to see you there.

Stefania

Time: 3:15pm Thursday, May 6.Place: CCRMA Ballroom

89

How to Calculate Constant Q Profiles and the Derivation ofToroidal Models of Inter-Key Relations

Hendrik Purwins <purwins at ccrma> (Berlin Technical University)

This talk will be a more technical follow up to the overview at the CCRMA Colloquium2 weeks ago. I will show three different derivations of toroidal models of inter-keyrelations (ToMIR):

• geometric explanation,

• emergence in the Self Organizing Feature Map (SOM, Kohonen 82) trained byShepard cadences previously processed by an auditory model,

• emergence in a SOM trained by averaged constant Q (cq-)profiles of Chopin’spreludes recorded by Cortot (1933/34).

In method (3) the cq-profiles are 12-dimensional vectors, each component referring to apitch class. They can be employed to represent keys. Cq-profiles are calculated with theconstant Q filter bank (Brown & Puckette 92). This filter bank gives equal resolutionfor all regions in the logarithmic frequency domain. The cq-profiles are also used forkey recognition, and for investigating pitch use in Bach, Chopin, and Alkan.

IN MORE DETAIL:

The constant Q transform is employed for deriving the ToMIR in (iii), and for keyanalysis. The algorithm is efficiently implemented by calculating the kernel of theconstant Q filters in the frequency domain in advance and exploiting the sparsity ofthat kernel. The 12-dimensional cq-profile is the concentrated spectral informationsupplied by the constant Q transform. Each component of the profile indicates thestrength of one pitch class. A cq-profile is calculated by summing up all values (bins)of the constant Q transform that belong to the same pitch class. Cq-profiles have thefollowing advantages: (a) The 12 components correspond to values in probe tone ratings.In a psychological probe tone experiment a quantitative description of a key is derivedfrom rating the different pitch classes within the tonal context of that key (Krumhansl& Kessler 82 ). (b) Calculation is possible in real-time. (c) Stability is obtained withrespect to sound quality. (d) Cq-profiles are transposable.

A cq-hierarchy is a sequence of 24 cq-profiles, each representing a different major orminor key. Cq-hierarchies enable the study of pitch use, and automatic key recognition.We derive cq-hierarchies from cq-profiles of cadential chord progressions played on thepiano or averaged cq-profiles of Prelude [& Fugue] cycles in all keys by Bach, Chopin,Alkan, and Scriabin. We investigate to what degree cq-hierarchies depend on (1) musicalinterpretation, (2) the recorded instrument (piano or harpsichord), and (3) the pieceof music. Results given by different interpretations by Cortot and Pogorelich (Chopinop. 28), and Gould and Feinberg (Bach ‘Well-Tempered Clavier’, Book I) and differentinstruments indicate a negligible influence of (1) and (2) on cq-hierarchies. Piano cq-profiles can be directly compared to the sum of a probe tone rating and its transposition

90

at the fifth, weighted by the strength of the third partial of piano tones. Cq-hierarchiesdisplay small but significant differences between Bach (WTC I + II) and romanticmusic (Chopin op. 28, Alkan op. 31) in the use of perfect fifths, major sixths andmajor sevenths in major, and fourths, major sixths, and major and minor sevenths inminor.

In order to apply cq-hierarchies to key recognition, a special distance measure is in-troduced: the ‘fuzzy distance’ is the modified Euclidean distance, which weights eachcomponent according to its inverse variance. In the calculation of key distances weaccount for the fact that components of the cq-profile, which represent the key are con-sistently stable from one piece to another. The strength of the minor sixths and themajor and minor sevenths in minor varies according to the use of natural, melodic, andharmonic minor scales. The fuzzy distance between keys emphasizes stable values in thecq-profiles, whereas varied components (e.g. minor sixths, major and minor seventhsin minor) are de-emphasized. There are no errors in key assignment for all pieces ofBach’s WTC I.

91

Real-Time Chord Recognition of Musical Sound: A System UsingCommon Lisp Music

Takuya Fujishima <fujishim at ccrma> (CCRMA)

I describe a realtime software system which recognizes musical chords from the inputsignal of musical sounds. I designed an algorithm and implemented it in Common LispMusic/Realtime environment. The system runs on Silicon Graphics O2 and, partly onlinux. Experiments verified that the system succeeded in correctly identifying chordseven on orchestral sounds. Details follow.

Traditionally, musical chord recognition is approached as a combination of polyphonictranscription to identify the individual notes and following symbolic inference to de-termine the chord. This approach often suffers from recognition errors at the firststage. They result from various kinds of noise and overlaps of harmonic components ofindividual notes in the spectrum, and are difficult to avoid.

My chord detection algorithm also has two stages, but each of them differs from thetraditional one. At the first stage, it does not identify the individual notes. Instead, itgenerates a ”pitch class profile (PCP).” PCP is an intensity set of the twelve chromaticpitch classes. It automatically unifies various dispositions of a single chord class. Atthe second stage, my algorithm does not do symbolic, rule-based inference. Instead, itdoes numerical processing on PCP to find the most likely root and chord type. Here Ihave tried two numerical methods for comparison. I have also introduced a couple ofheuristics to the system so as to improve the accuracy.

I have implemented my algorithm using Common Lisp Music Realtime extension (CLM/RT).CLM is a flexible set of COMMON-LISP-based synthesis and signal processing tools.CLM/RT is a realtime extension of CLM. CLM/RT compiles the LISP code to binaryexecutables to realize the realtime sound processing and the graphical user interface.Using CLM/RT, my system takes in the audio signal from either a microphone or asound file, does chord recognition, and displays the result in a realtime manner. Thesystem runs on Silicon Graphics O2.

I used various sound sources including electronic keyboards and recordings on CDs. Asfor CDs, I chose tonal harmonic music from the classical and popular repertoires. Inthe experiment, I input the sound to the system through a line-in, or a soundfile. Thesystem displayed the recognition results. Then I compared them with the chord nameswhich I put to the music materials in advance to evaluate the accuracy of the results.

The results of my experiments showed that my system could recognize triadic harmonicevents, and to some extent more complex chords such as sevenths and ninths, at thesignal level. The system put correct chord names even to complicated orchestral sounds.Each numerical method tried seemed to have its advantage and disadvantage, in termsof computational cost and accuracy. The heuristics introduced did improve the accuracyin total.

92

A Boundary Element Model for the Cylindrical Acoustic Tube

Professor Shyh-Kang Jeng <skjeng at ccrma> (Taiwan National University)

For next week’s DSP seminar from 3:15 PM, I will give a talk about the propagationand scattering of acoustic wave in a cylindrical tube, with an emphasis on the higher-order modes. I will show two or three short movies for the propagation of fundamentaland higher-order modes. I will also show the (colored) field distribution of the wave-guide modes. The excitation and the equivalent filters of the higher modes will beaddressed. The conventional concept of impedance discontinuity will be also shown asan approximation. Some recent results of the BEM analysis of horn-like transitionsbetween two cylindrical tubes will be given. One of which verifies the calculations byDavid Berners using a simpler, but less general approach. The possible applications ofthese higher-order mode analyses to improve the digital waveguide model will be alsoproposed.

Please come and give me suggestions.

Sincerely,

Shyh-Kang Jeng

93

A Robust Constant Q Spectrum for Polyphonic Pitch Tracking

Professor Benjamin Blankertz <blanker at uni-muenster.de> (Institute for Mathematical LogicUniversity of Muenster, Germany)

In this talk a method of improving the accuracy of frequency determination using thephase vocoder technique is presented. This gives the possibility to calculate a reliableconstant Q spectrum even at a high time resolution, which is e.g. required for polyphonicpitch tracking. Employing the constant Q spectrum reduces such tasks to a patternrecognition problem.

The derivation of an explicit form of the error function of the phase vocoder in termsof frequency and phase of the sinusoidal components of the input signal allows thecalculation of parameters for a 3-term Blackman window that minimize the expectederror in the vicinity of a given frequency. This allows a robost and precise frequencydetermination.

94

A Hybrid Waveguide Model of the Transverse Flute

Mark Bartsch <bartscma at flyernet.udayton.edu> (Visitor, University of Dayton / EE)

Recent years have seen substantial improvements in the modeling and synthesis of jet-reed instruments such as flutes and organ pipes. Developments in the modeling of wood-winds using so- called digital waveguides have allowed the efficient acoustical modelingof the main body and toneholes of the instrument. Further, a new and more completemodel of the jet’s behavior in the presence of the edge and the resonator has been for-mulated for recorderlike instruments in [M. P. Verge, A. Hirschberg, and R. Causse, J.Acoust. Soc. Am.101, 2925–2939 (1997)]. A new simulation model of the transverseflute is presented which combines the contributions of digital waveguide modeling andthe new source model. This new model is defined almost entirely by physical param-eters (such as dimensions of the instrument) rather than by the arbitrary adjustmentparameters often employed. The model is evaluated by comparing the effects of certainparameters on the model’s operation with their effects on the performance of an actualflute. The model is further evaluated for its tuning characteristics by comparing itsfrequencies of oscillation with the sounding frequencies of a simple flute with matchedphysical parameters. [Work supported by the University of Dayton Honors Program.]

95

Abstracts: Winter Quarter 1998-1999

Joint Estimation of Glottal Source and Vocal Tract Filter fromSpeech Signal and some Fundamental Frequency Detection

Methods


For this week’s seminar, I will talk about joint estimation of glottal source and vocaltract filter from speech signal and some fundamental frequency detection methods.

The goal is to build a voice model which is flexible enough for voice modification andinterpolation between different states, and yet simple enough such that we can estimatethe model parameters to reproduce a known speech signal. To ease the estimation, thesource filter type voice model is chosen. At this talk, I will review the source filter typespeech model and some methods to estimate glottal source and vocal tract filter fromspeech signal. A primitive model and its associated estimation procedure (not completeyet ...) will be introduced.

Fundamental frequency estimation is often necessary for pitch synchronize joint esti-mation of glottal source and vocal tract filter. The fundamental frequency is definedas a glottal cycle here. I will talk about 2 frequency domain methods (cepstrum basedmethod and Harmonic product spectrum) and 2 time domain methods (AMDF andSRFD).

See u there !

vicky

96

Speech Feature based on Bark Frequency Warping — theNon-Uniform Linear Prediction (NLP) Cepstrum


In statistically based speech recognition systems, choosing a feature that captures theessential linguistic properties of speech while suppressing other acoustic details is crucial.This could be more appreciated by the fact that the performance of the recognitionsystem is bounded by the amount of linguistically-relevant information extracted fromthe raw speech waveform. Information lost at the feature extraction stage can never berecovered during the recognition process.

Some researchers have tried to convey the perceptual importance in such features bywarping the spectrum to resemble the auditory spectrum. One example is the Percep-tual Linear Prediction (PLP) method proposed by Hermansky, where a perceptuallymotivated filterbank is used to warp the spectrum, followed by scaling and compres-sion of the spectrum. While the PLP provides a good representation of the speechwaveform, its ability to model peaks of the speech spectrum – formants – depends onthe fine structure of the FFT spectrum. This for instance could hinder the process ofmodeling formants of female speech through filterbank analysis, since there are fewerharmonic peaks under a formant region than in the male case. Also, various process-ing schemes used require memory, table-lookup procedure and/or interpolation, whichmight be computationally inefficient.

I will talk about a new method of obtaining parameters from speech that is based on fre-quency warping of the vocal-tract spectrum. The warping is achieved by using the BarkBilinear Transform (BBT) on a uniform frequency grid to generate a grid that incor-porates the non-uniform resolution properties of the human ear. Experimental resultsshowing the superior performance of the NLP cepstrum with respect to conventionalspeech features will be presented.

97

Transaural Stereo and HRTF Modeling Review


I will present some interesting reading I did recently on the subject of 3D sound and“transaural stereo”. Consider the following questions:

• What is transaural stereo exactly? Does it really work? How well?

• What is the Cooper-Bauck “shuffle structure” for cross-talk cancellation? Howdoes it improve on the original Schroeder-Atal structure?

• What is “diffuse field equalization” exactly?

• How long is an HRTF in time, and how much can it be shortened?

• How are HRTF filters designed these days?

• Is it worthwhile to implement FIR HRTFs using the FFT, or is time-domain con-volution faster?

• What are some of the main patents in 3D auditory display and transaural stereo?

If you don’t know the answers to any of the above questions, you might be interestedin coming!

Julius

98

A Perceptually Based Audio Signal Model with Application toScalable Compression

Tony Verma <verma at furthur.stanford> (EE)

This week i’ll be giving a first run of my orals talk. As such comments, suggestions,and questions will be greatly appreciated.

A Perceptually Based Audio Signal Model with Application to Scalable Compression

Abstract (which is a bit long...)

Audio delivery in network environments such as the Internet where bandwidth is notguaranteed, packet loss is common and where users connect to the network at variousdata rates demands scalable compression techniques. Scalability allows each user toreceive the best possible audio quality given the current network condition. In addition,because the separation principle for source and channel coding does not apply to lossypacket networks, an audio source coding technique that explicitly considers channelcharacteristics is desirable. These goals can be achieved by using a higher level descrip-tion for audio than the actual waveform. This talk will focus on a method for extractingmeaningful parameters from general digital audio signals that takes into account theway humans perceive sound; moreover, application of this parametric model to scalableaudio compression will be discussed.

The model consists of three major components: sines, transients and noise. Theseunderlying signal components are found during the analysis stage of the model. Quan-tizing and compressing the resulting model parameters allows for efficient storage andtransmission of the original audio signal. The talk will cover enhancements made tocurrent sine models. These enhancements allow explicit perceptual information to beincluded in the sinusoidal model. In addition, a novel transient modeling technique willbe covered.

The three part model provides an efficient, flexible and perceptually accurate repre-sentation for audio signals. It therefore is appropriate for scalable compression overlossy packet networks. The efficiency of the model ensures high compression ratios.Flexibility simultaneously allows scalability and robustness to channel characteristicssuch as packet losses because subsets of model parameters represent the original signalwith varying degrees of fidelity. Perceptual accuracy ensures that parameter subsetsreasonably represent the original signal while the complete parameter set represents theoriginal exactly in a perceptual sense.

Most current techniques for audio compression (e.g., MPEG audio layer 3 and AAC,Real Audio’s G2, etc.), use a subband decomposition in conjunction with psychoacous-tic models to compress the actual audio waveform itself. No model of the signal isassumed. These compression techniques have been very successful for targeted fixedbit rates; however, they cannot be scaled in large steps without severe loss in quality.This is evident in the case of Real Audio where a database will store many versionsof an audio signal at various bitrates (e.g., 92Kbps, 64Kbps, 32Kbps, 20Kbps, and

99

16Kbps) and quality. Because using an underlying model allows meaningful subsets ofparameters to describe the original signal, one compressed bitstream (e.g., 96Kbps) canbe stored. Embedded within this bitstream are lower bitrate versions (e.g., 64Kbps,32Kbps, 20Kbps, and 16Kbps) that can be easily extracted. Sound demos of the audiocompression scheme will be played.

Hope to see you there!

-Tony

Musical instrument mechanics

Jim Woodhouse <jw12 at eng.cam.ac.uk> (University of Cambridge)

An informal overview of interesting mechanical vibration problems associated with mu-sical instruments. The talk will range over tuned percussion, the musical saw, and thedesign of the violin body.

Bowed-string modeling and simulation

Jim Woodhouse <jw12 at eng.cam.ac.uk> (University of Cambridge)

Following on the previous colloquium, this talk will present a more detailed investigationof modeling the physical action of a bowed string, including the effects of string torsionalproperties and bending stiffness and the finite width of the ribbon of bow hair. Recentwork will also be described in which the physics of rosin has been investigated. Interms of detailed physics, although not necessarily in terms of effectiveness for musicalsimulation, the model for rosin friction which has always been used in the past will beshown to be quite wrong.


1. 09/24: Jyri Huopeniemi (Visiting Researcher): HRTF filter design

Objective and Subjective Evaluation of Head-Related Transfer Function Filter Design

(to be presented at the AES 105th Convention, San Francisco, Sept. 26-30,1998)

Jyri HuopaniemiNokia Research Center

Helsinki, Finlandjyri.huopaniemi at research.nokia.com

(work carried out while the author was a visiting researcher at CCRMA, StanfordUniversity)

100

In this presentation, the problem of modeling head-related transfer functions (HRTFs) isaddressed. Traditionally, HRTFs are approximated in real-time applications using minimum-phase reconstruction and various digital filter design techniques, yielding FIR or IIR struc-tures. In this work, binaural auditory modeling has been applied to HRTF filter design anal-ysis, and design methods have been compared from the auditory perception point of view.This paper presents applicable perceptually valid smoothing and filter design techniques anddiscusses listening test results for localization and timbre degradation using individualizedHRTFs.

2. 10/08: Scott Levine (EE) thesis defense preview (real thing 10/16)

Audio Representations for Data Compression and Compressed Domain Processing

(practice run for Scott’s PhD/EE thesis defense)

Scott Levine <scottl at ccrma>CCRMA/EE

In the world of digital audio processing, one usually has the choice of performing modificationson the raw audio signal, or data compressing the audio signal. But, performing modificationson a data compressed audio signal has proved difficult in the past. This thesis provides a newrepresentation of audio signals that allows for both very low bit rate audio data compressionand high quality compressed domain processing and modifications. In this context, processingpossibilities are time-scale and pitch-scale modifications.

This new audio representation segments the audio into separate sinusoidal, transients, andnoise signals. During determined attack transients regions, the audio is modeled by wellestablished transform coding techniques. During the remaining non-transient regions of theinput, the audio is modeled by a mixture of multiresolution sinusoidal modeling and noisemodeling. Careful phase locking techniques at the time boundaries between the sines andtransients allow for seamless transitions between representations. By separating the audio intothree individual representations, each can be efficiently and perceptually quantized. Demosof audio compression to 32 kbps (22:1) and time-scale modification between 50% to 200% willbe played.

3. 10/15

Acoustic Stability of a Cylinder with Conical Cap


This talk presents a proof of stability for a cylindrical acoustic tube terminated with a conicalcap. The reason this takes some work is that, as is well known, the reflection and transmissiontransfer functions at the cylinder-cone junction are all unstable one-pole filters. Another

101

strange manifestation of this acoustic configuration is that the reflection transfer functionsfor pressure waves at the junction converge to -1 (the transmittances converge to zero), whilefrom basic physical reasoning the reflectance of the conical cap as a whole must converge to+1. It is shown that this “paradox” is associated with two pole-zero cancellations at dc inthe conical cap reflectance, as seen from the cylinder.

4. 10/22

Estimation of model parameters for the two-mass glottalmodel


The objective of this work is to build a better glottal source model for the singing synthesizer.To obtain a higher quality of voiced sound, glottal source modeling is very promising. In thespeech literature, a two-mass model has been used for quite some time. One advantage ofthe two-mass model is that it considers the source-tract interaction automatically, and thisis important for the high-pitched voice. Moreover, the model is not too complicated forcomputation. Hence, a two-mass model is chosen for my singing synthesizer.

However, choosing proper system parameters of the two mass model for different pitches isnot trivial. This is the problem I addressed during my stay at NTT in Japan (Sep. 2nd toOct. 13th). I assume that I can obtain the glottal flow data from the measurements of oralflow. Using this glottal flow data, the system states (displacements and velocities of the twomasses, damping of the masses, and relative coupling stiffness) are then obtained using theExtended Kalman Filter (EKF).

In my talk, I will introduce the two-mass model I used. I will also show some preliminarysimulation results and discuss the estimation formulation. A videotape introducing the ex-perimental procedure will be shown.

102

5. 10/29

Investigation of Perceptually Relevant Features for SpeechProcessing


Choosing an optimal feature set for speech is essential in analyzing, recognizing and synthe-sizing speech. Also, in the problem of speaker normalization, the choice of parameters forthe analysis model becomes crucial. Traditionally, researchers in the speech recognition fieldhave used the weighted cepstrum as the feature for describing the acoustic aspects of speech.Cepstrum is often derived from LPC coeffcients by a simple transformation formula.

Some researchers have tried to convey the perceptual importance in such features by warpingthe spectrum to resemble the auditory spectrum. They found that the formant structureof the speech spectrum could be captured well using significantly less number of parametersthan the normal LPC analysis. Also, Hermansky showed that the warped spectrum agreeswell with the results in psychoacoustics, such as the effective second formant, and the 3.5bark spectral-peak integration theory.

Smith and Abel introduced methods to derive the optimal conformal mapping for warping thespectrum to resemble the bark spectrum. (Bark Biliear Transform). The transformation itselfis an all-pass type transfer function, thus mapping the unit circle onto itself. For a rationaltransfer function, the bark bilinear transform (BBT) preserves the order of the polynomialsin the numerator and the denominator. Thus, we can obtain a warped cepstrum by takingthe BBT of the LPC predictor polynomial, then converting it into the cepstrum (LPC-BBTCepstrum).

In this talk which serves as a report of research in progress, I’ll compare the performance interms of vowel separability between the LPC-BBT Cepstrum and conventional methods. I’llalso address the issue of computational complexity in obtaining these features. Finally, I willtalk about future plans and applicable problems.

103

6. 11/05: Peter Lindener (VR): TI 326x00 DSP applications

7. 11/12

Sampling Rate Estimation Techniques in the Discretization ofNonlinear Systems

Harvey Thornburg <harv23 at ccrma> (CCRMA/EE)

Nonlinearities are prevalent in acoustic and circuit modeling. Since nonlinear elements tendto expand the bandwidth of their input, direct implementation in a digital system results inaliasing. A multirate system can be used to reduce aliasing at additional computational cost.The problem then becomes one of choosing an appropriate sampling rate at which to run thenonlinear element.

To this end, one must first be able to estimate the output spectrum given certain ”worst-case”statistics of the input, (i.e. power,bandwidth). Many useful nonlinearities are memoryless;they are often hard (nonanalytic), or defined by an implicit relation. A technique based oniterated convolutions and Taylor series is extensible to general Volterra systems; however,hard elements cannot be dealt with and much complexity is required for the implicit cases.I will present a new technique for the memoryless case based on decomposition into a set ofFM operators, and show how the frequency-domain computations are simplified when broughtinto the ”amplitude domain”. This technique is valid for any bounded nonlinear map.

Areas of application include soft saturation (sigmoid) elements and quantization of narrow-band signals. I will show a completely general method for variably soft sigmoid constructionwhich supersedes methods previously discussed (such as p-circle methods), and present sam-pling rate estimations for different measures of softness over a range of overdrive settings.

Classical theories model quantization effects as additive broadband noise. If this ”noise” istruly uncorrelated with itself and the input signal, aliasing of the quantization error will beimperceptible. I will attempt to show, using spectral estimation techniques, for what bitdepths the classical assumptions break down, and to what degree quantization and aliasingcouple in the digitization of a narrowband signal.

8. 11/19: Craig Sapp (CCRMA): Optical Gesture Recognition

104

9. 12/03

Sinusoidal and Transient Modeling Using Frame-BasedPerceptually Weighted Matching Pursuits

Tony Verma <verma at furthur.stanford> (CIS/EE)

The talk will focus on a well known subject at CCRMA: sinusoidal modeling. I’ll first talkabout a method of sinusoidal modeling that explicitly takes into account the psychoacousticsof human hearing using extensions to an analysis-by-synthesis technique known as matchingpursuits. An efficient transient model that uses sinusoidal modeling in a rotated space willthen be discussed. Current noise models fit well with the sine and transient models forminga flexible compact model for many audio signals. The three part model coherently capturessalient features of an audio signal which is important for meaningful modifications such astime-scaling and pitch-shifting, exmples of which will be played.

105

Documents

Prior Abstracts: CCRMA DSP Seminar (MUS423)jos/mus423h/mus423h.pdf · We demonstrate analysis-synthesis applications including standard time and pitch scal-ing, as well as new e ect