Transcript
Page 1: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Algorithms in Signal Processors

Audio and Video Applications

2010

DSP Project Courseusing

Texas Instruments TMS320C6713 DSK and TMS320DM6437

Dept. of Electrical and Information Technology, Lund University,Sweden

i

Page 2: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

ii

Page 3: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Contents

I Speech RecognitionDivyesh V. Vaghani, Farhan Ahmed Khan 1

1 Voice Generation in Human Body 2

2 Introduction 2

3 How To Recognize The Speech 33.1 Voice Detector . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . 33.3 Pitch Detector . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4 FLOW CHART 4

5 DESIGN COMPONENTS 7

II Speech RecognitionDan Liu, Hongwan Qin, Ziyang Li, Zhonghua Wang 11

1 Introduction 12

2 The algorithm 122.1 Speech Recognition System . . . . . . . . . . . . . . . . . . . 12

2.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 122.1.2 Feature extraction . . . . . . . . . . . . . . . . . . . . 132.1.3 Reference creation . . . . . . . . . . . . . . . . . . . . 13

2.2 Speech Recognition algorithm . . . . . . . . . . . . . . . . . . 132.2.1 Partition and Pre-emphasis . . . . . . . . . . . . . . . 132.2.2 Detector . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . 142.2.4 Schur recursion . . . . . . . . . . . . . . . . . . . . . . 142.2.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . 15

3 How to realize the whole speech recognition 153.1 MatLab Implementation . . . . . . . . . . . . . . . . . . . . . 163.2 DSP implementation . . . . . . . . . . . . . . . . . . . . . . . 16

4 The problem and solution 16

5 The conclusions 17

iii

Page 4: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

III MIDI SynthesizerNauman Hafeez, Waqas Shafiq 21

1 Inroduction 22

2 Description 222.1 MIDI Standard . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2 Envelop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.1 ADSR . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Synthesis Techniques 253.1 FM Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Additive Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 323.3 Subtractive Synthesis . . . . . . . . . . . . . . . . . . . . . . . 323.4 Wavetable Synthesis . . . . . . . . . . . . . . . . . . . . . . . 32

4 Implementation 32

5 Conclusion 33

IV Pitch EstimationAravind K Annavaram V, Mohammed Azher Ali, Mirza Jameel Baig, SurendraReddy U 35

1 Introduction 36

2 Different methods of pitch estimation 362.1 Time-domain approaches: . . . . . . . . . . . . . . . . . . . . 362.2 Frequency-domain approaches: . . . . . . . . . . . . . . . . . 37

3 Cepstrum analysis 37

4 The Cepstrum algorithm Implementation 394.1 The MatLab algorithm . . . . . . . . . . . . . . . . . . . . . 394.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 Algorithm for the DSP . . . . . . . . . . . . . . . . . . . . . 394.4 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.5 Absolute And Logarithm Function . . . . . . . . . . . . . . . 404.6 IFFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.7 Maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.8 Pitch Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Simulation Results 41

6 Problems encountered 43

iv

Page 5: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

7 Conclusion 43

V Video processingBilgin Can, Ma Ling, Lu Fei 45

1 Introduction 46

2 Object detection 462.1 Moving object location . . . . . . . . . . . . . . . . . . . . . . 47

2.1.1 Image differential . . . . . . . . . . . . . . . . . . . . . 472.1.2 Noise reducing . . . . . . . . . . . . . . . . . . . . . . 472.1.3 Other optimizations . . . . . . . . . . . . . . . . . . . 47

2.2 Edge detection . . . . . . . . . . . . . . . . . . . . . . . . . . 482.2.1 Canny edge detection . . . . . . . . . . . . . . . . . . 482.2.2 Canny implementation in real time system . . . . . . 49

2.3 Summary for object detection . . . . . . . . . . . . . . . . . . 50

3 Object identification 503.1 The three categories’ features . . . . . . . . . . . . . . . . . . 503.2 Classification implementation in real time system . . . . . . . 503.3 Summary for object identification . . . . . . . . . . . . . . . . 51

4 Result assessment 514.1 Evaluation for object detection and classification . . . . . . . 514.2 Performance analysis for real time system . . . . . . . . . . . 524.3 Assessment summary . . . . . . . . . . . . . . . . . . . . . . . 52

5 Conclusion and further work 52

VI Video Processing - Light SaberKashif, Raheleh, Shakir, Vineel 57

1 Introduction 581.1 Development Kit . . . . . . . . . . . . . . . . . . . . . . . . . 581.2 MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581.3 Code Composer Studio . . . . . . . . . . . . . . . . . . . . . . 58

2 Background 582.1 Chroma Key . . . . . . . . . . . . . . . . . . . . . . . . . . . 582.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582.3 PAL System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592.4 Color space representaion . . . . . . . . . . . . . . . . . . . . 59

v

Page 6: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

3 Implementation 593.1 MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.2 Code Composer Studio . . . . . . . . . . . . . . . . . . . . . . 61

4 Simulation results 61

5 Conclusion and Future work 61

VII ReverberationSyed Zaki Uddin, Farooq Anwar, Mohammed Abdul Aziz, Kazi Asifuzzaman 65

1 Introduction 661.1 Introduction to Reverberation . . . . . . . . . . . . . . . . . . 661.2 Characteristics of Reverberation . . . . . . . . . . . . . . . . 661.3 Simulation of Reverberation . . . . . . . . . . . . . . . . . . . 67

2 Reverberator types 672.1 Reverberation Chamber . . . . . . . . . . . . . . . . . . . . . 672.2 Plate Reverberator . . . . . . . . . . . . . . . . . . . . . . . . 672.3 Spring Reverberator . . . . . . . . . . . . . . . . . . . . . . . 682.4 Digital Reverberator . . . . . . . . . . . . . . . . . . . . . . . 68

3 Reverberation Modeling 683.1 Impulse Response . . . . . . . . . . . . . . . . . . . . . . . . . 683.2 Early Reflection Modeling . . . . . . . . . . . . . . . . . . . . 693.3 Late Reflection Modeling . . . . . . . . . . . . . . . . . . . . 693.4 Reverberation Algorithms . . . . . . . . . . . . . . . . . . . . 69

3.4.1 Shroeder Algorithm . . . . . . . . . . . . . . . . . . . 693.4.2 Moorer Algorithm . . . . . . . . . . . . . . . . . . . . 693.4.3 Gardner Algorithm . . . . . . . . . . . . . . . . . . . . 703.4.4 Dattoro Algorithm . . . . . . . . . . . . . . . . . . . . 703.4.5 Jot Algorithm . . . . . . . . . . . . . . . . . . . . . . . 70

4 Reverberator Design 704.1 Jot Reverberation Algorithm . . . . . . . . . . . . . . . . . . 704.2 Coefficients calculation . . . . . . . . . . . . . . . . . . . . . . 70

4.2.1 Modal Density (Dm(f)) . . . . . . . . . . . . . . . . . 724.2.2 Frequency and Time Density . . . . . . . . . . . . . . 724.2.3 Echo Density (De) . . . . . . . . . . . . . . . . . . . . 724.2.4 Energy Decay Curve(EDC) . . . . . . . . . . . . . . . 724.2.5 Reverberation Time (Tr) . . . . . . . . . . . . . . . . 724.2.6 Energy Decay Relief (EDR) . . . . . . . . . . . . . . . 73

4.3 Designing Jot Reverberator . . . . . . . . . . . . . . . . . . . 734.3.1 Delay Length (M) . . . . . . . . . . . . . . . . . . . . 73

vi

Page 7: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

4.3.2 First order low pass Filter H(i) . . . . . . . . . . . . . 744.3.3 Effects of Matrix B and C . . . . . . . . . . . . . . . . 744.3.4 Tonal Correction Filter . . . . . . . . . . . . . . . . . 75

5 Realtime Reverbaration Implementation 755.1 Implementation Model . . . . . . . . . . . . . . . . . . . . . . 755.2 CBUFF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.3 H(z) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.4 A Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.5 T(Z) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.6 conv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6 Results 80

VIII Pitch Estimation/SingstarAnil kumar Metla, Anusha Gundarapu, Yaoyi Lin 85

1 Introduction 861.1 Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 861.2 Introduction to the project . . . . . . . . . . . . . . . . . . . 86

2 Pitch Estimation Algorithms in Theory 882.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

2.1.1 Time domain Fundamental Period Pitch Detection : . 882.1.2 Auto-correlation Pitch Detection : . . . . . . . . . . . 892.1.3 Adaptive filter Pitch Detectors : . . . . . . . . . . . . 892.1.4 Frequency Domain Pitch Detectors : . . . . . . . . . 892.1.5 Pitch Detection based on models of the ear : . . . . . 89

2.2 Cepstrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 902.3 Auto Correlation . . . . . . . . . . . . . . . . . . . . . . . . . 90

3 Project Implementation 933.1 Implementation in Matlab . . . . . . . . . . . . . . . . . . . . 933.2 Implementation on C6713 . . . . . . . . . . . . . . . . . . . . 933.3 Singstar Extension . . . . . . . . . . . . . . . . . . . . . . . . 98

4 Results 98

5 Conclusion 100

vii

Page 8: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

viii

Page 9: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Part I

Speech Recognition

Divyesh V. Vaghani, Farhan Ahmed Khan

Abstract

This Project report discusses one of the possible solution to an algo-rithm in speech recognition. By using the Filtering process and Auto-correlation, the coefficients are given for a spoken word. By comparingthese with the coefficients for a small set of words. The algorithm de-cides which word it is most like. For to check the algorithm of speechrecognition, we must want the MATLAB tool, After the successfulresult we did the whole code in C for the DSP Implementation.

1

Page 10: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

2 Speech Recognition

Figure 1: Block diagram of the system

1 Voice Generation in Human Body

The human voice is generated in a human body by the vocal cords for talking,singing, laughing, crying, screaming, etc. Human voice is specifically thatpart of human sound production in which the vocal cords are the primarysound source. Generally speaking, the mechanism for generating the humanvoice can be subdivided into three parts, the lungs, the vocal cords withinthe larynx, and the articulators. The lung which works as a pump mustproduce adequate airflow and air pressure to vibrate vocal cords (this airpressure works as a fuel of the voice). The vocal cords are a vibrating valvethat chops up the airflow from the lungs into audible pulses that form thelaryngeal sound source. The muscles of the larynx adjust the length andtension of the vocal folds to pitch and tone. The articulators which arethe parts of the vocal tract above the larynx consisting of tongue, palate,cheek, lips, etc. Articulate and filter the sound emanating from the larynxand to some degree can interact with the laryngeal airflow to strengthen itor weaken it as a sound source. The vocal cords, in combination with thearticulators, are capable of producing highly intricate arrays of sound.

Adult male voices are usually lower-pitched and have larger cords. Themale vocal cords are between 17 mm and 25 mm in length while the femalevocal cords are between 12.5 mm and 17.5 mm in length. The difference invocal cords size between men and women means that they have differentlypitched voices.

2 Introduction

Speech recognition is the process of converting an acoustic signal, capturedby a microphone, to a set of words. These recognized words can be thefinal results, as for applications such as commands & control, data entry,and document preparation. They can also serve as the input to furtherlinguistic processing in order to achieve speech understanding. In speechrecognition process the parameters used to characterize the capability ofspeech recognition systems. [4]

The requirement of this project was originally to determine the different-different words of different-different people. However, we were more inter-

Page 11: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Divyesh V. Vaghani, Farhan Ahmed Khan 3

ested in the speech recognition part and chose to work with the additionalwork instead. The aim of our project is to implement a speech recognitionalgorithm on a DSP. As per the project name Speech Recognition algorithmimplementation on DSP, we will present a method on how to determinewhich word, from a specific small set of words, is being spoken.

To accomplish this goal we first take a signal by the headphone thenpreprocess the signals with some technique for the voice detection whetherit is a sound or if its silence. And then we pass it form the Band Pass Filter.In the Case of a Sound, the signal is processed for the autocorrelation andthen for the pitch detection, after the pitch detection the code is implementson DSP [3].

3 How To Recognize The Speech

3.1 Voice Detector

The goal of the detector is to detect when a word starts and when it ends.To know that, we check the data going through the buffer at each sampletime. First of all we have to define what a word is in order to distinguishit from noise. We are using the principle that if the value read from thebuffer is higher than a certain level, the value corresponds to a sound. Ifnot, we consider it to be noise. But because we are working in term offrames (chunks), a frame is useful if it contains a sample with an amplitudehigher than the chosen level. If not, the frame is considered noise and wedont take it into account . Now to check if the current frame contains asample with sound or not. If that is the case, the autocorrelation algorithmis performed on that frame. If the current frame contains a sound and theprevious one did not, we know that its the beginning of a word, but how dowe know when were at the end of a word or just in a pause in a word? Thisproblem proves to be a bit more difficult, because for instance when we saythe word mama there is involuntarily a small silence between the m and thea and these silences will not be detected as sound, and the detector wouldthink that its the end of a word. To avoid this problem and to know whena word really ends, we decided to check the next bunch of frames. If thereis no sound in the last bunch of frames, we say that it really was the end ofa word. However if this is not the case, it means that we have a word andwe keep performing the autocorrelation. [5]

3.2 Autocorrelation

The basic definition of Autocorrelation is Correlation is the correlation ofa signal with itself. And if the signals are different then it is called crosscorrelation. Autocorrelation is also a mathematical tool used for frequently

Page 12: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

4 Speech Recognition

in signal processing for analyzing the functions or the series of values, suchas time domain signals. [1]

Autocorrelation is useful for finding repeating patterns in a signal, suchas determining the presence of a periodic signal which has been buried undernoise, or identifying the fundamental frequency of a signal which doesn’t ac-tually contain that frequency component, but implies it with many harmonicfrequencies. [1]

3.3 Pitch Detector

In simple words Pitch = Highest Frequency of Sound. The pitch determina-tion is very important for many speech processing algorithms. For examplethe concatenative speech synthesis methods require pitch tracking on thedesired speech segments if prosody modification is to be done.

Within the area of audio communication there is an agreement on theterm of pitch as it is basically refers to the tonal height of a sound object, e.g.a musical tone or the human voice. The use of pitch is, often inconsistentin that the term is both used for a stimulus parameter (i.e., synonymousto frequency) and for an auditory sensation. The people who is concernedwith the speech processing mostly used the pitch in the former sense, thatmeans the fundamental frequency of the glottal oscillation (vibration of thevocal cords). As per the ANSI definition of psychoacoustical terminologysays that pitch is that auditory attribute of sound according to which soundscan be ordered on a scale from low to high.

As per the ANSI definition of pitch, there is not such a thing as the pitchof a sound. Practically any sound of real time or real life - including thetones of musical instruments , in particular for the harmonic complex tonesproduced by any musical instruments out of them them is most prominentand then is said to be the pitch. Moreover, the Pitch is somehow related tothe log of the frequency. [2]

4 FLOW CHART

First we will initialize necessary objects which we used in the program, thenwe will wait for the data. Then we will take continuously 16 chunks of 128samples. After that we will do some filtering process like Band pass filtering,Apply Band Pass Filter on each sample. Count how many data samples arethere in last 16 chunks if it is more than 1500 samples then consider it asa sound. Normalize the data sound and take its autocorrelation the ACcoefficients. The autocorrelation coefficients are (nx+nr) in number. Takethe right half of these coefficients and send it to the Pitch estimation where itwill estimate the pitch by distance measure of the highest pitch. The divisionof this distance to the sampling frequency will be our estimated pitch. Ifthe pitch is lies between 50 Hz to 500 Hz then it will detected otherwise

Page 13: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Divyesh V. Vaghani, Farhan Ahmed Khan 5

Figure 2: Matlab Result of Autocorrelation

Page 14: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

6 Speech Recognition

Figure 3: Flow Chart of the Complete System

Page 15: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Divyesh V. Vaghani, Farhan Ahmed Khan 7

it cannot detected and Pitch out of Bound message will be displayed ondisplay.

5 DESIGN COMPONENTS

The first step is to initialize the DSP kit objects and other variables we needto use. The data will continuously taken through line in where DSP 6713A/D converter will digitize the found. Every samples we will take from theA/D converter will filtered through a Band Pass Filter (BPF). i.e. 60 Hzto 7200 Hz. If a continuous data found at least 1500 samples out of 2048samples. So, these 2048 samples hand over to the autocorrelation blockwhich found the number of autocorrelation coefficients (nx+nr).

These autocorrelation coefficients (nx+nr) then send to the Pitch esti-mation block where it detect the maximum peak after 16 samples. Thedistance of the location of the maximum peak from the half of Autocorre-lation coefficient is measured and divide the Fs (center frequency) from it.The answer from this division is our required pitch which is then displayedon Log Print.

Page 16: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

8 Speech Recognition

Figure 4: The Design Components

Page 17: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Divyesh V. Vaghani, Farhan Ahmed Khan 9

References

[1] Introduction of Autocorrelation, http://note.sonots.com/SciSoftware/Pitch.html

[2] Def. of Autocorrelation, http://www.mmk.e-technik.tu-muenchen.de/persons/ter/top/defpitch.html

[3] Cook, Real sound synthesis for interactive applications, A K Peters,cop.2002

[4] Rabiner, Lawrence On the Use of Autocorrelation Analysis For PitchDetection, IEEE Transactions on Acoustics, Speech and Signal Process-ing, Vol. ASSP-25, no. 1, February 1977

[5] Rabiner, Lawrence et al A Comparative Performance Study of SeveralPitch Detection Algorithms, IEEE Transactions on Acoustics, Speechand Signal Processing, Vol. ASSP-24, no. 5, October 1976

Page 18: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

10 Speech Recognition

Page 19: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

11

Part II

Speech Recognition

Dan Liu, Hongwan Qin, Ziyang Li, Zhonghua Wang

Abstract

In this report, the traditional algorithms for speech recognition areintroduced at first and then the design scheme on TMS320C6713 DSKis provided. We adopt these classical short-time speech reorganizationmethods to extract word features and then use some new approachesto increase the detection accuracy of the beginning and end point of aword (the detection accuracy is the bottleneck in speech recognition).The result of our design shows the effectiveness of these algorithms forspeech recognition and the good performance of the TMS320C6713 inaudio signal processing.

Page 20: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

12 Speech Recognition

1 Introduction

The project requirement of this course is to recognize a number of iso-lated words on DSP C6713 platform in Code Composer Studio(CCS) soft-ware environment. Under the instruction of two teachers,we choose thetraditional linear prediction coefficients algorithm [1](LPC) for speech recog-nition.

After learning relative algorithms about extracting word features, we de-signed the system architecture which includes high-pass filtering, voice pointdetection [2], LPC coefficients generation by using autocorrelation [3] andSchur recursion, and coefficients comparison with the reference database [4].Then a reference model was made in MatLab to make sure that our un-derstanding and structure are correct and could achieve the basic SpeechRecognition function. According to this preparation, we started to transfercoding from MatLab to C language. The latest thing also a very difficultstep, was implementing whole scenario on DSP board which spent most ofour time, and fortunately we reached the target finally.

In the following part, we will introduce the detailed algorithm of eachpart of the system and describe how to realize the whole speech recognitionproject, then we will present the problems and solutions during the processof system implementation, last part is the conclusions of this course.

2 The algorithm

2.1 Speech Recognition System

Before explaining the detailed algorithm, let’s have a draft picture aboutspeech recognition. One speech recognition system mainly includes twosteps: reference creation and recognition, both functions require preprocess-ing the original sound and feature extraction. The following figure showsthe fundamental speech recognition elements [4].We will introduce the function details of each block later in Fig 1.

2.1.1 Preprocessing

The original voice contains the speech data, but at the same time back-ground noise is inputted to DSP system via MIC in this process. The noisewill interfere with the recognition process so the speech engine must han-dle(and possibly even adapt to) the environment within which the words arespoken. The preprocessing should complete the following tasks: partitionthe input signals (including real valid voice and noise) into frames, filter the

Page 21: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Dan Liu, Hongwan Qin, Ziyang Li, Zhonghua Wang 13

noise as much as possible, detect where the beginning of our real voice signalis and where the end is, and store these detected valid voice signals for laterprocessing.

2.1.2 Feature extraction

The responsibility of this step is calculating and exacting sound charac-teristic so that the key parameter of voice signal-reflection coefficients couldbe obtained. These important data will help system to settle subsequenttreatment. In this project we use LPCmethod to extract the voice charac-teristic parameters.

2.1.3 Reference creation

Speech recognition systems require creating a reference system which isnecessary because there are different pronunciation characteristics for dif-ferent men to pronounce a same word. And we need to provide multiplepronunciations by different man for certain words and use that to createthe reference database for system to make decision.In addition, for eachpeople,the pronunciation comes from different period may render variouscharacter so it is also necessary to train the pronunciation of every one.In our design, the detail block model of the system is as above figure,the

corresponding algorithm of those important parts will be presented in fol-lowing section in Fig 2.

2.2 Speech Recognition algorithm

2.2.1 Partition and Pre-emphasis

In order to deal with the continuous input signals we have to partitionthese serial signals into separated frames. These original signals may bevalid input data or just some noises so it is better to filter them. The noisesnormally fasten on low frequencies,so we should use a high pass filter torestrain them. What’s more, the high-frequency voice signals tend to beblocked by mouth and nose,so a high pass filter is needed to increase theamplitude of these signals. This process is also called pre-emphasis. Thefiltering process could be transferred to posterior position in the flow chartto save some calculation resource. But in our design,we allocate it beforedetector to obtain better anti-noise performance and then more accuratebeginning detection of the voice input.

2.2.2 Detector

The system always receives signals from the Buffer after A/D con-verter,so the first things should be done is to tell the system when to start

Page 22: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

14 Speech Recognition

exacting the characteristics. In other words,we need to tell the system whichsignals are valid. This job is completed by finding out where the beginningof the input voice is and where the end is.A normal method to detect thebeginning is to judge if there are some signals whose amplitudes are higherthan a reference level. When there are continuous 15 frames signals whoseamplitudes are lower than the reference level,the system will cut off these15 frames and consider the previous frame is the end point.

2.2.3 Autocorrelation

Speech recognition modeling requires an algorithm to find the solutionto solve a general set of linear symmetric;in this project we use LPC analysiswhich is a very useful method to handle speech.The important part of speechrecognition is finding LPC coefficients. The previous step is to find out anautocorrelation which could act as a means to express the similarity of asignal to lagged versions of itself. It could truly help to process short-timespeech signal and then be used to create the linear prediction coefficients.Thefunction of autocorrelation is defined as:

Rx(k) =N∑n=k

X(n)X(n− k) k = 0, 1, ...,M (1)

Here M is the number of the reflection coefficients of each frame signal.-We use autocorrelation as a means to do pitch estimation and express thesimilarity of a signal to lagged versions of itself. It could truly help to findthe linear prediction coefficients.

2.2.4 Schur recursion

In this section,we will introduce another algorithm for solving the nor-mal equation known as the Schur recursion. Unlike the Levinson-Durbinrecursion, which obtains the filter coefficients ap(k) in addition to the reflec-tion coefficients Γj ,the Schur recursion is only used to generate the reflectioncoefficients. Therefore,a sequence of autocorrelation value rx(k) is providedto Schur recursion to obtain the corresponding sequence of reflection coef-ficients:

[rx(0), rx(1), . . . , rx(p), ]→, [Γ1,Γ2, . . . ,Γp, εp] (2)

The following steps illustrate the procedure of how to get reflection coefficients.-The pre-requirement is provided the autocorrelation sequence rx[]T , Basi-cally,the algorithm begins with initializing g0(k) and g0R(k) with rx(k),then the first reflection coefficient can be computed: Γ1 = −g0(1)/g0R(0) =−rx(1)/rx(0);Taking Γ1 as the multiplicand, a lattice filter is then used togenerate the sequences g1(k) and g1R(k), as previously, the second reflec-tion coefficient can be obtained from the ratio: Γ2 = −g0(2)/g0R(1); the

Page 23: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Dan Liu, Hongwan Qin, Ziyang Li, Zhonghua Wang 15

computation process can continue just in this way until all p coefficients aregenerated.1. Set g0(k) = gRo = rx(k)fork = 0, 1, ..., p2. For j=0,1,...,p-1(a) Set Γj+1 = −gj(j + 1)/gRj (j)(b) For k=j+2,...,p, gj+1(k) = gj(k) + Γj+1g

Rj (k − 1)

(c) For k=j+1,...,p, gRj+1(k) = gRj (k − 1) + Γj+1gj(k)3. εp = gkp(p)

2.2.5 Comparison

When autocorrelation coefficients of each frame were converted to linearprediction coefficients we will use a special compare algorithm to comparethe generated prediction coefficients with reference coefficients. As we men-tioned before,those reference database coefficients can be created with thesame way of finding input voice reflection coefficients.

Here we introduce one concept called Euclidean distance which is thereal distance between two points. This distance is a square root of the dif-ference between the reference coefficients and the input generated predictioncoefficients. The formula of the distance is d =

√(x1− x2)2. In fact, the

square root algorithm always costs too much calculation resources and theprinciple of using square also can realize the comparison purpose. So wejust use the square of the difference to implement the comparison. For eachframe we will get M distances and then plus these distances to obtain a totaldistance which represents this frame.

After coefficients of all frames are compared, we can get a lot of thesetotal distances. Then we plus these total distances to get a final distance be-tween input’s prediction coefficients with each stored reference coefficients.So if there are 10 groups of reference coefficients we can get 10 final dis-tances. After that we can compare these 10 final distances and find outthe minimum one. The group of reference coefficients which produced thisminimum distance represents what we want to recognize. If we establisheda mapping between reference coefficients and a special word in advance, wenow can find out which word is the right one.

3 How to realize the whole speech recognition

In this section we will present how we implement our design. Firstly, wewill describe how to develop MatLab reference model for the whole structure,after this step we converted them to C to operate on DSP board.

Page 24: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

16 Speech Recognition

3.1 MatLab Implementation

Following flow chart is our MatLab reference structure for verify ourdesign before we transfer them into C language. As seen in this Fig 3, thereare six models in the whole speech recognition process.Generator: This is the top level of system include all function parts exceptcompare model, and conduct each part to operate orderly.Detector: Check the start point and endpoint of voice signal.High Pass Filter: Improving the high frequency signal from voice anddiscard the low frequency unwanted samples which are noise and indistinctsound.Finding Autocorrelation: Calculating the autocorrelation.Schur Recursion: Commutating the reflection coefficients.Compare: find out the minimum difference reference word among the refer-ence database and output the result.

3.2 DSP implementation

When we transfer MatLab model into C Language we made some reviseto fulfill the actual operation on DSP [5]:1.We moved the High Pass filter to the front of the detector, this role of thisaction is to enhance the anti noise performance.2.We modified the start point and the endpoint detection. In the previousmodel if only there is one signal in a frame higher than the reference level weconsider it is the start frame and reserve it. As for the endpoint detectionwe used the method that if there are continuous 15 frame signals all lowerthan the reference level we consider hat the endpoint is found and the 15frames will be discarded. Actually when we use these previous methods todetect the beginning and endpoint on the DSP environment the recognitionrate was very low. So we have to add more judgment conditions to enhancethe reliability of detection.Now our new model as the Figure 4, the other parts of modules are iden-

tical with in matlab.

4 The problem and solution

At the beginning the entry noise in the original model was quite large,so it was very difficult to check the start point and endpoint precisely. Ouranalysis is that the difference between noise level after the codec and theeffective voice signals amplitude is not very obvious. The noise mainly con-centrates on low frequency, so we used high-pass filter before the detectorto filter these low-frequency noise so that getting a better signal to noiseratio. And we also set the value of ”microphone-boost” as 0 in the file

Page 25: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Dan Liu, Hongwan Qin, Ziyang Li, Zhonghua Wang 17

of dsk6713−codec−devParams.c to decrease the distortion comes from theamplifier in codec. Using the ”LOG−printf” we can see that these twomeasures can effectively increase signal to noise ratio.

Our previous method to detect the start point and endpoint is whenthere is a valid signal in each frame we treat it as a valid frame, however,this method will introduce more interfering signals. That means if therejust has a stochastic impulse in one frame, this frame also will be treated asvalid. In order to reduce misjudgment possibility caused by interference wedivide each frame into four sections and only if there are valid signals at leastin two sections the system considers this frame valid. In addition, for morerigorous conditions we detect the start point with the following method:detect continuous 5 frames signals and only if there have at least three validframes in the five frames, the system considers there is a start point in these5 frames and store signals from then on. Similarly, for endpoint detection,only if there are no more than two valid frames in 15 continuous framessignals, the system considers the endpoint is found out. The result showsthat the length of the data collected is moderate.

The pronunciations of different people for one word will generate differ-ent voice wave shape which represents different formants frequencies. Thatmeans different pronunciation of the same word will produce different predic-tion coefficients. So we collected the speech signals of one word pronouncedby 4 group members for several times and then average the generated predic-tion coefficients of each time. The averages stored as the reference coefficientprove a satisfying reliability.

5 The conclusions

We adopt the LPC algorithms on DSP and the running result is as ex-pected. That is to say our system can correctly recognize our voice words.Our design uses some effective anti-noise measures and could behave a bitrobust performance.

Actually, it is very difficult to identify a word in the real world whichis full of noise, especially if want to identify the different pronunciation ofdifferent people. The recognition process is really more complicated thanexpected. Because the time of this project is limited, it’s impossible for usto realize more powerful algorithms. Sometimes we have to train our oralspeaking so that we could pronounce the words similarly to guarantee a finerecognition effect.

Page 26: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

18 Speech Recognition

References

[1] Monson H. Hayes, Statistical Digital Signal Processing and Mod-eling,1996

[2] L.R.Rabiner,S.E.Levinson,A.E.Rosenberg,H.G.Wilpon,,SpeakerIndependent Recognition of Isolated Words Using ClusteringTechniques,1979

[3] http://en.wikipedia.org/wiki/Autocorrelation

[4] http://en.wikipedia.org/wiki/Speech_recognition

[5] Sanhengxing, TMS320C6713 DSP Theory and Application,ISBN:9787121085642,April 2009

Page 27: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Dan Liu, Hongwan Qin, Ziyang Li, Zhonghua Wang 19

Figure 1: Block of Speech Recognition modeling

Figure 2: Block of Speech Recognition process

Figure 3: block of MatLab Implementation

Page 28: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

20 Speech Recognition

Figure 4: DSP Block of the system

Page 29: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

21

Part III

MIDI Synthesizer

Nauman Hafeez, Waqas Shafiq

Abstract

The aim of the project is to implement MIDI synthesizer with FMsynthesis technique on Texas Instruments TMS320C6713 DSP pro-cessor kit. The song being synthesized is ”I will survive” by GloriaGaynor. The MIDI synthesizer can play 4 channels, one of them hav-ing the capability to play polyphonic notes. Report discusses howthe MIDI format is built up, description of how different synthesistechniques work, and overview on how to make a sound analysis ofdifferent musical instruments. Finally our experiences and problemsare discussed.

Page 30: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

22 MIDI Synthesizer

1 Inroduction

Yamaha introduced the FM-synthesis in its DX-7 synthesizer during the80’s. It came to revolutionize the world of synthesizer. With the help ofnew FM-synthesis, manufactures could create sounds that were more similarto real instruments. Yamaha based their synthesizer on the MIDI format, aformat still living today. Even though MIDI is an old technique; it is usedin cell phones, computers, sound cards and of course the synthesizer. TheMIDI format is an hexadecimal written language with information about theinstruments, rate, pace for the melody and amplitude for every tone that isplayed. To be able to play MIDI sounds a midi player is used to decode theinformation.

2 Description

2.1 MIDI Standard

MIDI is an acronym for ”Musical Instrument Digital Interface and is a hard-ware and software specification that allows communication between elec-tronic musical instruments. The hardware specification is a RS232-like se-rial interface and the software side of the specification is the actual languageof the data that is being sent its structure, order and other characteristics.An essential aspect of MIDI data is that it does not contain sampled sound;instead it is a series of commands or instructions which represent musicalperformance-related events. The data is stored in the form of events whichare recorded when an event happens on any instrument. There are threeforms of MIDI files. These are called format 0, format 1 and format 2. For-mat 0 files contain data from a single track whereas format 1 and 2 can storedata from multiple tracks. The MIDI synthesizer that we have designed canplay FORMAT 0 files.

MIDI (Musical Instrument Digital Interface) is an industry-standard pro-tocol defined in 1982 that enables electronic musical instruments such askeyboard controllers, computers, and other electronic equipment to commu-nicate, control, and synchronizes with each other. MIDI allows computers,synthesizers, MIDI controllers, sound cards, samplers and drum machines tocontrol one another, and to exchange system data. MIDI does not transmitan audio signal or media it transmits ”event messages” such as the pitch andintensity of musical notes, vibrato and panning, cues, and clock signals to setthe tempo. As an electronic protocol, it is notable for its widespread adop-tion throughout the industry. MIDI composition and arrangement takesadvantage of MIDI 1.0 and General MIDI (GM) technology to allow mu-sical data files to be shared among various electronic instruments by usinga standard, portable set of commands and parameters. The data can besaved as a Standard MIDI File (SMF), digitally distributed, and then re-

Page 31: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Nauman Hafeez, Waqas Shafiq 23

produced by any computer or electronic instrument that also adheres to thesame MIDI, GM, and SMF standards. MIDI messages (along with timinginformation) can be collected and stored in a computer file system, in whatis commonly called a MIDI file, or more formally, a Standard MIDI File(SMF).

All MIDI files begin with a Header chunk. The chunk starts with a 4byte character identifier ”MThd”. It contains data about the the type of thefile (either format 0,1 or 2), the number of tracks in the file(if format 1 or 0)and the time division used to record the MIDI file. The next chunk is theTrack chunk. It starts with the character identifier ”MTrk” and it containsthe length of track. After this all the events relating to the musical datais stored. The data is stored in terms of packets of information. The midievents are classified into several categories, each category corresponding to adifferent kind of event, like a note on, a note off, or a change in the controllerof an instrument. The end of the midi file contains an event of type 0x2F.Each MIDI event starts with a variable 4 byte field which indicates theamount of time in tics when the next MIDI event should be read in. Thenext byte contains the type of the MIDI event (ex: note on, note off) andchannel on which it occurs. The next byte contains data on the note numberwhich is played at that instant and the last byte indicates the velocity withwhich the note was played.

2.2 Envelop

The purpose of envelop is to avoid sudden start or end of sound; to givesmooth start and ending of sound signal.

2.2.1 ADSR

ADSR stands for ”attack-decay-sustain-release”. The notes produced by anymusical instrument do not have constant amplitude all the time. Every notegoes though an ”attack” stage, then a ”decay” stage, followed by a quitelong ”sustain” stage and in the end it goes though a ”release” stage. TheMIDI file contains data only about when a note was created and with whatamplitude(velocity) it was created. It also contains events to signify ”noteoff” But it doesn’t contain any information about the ”attack-sustain- decay-release” characteristics of the note. So it is the synthesizer’s responsibilityto emulate this attack-sustain-decay-release (ADSR) property of the note.We have implemented an ADSR function to return values based on thenote sample value to determine whether the current note being played isin the beginning stages or if its in the sustain phase. The function returnsfractional values depending on the value of note sample and these valuesare used to modulate the amplitude of the sound produced. A well definedADSR will make the notes sound better and more realistic. A sample ADSR

Page 32: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

24 MIDI Synthesizer

Figure 1: ADSR Envelop

Page 33: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Nauman Hafeez, Waqas Shafiq 25

envelope is shown in above Figure.We have simulated various instruments in LABVIEW, for example

3 Synthesis Techniques

There are different sound synthesis techniques, each with their own strengthsand weaknesses. We studied four of them.

3.1 FM Synthesis

FM synthesis is a way of generating musically interesting sounds by rapidlychanging the basic frequency of a sound in a repetitious way. The pattern ofchange comes from another waveform with a frequency in the hearing range.There is a lot to say about which waveforms and which frequency inputs touse. The basic equation for FM (Frequency Modulation) is describe below

y(t) = w1sin(2πfct+ w2sin(2πHfmt+ θ) + φ) (1)

In this magical equation the frequency of output signal varies sinusoidallyaround the carrier frequency fc. And frequency variation is controlled by fm,w1 is the envelope of output signal which can be constant or may be variedaccording to sound requirement, where as w2 is the modulation Index, a keyfactor to generate proper musical sounds. H = fc:fm For the different musicalinstruments the ratio fc:fm is very important, because without knowing theexact frequency ratio of fc and fm, it is very difficult for a designer to generatethe musical sounds through FM synthesis, although FM synthesis lookssimpler but there is no mathematical rule for musical instruments in FM, sodesigner must have to follow Hit and Trail rule to find exact frequency ratioof different musical instruments. The table shown below gives this ratio fordifferent musical instruments

Musical Instrument HBell 1.40Wood-Drum 0.688Brass 1Clarinet 0.667Bassoon 0.2

But the story of FM is not finished yet, there is one more very importantfactor which needs to be consider while doing FM synthesis ,that is modu-lation Index for each of the musical instrument presented in table above hasa unique time varying modulation index which controls the harmonic ratiofor sounds.

Page 34: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

26 MIDI Synthesizer

Figure 2: Envelop (a(t)) and modulation index (i(t)) waveform for Drum

Page 35: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Nauman Hafeez, Waqas Shafiq 27

Figure 3: Output Spectrum of Drum

Page 36: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

28 MIDI Synthesizer

Figure 4: Envelop (a(t)) and modulation index (i(t)) waveform for Brass

Page 37: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Nauman Hafeez, Waqas Shafiq 29

Figure 5: Output Spectrum of Brass

Page 38: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

30 MIDI Synthesizer

Figure 6: Envelop (a(t)) and modulation index (i(t)) waveform for Bell

Page 39: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Nauman Hafeez, Waqas Shafiq 31

Figure 7: Output Spectrum of Bell

Page 40: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

32 MIDI Synthesizer

3.2 Additive Synthesis

Additive synthesis is a technique of audio synthesis which creates musicaltimbre. The timbre of an instrument is composed of multiple harmonics orpartial, in different quantities, that change over time. Additive synthesisemulates such timbres by combining numerous waveforms pitched to dif-ferent harmonics, with a different amplitude envelope on each, along withenharmonic artefacts. Usually, this involves a bank of oscillators tuned tomultiples of the base frequency. Often, each oscillator has its own customiz-able volume envelope, creating a realistic, dynamic sound that changes overtime.

3.3 Subtractive Synthesis

Subtractive synthesis is often referred to as analogue synthesis because mostanalogue synthesizer use this method of generating sounds. In its most basicform, subtractive synthesis is a very simple process. An oscillator is used togenerate a suitably bright sound that is routed through a filter. A filter isused to cut-off or cut-down the brightness to something more suitable. Thisresultant sound is routed to an amplifier. An amplifier is used to controlthe loudness of sound over a period of time so as to emulate a naturalinstrument. Subtractive synthesis starts with a harmonically rich waveformand filters out unwanted spectral components.

3.4 Wavetable Synthesis

In wavetable synthesis the desired waveform is sampled with a given reso-lution during an integer number of periods. This sample is then re-sampledand looped at playback to reconstruct the sound. To achieve more var-ied sounds different waveforms can be used during different phases in thesound, for instance during the attack, sustain and decay parts. A wavetableis stored in memory and an increment in the reading pointer allows accessingthe required sample at each moment in time. If we want to change the pitchof the stored signal we have to change the table look-up speed.

4 Implementation

The source code written in this project (FM-synthesis) project consistsmainly of two parts. The first part is the process function defined inpip audio.c. This function is invoked when the program is ready to copydata from the input buffer to the output buffer. Before copying the datato the output buffer, the process function processes the input data (e.g.synthesizes tones given by the midi commands in the input data , and thencopies the processed data to the output buffer. The second part of the source

Page 41: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Nauman Hafeez, Waqas Shafiq 33

code is the instruments functions defined in instruments.c. When process isinvoked it reads a midi command from the input buffer by calling the MIDIrunCommand. The interpretation of the commands given by the input datais done in midi.c. The envelope calculations are performed using an enve-lope counter and the velocity parameter Mvel. The calculations results inan amplitude used as an input parameter to the instruments, along with theenvelope counter used as the time and the note to play (a set of notes, or-dered by rising frequencies, are stored in the vector notes). The instrumentfunctions are responsible of synthesizing the output values. The output isthen written to the output buffer. We are playing channel 1, 3, 5 and 10.In polyphonic channel, we are playing three notes at a time. For this weneed six data structures to store note information and three variables tokeep track of the earliest note. Out of six data structure three are currentand three are next. When a note comes its information is stored in firstcurrent data structure and variable corresponding to this data structure isassigned 0. When a new note of the same channel comes, its information isstored in second current data structure and variable corresponding to thisdata structure is assigned 0 and variable corresponding to first current datastructure is assigned 1. If new note comes it is assigned to third current datastructure and corresponding variable is made 0, variable corresponding tosecond data structure is assigned 1 and variable corresponding to first datastructure is given a value of 2. If another note comes variable having a valueof 2 tells which one is the earliest note and it has to be killed. That note iskilled and new note is assigned to the next data structure corresponding tothe killed current data structure. The new values of variables are updatedaccordingly.

5 Conclusion

We have successfully implemented MIDI synthesizer with 4 channels andone of them is polyphonic. We had a lot of problems understanding theMIDI standard for instance how does MIDI works and later on we hadimplemented it in MATLAB and LABVIEW easily but faced problems whenimplementing on DSP kit because we were working on it for the very firsttime. Due to this we have implemented our code using only FM synthesisthough drums has to be implemented with wavetable synthesis because itsounds not so good when implemented with FM synthesis.

Page 42: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

34 MIDI Synthesizer

References

[1] Kenneth Adams Jr, MIDI Synthesizer, Georgia Institute of Technology,Atlanta, GA

[2] John M. Chowning, The Synthesis of Complex Audio Spectra by Meansof Frequency Modulation

[3] http://www.midi.org/aboutmidi/tutorials.php

Page 43: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

35

Part IV

Pitch Estimation

Aravind K Annavaram V, Mohammed Azher Ali, MirzaJameel Baig, Surendra Reddy U

Abstract

This document describes a method of estimating Real-time pitchfor audio signal. Objective of the project is to implement an algorithmon TMS320C6713 floating-point Digital Signal Processor (DSP) thathandles audio input in real time using the Code Composer Studio(CCS) development tool [6]. Pitch estimation is an algorithm designedto estimate the pitch or fundamental frequency of a quasi periodic orvirtually periodic signal [1], usually a digital recording of speech or amusical note or tone. This can be done in time domain or in frequencydomain. In our implementation we use cepstrum analysis for pitchestimation which is one of the techniques in frequency domain.

Page 44: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

36 Pitch Estimation

1 Introduction

Pitch estimation is an essential task in a variety of speech processing Ap-plications. Although many pitch detection algorithms (PDAs), both in thetime and frequency domains, have been proposed in the literature [2], accu-rate and robust voicing detection and pitch frequency determination remainsan open problem. The difficulty involved in Pitch detection stems from thenon-stationarity and quasiperiodicity of the speech signal as well as the in-teraction between the glottal excitation and the vocal tract. [3]

A number of prevailing pitch estimation techniques usually use a varietyof features such as autocorrelation, cepstrum, average magnitude differencefunction (AMDF), and so on. In this report, we introduce a cepstrum pitchestimation approach based on Fourier transform of the log of the magni-tude spectrum of the input waveform. By using the log spectrum make thenonlinear (inharmonic) system more linear [4]. The algorithm will first betried out in Mat lab environment because of its simplicity and familiarity.When they are confirmed functional, they are converted into C code andimplemented on a Texas Instruments DSK 6713.

2 Different methods of pitch estimation

Pitch estimation is classified into 1. Time-domain approaches 2. Frequency-domain approaches

2.1 Time-domain approaches:

In this approach we measure the zero crossing points of the signal (i.e. theZero Crossing Rate). However if the signal consists to multiple sine waveswith different periods then this approach does not work well [1]. Few ap-proaches compare the segments of a signal with the offset of other segmentsto find a match on trial bases. Average Squared Mean Difference Func-tion, Average magnitude difference function and few other algorithms likeautocorrelation work this way. For signals with high periodicity, these algo-rithms have proved to be efficient [1]. But with errors like octave errors andpolyphonic tones cannot be estimated.

Page 45: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Aravind K Annavaram V, Mohammed Azher Ali, Mirza Jameel Baig, Surendra Reddy U37

2.2 Frequency-domain approaches:

In the frequency domain approaches, polyphonic tones can be detected usingFast Fourier Transform (FFT) to convert the signal from time to frequencydomain. This approach requires more processing power as the desired ac-curacy increases. Well-known efficiency of the FFT algorithms makes itefficient for many purposes [1].

Popular frequency domain algorithms include harmonic product spec-trum, cepstral analysis and maximum likelihood. Frequency domain pitchdetectors use short time Fourier transforms for successive segments of inputsignals after finding the peak the detector must decide which frequencies arefundamental frequencies and which frequencies are merely harmonics. In realtime these frequency domain pitch detectors select the strongest frequencyas the pitch this fundamental frequency may not be strongest componentbut most prominent pitch due to reinforcement of multiple harmonics. Thisis a problem in using short time Fourier transform in pitch detector, be-cause it damages a large number of analysis channels at the low end of thespectrum [5] The fundamental frequency of speech can vary from 40 Hz forlow-pitched male voices to 600 Hz for children or high-pitched female voices[1].

3 Cepstrum analysis

The name cepstrum comes from reversing the first four letters in the wordspectrum. Cepstrum technique is commonly used in frequency-domain pitchdetection. It can be defined as an algorithm in which it tends to separatestrong pitched components from the rest of the spectrum because of thisit is considered as a better model for vocal and instrumental sounds whosespectrum can be sum of excitation and resonances [5] of the discrete Fouriertransforms. The result of cepstrum is in time sequence, like the input signalitself. If the input signal has a strong fundamental pitch period, this showsup as a peak in the cepstrum. By measuring the time distance from time 0to the time of the peak, one finds the fundamental period of this pitch.

The cepstrum algorithm flow-chart is in shown figure 1. The input signalis given to FFT block. In this block the FFT input signal is computed, i.e.,the signal is converted from time-domain to frequency-domain spectrum.Then the signal is passed through Absolute module, to get the magnitudespectrum. The magnitude spectrum is then converted to logarithmic scaleby the log module preceding the absolute module in the flowchart. Thesignal is now converted to time domain by passing through IFFT block Themaximum peak in output waveform is the fundamental pitch period, and theinverse gives the pitch frequency of original signal. This is how the cepstrumalgorithm works.

Page 46: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

38 Pitch Estimation

Figure 1: cepstrum

Page 47: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Aravind K Annavaram V, Mohammed Azher Ali, Mirza Jameel Baig, Surendra Reddy U39

4 The Cepstrum algorithm Implementation

Here the cepstrum algorithm is implemented and analyzed in two differenttools, such as in MatLab and Code Composer Studio (CCS). First the codeis written in the MatLab and tested for different tones and then the MatLabalgorithm is developed in to C code in CCS.

4.1 The MatLab algorithm

The flowchart shown in the figure 1 describes the cepstrum algorithm.Apiano tone (.wav file, mono i.e. consisting of a single channel) is read into matlab and FFT function is applied to it. A common use of Fouriertransforms is to find the frequency components of a signal buried in a noisytime domain signal. The frequency domain signal is further applied withLOG function. The log function operates element-wise on arrays. Its in-cludes complex and negative numbers. These complex values obtained areconverted to absolute by applying ABS function. Abs(X) returns an arrayY such that each element of Y is the absolute value of the correspondingelement of X.

The signal is now converted to time domain by applying IFFT function.Zeros are introduced at the start and end of the signal to cut down the noisecomponents due to window effects.

A peak is to be detected from the following signal, this is achieved byapplying a MAX function, which gives the maximum amplitude of the signalon x-axis corresponding to time, the inverse of time period to which themaximum peak corresponds gives the pitch of the signal. Description ofthese functions is shown in the figure 2.

4.2 Observations

The following matlab implementation of cepstrum in figure 2 was testedusing sine wave frequencies ranging from 250 Hz to 900 Hz and piano tonesform B3 (246Hz) to A5 (880Hz).

4.3 Algorithm for the DSP

The cepstrum algorithm which was tested in matlab, is implemented in realtime on TMS320C6713 floating-point Digital Signal Processor (DSP) usingthe Code Composer Studio (CCS) development tool using C language.TheLab 2 source code is used for the implementation.

In our project we are using two pip buffers PIP-TX and PIP-RX. Theinput signal is taken form the line-in port, the data from the line-in will bestored in pip-rx buffer, and using c code the data is transferred form pip-rx to pip-tx a buffer defined in code and all the processing is done on this

Page 48: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

40 Pitch Estimation

Figure 2: matlab results

data. The functions that are used in the achieving the desired algorithm arementioned below sections.

4.4 FFT

Here in our code we are using complex FFT function (128 point) which is de-scribed in DSP LIBRARIES from Texas instrumentation TMS320C6713[8].To implement this function we need input values on which the FFT is tobe computed is stored in a buffer, twiddles are generated in c code manu-ally, FFT size which depicts the n-point FFT that is to be computed. Firstwe generate coefficients for FFT function, which are called as twiddles. Theoutput generated from FFT function is in bit reversed order and in frequencydomain.

4.5 Absolute And Logarithm Function

The FFT thus obtained is in complex form, we then find its absolute valueusing a function manually implemented in C. The absolute values obtainedare converted to logarithm scale using LOG function, by including math.hlibrary, in the C code.

Page 49: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Aravind K Annavaram V, Mohammed Azher Ali, Mirza Jameel Baig, Surendra Reddy U41

4.6 IFFT

The signal thus obtained is converted to time domain, by applying IFFTfunction, which is described in DSP LIBRARIES from Texas instrumenta-tion TMS320C6713[8]. The IFFT signal is further scaled

4.7 Maximum

Maximum value in output buffer is found by using maximum function whichis described in DSP LIBRARIES from Texas instrumentation TMS320C6713.This function finds the maximum number in an array, and displays the indexcorresponding to the maximum number.

Before using this function, to calculate the maximum index of the bufferobtained from the IFFT function we have to introduce some zeros in thebeginning because at the start there are some noise components due towindow effect, which are having higher amplitude values. So there is achance of detecting wrong peak and estimating incorrect pitch. This isthe reason why we introduce some zeros in the beginning to cut the noisecomponents and to detect correct peak value.

4.8 Pitch Estimation

Pitch Frequency corresponding to the maximum value is found by dividingthe sampling frequency (i.e. frequency at which the DSK 6713 samples theinput signal) with the index corresponding to the maximum number in thearray. And this pitch frequency is displayed in MESSAGE LOG window,using LOG-printf function in C.

5 Simulation Results

The different piano tones with various frequencies are given as input andcorresponding fundamental frequency (pitch) of the tone is estimated, whichcan be analyzed from the simulation results figures. A piano tone of (B3)246Hz is given to line in port of DSK6713 kit. The pitch for this tone isdetected and printed in log window as show in figure 3.

A piano tone of (C5) 523Hz is given to line in port of DSK6713 kit. Thepitch for this tone is detected and printed in log window as show in figure4.

A piano tone of (A5) 885Hz is given to line in port of DSK6713 kit. Thepitch for this tone is detected and printed in log window as show in figure5.

Page 50: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

42 Pitch Estimation

Figure 3: simulation result for (B3) 246Hz

Figure 4: simulation result for (C5) 523Hz

Figure 5: simulation result for (A5) 885Hz

Page 51: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Aravind K Annavaram V, Mohammed Azher Ali, Mirza Jameel Baig, Surendra Reddy U43

6 Problems encountered

The Texas instruments provides DSPLIB which is an optimized floating-point DSP Function Library. It includes C-callable, assembly-optimizedgeneral-purpose signal-processing routines for optimal execution speed. Ap-parently there had been some problems to get the DSP Library to work,as it was compiled for a newer version of CCS. And we had to use the Ccode instead of the libraries. Another main problem which we face is thatin input signal is in the form of unsigned short, but the input signal to FFTfunction should be in floating point format. So, we first converted the inputsignal from unsigned short to short and then to float, but if convert directlyto floating point it doesnt work.

7 Conclusion

The above algorithm has been verified for various piano tones ranging from(B3 = 246 Hz to A5 = 880Hz) and sine waves ranging from frequencies of(240Hz to 900Hz). The cepstrum algorithm which has been implemented isan optimal method of pitch detection at low frequencies.

Finally, the project was implemented and verified for different pianotones with sample frequency of 8 kHz, so the accuracy of the function issomewhat poor in estimating frequencies below 240 Hz and frequencies above900 kHz.

References

[1] wikepedia

[2] W. Hess, Pitch Determination of Speech Signals. Berlin, Germany:Springer-Verlag, 1983.

[3] IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING,VOL. 7, NO. 3, MAY 1999. Cepstrum-Based Pitch Detection Usinga New Statistical V/UV Classification Algorithm Sassan Ahmadi andAndreas S. Spanias

[4] Lu Meng March 3, 2004 Based on Pitch Extraction and Fundamental,Frequency: History and Current Techniques by Dr. David Gerhard,University of Regina.

[5] curtis roads, the computer music tutorial 1996. part 4 sound analysis

[6] www.eit.lth.se

[7] www.mathtools.net/MATLAB/Signal processing/DSP/index.html

Page 52: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

44 Pitch Estimation

[8] http://focus.ti.com/general/docs/ TMS320C67x DSP Library Pro-grammer’s Reference Guide (Rev. C)

Page 53: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

45

Part V

Video processing

Bilgin Can, Ma Ling, Lu Fei

Abstract

In this project, object detection and identification in video process-ing is implemented based on the TMS320DM6437 Evaluation Mod-ule.The aim is to detect the moving object while keeping the back-ground fixed. Firstly, we search the moving object’s location.ThenCanny edge detection is employed to extract the moving object’s shape.The moving object’s edges are shown on the monitor after extraction.

Furthermore, object identification is executed after finishing thebasic requirement. Based on the moving object’s edge, the shapecan be classified into three categories: rectangular, circular and oth-ers.Different background colors are used to indicate the shape of themoving object, blue for rectangular, green for circular, and black forothers. Finally, we can get a colored object’s edges on the monitor.

Page 54: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

46 Video processing

1 Introduction

Real-time video and image processing is applied in a wide variety of appli-cations, such as video surveillance, traffic management and medical imagingapplications [1][2]. For different applications, the video processing can bedivided into different branches. For example, in video surveillance, an im-portant task is moving object detection. Traditionally, the critical task ofmonitoring safety is executed by human eyes, which is a hard work for watch-men. An automatic detection of moving objects is helpful to reduce human’stask in the monitoring system although it cannot completely replace the hu-man presence. This automatic detection means that while capturing thevideo frames by video camera, the shape of moving object under static en-vironment could be detected automatically and displayed on the monitor.After getting the moving object’s information, the following task is to iden-tify the object within a video frame. The detected object can be classifiedinto different categories. This is useful for watchmen to focus on the inter-ested category. The category’s information could be marked on the monitorby introducing different colors, figures or texts. In our project, the abovetwo tasks, object detection and identification, are implemented based onthe TMS320DM6437 Evaluation Module. The overall flow chart is shownas follows.

In Fig 1, while beginning to receive the input frames from video camera,the first step is using the first three frames to create a clean background.Then the object detection and identification are carried out for the followingframes as step 2 to 5. Step 1 and 2 are introduced in chapter 2.1. Step 3 isdescribed in chapter 2.2. Then in chapter 3.1, the object features for step 4are presented. The next step 5 is introduced in chapter cliirts. After thesesteps implemented in the TMS320DM6437 module, we use several objectsto evaluate the performance of object detection and identification which canrefer to chapter 4.1. In addition, we also observe the CPU load, which isrecorded in chapter 4.2.

2 Object detection

The platform DM643x offers an interface in the framework, from which wecan access the input video stream frame by frame, with the gray scale andchrome signals of each pixel. Thus in detail what we should do is to processthe image signals in each frame, and to send them to the output buffer.

Since we hope to detect the shape of moving object under static environ-ment and display it on the monitor, an efficient algorithm need to be used.This algorithm includes two steps. Firstly, we should locate the position ofmoving object in the current frame. Then an edge detection method can beapplied on that position to find out the object’s shape.

Page 55: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Bilgin Can, Ma Ling, Lu Fei 47

2.1 Moving object location

2.1.1 Image differential

Moving object detection is a complicated work in video processing field. Herewe just use a simple method to identify the moving object. In this solution,the static background is recorded at the initialization of the program. In theprocedure of video processing, each frame in the buffer cache is compared tothe original background image, pixel by pixel, then we can find out all thelocations, where the difference value between the current pixels and originalpixels are greater than a default threshold value, as the area of the movingobject. Thus, the camera should not be shifted when the program is running,otherwise error will be introduced to the image differential result.

2.1.2 Noise reducing

However, in real-time implementation, some pixels in a complex backgroundare not stable but flickering in different frames, especially the pixel points atthe high-contrast edge, which introduce unexpected areas in the differentialresult. Because the last function, object shape identification, needs theobject edge image with a background as clear as possible, we need to reducethe noise points at the image differential step.

We find out two ways to get rid of the problem. First, at the initial-ization, we save three different background image samples instead of singleone. Only when the current pixel is different from the pixels of all the threeimages at the same location, it can be determined as a part of the movingobject. Second, we use the mean value of current pixel with the eight neigh-bor pixels to compare to the average value at the same point of the originalimage, which can reduce the error of single flickering pixel.

The result is very good when combining the two methods together, butconsidering the amount of computation in the canny edge detection step andthe global CPU load, we finally skip the averaging part to achieve a moreacceptable speed.

2.1.3 Other optimizations

On the DM643x devices, we can use the FVID API provided by the devicedriver to access the frame pixels in an interleaved YCbCr 4:2:2 stream.For the edge detection and shape identification functions, we only needthe gray-scale signal. But in this step, with the purpose of getting higherdetection accuracy, we use both of the gray and chrome signals to do theimage differential. After this, the threshold value in differential can belimited to 15 (in 256 levels), which is much more sensitive than the valueof 80 at the beginning, with hardly any extra noise involved. Furthermore,because what we need is just the shape of the moving object rather than

Page 56: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

48 Video processing

the image in detail, we set all the value of the object pixels to 255 while thebackground pixels to 0, which can decrease the amount of computation inthe edge detection efficiently. The complete procedure of the moving objectlocation is shown in the figure 2.

2.2 Edge detection

After identifying the moving object’s location, we need the edge detectionalgorithm to extract the object’s shape. There are many edge detectionmethods available, such as Sobel detection, Robert cross detection, Prewittdetection, and Canny detection. These methods have the similar principlewhich includes convolving the image with a 2-D filter and detecting largegradients in the image because zero is returned in uniform regions. Amongthese methods, the Canny edge detection is the optimal one. For example,we compare the Sobel and Canny detection in the matlab simulation.

The original figure is shown as Fig 3 (a). Comparing Fig 3 (b) and Fig 3(c) which are the results after Sobel and Canny detection, Canny providesbetter shape and its edge is thinner. Although Soble runs faster than Cannydue to its less convolution time, we still choose Canny for our object edgedetection.

2.2.1 Canny edge detection

Canny edge detection algorithm was proposed by John F. Canny in 1986[3]. It involves several steps as shown in Fig 4.

1) Image Smoothing [4]A 2-D Gaussian filter is applied to reduce noise and unwanted details

and textures in the original image. Assume that f(m,n) is the input pixel,g(m,n) is the output of the filter, G(m,n) is 2-D Gaussian filter.

g(m,n) = Gσ(m,n) ∗ f(m,n) (1)

where

Gσ(m,n) =1√

2πσ2exp[−m

2 + n2

2σ2] (2)

2) Gradient calculationThe gradient of g(m,n) is computed as

M(m,n) =√g2x(m,n) + g2

y(m,n) (3)

θ(m,n) = arctan[g2x(m,n)/g2

y(m,n)] (4)

Here θ(m,n) indicates the gradient direction.3) Noise suppression

MT (m,n) ={M(m,n) ifM(m,n) > T

0 otherwise(5)

Page 57: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Bilgin Can, Ma Ling, Lu Fei 49

Where T is chosen to make sure all edge elements are saved when mostof the noise is suppressed.

4) Non-maximum SuppressionBecause the edges might be broadened in step 1), an additional suppres-

sion is executed to thin the edges. We check each non-zero MT (m,n) if itis larger than two neighbors along the gradient direction θ(m,n). If yes,MT (m,n) is kept unchanged. Otherwise, it is set to zero.

5) Hysteresis thresholdingHysteresis thresholding is employed to eliminate the breaking up of the

object’s shape caused by the output’s fluctuation around the threshold. Twodifferent thresholds τ1 and τ2 (τ1 < τ2) are used to filter MT (m,n) and toobtain two binary images T1 and T2.

6) Edged imageThis step is for getting a form continuous edges. Firstly, each segment

in T2 is traced to its end. Then its neighbors in T1 are searched to find anyedge segment in T1 to bridge the gap till reaching another edge segment inT2.

2.2.2 Canny implementation in real time system

To facilitate the implementation of Canny detection algorithm in the realtime system, several parts are modified and listed below.

1) For the implementation of 2-D Gaussian filter, a discrete approxima-tion of Gaussian filter is employed and we get a simple convolution masklike Fig 5.

This mask is smaller than the original image so it is slid over the imageand a square of pixels is manipulated at a time.

2) For the gradient calculation, a pair of 3x3 convolution masks is used.One is for estimating the gradient in the x-direction (columns), and anotheris to estimate the gradient in the y-direction (rows). The two masks areshown in Fig 6.

After getting the Gx and Gy, the gradient’s magnitude should be calcu-lated by using square root function. Because square root operation takes alot of computation time and reduce the system’s performance, the magni-tude is approximated by the formula, |G| = |Gx|+ |Gy|.

3) For noise suppression, the threshold T is not fixed and adaptive to themaximum and minimum gradients. Their relation is shown in the formula,T = α ∗ (Gmax−Gmin) +Gmin, where α is set to 0.1.

4) Considering the real time system, we didn’t executed the final twosteps: Hysteresis thresholding and Edged image.

Page 58: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

50 Video processing

2.3 Summary for object detection

In this chapter, the first section 2.1 describes how to locate the moving ob-ject’s position. Then we introduce Canny edge algorithm which is suitable todetect the moving object’s shape in section 2.2.1. In addition, several partsin Canny detection should be adjusted to fit the real time implementation.We present these modified parts in section 2.2.2.

3 Object identification

For object identification, firstly we define three categories: rectangular, cir-cular and others. Based on the object shape information extracted in theobject detection, the object can be classified into one category. Here weemploy a simple method to do the classification according to the features ofrectangular and circular.

3.1 The three categories’ features

Considering the rectangular and circular’s shapes shown in Fig 7, they canbe distinguished by their special properties.

For rectangular, its four edges have the constant slopes. In addition, theneighboring edges are vertical to each other. If the object is constructedby four edges and these edges meet the above features, we classify it intorectangular. For circular, its radius is constant. The circle’s center can bedetermined by the most upper, lower, left and right points as four blackpoints marked in Fig 7 (b). Note that the upper and lower points shouldlocate at the same vertical line, and the left and right points should beat the same horizontal line. The distance between upper and lower pointsis identical to the distance between left and right point, which is twice ofradius. If all points at the edge have the same distance to the center pointas radius, the object is distinguished as a circular. When the object doesn’thave the features as a rectangular or circular, it is recognized to belong tothe third category, others.

3.2 Classification implementation in real time system

Because the above feature is ideal based on a complete and clean objectshape and we can’t get a perfect object shape in the previous object edgedetection, some trade-off should be taken while carrying out the classificationin real time system. They are listed as follows.

1) Varied with the peripheral light, sometimes the object’s shadow ormirror image is also is detected as a part of the object’s edges and changesthe object’s shape. To avoid the shadow’s effect, we find the object’s featuresonly according to the upper part. For example, if the upper two edges of

Page 59: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Bilgin Can, Ma Ling, Lu Fei 51

the object have constant slope and are vertical to each other, this object isrecognized as a rectangular. For circular-like object, its center is determinedonly based on the upper, left and right points. If the points at the upperedges have the constant distance as radius, the object is identified as acircular.

2) Because the points at the edge might fluctuate around the real shape,we set a tolerance range for judging the constant slope and radius. Forexample, the slope between the upper and left points is taken as the slopeof left edge, Slope left. Then we calculate the slope between upper pointand other points at this edge. If there are more than 10 points whose slopeis in the range from 0.95 ∗Slope left to 1.05 ∗Slope left, we announce thatthis edge has a constant slope.

In addition, for marking the category information on the monitor, dif-ferent background color indicates different categories. Blue is used for rect-angular, green for circular, and black for others.

3.3 Summary for object identification

In this chapter, the first section describes the features of three categories:rectangular, circular, and others. In second section, we introduce sometrade-off when implementing the object classification in real time system.Moreover, we use color as the category indicator while the object shape isdisplayed on the monitor.

4 Result assessment

After implementing the above algorithms in TMS320DM6437 EvaluationModule, we use several objects to test them and observe the output on themonitor to assess their performance. At the same time, we also check theCPU load and execution time since these algorithms are executed in a realtime system.

4.1 Evaluation for object detection and classification

To access the performance of object detection and classification, we usethe following objects: Mobile phone (black&white), wallet (red), ping-pongball(white), lock(yellow). Mobile phone and wallet should be classified into”rectangular”. Ping-pong ball belongs to ”circular”. The lock is in ”others”.

1) Object detectionWhen there is an object moving in front of video camera, its shape is

shown on TV at the same time. In the assessment, we note that when thebackground is blue, the black mobile phone’s shape is not complete. Otherobjects have clean edges to describe their shape. If the background is white,

Page 60: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

52 Video processing

the ping-pong ball can’t be shown clearly on TV. So the object detection iseffected by background color.

2) Object classificationAfter getting the moving object’s shape, the classification is executed

immediately. We can see the category’s color on TV almost at the sametime as object is displayed. If there is no noise point around the top part ofthese objects, they can be identified correctly. Otherwise, sometimes theyare recognized as an wrong category. So the classification is influenced bythe noise points from previous stage.

4.2 Performance analysis for real time system

Before running the detection and identification functions, the CPU load isaround 1.4% at peak. After starting the two functions, there is no curve inthe CPU load graph, while many frames are missing on the output moni-tor.So we believe that the CPU load has reached 100%.

4.3 Assessment summary

In this chapter, we use several objects to test the detection and identificationmethods. The results show that the background color can effect the detec-tion and the noise points might disturb the identification. Except theseaffects, the two functions work very well. In addition, when the systemimplements the above methods, the CPU load is full.

5 Conclusion and further work

In this project, we implement the moving object detection and identificationfunctions. In object detection, Canny algorithm is executed, which can out-put thin edges for describing the object’s shape. In identification, we definethree categories: rectangular, circular and others. Based on the categories’features, the object is classified.

Then some objects are used to evaluate the functions’ performance. Themoving object can be detected quickly and classified correctly if ignoring theaffects of background color and noise points. But the CPU load is full whenrunning the system. For further work, detection and identification functionscan be optimized to reduce the resource consumption. Moreover, the twofunctions can be improved to eliminate the effects from the environment,such as introducing low-pass filter or averaging, and executing the step 5and 6 of Canny algorithm.In addition, some special identification methodscan be implemented for more categories, whereas how to balance the highCPU load is still a problem.

Page 61: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Bilgin Can, Ma Ling, Lu Fei 53

References

[1] I. Haritaoglu, D. Harwood and L-S Davis, ”Real-time Surveillanceof people and their activities,” IEEE Trans. Pattern Analysis andMachine Intelligence. vol. 22, no. 8, 2002, pp. 809-830.

[2] J- B. Kim., et al, ”Wavelet-based vehicle Tracking for AutomaticTraffic Surveillance,” In Proc. IEEE Tencon. vol. 1, 2001, pp. 313-316.

[3] Canny, J., ”A Computational Approach To Edge Detection,” IEEETrans. Pattern Analysis and Machine Intelligence, 1986, pp. 679-714.

[4] http://en.wikipedia.org/wiki/Canny_edge_detector

Page 62: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

54 Video processing

Figure 1: Overall flow chart

Figure 2: images in each step of the moving object location

Page 63: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Bilgin Can, Ma Ling, Lu Fei 55

Figure 3: Comparison of Sobel and Canny detection

Figure 4: Steps of Canny edge detector

Figure 5: Discrete approximation of 2-D Gaussian filter (σ=1.4)

Page 64: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

56 Video processing

Figure 6: two masks for gradient calculation

Figure 7: Rectangular and circular

Page 65: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

57

Part VI

Video Processing - Light Saber

Kashif, Raheleh, Shakir, Vineel

Abstract

The main aim of the project is to develop the framework to imple-ment the light saber in video processing. The process is carried outin various steps that include identifying the painted (dark colors E.g.,green or blue) wooden stick in the video and add red color to it. Theadded color is filtered and thus the video is free from noise. The entireproject is implemented in two stages. Firstly, the still images are im-plemented in MATLAB. Secondly, the real-time video captured fromthe camera is processed in DM 6437 EVM based on TMS320DM6437Processor using the development tool Code Composer Studio 3.3. Thusthe processed video is displayed on the TV which follows PAL system.

Page 66: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

58 Video Processing - Light Saber

1 Introduction

1.1 Development Kit

The DM6437 EVM is a PCI based or standalone development platformthat enables users to evaluate and develop applications for the TI DaVinciprocessor family. It is designed to work with TIs Code Composer Studiodevelopment. Code Composer communicates with the board through theembedded emulator or an external JTAG emulator.

1.2 MATLAB

MATLAB is a numerical computing environment which allows matrix ma-nipulation,implementaion of algorithms and plotting of functional data.Ithas optional toolboxes for specfic applications.We have adopted features ofImage processing toolbox for the entire project.

1.3 Code Composer Studio

Code Composer Studio is an integrated development environment for de-veloping DSP applications for TMS320 DSP processor families.It includes amini operating system called DSP/BIOS.CCS 3.3 supports components forthe DaVinci platform.DaVinci processors are capable of processing multi-media applications especially digital audio and video.

2 Background

2.1 Chroma Key

It is a technique used in video and still photography to replace a portion ofan image with a new image. It comes into play when a colored backgroundis replaced by another. This technique is also referred to as color keying,color-separation overlay. The most commonly used colors in chroma keyingare blue or green. These are used since they have no effect on the fore-ground image. In the digital world, however green has become the favoredcolor because digital cameras retain more detail in the green channel and itrequires less light than blue. Green not only has a higher luminance valuethan blue but also in early digital formats the green channel was sampledtwice as often as the blue, making it easier to work with.Although blue andgreen are the most common, any color can be used.Red is usually avoideddue to its prevalence in normal human skin pigments.

2.2 Filtering

The image is processed to make it free from noise created from the envi-ronment, that is performed using morphological operators,which are erosion

Page 67: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Kashif, Raheleh, Shakir, Vineel 59

and dilation Dilation, in general, causes objects to dilate or grow in size;erosion causes objects to shrink. The amount and the way that they growor shrink depend upon the choice of the structuring element. Dilating oreroding without specifying the structural element makes no more sense thantrying to lowpass filter an image without specifying the filter. The two mostcommon structuring elements givenaCartesiangrid are the 4-connected and8-connected sets, N4 and N8.

2.3 PAL System

PAL, phase alternating line is an analogue television encoding system. Thisis generally referred to 625-line/50 Hz system with 25 frames per second.Luminance Y is derived from red, green, and blue (R’G’B’) signals as

Y = 0.299R′ + 0.587G′ + 0.114B′ (1)

CB = −0.168736R′ − 0.331264G+ 0.5B′ (2)

CR = −0.5R′ − 0.418688G− 0.081312B′ (3)

2.4 Color space representaion

YCbCr is a family of color spaces used as a part of the color image pipeline invideo and digital photography systems. Y is the luma component where asCB and CR are the blue-difference and red-difference chroma components.YCbCr is not an absolute color space, it is a way of encoding RGB infor-mation. The actual color displayed depends on the actual RGB colorantsused to display the signal. Therefore a value expressed as YCbCr is onlypredictable if standard RGB colorants or an ICC profile are used.

3 Implementation

The thoughts in the earlier stages of project come from the idea of colorkeying. In precedence we firstly worked in the implementation of still pho-tographs with the colored background in MATLAB and secondly optimizingthe code in video preview of built in flash memory of the RAM. The outputof the camera is given to the EVM DM6437 serves as the input referencesignal (real-time) for the processing of the video and then the processedvideo is displayed in the TV which follows the PAL system,which is shownin the figure 1.

3.1 MATLAB

The proof of concept for the whole project was done in Matlab by cap-turing real images from the camera and converting the images to YCbCr

Page 68: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

60 Video Processing - Light Saber

Figure 1: Block diagram of the system

Page 69: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Kashif, Raheleh, Shakir, Vineel 61

Figure 2: Orginal image

format (using the equation mentioned above) and then assessing the datafor constructing thresholds on the Y, Cb and Cr. The effect of morpholog-ical operators was also tested and based on the obtained results, the workflow for the implementation in code composer was designed.

3.2 Code Composer Studio

The tested design was implemeted in code composer studio. The frame workfor accessing pixels was provided by the instructors, where as some bugs inthe pixel access framework were fixed. Currently Green and Blue coloursare deteteded and replaced by the red colour. Morphological operators areapplied to provide a glow effect for the sword. Due to the huge computa-tional power required for the glow effect, we had to constrain our findingsto small filters, as the CPU load was increased by many folds.

4 Simulation results

With the efforts carried out in implementing key concepts in earlier stageshave the following results in figure 2 to figure 5.

5 Conclusion and Future work

There are many future improvements to our design, the biggest improvementwould be to incorporate sound and may be playing different sounds if theswords are swirling, striking together etc.

Page 70: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

62 Video Processing - Light Saber

R Before Threshold

G Before Threshold

B Before Threshold

R After Threshold

G After Threshold

B After Threshold

Figure 3: The image with thresholds

Figure 4: Image after erosion

Page 71: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Kashif, Raheleh, Shakir, Vineel 63

Figure 5: Final image

References

[1] C. Gonzalez & Richard E. Woods,Digital image processing, Addison-Wesley Publishing Company,1993

Page 72: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

64 Video Processing - Light Saber

Page 73: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

65

Part VII

Reverberation

Syed Zaki Uddin, Farooq Anwar, Mohammed Abdul Aziz,Kazi Asifuzzaman

Abstract

Reverberation is an audio effect that retains original sound as echoesafter original sound has already been removed, creating a smooth softsound instead of a dull original sound. Reverberation is a naturalprocess, which creates thousands of echoes of original sound with de-creasing amplitude while reflecting from various surfaces, creating aspacious sound to perceive instead of confined sound. In most audiorecording studios, artificial reverberations are created to achieve highquality of reverberation effects. Our project deals with creating anartificial reverberation effect to smoothen the sluggish sound in realtime. To create artificial reverberations, we have chosen jot Reverber-ation algorithm, which uses Feedback Delay Network (FDN) techniqueto create high quality reverberation effect. We have designed our ref-erence model in Matlab and implemented in C language for real timedata processing. All Digital signal Processing have been carried out onTexas Instrument TMS320c6713 Digital Signal Processing Kit (DSK).Along with good reverberation algorithm, there is great need for cal-culation of correct coefficient values to achieve effective reverberationeffect, as these coefficients controls the impulse response for expectedreverberation effects. Our designed Reverberator quality is very goodand we have successfully simulated our design for different room sizes.This project is a study work of jot reverberator and it provides goodknowledge of real time digital Signal Processing.

Page 74: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

66 Reverberation

1 Introduction

1.1 Introduction to Reverberation

Reverberation (Reverb) is a physical phenomenon, which can be clearly seenin an enclosed space, where echoes with decaying amplitude can perceiveafter the occurrence of original sound. Reverberation is a set of reflectionswith decaying amplitude along with original sound. These reflection areproduced until the reflections amplitude decays to zero i.e., no sound can beheard.

1.2 Characteristics of Reverberation

Reverberated signal consist of three different parts, the direct signal, theearly reflections and the late reflections. Reverberation can be characterizedwith many parameters, such as [2] :

• Envelopment and Presence(ratio between direct sound, early reflec-tions and late reflections)

• Brilliance, warmth and source proximity (energy and spectrum of di-rect sound and early reflections)

• Running Reverberence (decay of early reflections)

• Late Reverberance (decay of late reflections)

• Heaviness and liveness (variation of decay time with frequency)

• Clarity (ratio of impulse-response energy in early reflections to that inthe late reverb).

In order to design good reverb effect, certain parameters that needed to beconsidered are [1] :

• Source distance

• Reverberation Time t60

• Room Proportions

• Wall reflectivity

• Air Absorption

• Wall diffusion

Page 75: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Syed Zaki Uddin, Farooq Anwar, Mohammed Abdul Aziz, Kazi Asifuzzaman67

1.3 Simulation of Reverberation

Reverberation signal consists of three different parts, the direct signal, theearly reflections and the late reflections. The sound signal that reachesto destination directly from the source is called direct signal. The signalthat reflects from surface and reaches to destination after few milliseconds(approximately 60-100ms) is called early reflection. After early reflections,lot of echoes arrives at destination increasing exponentially; these echoesare called as late Reverberation. Late reverberation is a statistical processinstead of producing echoes at specified time delay. Usually, simulating re-verberation in real time without approximating the system is quite difficult,since real time creation of reverberation requires thousands of poles andzeros to be realized which is impossible, and because of limited processingtime for algorithm. Thus reverberation system has been approximated, inorder to simulate a real time reverberation effect. Early reflections are simu-lated using a Finite Impulse Response filter and late reflections using a firstorder low pass Infinite impulse response (IIR). The late reverberation im-pulse response resembles Gaussian white noise with exponentially decayingenvelope, thus synthesized impulse response for an efficient reverberationcan be obtain by convolving the dry signal(original sound) with Gaussianwhite noise signal, to create a reverberated signal, called wet Signal.

2 Reverberator types

There exists a lot of Reverberator to obtain reverberation effect. The fourfamous Reverberators used throughout the history of reverberation model-ing are discussed here. The four types are Reverberation chamber, plateReverberator, spring Reverberator and Digital Reverberator.

2.1 Reverberation Chamber

Reverberation chamber is the prominent technique to produce reverberationeffect before other techniques are developed. This chamber is used in stu-dios to produce reverberation using absorptive surfaces in chamber, whichproduces reverberation effect to the dry signal.

2.2 Plate Reverberator

Plate Reverberator was originated in 1954, which was first electromechanicalReverberator. Reverberations are obtained by connecting a generator to thedry signal and reverberated signal are collected using a transducer.

Page 76: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

68 Reverberation

2.3 Spring Reverberator

This type of Reverberator is simple and used most commonly in most guitaramplifiers because of cheaper cost. Reverberations are created with coil andtwo transducers at each end of coil. The length and tension of the coil affectsreverberation time.

2.4 Digital Reverberator

Digital Reverberator is most common Reverberator used now a day, as itexhibits several advantages in terms of smaller size, low cost or more flexible.Digital Reverberators are discussed briefly hereafter.

3 Reverberation Modeling

Reverberations are modeled separately for direct signal, early reflections andfor late reflections. Reverberations are modeled according to the physicalimpulse response of a room i.e., all designing parameters of reverberator aredependent on physical impulse response of room.

3.1 Impulse Response

Impulse response for reverberation can be measure for a real room using twofamous methods; the Image source method and the Ray-trace method. Thismethods gives impulse response of room. this impulse response usually con-sists of early reflection and Late Reflection, which are modeled accordingly.

3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000

−1.5

−1

−0.5

0

0.5

1

Figure 1: Impulse Response

Page 77: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Syed Zaki Uddin, Farooq Anwar, Mohammed Abdul Aziz, Kazi Asifuzzaman69

3.2 Early Reflection Modeling

Early Reflection are modeled using a Finite Impulse Response Filter alongwith a set of delay lines are evenly most adequate to model early reflec-tions. Clusters are used to model the reflections between the early and latereflections. Usually cluster is a multiplication matrix with group of delayswhich are designed to model intermediate reflections (between early and latereflections) in Jot Reverberator.

3.3 Late Reflection Modeling

The late reverberations are tougher part to model. Late reverberation aremodeled using a first order Low Pass Infinite Impulse response filter H(i).The impulse response of Late reverberations resembles Gaussian white noisewith exponentially decaying envelope [1], Thus Low pass filter with decay-ing gain is used to produce exponentially decaying Gaussian noise impulseresponse.

Impulse Response of Low pass Filter:

Hi(z) = gi(1− ai

1− aiz−1) (1)

3.4 Reverberation Algorithms

Reverberation Algorithms are basic techniques invented to attain the desiredreverberation effect.

3.4.1 Shroeder Algorithm

Shroeder is one of the early initiators of developing reverberation algorithms.Even though his theories has become little old, but still most of the modernreverberation theories has received partial contribution from his work. Thealgorithm Shroeder suggested, has four parallel comb filters and two allpassfilters in series [1]. Shroeder’s algorithm was a considerable achievementin the development of reverberation algorithms but now it falls behind incomparison with modern reverberation algorithms.

3.4.2 Moorer Algorithm

Moorer further improved Shroeder’s algorithm by substituting the feedbackgains with first order recursive lowpass filter which simulated the air ab-sorption. This improvement to Shroeder’s algorithm made the sound muchmore realistic than before [1] .

Page 78: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

70 Reverberation

3.4.3 Gardner Algorithm

To achieve an improvement in analytical part of the design Gardner useda lowpass filter in a feedback loop in account for the frequency dependentbehavior of a real room. The filters have been implemented with a secondorder recursive filter to provide a variable cut-off frequency [1].

3.4.4 Dattoro Algorithm

Dattorro introduced the idea to constitute the major part with allpass filterswhich can be connected with time varying delay lines targeting that it canimpose delay on output when needed [1] .

3.4.5 Jot Algorithm

Jot Reverberation Algorithm is one of the best reverberation algorithm tilldate. Since we have implemented jot reverberator, we will discuss it briefly.

4 Reverberator Design

4.1 Jot Reverberation Algorithm

Jot Reverberation Algorithm has been implemented by Jean-Marc Jot inearly nineties. He used Feedback Delay Networks (FDNs) for creating re-verberation effect, and is consider as most efficient reverberation algorithm.Jot makes use of a single feedback matrix A, Digital Delay lines, first orderlow pass Infinite impulse Response (IIR) Filters H(i) and tonal correctionfilter t(z) to produce excellent reverberation effect. The feedback matrix ’A’is used to spread the energy among N delay lines and is always unitary Ma-trix to ensure stability. A Unitary matrix ’A’ should satisfy the conditionthat the transpose of matrix A (A’) should be equal to inverse of matrixA inv(A). Usually Unitary matrix is lossless feedback matrix and all of itspoles are on unit circle, thus provides stability and produces maximum echodensity if no null coefficients were there. There are a lot of Matrix designsthat satisfies Unitary Matrix, thus we have implemented unitary matrixusing household matrix to produce lossless feedback Network. In Jot Re-verberation Algorithm, input samples are fed to Feedback Delay Networksin series with low pass IIR Filter and are output from low pass filter arepassed through correction filter and finally summed together with originalsound to produce efficient reverberating sound.

4.2 Coefficients calculation

Along with the good reverberation algorithm, correct coefficients calcula-tion plays a major role in designing a good Reverberator. Desired impulse

Page 79: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Syed Zaki Uddin, Farooq Anwar, Mohammed Abdul Aziz, Kazi Asifuzzaman71

100 200 300 400 500 600 700 800

100

200

300

400

500

600

Figure 2: Jot Reverberator Block Diagram

Page 80: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

72 Reverberation

response for reverberation effect depends on following parameters:

4.2.1 Modal Density (Dm(f))

Modal Density is defined as the number of modes required per hertz, whichcan be calculated as [1]:

Dm(f) =4πV f2

c3Modes/Hertz (2)

Where mode is an exponential decaying sinusoid, f is the frequency ofmode, V is volume of room and c is speed of sound. Modal Density isgenerally required while describing late reverberations.

4.2.2 Frequency and Time Density

Frequency and time density defines the quality of perceived reverberationeffect in frequency and time domains respectively.

4.2.3 Echo Density (De)

Echo density is the average number of pulses per second, and is independentof size of room. Echo density can be calculated as

De =4πc3t2

Vreflections/s (3)

4.2.4 Energy Decay Curve(EDC)

Energy Decay curve is a function derived by integrating the square of theimpulse response from finite time to infinity time. Many perceptual termsrelated to reverberation design can be synthesized using this curve.

4.2.5 Reverberation Time (Tr)

Reverberation time is the most important parameter for designing a Re-verberator. It defines how long should original sound should be audible orsound should reaches to a certain limit. Theoretically, reverberation time isdefined as time required for impulse response to attenuate by 60dB. It canalso be defined as the time required for sound intensity to attenuate by 60dB.Reverberation time depends on many factors such as volume of room (V),absorption coefficient(s) and surface area (A). Thus, Reverberation time hasbeen approximated by using sabine formula as

Tr =0.163VsA

s (4)

Page 81: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Syed Zaki Uddin, Farooq Anwar, Mohammed Abdul Aziz, Kazi Asifuzzaman73

But sabine formula is not useful for high frequency terms, thus C.F.Eyringimproved it for higher frequencies and given it as

Tr =0.163V

A ln(1− s) + 4δaVs (5)

Where delta is attenuation constant of air.

Reverberation Time can also be found as a function of attenuation:

Tr =−3Tlogγ

s (6)

Here gamma is attenuation per sample period which defines the attenuationrequired every samples, fs is sampling frequency and T = 1/fs.

Reverberation time can also be determined as time required for EDCcurve to travel from 0 to 60dB.

4.2.6 Energy Decay Relief (EDR)

Energy Decay Relief curve is calculated as inverse-time integration of asquared time - frequency distribution. EDR curve displays energy com-ponents distribution of impulse response.

4.3 Designing Jot Reverberator

Designing a jot reverberator is simple, but the parametric calculations aretougher and it requires correct values to design a good reverberation effect.these parametric calculations and their effects on reverberation impulse re-sponse are discussed hereafter.

4.3.1 Delay Length (M)

Delay Length (M) is defined as the total delay lengths required to obtain es-timated reverberation. Usually these delay lengths are component of FDNsrequired to produced desired latency. Delay Length (M) can be found as

M = 0.15 ∗ fs ∗ t60 (7)

Where fs is sampling frequency and t60 is reverberation time. In orderto design an efficient reverberator, Delay Length (M) should always be lessthan the sum of individual delay lengths i.e.,

M �N∑i=1

Mi (8)

Page 82: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

74 Reverberation

And the individual delay length should always be mutually co-prime inorder to avoid ringing of sound. If there exists a common divisor, thensample at particular frequency will arise, results in uneven noise.

Delay lengths plays important part of Feedback Delay Networks to modelEarly Reflections. Delay Length provides the desired echo time.

4.3.2 First order low pass Filter H(i)

A First order low pass Infinite Impulse response (IIR) filter is used in jotReverberation Algorithm to model late reflections. This filter is used forabsorption of high frequencies. The gain of low pass filter describes theattenuation of impulse response. The impulse response of Low Pass Filtercan be given as

Hi(z) = gi(1− ai

1− aiz−1) (9)

where

gi = 10(−3MiT

t(60)(0)) (10)

ai =ln(10)

4log(10)(gi)(1− 1

α2) (11)

where

α =t(60)(

π

T)

t(60)(0)(12)

4.3.3 Effects of Matrix B and C

Here, Matrix B is used to attain desired amplitude for early reflections.Matrix C is used to decide desired magnitude of output signal.

b(i) = 1− g2i (13)

and C Matrix can be designed with coefficient as

c(i) = 1− 2−N (14)

where N is the wordlength of processor.

Page 83: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Syed Zaki Uddin, Farooq Anwar, Mohammed Abdul Aziz, Kazi Asifuzzaman75

4.3.4 Tonal Correction Filter

”Tonal Correction Filter E(z) equalizes mode energy independent of Rever-beraion Time [1] .

Impulse Response of Tonal Correction Filter E(z):

E(z) =1− bz−1

1− b(15)

where

b =1− α1 + α

(16)

and

α =t(60)(

π

Tt(60)(0)

) (17)

5 Realtime Reverbaration Implementation

The correct method to implement a real-time reverberation is to use theimpulse response of the room or the environment and convolve it with theoriginal signal. This approach is more computational hungry and thus takesmore processor power. Different techniques have been developed to synthesisthe environment with efficient use of the hardware as told in the previouschapters. The JOT model is one the most appropriate technique and thusused for this implementation. The JOT model not only efficiently simulate’ssynthetic environment but also gives more flexibility in environmental setup.As there are allot of parameters that can be given different values.

Jot model is easy to implement but the parameter calculation is rathermore complex. Therefore, parameter calculation for this implementationwas not done in DSP.

5.1 Implementation Model

In DSP the Reverberator model implemented is equivalent to traditionalJOT model but it contains three networks. The networks in traditionalJOT model are

1. Feed-back delay Network: To synthesis late reverberations.

2. Feed-forward Network: To render early reflections

Feed-forward Network in this model does not implements the exact syn-thetic model of the environment. So another network was also added whichconsists a convolution model for early reflection. The convolution model

Page 84: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

76 Reverberation

gives a good and clear rendering of the environment depending on the num-ber of coefficients . The equivalent Jot Implementation consists of followingblocks

• CBUFF: Circular buffer blocks to add appropriate Delay.

• H(z): One-Pole IIR Filter.

• A: 5x5 A Matrix.

• T(z): Tonal Correction FIR Filter.

• Conv : Fir based Convolution Module for Early Reflection.

The block diagram can be seen in the figure 3

Page 85: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Syed Zaki Uddin, Farooq Anwar, Mohammed Abdul Aziz, Kazi Asifuzzaman77

100 200 300 400 500 600 700 800

100

200

300

400

500

600

Figure 3: DSP Implementation of Real-Time Reverberation

5.2 CBUFF

Circular buffer is one of the important part in the Implementation. Thetraditional JOT reverberator consists of shift register. For the equivalentJot implementation the use of the circular buffers, rather than shift registersgave the flexibility to implement it in DSP with less complexity.

The operating sampling frequency of the Reverberation implementationis 48khz so the delays are pretty large and could range upto 20000 samplesfor single buffer. The CBUFF was implemented in external memmory using

_EXTERNALHEAP.

which gave the possibility to implement large and many delays without anyproblem.

The implementation of the CBUFF block is given in the figure 4

Page 86: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

78 Reverberation

100 200 300 400 500 600 700

50

100

150

200

250

300

Figure 4: Circular buffer implementation

The implementation for the circular buffers given in the above blockgives an sequential implementation of the iterative delay modules in theJOT reference model.

5.3 H(z)

The one-pole IIR-Filter it is implemented just using one multiplier and onesubtractor. This is one of the blocks in the Implementation that takes lessresources. The Implemented equation is shown below

sum = gain ∗ (1− a) ∗ x+ a ∗ yprev; (18)

The parameter calculation for this block is very important. As it is singlepole filter so it is pretty unstable and the pole can easily move out of theunit circle. This single pole IIR filter is also important in a sense that itdefines the reverberation time for a single delay element or data coming fromCBUF into H(z) block.

5.4 A Matrix

The one of the most important part of the JOT system is the A matrix.It is implemented using simple Matrix multiplication rules. The size andparameter of the A Matrix decides the performance of the system. The Amatrix in the real-time JOT implementation is a 5x5 matrix. This is themaximum matrix size that was achieved to implement it recursively in theDSP without optimization. The A Matrix is given below.

static float A[5][5] = {0.2818, -0.2818, -0.2818, -0.2818, -0.2818,

Page 87: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Syed Zaki Uddin, Farooq Anwar, Mohammed Abdul Aziz, Kazi Asifuzzaman79

-0.3103, 0.3103, -0.3103, -0.3103, -0.3103,-0.3374, -0.3374, 0.3374, -0.3374, -0.3374,-0.3475, -0.3475, -0.3475, 0.3475, -0.3475,-0.3825, -0.3825, -0.3825, -0.3825, 0.3825};

In the optimized implementation, the calculation of matrix multiplica-tion was improved by dividing A matrix in two matrixes. The first 5x5matrix consisting of positive and negative 1 and the second 5x1 matrix con-sisting of floating point values for each row as shown below

static short A[5][5] = {1, -1, -1, -1, -1,-1, 1, -1, -1, -1,-1, -1, 1, -1, -1,-1, -1, -1, 1, -1,-1, -1, -1, -1, 1}

static float A[5] = {0.2818, 0.3103, 0.3374, 0.3475, 0.3825}

The complexity of this module was reduced from one floating point multi-plication for each row compared to four floating point multiplications. Sec-ondly depending on the A matrix value the input is added or subtracted.The implementation is shown below

Acoef(1) ∗N∑i=1

A1(i) ∗X(i)

Acoef(2) ∗N∑i=1

A2(i) ∗X(i)

Acoef(3) ∗N∑i=1

A3(i) ∗X(i)

Acoef(4) ∗N∑i=1

A4(i) ∗X(i)

Acoef(5) ∗N∑i=1

A5(i) ∗X(i)

5.5 T(Z)

This is a simple Fir filter consisting of one coefficient. The implementationconsists of two multipliers and one adder. The implementation is givenbelow

sum = gain ∗ (1− bcor) ∗ (xnewcor − bcor ∗ xprevcor); (19)

Page 88: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

80 Reverberation

5.6 conv

This is an Fir filter conisisting of 20 coefficients. The coefficients of thismodule is the impulse response of the environment. This module is an iter-ative module rather than sequential therefore it is computationally complexand takes more processor time.The implementation for this block is shownin figure 5

100 200 300 400 500 600 700

100

200

300

400

500

600

Figure 5: Convolution Module

6 Results

The implementation in DSP gave a good reverberation time and used 50percent of CPU power after optimization shown in figure 8. The impulseresponse of the reverberator is given in figure 6. The Frequency Responseis given in figure 7

Page 89: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Syed Zaki Uddin, Farooq Anwar, Mohammed Abdul Aziz, Kazi Asifuzzaman81

3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000

−1.5

−1

−0.5

0

0.5

1

Figure 6: Impulse Response

0 0.5 1 1.5 2 2.5

x 104

0

0.005

0.01

0.015

0.02

0.025Single−Sided Amplitude Spectrum of y(t)

Frequency (Hz)

|Y(f

)|

Figure 7: Frequency Response

Page 90: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

82 Reverberation

100 200 300 400 500 600

50

100

150

200

250

300

350

400

450

500

Figure 8: CPU Load graph from Code Composer Studio

Page 91: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Syed Zaki Uddin, Farooq Anwar, Mohammed Abdul Aziz, Kazi Asifuzzaman83

References

[1] Lilja, Ola Algorithm for Reverberation - theory and Implementation,2002 (Master thesis, LTH)

[2] J.M.Jot, O. Warusfel, ”A real time spatial sound processor for musicand virtual reality applications”, proc. 1995 ICMC, 1995

[3] Reverberation, http://en.wikipedia.org/wiki/Reverberation.

Page 92: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

84 Reverberation

Page 93: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

85

Part VIII

Pitch Estimation/Singstar

Anil kumar Metla, Anusha Gundarapu, Yaoyi Lin

Abstract

Pitch estimator is a software algorithm or hardware device thattakes a sound signal as its input and attempts to determine the fun-damental pitch period of that signal.Our project goal is to implementa algorithm in C-code that estimates the pitch from an audio inputsignal. The C-code is loaded into DSP board that can run the pitchestimation algorithm in real-time and output notes into a file.The algo-rithm we use is Cepstrum for the frequency estimation to give a robustpitch estimation. The pitch estimation can be used for various musicalapplications.

Page 94: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

86 Pitch Estimation/Singstar

1 Introduction

The project Pitch Estimation/Singstar involves implementation of a pitchestimation algorithm for a song or specific instrument on Texas InstrumentsDSK C6713 Development Platform. A Pitch estimation algorithm calculatesthe pitch of the input signal continuously. The Singstar is an application ofPitch estimation where the pitch of original song and the same song sungby a person is estimated and compared.

Section 1.1 gives the introduction to pitch and gives the applications ofpitch estimation. Section 1.2 gives the introduction to the project.

1.1 Pitch

By definition Pitch[1] is the perceived fundamental frequency of a sound. Itis one of the four major auditory attributes of sounds along with loudness,timbre and sound source location. Pitch estimator attempts to find thefrequency that a human listener would agree to be the same pitch as theinput signal. Pitch estimators can be successful only on a limited corpus ofsounds. It makes no sense to attempt to find ”the pitch” of a noisy percussivesound such as cymbal crash, brief impulses, low rumblings or complex soundmasses.

Pitch estimation has several applications[3]. One common applicationis speech analysis. In many musical applications, such as live performance,the job of pitch detectors is to ignore these micro variations and locate thecentral pitch. Another application of the pitch estimation is in sound trans-formation. Sound-editing programs often include pitch estimation routinesthat are used as a guide for pitch-shifting and time-scaling operations. An-other studio-based application is in transcribing a solo played on an acousticinstrument such as a piano. Advanced procedures like the separation of twosimultaneous voices start pitch detection. There are various pitch estimationalgorithms using different methods.

1.2 Introduction to the project

The project involves the pitch estimation and extending the concept tosingstar application.Fig. 1 gives the overall block diagram of the project.

The input sound signal is fed to the C6713 running the pitch estimationalgorithm and the output of the DSP is given to the PC which runs agraphical tool that takes the Pitch of the original song and the singingperson and plots them on a continuous graph.

In order to accomplish the goal of the project a project plan was madeand it is divided into 4 phases and is show in Fig. 2.

Chapter 2 gives introduction to various pitch estimation algorithms andexplains the algorithms selected for the implementation in this project in

Page 95: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Anil kumar Metla, Anusha Gundarapu, Yaoyi Lin 87

Figure 1: Block Diagram of project

Page 96: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

88 Pitch Estimation/Singstar

Figure 2: Project flow graph

detail. Chapter 3 explains the implementation details of the selected algo-rithms and singstar application extension, Chapter 4 gives the results of theproject.

2 Pitch Estimation Algorithms in Theory

2.1 Introduction

The Pitch estimation is a non-trivial problem and there are many simple aswell as complex algorithms developed. The methods for pitch detection canbe classified into five general categories[2].

1. Time domain

2. Auto-correlation

3. Adaptive filter

4. Frequency domain

5. Models of human ear.

Each of the categories are explained in brief below and the choice for im-plementation in current project are Cepstrum and Auto-Correlation whichare explained in detail in sections 2.2 and 2.3

2.1.1 Time domain Fundamental Period Pitch Detection :

Fundamental period methods look at the input signal as fluctuating am-plitude in the time domain, and they try to find repeating patterns in thewaveform that give clues as to its periodicity. One way of finding the pe-riodicity is by looking for repeating zero-crossings. A Zero crossing is apoint where the waveforms amplitude goes from positive to negative or viceversa. In general these time domain algorithms are easy to implement andinexpensive but they are also less accurate.

Page 97: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Anil kumar Metla, Anusha Gundarapu, Yaoyi Lin 89

2.1.2 Auto-correlation Pitch Detection :

Correlation functions compare two signals. The goal of correlation routinesis to find the similarity between two signals. Auto correlation methodscompare a signal with versions of itself delayed by successive intervals to findrepeating patterns. Auto-correlation is accurate in mid-to-low frequenciesand is also relatively easy to implement which made it choice for our projectfor implementation. Auto correlation is explained in detail in section 2.3.

2.1.3 Adaptive filter Pitch Detectors :

An adaptive filter operates in a self-tuning manner depending on the inputsignal. One of the pitch detection strategies using adaptive filters involvessending a signal into a narrow band-pass filter. Both the unfiltered inputsignal and the filtered signal are routed to a difference detector circuit. Theoutput of the difference detector circuit is fed back to control the band passfilter center frequency. This control forces the band pass filter to convergeto the frequency of the input signal. The convergence test measures the dif-ference between the filter output y(n) and the filter input x(n). When thedifference is close to zero then the system makes a pitch decision. There areother adaptive filter techniques such as optimum comb method etc. Adap-tive filter techniques are computationally expensive and take time to con-verge for which reason they are not considered for implementation as partof this project.

2.1.4 Frequency Domain Pitch Detectors :

Frequency domain pitch detection methods dissect the input signal into fre-quencies that make up the overall spectrum. The spectrum shows strengthof the various frequency components contained in the signal to isolate thepitch out of the spectrum. There are various frequency domain methodsout of which Short-time Fourier transform and Cepstrum are most popu-lar though other methods like Tracking phase vocoder Analysis (TPV) alsoexist. The Cepstrum analysis in simple words can be described as the in-verse fourier transform of logarithm of absolute value of Fourier transformof a signal and the pitch is found as a peak in the graph of the Cepstrum.Cepstrum is explained in detail in section 2.2.

2.1.5 Pitch Detection based on models of the ear :

These models combine algorithms based on perception theories with modelsof known mechanisms of the human auditory system. These Pitch detectionmethods are complex to implement and are out of scope for this project.

Page 98: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

90 Pitch Estimation/Singstar

2.2 Cepstrum

A Common frequency-domain pitch detection method in speech researchis the cepstrum technique. The term cepstrum was formed by reversingthe first four letters of spectrum. Cepstrum tends to separate a strongpitched component from the rest of the spectrum. Technically cepstrum isthe inverse fourier transform of the log magnitude fourier spectrum. Thisis the absolute value of log of the output of the discrete fourier transform.The result of cepstrum computation is a time sequence, like the input signalitself. If the input signal has a strong fundamental pitch period, this showsup as a peak in the spectrum. By measuring the time distance from time 0to the time of the peak, the fundamental period of this pitch is found. Fig. 3shows the sequence for cepstrum computation.

What cepstrum does is that it tends to separate a strong pitched signalfrom the rest of the spectrum. The cepstrum analysis works in following wayfor speech. The cepstrum serves to separate two super positioned spectra.The glottal pulse (vocal cord) excitation and the vocal tract resonsance. Theexcitation can be viewed as sequence of quasi-periodic pulses. The fouriertransform of these pulses is a line spectrum where the lines are spaced atharmonics of the original frequency. The process of taking the log magni-tude does not affect the general form of this spectrum. The inverse fouriertransform yields another quasi-periodic waveform of pulses. By contrast,the spectrum of the response of the vocal tract is a slowly varying functionof the frequency. The process of applying the log magnitude and inversefourier transform yields a waveform that has significant amplitude only fora few samples, generally less than the fundamental pitch period. Thus thecepstrum clusters the pitch into a series of peaks at the period of the fun-damental frequency. An example of cepstrum for a piano tone is shown inFig. 5along with spectrum and input wave form.

2.3 Auto Correlation

Correlation functions compare two signals. Autocorrelation methods com-pare a signal with versions of itself delayed by successive intervals. Thepoint of comparing delayed versions of a signal is to find repeating patternsof the signal. Autocorrelation pitch detectors hold part of the input signalin a buffer.As more of the same signal comes in, the detector tries to matcha pattern in the incoming waveform with a part of the stored waveform. Atypical autocorrelation function looks like this:

autocorrelation[lag] =N∑n=0

signal[n] ∗ signal[n+ lag] (1)

where n is the input sample index and N is the window length. The de-gree to which the values of the signal at different times n are the same as the

Page 99: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Anil kumar Metla, Anusha Gundarapu, Yaoyi Lin 91

Figure 3: Block diagram of Cepstrum

Page 100: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

92 Pitch Estimation/Singstar

Figure 4: Block diagram of Autocorrelation

Page 101: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Anil kumar Metla, Anusha Gundarapu, Yaoyi Lin 93

values of the same signal delayed by lag samples determines the magnitudeof autocorrelation[lag]. If the lag is 0 then the output of the normalizedautocorrelation will be 1 which means that the signals are identical.

Pitch detection by autocorrelation is the most efficient at mid and lowfrequencies.It has been popular in speech recognition applications where thepitch range is limited.

3 Project Implementation

This chapter explains the implementation details of the chosen algorithmsboth in C6713 (using C and Assembly Language) and MATLAB. This chap-ter also gives the results of performance analysis of Cepstrum Vs Autocor-relation as part of the project. Section 3.1 explains the implementationdetails in the matlab, section 3.2 explains implementation details on C6713.Section 3.3 explains the singstar extension of the project.

3.1 Implementation in Matlab

As part of project implementation both the Cepstrum and Auto-correlationare implemented in the matlab. The Matlab model is used for verificationpurpose of the algorithms proposed. The Matlab implementation was donein two parts, first the algorithm was implemented in Matlab and then a toneof known frequency was given as input to the Matlab to verify.

The implementation of cepstrum on Matlab is straight forward as Matlabhas built in support for wavfile read and all the fft, log, abs and the ifftfunctions. This made the implementation easy and the result was plottedin a graph. The wave file which was given as input to the function wascontinuously read and the pitch was plotted in a graph. As the wave filewas a tone of single frequency the graph was expected to be a straight lineat the frequency of tone but due to rounding errors the frequency is notconstant for all values but varies with a tolerable margin as shown in Fig. 5.Here the input was 440Hz A4 piano tone.

The implementation of Auto correlation was also done in the Matlaband the output was plotted in a wave form and the pitch period is plotted.The same A4 Piano tone is given as input and the output is calculated as446 Hz. The autocorrelation Matlab output for A4 Piano tone is shown inFig. 6.

3.2 Implementation on C6713

The cepstrum implementation on the C6713 involves usage of several func-tions of BIOS operating system and the Texas instruments supported dsplibrary source code for the fft functions. The overall operation of the cep-strum implementation is shown in Fig. 7.

Page 102: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

94 Pitch Estimation/Singstar

1000 2000 3000 4000 5000 6000 7000−1

0

1

Spe

ech

10 20 30 40 50 60 70 800

5

10

Ene

rgy

10 20 30 40 50 60 70 800

2

4

6

ZC

R

Figure 5: Matlab Implementation of Cepstrum

Page 103: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Anil kumar Metla, Anusha Gundarapu, Yaoyi Lin 95

Figure 6: Matlab implementation of Autocorrelation

Page 104: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

96 Pitch Estimation/Singstar

Figure 7: C6713 Implementation in Cepstrum

Page 105: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Anil kumar Metla, Anusha Gundarapu, Yaoyi Lin 97

As part of project implementation there are two important files one ispip audio.c where the main function and the swiEcho function are writ-ten and the Pitch Estimator.c function where the cepstrum is performed.There is also pitch estimator.h function containing the data required by thepitch estimator.c function. Initially in the main function two pipe objectsare created Pip Rx and Pip Tx for reading and writing data to and fromthe Line and Headphones respectively. A pipe object is a circular bufferin memory. In our project it consists of two frames where the data can beread from one frame for processing while the other frame is written by newdata. A Pip object signals when entire frames have been written or read.In our project the function SwiEcho() is called when the Pip Rx is full. ThePIO driver takes care of writing the data from the line-in to the Pipes andthen the pipes via Swi the software interrupt invokes the swiEcho functiondefined. Inside the swiEcho() Function the cepstrum analysis is done andthe pitch is calculated. As the input data through line in is stereo only onechannel data is read and the other channel data is discarded. This is donein the Process() function called by the swiEcho function.

The following functions are important in our project and are explainedin detail as below:Main(): This is the main function and the pio driver and pipes are initial-ized here. The memory allocation for pipes is also done here.SwiEcho(): This function is called whenever there is a new buffer available.This function then gets the pointer to the new data and also the size of thedata available in words and then passes this information to the Process()function.Process(): This function reads the input data from one channel and leavesthe other channel. Also this function takes care of organizing input datainto complex data with alternate real and imaginary data. This functionalso pads the buffer with zeros in-order to align the data for the power of2. The complex data arrangement and zero stuffing is because the FFTcalculation function performs the FFT only on the FFT data. This functionlater calls the pitch Estimator() function in the Pitch Estimator.c file whichcalculates the pitch.Pitch Estimator(): This is the function which does all the functionalityof pitch estimation. First it calls the gen w r2() function which generatesthe fourier coefficients which is called as twiddle factor required to performFFT. Then the twiddle factor is bit reversed as required. Then the twiddlefactor and the input data which is formatted into complex form, and theFFT window size is given as arguments to the DSPF sp cfftr2 dit() whichperforms the FFT. Once the FFT is performed the fft data is given to thecalc log abs() function which calculates the logarithm of the absolute valueof the fft data. Then the log abs data is given to the DSPF sp icfftr2 dif()which performs the IFFT of the data. Then the IFFT data is taken and themax value and its index is found using the find max() function. Once the

Page 106: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

98 Pitch Estimation/Singstar

Figure 8: Singstar Application

max value index is found the pitch is calculated via the following equation.The sampling rate is set in the pip audiocfg c file and the initial window isthe number of samples left for the noise margin and usually it depends onsampling rate. For our project it is set to 8.

Pitch = 2 ∗ Sampling Rate/(Initial Window + Index) (2)

The pitch thus calculated is displayed using the tracing function.

3.3 Singstar Extension

The singstar extension of the project involves couple of more blocks and adesktop software which plot graphs in real-time. The updated block diagramis shown in Fig. 8.

The input original song is fed into one channel and the singing data fromuser is fed in the other channel of the Line-in which is a stereo. For conver-sion of stereo to mono and dual mono stereo new hardware adapters werespecifically tailor made. The stereo data containing two different versions ofsong is sent for cepstrum analysis and the Pitch is calculated individually.Then using the HST objects the data is written to a file on PC and thenthe Plotting tool continuously scans file for new data and plots the graphwhenever new data is available.

4 Results

The implementation of the cepstrum was successful in both Matlab and theC6713. The pitch was calculated approximately and is verified successfullyfor different frequencies. The singstar part was partially verified to calculatepitch on two channels data at the time of writing of the document andthe application is still under modification to plot data in real time. Theauto-correlation was implemented successfully on the Matlab and is notimplemented on C6713 due to lack of time.

Also as part of the project analysis a thorough performance analysiswas performed on both the Cepstrum and the Autocorrelation in Matlab

Page 107: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Anil kumar Metla, Anusha Gundarapu, Yaoyi Lin 99

Figure 9: Matlab results for 8KHz sampling rate

Page 108: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

100 Pitch Estimation/Singstar

0 200 400 600 800 10000

200

400

600

800

1000

1200

realcepstrumauto−correlation

Figure 10: Matlab results for 16KHz sampling rate

for different sampling frequencies of 8KHz shown in Fig. 9, 16KHz shownin Fig. 10 and 44.1KHz shown in Fig. 11. The results are shown in figuresbelow. Surprisingly the auto correlation outperformed the cepstrum for8KHz and 16KHz sampling rates.

5 Conclusion

This project course has been a great learning experience and we are veryhappy to work on such a project. The learning was not only in terms oftechnical aspects but also in terms of time management and project plan-ning. This project has helped us in understanding our strengths and ourweakness and areas to improve.

When it comes to technical learning we learnt about pitch estimationalgorithms and how they work. We referred to various articles to makeour selection of pitch estimation algorithms and this served as experienceon how to make use of articles in journals to help solving problems. As

Page 109: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

Anil kumar Metla, Anusha Gundarapu, Yaoyi Lin 101

Figure 11: Matlab results for 44.1KHz sampling rate

Page 110: Algorithms in Signal Processors Audio and Video ... · laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to pitch and tone. The articulators

102 Pitch Estimation/Singstar

part of development on Matlab we learnt how to use Matlab efficiently fordevelopment and verification of algorithms and later using the Matlab asreference for future. As part of the implementation on C6713 we have learnthow to work with code composer studio and how to develop applicationson the C6713. We have understood the various services and features of theBIOS operating system and how to use them. We also learnt how to debugusing the Code composer studio. We also learnt how to work with latex inwriting documents.

On the other hand we learnt how to make project plan, and how toimplement the project in various stages. We also learned how to managetime though we failed at times. The major setback caused by the bad timemanagement is that we were not able to verify the singstar application com-pletely though we verified the concept individually. As part of implementa-tion our initial plan was to implement both auto-correlation and cepstrumalgorithms but we finally ended up implementing just cepstrum. As a re-sult of bad planning we took the performance analysis of algorithms at laterstage of project where the time was very limited and when the results camein favor of Auto-correlation we had little time to implement that algorithm.We learned how to work as a team and how to share tasks among us. Alsowe learned how to synchronize among the team.

Finally to conclude we are happy with our learning experience and wehave learned some valuable lessons from the project which might come handyin our future undertakings. We also learned how to implement an algorithmfrom scratch onto a DSP which is very useful for our technical understandingof Digital Signal Processors and DSP algorithms.

References

[1] The computer music tutorial, Curtis Roads, MIT Press 1996

[2] A Comparative Performance Study of Several Pitch Detection Algo-rithms,, Lawrence R.Rabiner, Michael J Cheng, Aaron E Rosenberg,Carol A.McGonegal, IEEE Transactions on acoustics, speech, and signalprocessing VOL ASSP-24, No.5, October 1976

[3] Digital Signal Processing Principles, algorithms, and applications,John.G.Proakis and Dimitris.G.Manolakis, Prentice hall

[4] http://en.wikipedia.org/wiki/Pitch %28music%29

[5] http://en.wikipedia.org/wiki/Cepstrum

[6] Texas instruments Reference manuals of DSK C6713


Recommended