SYSTEM FOR CHARACTERISATION AND RECOGNITION OF ARABIC ... · SYSTEM FOR CHARACTERISATION AND RECOGNITION OF ARABIC ... akustik dan fonetik adalah sangat pesat untuk ... Rujukan adalah

SYSTEM FOR CHARACTERISATION AND RECOGNITION OF ARABIC

PHONEMES AMONG MALAYSIAN CHILDREN USING FEED-FORWARD

NEURAL NETWORKS

NURUL ASHIKIN BINTI ABDUL KADIR

A thesis submitted in fulfilment of the

requirements for the award of the degree of

Master of Engineering (Electrical)

Faculty of Electrical Engineering

Universiti Teknologi Malaysia

JULY 2012

iii

To my beloved

Embut Embong, Abdul Kadir A Rahman,

Nor Amin Rohimi

and

Rubita Sudirman (Dr.)

iv

ACKNOWLEDGEMENT

Thank you Allah for giving me chance and good health experienced these

precious moments doing my master study.

My greatest appreciations and thank you to my supervisor, Dr Rubita binti

Sudirman for guiding and supervising me to complete this Master study successfully.

My sincere appreciation to Ministry of Science, Technology & Innovation, Malaysia

(MOSTI) and Universiti Teknologi Malaysia (UTM) for funding my Master study. I

am also indebted to children from Sekolah Agama UTM and Umi Kalsom’s Tuition

Centre for being the research samples.

My sincere gratitude goes to the wonderful person in my life, Nor Amin

Rohimi for being my helping hands and providing lots of assistance along this study

completion. Also to Dr. Norlaili Mat Safri, Mr. Tan Chin Luh, Mrs. Husnaini Azmy,

Mr. Afzan Othman, Dr Puspa Inayat Khalid who provide assistance along this study

completion. Special thanks to Ms. Nor Hekmat Yahaya and Ms. Carolyn Foot for

proofreading my abstract. Thanks for the valuable tips indeed. Also technicians from

Medical Electronics Lab, Ms. Wan Nor Afiza whom always bothered by me, thank

you so much.

My mom, dad, fiancé, sister, brothers and aunt who always support me along

my way, thank you very much for their prayers, kind, love, care and patience.

v

ABSTRACT

There has been limited study and research in Arabic phoneme among

Malaysians, hence making references to the work and research difficult. Although

there have been significant acoustic and phonetic studies on languages such as

English, French and Mandarin, to date there are no guidelines or significant findings

on Malay language. This study discusses the correct use of Arabic phonemes

pronunciation in Malay accent in the simplest form. The International Phonetic

Alphabet (IPA) of Arabic chart is used as reference of every recorded speech sample

using Malaysian children for sound localisation of every phoneme. Results from

Maahad Tahfiz School teachers were used to identify the most suitable samples

among the recorded samples. The samples were analysed to determine the origin of

each phoneme data by measuring its formant frequencies. Consonants of Standard

Arabic (SA) phonemes were studied and an appropriate place of articulation of every

phoneme was measured through its formant. Seven out of 25 consonants of SA

phonemes of the children’s samples did not give the appropriate formants value were

identified. The formant frequency values obtained were used as reference for the

database for the proposed recognition system which was developed using Matlab

software. All selected samples were randomly divided into 10 disjoint sets of equal

size for validation, namely 10-fold cross validation, to estimate the performance of a

predictive model. The mean square error (MSE) observed was 0.03, with the speech

recognition using a developed neural network (NN) system. The results indicated that

the highest training recognition rate obtained for multi-layer and cascade-layer

network were 98.8 % and 95.2 % respectively, while the highest testing recognition

rate achieved was 92.9 % for both networks and the MSE is 0.04.

vi

ABSTRAK

Kajian mengenai penggunaan fonem bahasa Arab di kalangan rakyat Malaysia

adalah sangat terhad dan ini menyukarkan rujukan dan kajian. Walaupun kajian

akustik dan fonetik adalah sangat pesat untuk bahasa asing seperti Bahasa Inggeris,

Bahasa Perancis dan Bahasa Mandarin namun hingga kini tiada panduan khusus

didapati atau temuan yang penting dalam bahasa Melayu. Kajian ini dilaksanakan

untuk mengkaji kesesuaian penyebutan fonem Arab dalam dialek Bahasa Melayu.

Rujukan adalah bersandarkan carta Fonetik Huruf Antarabangsa (IPA) Bahasa Arab

bagi memastikan ketepatan fonem yang direkod terhadap penyebutan setiap fonem.

Keputusan daripada guru-guru sekolah Maahad Tahfiz digunakan dalam

mengenalpasti sampel yang paling sesuai dari sampel yang telah direkodkan. Sampel-

sampel tersebut dianalisa untuk menentukan titik asal setiap fonem dengan mengukur

frekuensi-frekuensi formannya. Konsonan-konsonan fonem bahasa Arab Standard

(SA) dikaji dan titik artikulasi sesuai setiap fonem diukur melalui formannya. Tujuh

daripada 25 konsonan fonem SA bagi sampel kanak-kanak tidak memberikan nilai

forman yang bersesuaian telah dikenalpasti. Nilai-nilai frekuensi forman yang

diperoleh digunakan sebagai rujukan untuk pangkalan data bagi sistem pengecaman

yang dicadangkan untuk dibina menggunakan perisian Matlab. Kesemua sampel

terpilih dibahagi secara rawak kepada 10 set tidak berturutan yang sama saiz untuk

pengesahan, iaitu 10-lipatan pengesahan bersilang, untuk anggaran prestasi model

cadangan. Ralat min kuasa dua (MSE) diperoleh adalah 0.03, dengan pengecaman

pertuturan menggunakan sistem jaringan saraf tiruan (NN) yang direka. Keputusan

menunjukkan kadar latihan pengecaman paling tinggi diperoleh untuk jaringan

lapisan pelbagai dan lapisan lata masing-masing adalah 98.8 % dan 95.2 %, sementara

kadar ujian pengecaman diperoleh adalah 92.9 % untuk kedua-dua jaringan dan MSE

ialah 0.04.

vii

TABLE OF CONTENTS CHAPTER TITLE PAGE

DECLARATION ii

DEDICATION iii

ACKNOWLEDGEMENT iv

ABSTRACT v

ABSTRAK vi

TABLE OF CONTENTS vii

LIST OF TABLES xi

LIST OF FIGURES xiii

LIST OF APPENDICES xvi

1 INTRODUCTION

1.1 Introduction 1

1.2 Overview of Speech Technology Research History 1

1.3 Problem Statement 4

1.4 Objectives of the Study 4

1.5 Scope of the Study 5

1.6 Significance of the Study 6

1.7 Organization of Thesis 6

2 LITERATURE REVIEW

2.1 Introduction 7

2.2 Speech Sound 8

2.2.1 Organ to Pronounce Speech 8

2.2.2 Articulation Places 9

2.2.3 Voicing 10

2.2.4 Type of Sound 11

viii

2.2.5 Manner of Production 12

2.2.5.1 Fricative 12

2.2.5.2 Plosive 12

2.2.5.3 Nasal 13

2.2.5.4 Approximant 13

2.2.5.5 Trill 13

2.2.5.6 Lateral 13

2.2.6 Malay Language Consonants of Speech 15

2.3 The Speech Signal 15

2.3.1 Speech Signal Representation 16

2.3.2 Speech Features 16

2.3.2.1 Pitch and Energy 16

2.3.2.2 Formants 17

2.3.2.3 Features Prosody 18

2.4 Speech Characteristics through Graph Analysis 19

2.4.1 Spectrogram 19

2.4.1.1 Spectrogram based Previous Research 21

2.4.2 Spectral Image 21

2.4.2.1 Spectral Analysis based Previous Research 23

2.5 k-fold Cross Validation Technique 24

2.5.1 k-fold Cross Validation based Previous Research 24

2.6 Speech Recognition 25

2.6.1 Neural Networks for Speech Recognition 26

2.6.2 Neural Networks based Previous Research 28

3 METHODOLOGY

3.1 Introduction 33

3.1.1 Study Flow Charts 34

3.1.2 Overall Study Design 36

3.2 Speech Signal Collection 37

3.3 Speech Signals Selection and Evaluation 38

3.3.1 Spectrogram 39

3.3.2 Spectral Image 40

3.3.3 k-fold Cross Validation 42

ix

3.4 From Speech to Vector Representation 42

3.4.1 Preprocessing 43

3.4.1.1 Edge Detection 43

3.4.1.2 Pre-Emphasis 44

3.4.2 Segmentation 44

3.4.2.1 Frame Blocking 45

3.4.2.2 Windowing 47

3.4.2.3 Continuous Filtering 48

3.4.3 Auto-Correlation Analysis 49

3.4.4 Feature Extraction 49

3.4.4.1 Linear Prediction 49

3.5 Pattern Recognition using Neural Networks 50

3.5.1 Neural Networks System Architectures 51

3.5.2 Neural Networks Speech Recognition Training

Process 53

3.5.3 Neural Networks Speech Recognition Testing

Process 56

3.6 Experimental Setup 56

4 RESULTS AND DISCUSSIONS

4.1 Characteristics of Samples Collection 59

4.1.1 Signal Observations through Spectrogram 59

4.1.1.1 Nasal 60

4.1.1.2 Lateral 63

4.1.1.3 Trill 65

4.1.1.4 Fricative 67

4.1.1.5 Plosive 79

4.1.2 Signals Observations through Spectral Image 81

4.1.2.1 Alveolar /s/, [س], and Post-alveolar /∫/, [ش] 81

4.1.2.2 Labiodental /f/, [ف], Dental /θ/, [ث], /ð/, [ذ],

Alveolar /z/, [ز], and Palatal /ð’/, [ظ] 83

4.1.3 Comparisons of Speech Signals Characteristics 85

4.2 k-fold Cross Validation Error Estimation 88

4.3 Phonemes Classification using Neural Networks 89

x

4.3.1 Classification of 10 Phonemes 90

4.3.2 Classification of Phonemes by Manner of

Articulation 94

4.3.2.1 LPC Order of 8 94

i. Group 1 - Nasal, Lateral and Trill

Manners of Articulation 94

ii. Group 2 - Fricative Manner of Articulation 96

4.3.3 Comparison between LPC Order 99

4.3.3.1 Group 1 - Nasal, Lateral and

Trill Consonants 99

4.3.4 System Performance 100

4.3.4.1 Group 1 - Nasal, Lateral and Trill

Manners of Articulation 101

4.3.5 Comparison between Previous Research Findings 104

4.4 Demo Prototype – Graphical User Interface Design 106

5 CONCLUSION AND RECOMMENDATIONS

5.1 Conclusion 109

5.2 Recommendations and Future Works 110

REFERENCES 111

Appendices A-F 122 - 168

xi

LIST OF TABLES TABLE NO. TITLE PAGE

2.1 Distribution of Standard Arabic consonants 10

2.2 Distribution of Malay Language consonants 10

2.3 Symbolisation of vowels in Standard Arabic for alphabet [ف] 11

2.4 List of Standard Arabic phonemes 14

2.5 List of Malay Language consonants 15

2.6 Previous research based on spectrogram 21

2.7 Previous research findings based on spectral analysis 24

2.8 Example of neural networks transfer functions 27

2.9 Previous study of neural networks in speech applications 30



3.1 The Standard Arabic fricative consonants 41

3.2 The Malay Language fricative consonants 41

3.3 Experimental setup 57

4.1 The formants averages for nasal phonemes 63

4.2 The formants averages for lateral phoneme 65

4.3 The formants averages for trill phoneme 66

4.4 The formants averages for fricative phonemes 79

4.5 The formants averages for plosive phonemes 80

4.6 The spectrum frequency of fricative consonants 82

4.7 The spectrum frequency of fricative consonants 83

4.8 Comparisons towards nasal previous research findings 85

4.9 Comparisons towards lateral previous research findings 85

4.10 Comparisons towards trill previous research findings 85

4.11 Comparisons towards fricatives previous research findings 86

4.12 Comparisons towards plosive previous research findings 86

xii

4.13 Previous research findings of fricative consonants 87

4.14 MSE yields from 10-fold cross validation 88

4.15 Single-layer (SL) networks performance 91

4.16 Single-layer (SL) confusion matrix 91

4.17 Multi-layer (ML) networks performance 92

4.18 Multi-layer (ML) confusion matrix 92

4.19 Cascade-layer (CL) networks performance 93

4.20 Cascade-layer (CL) confusion matrix 93

4.21 Single-layer (SL) networks performance for /m/, [م], /n/, [ن],

/l/, [ل], /r/, [ر] 95

4.22 Multi-layer (ML) networks performance for /m/, [م], /n/, [ن],

/l/, [ل], /r/, [ر] 95

4.23 Cascade-layer (CL) networks performance for /m/, [م], /n/, [ن],

/l/, [ل], /r/, [ر] 95

4.24 Single-layer (SL) networks performance for fricative consonants 97

4.25 Multi-layer (ML) networks performance for fricative consonants 97

4.26 Cascade-layer (CL) networks performance for fricative

consonants 97

4.27 Recognition rate using combinations of multi-layer (ML) hidden

neurons architecture 101

4.28 Recognition rate using combinations of cascade-layer (CL) hidden

neurons architecture 102

4.29 Multi-layer (ML) networks performance 103

4.30 Cascade-layer (CL) networks performance 103

4.31 Comparison of different neural network types system performance 105

xiii

LIST OF FIGURES FIGURE NO. TITLE PAGE

2.1 Human speech mechanisms; Lungs, larynx, epiglottis, vocal tract 8

2.2 Pronunciation organ 9

2.3 Speech signal and spectrogram for utterance “wao ha” 11

2.4 Speech representations of utterance “shin”. Up: speech signal,

down: speech spectrogram 16

2.5 The first, second and third formants frequencies (F1, F2 and F3) 18

2.6 (a) Gray spectrogram, (b) Colour spectrogram 20

2.7 (a) The spectrogram and formants, (b) The spectra of vertical

line of (a) (shows by arrow) 22

2.8 Recognition process 26

2.9 (a) Neuron model (artificial neural network) and

(b) biological neuron 28

3.1 Steps to speech sample selection 35

3.2 Steps to speech recognition 35

3.3 Overall system design 36

3.4 Speech collection and graph analysis 38

3.5 Steps to achieve formants 39

3.6 The workflow to obtain the average value of spectral image

reading 40

3.7 Place of articulation: 1-2) Labiodental, 2) Dental, 3) Alveolar,

4) Post-alveolar, 5) Palatal, 6) Velar, 7) Pharyngeal, 8) Glottis 41

3.8 Data partitioning using k-fold cross validation 42

3.9 Preprocessing technique to speech recognition system 43

3.10 (a) Normalised speech signal, (b) The speech signal energy level 44

3.11 Speech sample through finite impulse response (FIR) digital filter 44

3.12 The pre-emphasis signal having blocked into frames 46

xiv

3.13 Frame blocking of N samples 46

3.14 Samples with greater frame blocking size 46

3.15 Process of windowing the sample 47

3.16 Signal multiplications with Hamming window 47

3.17 Signal after windowing process 48

3.18 Neural networks architectures 1, 2 and 3: (a) Single-layer network

(Architecture 1), (b) Multi-layer network (Architecture 2),

(c) Cascade-layer networks (Architecture 3) 52

3.19 System architecture 52

3.20 Neural network algorithm for single-layer network 53

3.21 Neural network algorithm for multi-layer network 54

3.22 Neural network algorithm for cascade-layer network 55

3.23 Network testing 56

4.1 (a) Speech signal, (b) The speech spectrogram of bilabial

nasal /m/, [م] 60

4.2 Formants distribution of bilabial /m/, [م] 61

4.3 (a) Speech signal, (b) The speech spectrogram of dental

nasal /n/, [ن] 62

4.4 Formants distribution of dental /n/, [ن] 62

4.5 (a) Speech signal, (b) The speech spectrogram of post-alveolar

lateral /l/, [ل] 64

4.6 Formants distribution of /l/, [ل] 64

4.7 (a) Speech signal, (b) The speech spectrogram of alveolar

trill /r/, [ر] 66

4.8 Formants distribution of alveolar /r/, [ر] 66

4.9 (a) Speech signal, (b) The speech spectrogram of dental

fricative /θ/, [ث] 67

4.10 Formants distribution of dental /θ/, [ث] 68

4.11 Formants distribution of labiodental /f/, [ف] 69

4.12 Formants distribution of dental /ð/, [ذ] 70

4.13 Formants distribution of alveolar /z/, ]ز[ 70

4.14 Formants distribution of alveolar /s/, [س] 71

4.15 Formants distribution of post-alveolar /∫/, [ش] 72

xv

4.16 Formants distribution of coronal /s'/, [ص] 73

4.17 Formants distribution of palatal /ð'/, [ظ] 74

4.18 Formants distribution of velar /x/, [خ] 74

4.19 Formants distribution of velar /ɣ/, [غ] 76

4.20 Formants distribution of pharyngeal /ħ/, [ح] 76

4.21 Formants distribution of pharyngeal /ʕ/, [ع] 77

4.22 Formants distribution of glottis /h/, [ه] 78

4.23 (a) Speech signal, (b) The speech spectrogram of bilabial

plosive /b/, [ب] 80

4.24 Spectral image of /s/, [س] 81

4.25 Spectrum peaks for labiodental /f/, [ف] dental /θ/, [ث], /ð/, [ذ],

alveolar /z/ , [ز] and palatal /ð’/, [ظ] 84

4.26 Trough and bandwidth for labiodental /f/, [ف], dental /θ/, [ث],

/ð/, [ذ], alveolar /z/, [ز], and palatal /ð’/, [ظ] 84

4.27 The combinations of nasal, lateral and trill networks

training recognition rate 96

4.28 The combinations of nasal, lateral and trill networks testing

recognition rate 96

4.29 The fricative networks training recognition rate 98

4.30 The fricative networks testing recognition rate 99

4.31 The networks training recognition rate 100

4.32 The networks testing recognition rate 100

4.33 The multi-layer (ML) and cascade-layer (CL) networks

performance 104

4.34 The graphical user interface design (GUI) design 106

4.35 The pop-up window to select sample to be tested 107

4.36 The “Result” push-button 107

4.37 The output of GUI displayed at Matlab’s Command Window 108

4.38 A child testing her pronunciation using developed GUI 108

xvi

LIST OF APPENDICES APPENDIX TITLE PAGE

A Makhraj Point 122

B Maahad Correctness List 127

C Spectrogram Formants 153

D Spectral Results 154

E Matlab Coding 155

F Publications 167

CHAPTER 1

INTRODUCTION 1.1 Introduction

There is not much research done on Arabic speech recognition which makes

fewer references for the study compared to English. Furthermore, the researches

done were on Arabian, Pakistan, Jordanian, Palestinian and other Arabian countries.

Therefore, the purpose of this study is to develop an Arabic phonemes database and

later will be used to develop software to show sound localization of Arabic letters

recitation. This chapter tells about the history of speech recognition technology in

brief, together with the problem statement of the study. Besides that, this chapter

also discusses about the objectives of the study and the significance of the study.

1.2 Overview of Speech Technology Research History

The review of speech technology shows that the development of speech

technology are developing and a lot of remarkable results have been produced as

well as new challenges in the research fields. Nowadays, the application of speech

processing technology (or known as Natural Language Processing Technology, NLP)

have led to a number of successful applications available in software which suitable

for personal-computer applications and hand-phones. NLP used via hand-phone,

called smart-phone are attracting user interest as the intelligent software assistant

2

with user-interface makes the technology such as Apple’s and Google’s on debut

recently in the telecommunications world. NLP which manipulates the voice

explicitly to produce signals that control computers or software or communication

devices are user interest and research awareness.

As early as 50’s, a system for isolated digit recognition for a single speaker

was researched by Davis, Biddulph and Balashek (1952). Later, Olson and Belar

(1956) designed a system to recognise 10-syllable for a single speaker. While in

1959, a system of recognising four vowels and nine consonants was built by Denes

(1959) using spectral information in extracting voice features. By applying the same

method, Forgie and Forgie (1959) built a speaker independent 10 English vowels

recogniser in isolated words.

Throughout the 1960s, vowel recogniser, phoneme recogniser and digit

recogniser developed. These researches were among Japanese, e.g. Suzuki and

Nakata (1961), Sakai and Doshita (1962) and Nagata, Kato and Chiba (1964).

Afterward in year 1970s, most research focused on implementing techniques

to improve the recognition technology. Sakoe and Chiba (1978) used warping

function for spoken word normalization in spoken word recognition. Digits and

names are considered in the recognition system. Itakura (1975) used Linear

Predictive Coding (LPC) extraction method in speech recognition. While Rabiner et

al. (1979) applied clustering techniques in the speaker independent recognition for

isolated words.

Speech recognition research field in 1980’s mostly employed of statistical

modeling methods such as Neural Networks (NNs) and Hidden Markov Models

(HMMs). The suitability of NN for pattern recognition was discussed by Lipmann

(1987). Lippmann (1989) overviewed of clustering technique by using Multi-Layer

Feed-Forward (MLFF) NN. The networks used Back-Propagation (BP) training rule

effectively. As described by Juang and Furui (2000), statistical methods were

3

designed to allow machine to learn the structure regularities in the speech signal

directly from the data for the purpose of automatic speech recognition (ASR).

Later in year 1990’s, shows a lot of additional improvement from the results

developed years previously. Lee (1990) developed a continuous speech recognition

speaker independent by using HMMs methods in context-dependent modeling. Liu

et al., (1992) implemented Dynamic Time-Warping (DTW) with NN. Large

vocabulary of continuous speech recognition was built using HMM toolkit, HTK by

Woodland et al., (1994). HTK is a portable software toolkit developed by

Cambridge University Speech Group. It uses continuous density HMMs for building

speech recognition systems. Matsuura, Miyazawa and Skinner (1994) developed a

speaker-independent isolated-word recognition system using NN to identify the

fundamental sequence of phonemes and a DTW technique to time-align phonemes

with equivalent lexical sequences of phonemes.

During 2000’s, a broader research done in the field of speech recognition.

Sutat and Chularat (2000) had developed Thai text-dependent speaker identification

by using Multi-Layer Perceptron (MLP) trained with Back-Propagation (BP)

algorithm. The used of linear interpolation and time-normalization are applied to the

digits 0 to 9 in Thai language. Hong, Salleh and Sha’ameri (2002) developed speech

recognition software for the purpose of teaching Linear Predictive Coding (LPC) and

LPC-derived cepstrum (LPCC) in education environment. It provided with DTW,

HMM and Vector-Quantization (VQ) techniques. Ting (2002) developed a system

which concerned on the improvement over the limitations and problems of traditional

speech therapy in Malay language. The speech classification used is multi-layer NN.

Al-Sayegh and AbedEl-Kader (2004) developed an Arabic phoneme recogniser

which used MLP NN method. While Pisarn and Theeramunkong (2007) developed a

Thai spelling speech recogniser with HMM used in the system training phase. The

Thai phonetic characteristics, alphabet system and spelling methods have been

analysed. Grewal and Kumar (2010) designed an isolated word recognition system

for English Language. Al-azzawi and Daqrouq (2011) developed Arabic vowel

recognition based on wavelet and LPC by using Feed-Forward Back-Propagation

Neural Networks (FFBPNN).

4

In spite of all of the progressions in the speech recognition field, a number of

excellent commercial products are available, such as products that recognise human

voice. IBM introduced their speech recognition system, Dragon (Pallet, 2002).

Research describing about these technologies were published by Baker (1975), Radin

(1983) and Atkinson and McCreight (1987). Also available through Google search

engine, known as Google’s Voice Recognition and available via Apple’s smart-

phone, IPhone, named as Siri.

1.3 Problem Statement

In Malaysia, still lack of Arabic speech recognition using neural networks.

Most of the research focused on Malay language instead of Arabic language as the

non-native language in their study. At present, a lot of Arabic software has been

developed. Unfortunately, less research focused on children speech sounds as the

children speech is more dynamic and inconsistent if compared to adult’s speech

(Ting and Yunus, 2004). Furthermore, children have shorter vocal tracts and smaller

vocal cords thus having higher formant frequencies than adults (Boё et al., 2006).

Most speech recognition systems target to adult users and had experience severe

speech recognition system performance degradation when used by children users.

Zue et al. (2000) found that children’s word error rate was almost twice adult users.

In Malaysia’s primary education, children were taught Arabic language as their third

language after Malay language and English. Therefore this study focused on the

characterisation, identification and classification of the children’s Arabic speech

sounds. Moreover, sound production point will be visualised using Matlab graphical

user interface (GUI) for education purposes especially for children.

1.4 Objectives of the Study

There are four objectives for this study. The first objective is to identify the

characteristics of Standard Arabic (SA) consonants. The characteristics of every SA

5

consonants can be characterised by implementing Fast-Fourier Transform (FFT) and

finding the formant frequencies from the signals.

The second objective is to employ k-fold cross validation technique to

estimate the performance of the NN system recognition as the small network tend to

learn the most gross behaviour of training data and ignore subtleties. k-fold cross

validation is a reliable method on choosing the best NN after elimination of data by

using formant frequencies pattern.

The third objective is to develop a system that can recognise Arabic

phonemes sound pronunciation by using neural networks for pattern recognition and

classification.

The forth objective is to design learning software for education purposes. A

learning system is built by using the appropriate Arabic phonemes database and NN

systems through visualising the sound production point at graphical user interface

(GUI).

1.5 Scope of the Study

The limitation of this study is using children’s voice as the database. The

subjects are limited to healthy children aged 7 to 11 years. The formant frequencies

of Standard Arabic (SA) consonants are studied as every phoneme has different

formants pattern and characteristics.

This study focuses on SA consonants, which consist of 25 consonants. There

are five manners of articulation for consonants which are fricative, plosive, trill,

lateral and nasal are investigated. Their unique characteristics are identified as well,

based on their formant frequencies.

6

1.6 Significance of the Study

Every Arabic phoneme possess different positions of the sound production

organs of speech which comprise the tongue, mouth roof, teeth and the vocal cord

which form the makhraj points to produce its pronunciation (refer to APPENDIX A

for the Arabic letters sound production point). Despite that, the characteristics of

Arabic phonemes among Malaysians children are identified based on their manners

and places of articulation. Furthermore, the Arabic phonemes recognition system

among Malaysian children is developed using neural networks. By visualising the

makhraj point and produce a sample sound of each letters through graphical user

interface (GUI), it can help children to get better understanding in their learning

process of pronouncing the Arabic letters.

1.7 Organization of Thesis This thesis is divided into five chapters. Following this introduction chapter

is Chapter 2, which presents some background information of the study and literature

review of related research of speech characteristics, speech recognition and neural

networks.

In Chapter 3, we described briefly about methodology used in this study.

Overall system design and implementation are described.

Chapter 4 discusses the findings during experiments and simulation of neural

networks. Formant frequencies for every 25 consonants of Standard Arabic are

determined. Further discussions about the best network selection were explained in

this chapter.

Chapter 5 concludes research findings and recommendations of some ideas

for future work.

111

REFERENCES Ahmad A. M. and Ismail S. Recurrent Neural Network with Backpropagation

Through Time For Arabic Recognition. Germany: SCS Europe. 13-16 June

2004.

Ahmad, A. M., Ismail, S. and Samaon, D.F. Recurrent Neural Network with

Backpropagation through Time for Speech Recognition. IEEE International

Symposium on Communications and Information Technology, 2004. ISCIT

2004. 2004. 1: 98-102.

Al-azzawi, K.Y. and Daqrouq, K. Feed Forward Back Propagation Neural Network

Method for Arabic Vowel Recognition Based on Wavelet Linear Prediction

Coding. International Journal of Advances in Engineering and Technology.

2011. 1(4): 62-72.

Ali, A. M. A., Van der Spiegel, J. and Mueller, P. Acoustic-Phonetic Features for

the Automatic Classification of Stop Consonants. IEEE Transactions on Speech

and Audio Processing. 2001. 9(8): 833-841.

Al-Sayegh, S. W. and AbedEl-Kader, A. F. Arabic Phoneme Recognizer based on

Neural Networks. Proceedings of International Conference on Intelligent

Knowledge Systems, IKS2004. Aug 16-20, 2004. Troy, Turkey. 2004.

Al-Shargabi, B., Al-Rominah, W. and Olayah, F. A Comparative Study for Arabic

Text Classification Algorithms based on Stop Words Elimination. ISWSA’11.

April 18-20, 2011. Amman, Jordan: ACM. 2011. Article 11.

Atkinson, R. R. and McCreight, E. M. The Dragon Processor. ACM Sigarch

Computer Architecture News. 1987. 15(5): 65-69.

112

Awadalla, M., Abou Chadi, F. E. Z. and Soliman, H. H. Development of an

Arabic Speech Database. Enabling Technologies for the New Knowledge

Society: ITI 3rd International Conference on Information and Communications

Technology, 2005. Dec 5-6, 2005. Cairo: IEEE. 2005. 89-100.

Awais, M. M., Masud, S., Akhtar, J. and Shamail, S. Arabic Phoneme

Identification Using Conventional and Concurrent Neural Networks in Non

Native Speakers. Lecture Notes in Computer Science. 2007. 4681: 897-905.

Awais, M. M., WaqasAhmad, Masud, S. and Shamail, S. Continuous Arabic

Speech Segmentation using FFT Spectrogram. Innovations in Information

Technology 2006. Nov 19-21, 2006. Dubai: IEEE. 2006. 1-6.

Bahi, H. and Sellami, M. Neural Expert Model Applied to Phonemes Recognition.

In: Perner P. and Imiya A. Lecture Notes in Computer Science Volume

3587/2005. Germany: Springer-Verlag Berlin Heidelberg. 507-515; 2005.

Baker, J. K. The Dragon System – An Overview. IEEE Transactions on Acoustics,

Speech and Signal Processing. 1975. ASSP-23: 24-29.

Beautemps, D., Badin, P. and Laboissière, R. Deriving Vocal-Tract Area Functions

from Midsagittal Profiles and Formant Frequencies: A New Model for Vowels

and Fricative Consonants based on Experimental Data. Speech Communication.

1995. 16(1): 27-47.

Boё, L-J., et al. Skull and Vocal Tract Growth from Newborn to Adult.

Proceedings of the 7th International of Seminar on Speech Production,

ISSP2006. Dec 13-15, 2006. Ubatuba, Brazil: CEFALA. 2006. 75-82.

Bouckaert, R. R. Choosing between Two Learning Algorithms based on Calibrated

Tests. Proceedings of the Twentieth International Conference on Machine

Learning, ICML 2003. August 21-24, 2003. Washington DC: AAAI Press.

2003. 51-58.

Callan, R. The Essence of Neural Networks. Maylands Avenue: Prentice Hall.

1999.

113

Chandra, B. and Varghese, P. P. Applications of Cascade Correlation Neural

Networks for Cipher System Identification. World Academy of Science,

Engineering and Technology 26. Dec 14-16, 2007. Bangkok, Thailand:

WASET. 2007. 312-314.

Cutler, A., Dahan, D. and Donselaar, W. Prosody in the Comprehension of Spoken

Language: A Literature Review. Language and Speech April/June 1997. 1997.

40(2): 141-201.

Daniel, R. P. and Marcos, F. Z. Speaker Recognition with a MLP Classifier and

LPCC Codebook. IEEE International Conference on Acoustics, Speech &

Signal Processing 1999. 1999. 2: 1005-1008.

Davis, K. H., Biddulph, R. and Balashek, S. Automatic Recognition of Spoken

Digits. Journal of the Acoustical Society of America. 1952. 24(6): 637-642.

Debyeche, M., Houacine, A. and Haton, J. P. A Knowledge-Based Approach for

Arabic Amphatic Consonant Identification Based on Speech Spectrogram

Reading. Proceedings of the Thirtieth Southeastern Symposium System Theory,

SSST 1998. March 8-10, 1998. Morgantown, West Virginia: IEEE. 1998. 325-

328.

Denes, P. The Design and Operation of the Mechanical Speech Recognizer at

University College London. Journal of the British Institution of Radio

Engineers. 1959. 19(4): 219-229.

Dietterich, T. G. Approximate Statistical Tests for Comparing Supervised

Classification Learning Algortihms. Neural Comput. 1998. 10(7): 1895-1923.

Fackrell, J. W. A., Vereecken, H., Martens, J-P., and Coile, B. V. Multilingual

Prosody Modelling using Cascades Regression Trees and Neural Networks.

Sixth European Conference on Speech Communication and Technology

EUROSPEECH 1999. Sept 5-9, 1999. Budapest, Hungary: EUROSPEECH.

1999. 1835-1838.

Fahlman, S. E. An Empirical Study of Learning Speed in Back-Propagation

Networks. Technical Report # CMU-CS-88-162. Carnegie Mellon University,

Pittsburgh. September 1988.

114

Fahlman, S. E. and Lebiere, C. The Cascade-Correlation Learning Architecture.

Technical Report # CMU-CS-90-100. Carnegie Mellon University, Pittsburgh.

August 1991.

Fant, G. Acoustic Theory of Speech Production with Calculations Based on X-Ray

Studies of Russian Articulations. Netherlands: Mouton and Co, The Hague.

1960.

Forgie, J. W. and Forgie, C. D. Results Obtained from a Vowel Recognition

Computer Program. Journal of the Acoustical Society of America. 1959.

31(11): 1480-1489.

Gordon, M., Barthmaier, P. and Sands, K. A Cross-Linguistic Acoustic Study of

Voiceless Fricatives. Journal of the International Phonetic Association. 2002.

32(2): 141-174.

Grewal, S. S. and Kumar, D. Isolated Word Recognition System for English

Language. International Journal of Information Technology and Knowledge

Management. 2010. 2(2): 447-45.

Hassan, A. Linguistik Am Untuk Guru Bahasa Malaysia. 5th ed. Selangor,

Malaysia: Fajar Bakti. 1984.

Heinzel, G., Rudiger, A. and Schilling, R. Spectrum and Spectral Density

Estimation by the DFT Including A Comprehensive List of Window Functions

and Some New Flat Top windows. Internal Report, Max-Planck-Institut für

Gravitationsphysik, Hannover. Feb 15, 2002.

Hong, K. S., Salleh, S-H. and Sha’ameri, A. Z. A New Developed Speech

Recognition Education Software in Teaching LPC and LPC-derived Cepstrum.

Proceeding of ICSP ’02. 2002. 2: 1701-1704.

Iqbal, H. R., Awais, M. M., Masud, S. and Shamail, S. On Vowels Segmentation

and Identification Using Formant Transitions in Continuous Recitation of

Quranic Arabic. In: Nguyen, N. T. and Katarzyniak, R. New Challenges in

Applied Intelligence Technologie Vol. 134/2008. Chennai, India: Springer-

Verlag Berlin Heidelberg. 155-162; 2008.

115

Itakura, F. Minimum Prediction Residual Applied to Speech Recognition. IEEE

Transactions on Acoustics, Speech, and Signal Processing. 1975. ASSP-23:

67-72.

Jackson, P. J. B. Acoustic Cues of Voiced and Voiceless Plosives for Determining

Place of Articulation. Proc. Workshop on Consistent and Reliable Acoustic

Cues for sound analysis, CRAC 2001. September, 2001. Aalborg, Denmark:

Eurospeech. 2001. 19-22.

Jackson, P. J. B. Characterisation of Plosive, Fricative and Aspiration Components

in Speech Production. Ph.D. Thesis. Universiti of Southampton, UK; 2000.

Jassem, Acoustical Description of Voiceless Fricatives in terms of Spectral

Parameters. Speech Analysis and Synthesis 1. 1967. 1: 189-206.

Jesus, L. M. T. and Jackson, P. J. B. Frication and Voicing Classification.

Computational Processing of the Portuguese Language, Lecture Notes in

Computer Science. 2008. 5190/2008: 11-20.

Jesus, L. M. T. and Shadle, C. H. Acoustic Analysis of European Portuguese

Uvular [χ,ʁ] and Voiceless Tapped Alveolar [ɾ] Fricatives. Journal of the

International Phonetic Association. 2005. 35(1): 27-44.

Johnson, D. Modeling the Speech Signal. Version 2.28. 2009.

http://cnx.org/content/m0049/2.28/

Johnson, K. Acoustic & Auditory Phonetics. 2nd ed. United Kingdom: Blackwell

Publishing. 2003.

Juang, B. H. and Furui, S. Automatic Recognition and Understanding of Spoken

Language – A First Step Toward Natural Human-Machine Communication. In

the Proceedings of IEEE. 2000. 88(8):1142-1165.

Juang, B. H. Maximum Likelihood Estimation for Mixture Multi-Variate Stochastic

Observations of Markov Chains. AT&T Technical Journal. 1985. 3(3): 239-

251.

Kirschning, I., Tomabechi, H. and Aoe, J-I. A Parallel Recurrent Cascade-

Correlation Neural Network with Natural Connectionist Glue. Proceedings of

116

the 1995 IEEE International Conference on Neural Networks, ICNN. 1995. 2:

953-956.

Kirschning, I., Tomabechi, H. and Aoe, J-I. Recent Advances in Continuous Speech

Recognition Using the Time-Sliced Paradigm. International Workshop on Soft

Computing in Industry 1996, IWSCI'96. April 1996. Muroran, Japan. 1996.

Kohavi, R. Study of Cross-Validation and Bootstrap Accuracy Estimation and

Model Selection. International Joint Conference on Artificial Intelligence.

1995. 2: 1137-1143.

Ladefodged, P. Phonetic data analysis: An Introduction to Fieldwork and

Instrumental Techniques. 2nd Edn. United Kingdom: Blackwell Publishing.

2003.

Lee, K. F. Context-Dependent Phonetic Hidden Markov Models for Speaker-

Independent Continuous Speech Recognition. IEEE Transactions on Acoustics,

Speech, and Signal Processing. 1990. 38(4): 599-609.

Lee, T., Ching, P. C. and Chan, L. W. Isolated Word Recognition using Modular

Recurrent Neural Networks. Pattern Recognition. 1998. 31(6): 751-760.

Lippmann, R. P. An Introduction to Computing with Neural Nets. IEEE ASSP

Magazine. 1987. 4(2): 4-22.

Lippmann, R. P. Pattern Classification using Neural Networks. IEEE

Communications Magazine. 1989. 27(11): 47-50, 59-64.

Li, F. Contrast and Covert Contrast: The Phonetic Development of Voiceless

Sibilant Fricative in English and Japanese Toddlers. Journal of Phonetics.

2009. 37: 111-124.

Liu, Y., Lee, Y. C., Chen, H. H. and Sun, G. Z. Speech Recognition using

Dynamic Time Warping with Neural Network Trained Templates. International

Joint Conference on Neural Networks. 1992. 2: 326-331.

Looney, C. G. Pattern Recognition using Neural Networks – Theory and

Algorithms for Engineers and Scientists. New York: Oxford University Press.

1997.

117

Maaly, I. A. and El-Obaid, M. Speech Recognition using Artificial Neural

Networks. 2nd Information and Communication Technologies, ICTTA ’06.

2006. 1: 1246-1247.

Macias-Guarasa, J. Montero, J. M., Ferreiros, J., Cordoba, R., San-Segundo, R.,

Gutierrez-Arriola, J., D’Haro, L. F., Fernandez, F., Barra, R. and Pardo, J. M.

Novel Applications of Neural Networks in Speech Technology Systems: Search

Space Reduction and Prosodic Modeling. Intelligent Automation and Soft

Computing. 2009. 15(4): 631-646.

Matsuura, Y., Miyazawa, H. and Skinner, T. E. Word Recognition using a Neural

Network and a Phonetically based DTW. Proceedings of the 1994 IEEE

Workshop on Neural Networks for Signal Processing [1994] 4. Sept 6-8, 1994.

Ermioni, Greece: IEEE. 1994. 329-334.

McLoughlin, I. Applied Speech and Audio Processing with Matlab Examples. New

York: Cambridge University Press. 2009.

Meyer, A. Estimation of Auditory Spectro-Temporal Receptive Fields using

Statistical Learning Methods. Master Thesis. University Oldenburg; 2009.

Moller, M. F. A Scaled Conjugate Gradient Algorithm for Fast Supervised

Learning. Neural Networks. 1993. 6: 525-533.

Mitra, S. K. Digital Signal Processing: A Computer-based Approach. 3rd Edn.

Singapore: Mc Graw Hill. 2006.

Nagata, K., Kato, Y. and Chiba, S. Spoken Digit Recognizer for Japanese

Language. Journal of the Audio Engineering Society, JAES. 12(4): 1964.

Nilsson, J. J. Learning Machines. New York: McGraw Hill. 1965.

Olson, H. F. and Belar, H. Phonetic Typewriter. Journal of the Acoustical Society

of America. 1956. 28(6): 1072-1081.

Pallet, D. S. The Role of the National Institute of Standards and Technology in

DARPA’s Broadcast News Continuous Speech Recognition Research Program.

Speech Communication. 2002. 37(1-2): 3-14.

118

Patterson, R. D., Smith, D. R. R., Dinther, R. V. and Walters, T. C. Size

Information in the Production and Perception of Communication Sounds. In:

Yost, W. A., Popper, A. N. and Fay, R. R. Auditory Perception of Sound

Sources. New York: Springer Science + Business Media, LLC. 43-75; 2008.

Pisarn, C. and Theeramunkong, T. Thai Spelling System for Automatic Spelling

Speech Recognition. An International Journal on Information Sciences. 2007.

178: 122-136.

Priddy, K. L. and Keller, P. E. Artificial Neural Networks – An Introduction.

India: SPIE. 2007.

Rabiner, L. and Juang, B. H. Fundamentals of Speech Recognition. New Jersey:

Prentice Hall, Englewood Cliffs. 1993.

Rabiner, L. R., Levinson, S. E., Rosenberg, A. E. and Wilpon, J. G. Speaker

Independent Recognition of Isolated Words using Clustering Techniques. IEEE

Transactions Acoustics, Speech, and Signal Processing. 1979. ASSP-27: 336-

349.

Radin, G. The 801 Minicomputer. IBM Journal of Research and Development.

1983. 27(3): 237-246.

Refaeilzadeh, P., Tang, L. and Liu, H. Encyclopedia of Database Systems - Cross

Validation. New York: Springer Science+Business Media. 2009.

Rumelhart, D. E. and McClelland, J. L. Parallel Distributed Processing:

Explorations in the Microstructure of Cognition. Cambridge, MA: M.I.T.

Press. 1986.

Sakai, T. and Doshita, S. The Phonetic Typewriter. Proceedings of International

Federation of Information Processing, IFIP Congress. Aug 27 – Sept 1, 1962.

Munich, Germany: ACM New York. 1962. 445-450.

Sakoe, H and Chiba, S. Dynamic Programming Algorithm Optimization for Spoken

Word Recognition. IEEE Transactions on Acoustics, Speech, and Signal

Processing. 1978. ASSP-26(1): 43-49.

119

Selouani, S-A. and Caelen, J. Recognition of Arabic Phonetic Features Using

Neural Networks and Knowledge-Based System: a Comparative Study.

Proceedings of IEEE International Joint Symposia on Intelligence and Systems

1998. May 1998. Rockville, Maryland: IEEE Xplore Press. 1998. 404-411.

Shadle, C. H. and Mair, S. J. Quantifying Spectral Characteristics of Fricatives.

Proceedings of ICSLP ’96. 1996. 3: 1521-1524.

Shadle, C. H., Mair, S. J. Carter J. N. and Millner, N. The Effect of Vowel

Context on Acoustic Characteristics of [ç, x]. Proceedings of the 13th

International Congress of Phonetic Sciences, ICPhS ‘95. 1995. 1: 66-69.

Sima, M., Croitoru, V. and Burileanu, D. Performance Analysis on Speech

Recognition using Neural Networks. Proceeding of the International

Conference on Development & Application Systems. May 1998. Suceava,

Romania: University of Suceava. 1998, 259-266.

Smith, S.W. The Scientist and Engineer's Guide to Digital Signal Processing.

California: California Technical Publishing. 1997.

St George, B. A., Wooten, E. C. and Sellami, L. Speech Coding and Phoneme

Classification using Matlab and Neural Networks. Information Sciences

Elsevier, 1997. 90: 109-119.

Steinberg, R. and O'Shaughnessy, D. Segmentation of a Speech Spectrogram Using

Mathematical Morphology. IEEE International Conference on Acoustics,

Speech and Signal Processing, ICASSP 2008. March 31-April 4, 2008. Las

Vegas, Nevada, USA: IEEE. 2008. 1637 – 1640.

Sutat, S. T. and Chularat, T. Feature Windowing-Based for Thai Text-Dependent

Speaker Identification using MLP with Backpropagation Algorithm.

Proceedings of the 2000 IEEE International Symposium on Circuits and

Systems, ISCAS2000. 2000. 3(3): 579-582.

Suzuki, J. and Nakata, K. Recognition of Japanese Vowels – Preliminary to the

Recognition of Speech. Journal of the Radio Research Laboratories. 1961.

37(8): 193-212.

120

Tan, C.L. and Jantan, A. Digit Recognition using Neural Networks. Malaysian

Journal of Computer Science. 2004. 17(2): 40-54.

Ting, H. N. and Yunus, J. Speaker-Independent Malay Vowel Recognition of

Children using Multi-Layer Perceptron. TENCON 2004. 2004. A: 68-71 vol 1.

Ting Hua Nong. Speech Analysis and Classification using Neural Networks for

Computer-Based Malay Speech Therapy. Master Thesis. Universiti Teknologi

Malaysia; 2002.

Watrous, R. L. Speaker Normalization and Adaptation using Second-Order

Connectionist Networks. IEEE Transactions on Neural Networks. 1993. 4(1):

21-30.

Weinberg, B. and Horii, Y. Acoustic Features of Pharyngeal /s/ Fricatives Produced

by Speakers with Cleft Palate. Cleft Palate Journal 12. 1975. 12(1): 12-16.

Woodland, P. C., Odell, J. J., Valtchev, V., and Young, S. J. Large Vocabulary

Continuous Speech Recognition using HTK. Proceedings of ICASSP ’94.

1994. 1: 251-254.

Wright, J. L. Neural Network Architecture Selection Analysis with Application to

Cryptography Location. WCCI 2010 IEEE World Congress on Computational

Intelligence. July 18-23, 2010. Barcelona, Spain: IEEE. 2010. 2941-2946.

Yeou, M. and Maeda, S. Pharyngeal and Uvular Consonants are Approximants: An

Acoustic Modeling Study. Volume 2 of Proceedings of the XIIIth Intemational

Congress of Phonetic Sciences, ICPhS 95. August 13-19, 1995. Stockholm,

Sweden: Kungliga Tekniska Högskolan. 1995. 586-589.

Yusof, Z. M. and Ahmed, M. Malay Phoneme Classification using Perceptual

Linear Prediction (PLP) algorithm. 2nd International Conference on

Engineering Technology 2009. 8 – 10 Dec 2009. Kuala Lumpur, Malaysia:

UniKL BMI. 2009, 1-5.

Yoshua, B. Neural Networks for Speech and Sequence Recognition. London, UK:

International Thomson Computer Press. 1996.

121

Zue, V. W. and Cole, R. A. Experiments on Spectrogram Reading. IEEE

International Conference on Acoustic, Speech, and Signal Processing, ICASSP

’79. 1979. 4: 116-119.

Zue, V. W. et al. Jupiter: A Telephone-based Conversational Interface for Weather

Information. IEEE Trans. Speech Audio Processing. 2000. 8: 85-96.

Documents

SYSTEM FOR CHARACTERISATION AND RECOGNITION OF ARABIC ... · SYSTEM FOR CHARACTERISATION AND RECOGNITION OF ARABIC ... akustik dan fonetik adalah sangat pesat untuk ... Rujukan adalah