On the Variability of TEOAE Human Identification and ... · Martanda, Rahul Udasi, Sam Haruna, SeungWan Choi, Shehzad Akbar, and Van Nguyen for their support. There were de nitely

On the Variability of TEOAE Human Identification and Verificationsystem

by

Jin Sung Kang

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

The Edward S. Rogers Sr. Department of Electrical & Computer Engineering.University of Toronto

c© Copyright 2018 by Jin Sung Kang

Abstract

On the Variability of TEOAE Human Identification and Verification system

Jin Sung Kang

Master of Applied Science

The Edward S. Rogers Sr. Department of Electrical & Computer Engineering.

University of Toronto

2018

This study presents a deep neural network architecture that achieves state of the art multi-

session verification and identification performance for Transient Evoked Otoacoustic Emission

(TEOAE) biometric system. TEOAE is a 20ms long response generated by the ear that is

naturally strong against falsification, and replay attacks. It can be measured using a device

with a speaker and multiple microphones. Previous TEAOE authentication methods focused

on single-session or mixed-session performance. Our method focuses on multi-session authenti-

cation performance. We train a neural network model that generates a TEOAE embedding that

is separable in Euclidean space by using the triplet loss function. These embeddings are used

to create identity templates which are used to authenticate the user. We show that our method

has 7.56% performance increase for identification scenarios and 13.3% performance increase for

verification scenarios over previous methods when averaged across all tests.

ii

Acknowledgements

I would like to thank Professors Dimitrios Hatzinakos and Yuri Lawryshyn for guiding me with

their mentorship and support. Their expertise has helped me become a better researcher and

a better person.

I would like to thank my family. My mom, my dad, and my sister whom without their love

and support, I would not be here. They have stuck by me every step of the way, through all

my highs and lows and were evermore patient. Mom and dad, ”you da real MVP”.

I really want to thank Yanshuai Cao for his expertise and encouragement. I do not know

how I would have finished this work without your help. Every time I struggled, you were there

to bail me out. I would like to thank Foteini Agrafioti for her constant support and motivation

to help me finish this work. To Joey Bose for accompanying me to the much-needed caffeine

walks and for his positivity during the stressful times. I would also like to thank everyone at

Borealis AI for their support. Every time when I was stuck and needed to vent, they complained

about how hard their schooling was in a genuine effort to make me feel better. I would like to

thank my labmates Mahjid Komeli, Sherif Seha, and Umang Yadav who helped me get through

the courses and provided me with their expertise in biometric security research.

I would like to thank Andrew Persaud, Sohaib Qureshi, Kevin Lee, Bhavik Vyas, Alan Li,

Andre Yang, Dominic Cheng, Fortunato Guanlao, George Jose, Henry Liu, Jawad Ateeq, Rahjee

Martanda, Rahul Udasi, Sam Haruna, SeungWan Choi, Shehzad Akbar, and Van Nguyen for

their support. There were definitely times where I was losing my mind, and you were there to

check up on me and helped me stay sane. Thank you all for sending your positive vibes my

way.

I would like to thank Vivosonic, and Royal Bank of Canada for their funding.

iii

To my parents, who will always be my heroes

iv

Contents

Acknowledgements iii

Dedication iv

List of Tables viii

List of Figures ix

List of Acronyms xi

1 Introduction 1

1.1 Research Goals and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature Review 6

2.1 TEOAE Biometrics Literature Review . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Biometrics using Neural Networks . . . . . . . . . . . . . . . . . . . . . . 9

3 Methodology 11

3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Components of Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 1D Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.2 Rectified Linear Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.3 Layer Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.4 Max pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.5 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.6 Triplet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.7 Triplet Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

v

3.4.1 Convolution Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.2 Embedding Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4.3 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4.4 Full Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.5 Fusion Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5.1 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5.2 Hyperparameter Explanation . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5.3 Hyperparameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5.4 Weight Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5.5 Mini-batch Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5.6 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Templates and Comparison Metrics 41

4.1 Identity Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.1 Mean Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.2 SVM Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Comparison Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.1 Eucledian Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.2 Cosine Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.3 Pearson Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.1 Enrollment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.2 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.3 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Experiments 49

5.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.1 54 subject test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.2 24 subject test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1.3 10 subject test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1.4 Dataset Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1.5 Neural Network Generalization . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.6 Tested Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3 Single Ear Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

vi

5.3.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.4 Both ears . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.5 Training Time Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.6 Inference Time Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6 Conclusions 66

Bibliography 67

A Performance 75

A.1 File Size Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

B Failed Experiments 77

B.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

B.2 Simple Convnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

B.3 One Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

B.4 Auto Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

B.5 CWT and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

B.6 Quadruplet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

C Data Splits 82

vii

List of Tables

3.1 TEOAE recording protocol [17] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Network Hyperparameters are shown in this table. The size in and size out is

shown as features × channels. Kernel for convolution block is shown as kernel

size × channels, Maxpooling is shown as kernel size as k and stride as s, and

embedding block is described as kernel size × channels, embedding size. . . . . . 37

3.3 List of Training Hyperparameters and their values . . . . . . . . . . . . . . . . . 37

5.1 Number of responses in the training data set averaged across 20 different data

splits. The training set of the 24 subject test includes responses from 30 subjects,

and the training set of the 10 subject test includes responses from 40 subjects.

The data splits and the amount of responses in the training set for each data

split is shown in Table C.2 and Table C.1 . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Verification performance of different methods for 54 subject test . . . . . . . . . 55



5.5 Identification performance of different methods for 54 subject test . . . . . . . . 59



5.8 Verification performance of different methods for fusion of ear scenario . . . . . . 62

5.9 Identification performance of different methods for fusion of ear scenario . . . . . 62

5.10 Training time for neural networks with different data sizes . . . . . . . . . . . . . 64

5.11 Training time for CWT/LDA method with different data sizes . . . . . . . . . . 64

5.12 Inference time comparison between neural network and CWT/LDA . . . . . . . . 65

A.1 PyTorch model file sizes for different hyperparameters . . . . . . . . . . . . . . . 76

C.1 Random seed and Test Subject distribution for 10 subject test . . . . . . . . . . 82

C.2 Random seed and Test Subject distribution for 24 subject test . . . . . . . . . . 83

viii

List of Figures

1.1 Diagram showing how the OAE response acquisition device operates. The speaker,

and microphones are on the earpiece. [36] . . . . . . . . . . . . . . . . . . . . . . 2

3.1 The distributions of the data in TEOAE database . . . . . . . . . . . . . . . . . 13

3.2 Location of the first and last 10 responses relative to the recording session.

Length of each recording session for each subject are different and they range

from 23 to 336 responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 First 10 TEAOE responses for subject 1 . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Last 10 TEAOE responses for subject 1 . . . . . . . . . . . . . . . . . . . . . . . 16

3.5 Last 10 TEAOE averaged responses for multiple subjects . . . . . . . . . . . . . 17

3.6 Proposed Network architecture for single ear authentication. The number beside

the convolution block denote the dimension of the convolutional filter, K is the

kernel size, and S is the stride of the max pooling layer. . . . . . . . . . . . . . . 19

3.7 The components of the convolution and the embedding block. K represents the

Kernel size, and P represents the padding size. /2 for 1D-Convolution means

that the number of channel dimensions is reduced by half. . . . . . . . . . . . . . 20

3.8 Example of a 1D-convolution. This diagram shows a 1-D convolution layer with

a kernel size of 3, a stride of 1, and padding of 1. The input to the layer is

shown in grey. The original input is shown in dark grey, and the added padding

is shown in light grey. The kernel slides over the input, and the inner product

between the input and the kernel is the output to the layer. The equation for

calculating the output y2 is y2 = w1 ∗ x1 + w2 ∗ x2 + w3 ∗ x3 . . . . . . . . . . . 21

3.9 Graph of the Rectified Linear Unit(ReLU) function . . . . . . . . . . . . . . . . . 22

3.10 Example of a Layer Normalization layer. Layer norm calculates mean and the

variance along the feature dimension. The total number of features learned are

2×mini− batchsize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.11 Example of a max pooling layer. This diagram shows a max pooling layer with

a kernel size of 3, a stride of 1, and a padding of 1. The input to the layer is

shown in grey. The original input is shown in dark grey, and the added padding

is shown in light grey. The kernel slides kernel slides over the input, and the

maximum value of the input is produced as the output . . . . . . . . . . . . . . . 24

ix

3.12 How a standard Neural network differs from neural network with dropout [59] . . 25

3.13 Diagram for Triplet loss. Error is calculated when the negative is closer to anchor

than the positive. The end goal for training is the diagram on the right. . . . . . 27

3.14 Proposed Network architecture for fusion of both ears. The number beside the

conv block denote the dimension of the convolutional filter, K is the kernel size,

and S is the stride of the max pooling layer. . . . . . . . . . . . . . . . . . . . . . 33

4.1 System diagram for enrolling a new user. SVM Template Enrollment is shown

on the top and Enrollment using Mean Template is shown on the bottom. . . . . 47

4.2 System diagram for Verification of a prob-sample is shown. Verifying using the

SVM template is shown at the top, and Mean template is shown at the bottom . 48

4.3 System diagram for identifying a probe-sample is shown. Identification using

SVM template is shown at the top, and Mean Template is shown at the bottom. 48

5.1 Verfication scenario EER graphs for subject 0 . . . . . . . . . . . . . . . . . . . 57

5.2 CMC curve for single ear identification scenario . . . . . . . . . . . . . . . . . . . 60

5.3 CMC curve for fusion ear identification scenario . . . . . . . . . . . . . . . . . . . 63

B.1 Simple Architecture used to perform classification . . . . . . . . . . . . . . . . . . 78

B.2 Architecture diagram for training the convolutional neural network without shared

parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

B.3 The final objective of quadruplet loss . . . . . . . . . . . . . . . . . . . . . . . . . 81

x

List of Acronyms

BEMD Modified Bivariate Empirical Mode Decomposition. 6

BioSec Biometrics Security Lab. 11

CMC Cumulative Match Characteristic. 53

CPU Central Processing Unit. 9

CWT Continuous Wavelet Transform. 4

EER Equal Error Rate. 7

FAR False Acceptance Rate. 52

FRR False Rejection Rate. 52

GB Gigabyte. 34

GPU Graphics Processing Unit. 9

HBM High Bandwidth Memory. 34

IoT Internet of Things. 1

LDA Linear Discriminant Analysis. 7

MLE Maximum Likelihood Estimation. 6

PCIe Peripheral Component Interconnect Express. 34

PDF Probability Density Function. 6

ReLU Rectified Linear Unit. 21

ROC Receiver Operating Characteristic. 46

xi

SGD Stochastic Gradient Descent. 8

STFT Short-Time Fourier Transform. 52

SVM Support vector machine. 42

TEOAE Transient Evoked Otoacoustic Emission. 1

WWR Whole Wave Reproducibility. 11

xii

Chapter 1

Introduction

Smartphones are increasingly storing more of our personal information; our communication with

one another, photos we capture, locations we visit, our banking information and other sensitive

data can all be accessed through a smartphone. With more devices connecting to the internet

due to the increase in the Internet of Things (IoT) devices and continual improvements to 5G

networks, the risk of security failure has never been higher. The latest smartphones implement

biometric authentication systems using fingerprint, face, and iris. These technologies have seen

significant improvements over the years largely impart to machine learning and deep neural

networks. As these security systems advance, forgery and spoofing techniques also evolve. For

example, generating a realistic video using limited photos is becoming simpler [62], images

can be manipulated in a manner unnoticeable to classification [20, 41, 53], and Bose et al. [7]

demonstrated that advanced face detectors could be fooled. With the advent of social media,

images of faces can be easily obtained and misused to attack facial biometric authentication

systems.

Additionally, fingerprints can be left on glass surfaces to create moulds to bypass security.

In addition to these weaknesses, there is a significant problem with these biometric modalities:

the biometric information is compromised forever when the information is stolen. Fingerprints,

face, and iris information are very difficult to replace.

Transient Evoked Otoacoustic Emission (TEOAE) is a response generated by the ear after

applying a low-level transient click stimulus, and it is present in most individual’s ears (99%+)

[23]. It is mainly used to diagnose hearing loss in infants and the elderly. TEOAE responses are

dependent on factors such as ear structure, and genetics [29]. The cochlea generates a response

1

Chapter 1. Introduction 2

that quickly dissipates after the stimulus is applied. The average length of the response is about

20ms and is measured using an earphone like device inserted into the ear. The measuring device

contains a speaker and multiple microphones. The speaker produces a stimulus signal, and the

microphones record the responses in quick succession. The diagram for the device operation is

shown in Figure 1.1

Figure 1.1: Diagram showing how the OAE response acquisition device operates. The speaker,and microphones are on the earpiece. [36]

Unlike the biometric modality available on smartphones, TEOAE is naturally strong against

replay and falsification attacks because it requires the attacker to have a complete copy of the

inner ear. Everything in the ear, including the membranes and the fluid consistency, has

to be matched to generate the same response. TEAOE is known to produce responses with

different stimulus signals [29] and further studies are required to prove that the underlying

structure of TEOAE responses produced with different stimulus signals is also different. If

the studies prove that authentication can be done using a response from a different stimulus,

TEOAE may mitigate some of the risks associated when biometric information is compromised.

When TEOAE biometric information is stolen, compromised individuals can re-register using

a TEOAE produced with a different stimulus signal to restore their credentials.

Currently, specialized equipment is needed to capture TEOAE responses. The smallest

TEAOE measurement device available is approximately the size of a smartphone [64]. Although


there are current technological limitations to integrate TEOAE measurement into smartphones,

it is not hard to imagine integrating a TEOAE capturing device into headphones. There are

more than 350 million earphones sold every year [60]. An audio device with TEOAE can contin-

uously authenticate users to provide a different dimension to human-computer interaction and

enable new forms of entertainment. Tracking immunization and hospital record for babies with-

out the proper form of identification is a complicated process in developing countries. Newborns

already receive hearing tests using a TEOAE device, and TEOAE biometric authentication can

be integrated using these devices to keep track of health records. Search and rescue missions

requiring rescuers to be on the move and require hand dexterity can use TEOAE authentication

because fingerprint and face recognition would be implausible. Any other applications where

the face has to be covered due to protection or privacy can apply TEOAE authentication.

1.1 Research Goals and Contributions

The primary focus of this work is to test the viability of multi-session identification and verifi-

cation. Data measured in one seating is defined as one session, and a multi-session problem uses

multiple sessions. Previous methods have not focused on the viability of TEOAE biometrics

on multiple session data. We work on increasing the authentication performance compared to

previous methods which have focused on a single, or mixed sessions. Our work contributes the

following:

• We designed a deep neural network model that can learn the characteristics of a TEOAE

response. This model generalizes across multiple sessions and outperforms previous meth-

ods in identification and verification scenarios.

• We designed a neural network architecture that does not require retraining when a new

subject is registered. Neural networks are challenging to train, and conventional neural

network identification methods require retraining a model to incorporate the newly reg-

istered identities. We present an architecture that can register new identities without

retraining the network.

• We propose neural network training techniques to train the network faster and gener-


alize better by incorporating the latest advancements in neural networks. Architecture

designs from multi-task learning were used to reduce the number of parameters. Our

design has reduced the number of parameters which enables faster processing, and better

generalization.

• We simplified the feature extraction step required in previous methods. The neural net-

work was designed to extract features directly from a normalized response, while previous

methods have used Continuous Wavelet Transform (CWT) to extract the features. Two

hyperparameters are required to extract features using CWT: the mother wavelet, and

the scale. These hyper-parameters had to be tuned separately for the left ear, the right

ear, and the dataset to get the best result. The combination of multiple CWT scales

results in worse performance than a system that uses a tuned scale. We designed a model

that works without specific hyper-parameters for different ears and datasets.

• We tested the effectiveness of our method using the TEOAE biometric database. Perfor-

mance of identification and verification scenarios for a single ear TEOAE were tested, and

extended to a fusion of left and right ear TEOAE. The TEOAE database was collected by

Biometrics Security Laboratory, at the University of Toronto under the protocol reference

# 23018.

1.2 Thesis Organization

The remainder of the thesis is organized as follows:

Chapter 2 presents the background for TEAOE biometric authentication using neural net-

works. We discuss previous research that shows TEOAE response as a viable biometric authen-

tication modality. We provide background information on neural networks and deep learning

and examine the application of deep neural networks in other biometric systems that contribute

to our work.

Chapter 3 describes the methodology. First, the TEOAE dataset and the TEOAE responses

from different subjects are presented. The collection process and the protocol used for collecting

the dataset are discussed. It also provides background information for neural network layers


and discusses the architecture of our proposed neural network model. It includes the reasoning

behind choosing each neural network layer, and how the layers fit into the final architecture.

Finally, the training methods and algorithms that are implemented to train the neural network

are presented.

Chapter 4 introduces various identity templating methods and discusses the pros and cons of

each method. Distance metrics used to compare a probe sample to the template are examined.

This chapter also explains the system architecture required to build an authentication system.

Chapter 5 presents the experimental settings and various tests along with metrics used to

compare these tests. The results and discussion of the tests for both verification and authentica-

tion scenarios are presented to show the effectiveness of our approach. Computational efficiency

between previous methods and neural network model are also discussed.

Chapter 6 concludes our work and provides directions for future studies.

Chapter 2

Literature Review

The TEOAE has been investigated before to assess its effectiveness as a biometric modality

for authentication. TEOAE biometric modality is not heavily researched, and the University

of Toronto Biometric Security group is one of the few institutions continuing the research in

TEAOE. In this section, previous research in TEOAE biometrics authentication system is pre-

sented, along with a short background in neural networks, and a discussion on the advancements

in other areas of machine learning that contribute to our work.

2.1 TEOAE Biometrics Literature Review

The original study done by Swabey et al. [63] showed that TEOAE response was a viable biomet-

ric modality for an authentication system. The paper visually investigated the inter-class and

intra-class differences in TEOAE response using a dataset with hundreds of subjects spanning

over a six month period. The methods in this study mathematically modelled the responses in

the time domain without any transformation. The inter-class and the intra-class distances were

estimated by Maximum Likelihood Estimation (MLE) to approximate the Probability Den-

sity Function (PDF), and the distance between two responses was calculated using Euclidean

distances. The study concluded that TEOAE responses were not only different among individ-

uals but were repeatable with a high degree of reliability, making them suited for a biometric

authentication system.

Agrafioti et al. [18] worked on a model that applied a Modified Bivariate Empirical Mode

6

Chapter 2. Literature Review 7

Decomposition (BEMD) to build an identification system. BEMD with an auditory model was

applied to decompose a response to generate multi-level local oscillation components. These

components were then used to calculate matching scores. Their work first explored the possi-

bility of fusing results from both ears for identification to increase identification performance.

Liu and Hatzinakos continued the work by applying a neural network autoencoder using

CWT features [37]. Their work reduced the dimensions of a TEOAE response using a neural

network to generate an embedding. These embeddings were compared using the Euclidean

distance metric. Equal Error Rate (EER) is a metric used to compare different biometric

systems. It measures the accuracy at which the proportion of the false matches and the false

non-matches are the same for a given test set. This method had a high system level EER, but a

low individual level EER. The results showed that the embeddings were separable individually,

but the variability among the individuals was very large. This work did not explore identification

scenarios.

Another work by Liu and Hatzinakos used CWT and Linear Discriminant Analysis (LDA)

[36] to achieve state of the art performance on verification and identification scenarios. This

paper used CWT as a feature extractor and trained an LDA model to reduce the dimensionality

of the features. Pearson correlation distance was used to determine the similarity between two

responses. This work showed great promise for single session authentication scenarios, but it

did not explore multi-session identification scenarios. The method for identification presented

in this work could only be applied to a test set where all individuals are registered on the

system. Even when the response of an unregistered individual was presented to the system,

the model always predicted a registered individual. The identification method was difficult to

apply in real-world systems.

Komeili et al. [30] presented their work to reduce TEOAE acquisition time, generalize among

multi-session, and reduce computational complexity for verification scenarios. Ten responses

were randomly picked from the first quarter of a recording session to reduce acquisition time. It

was an improvement over the previous methods that picked the last ten samples from a recording

session. This work also aimed to generalize verification scenarios to multi-session data. The

algorithm proposed in this work learns to select a subset of features that maximizes verification

performance from a list of pre-determined features. The pre-processing step required to generate


the list of features in this work was time-consuming and computationally burdensome. This

paper did not explore identification scenarios.

For our work, we attempt to overcome the shortcomings of previous methods. Most works

have not tested their methods against both verification and identification scenarios, did not

test their algorithms on multi-session authentication, had a feature extractor which limited

real-world application, and had to be retrained for every new registrant. In this thesis, we

focus on building a multi-session authentication system, building an effective TEAOE feature

extractor, and reducing the number of retraining steps when registering a new identity.

2.2 Neural Networks

Neural networks were originally designed as an attempt to mimic the brain. A neural network

contains multiple neurons that are connected with each other. Connections between neurons

have weights and biases that are learned from various examples and data. Neural network

models are trained using the backpropagation algorithm [52]. The objective function calculates

the error of a neural network when it produces the wrong output. Gradients of the error with

respect to the model weights are used by the backpropagation algorithm to update neural

network weights. Since gradients have to be calculated for backpropagation to work, it is

important that neural networks be differentiable.

Neural networks require a big training set to be useful, but because of the computational

constraints, the full training set often cannot fit into memory. A training set is divided into

smaller sets known as mini-batches. Neural networks cannot compute full gradient information

from a mini-batch because a mini-batch is only a subset of the training set. Instead, an approx-

imated gradient from a mini-batch is used to train a network. The optimization algorithm to

find the optimal neural network weights is called Stochastic Gradient Descent (SGD). SGD is

known to converge to a global minimum under relaxed constraints when the objective function

is convex, and to a local minimum otherwise [8]. Neural networks are in general non-convex

and non-linear, but SGD works well in practice.

Mini-batches are often randomly sampled without replacement [21,38]. Smaller mini-batches

can be computationally efficient but may not converge due to high variance per mini-batch.


Bigger mini-batches are computationally inefficient and may get stuck in a local minimum due

to low variance. Tuning the mini-batch size is an essential part of a neural network training

process.

Neural networks have two operation modes: the training mode and the inference mode.

In training mode, network weights are updated using the backpropagation algorithm. In the

inference mode, a network makes predictions based on input data. Depending on the type of

layer used in a neural network architecture, a neural network model is computed differently in

the training and the inference modes [26]. Both modes perform forward propagation to calculate

the output of the network, but training step has an additional backpropagation step to update

the network [52]. For efficient training, a Graphics Processing Unit (GPU) is required [31].

The inference mode is faster because it does not require the backpropagation step. Depending

on the size of a network, inferences can be made using a Central Processing Unit (CPU) in a

reasonable time.

The configuration of a network is dependent on the type of problem, and the amount of

training data. Neural networks contain many hyperparameters, and some guidelines exist to

determine their range, but it varies depending on the problem and the dataset. Bergstra

and Bengio [5] have shown that random search is exceptionally effective for neural network

hyperparameter search.

2.2.1 Deep Neural Networks

The current boom in artificial intelligence is due to deep neural networks [31]. With sufficient

data and a deeper neural network, problems that were difficult to solve by traditional machine

learning algorithms can be solved. Deep Neural networks are constructed by stacking multiple

layers of neural network components on top of each other.

2.2.2 Biometrics using Neural Networks

Deep neural networks have been applied to vision-based authentication systems, and it has

shown to increase the performance of these systems. Fingerprints [35], signature recognition [10]

have applied neural networks since the early 1990s.

In face-reidentification tasks, the Siamese network architecture proposed by Copra et al. [14]


have been commonly used. A Siamese network produces embeddings that are closer together

when two samples are from the same identity, and further when data points are from different

identities. The contrastive loss function [22] is used to train a Siamese network. Further work

was done by Koch et al. [28] to enable one-shot learning for face re-identification tasks. One-shot

learning focuses on training a new class using one or a small number of samples. This learning

method allows neural networks to be trained even when the training data is not plentiful. One-

shot learning is useful in a biometric authentication system because registration needs to be

quick, and the amount of data is limited. In this thesis, we apply the techniques in one-shot

learning and parameter sharing to allow registration of new users with a limited number of

samples.

Parameter sharing has been used extensively for multi-task learning [51]. This learning

technique is used when multiple tasks are related to each other. Examples of this would be

classifying types of clothes. A T-shirt, dress, and buttoned up shirt share similar attributes,

but classifying a specific type of t-shirt or a dress may be too difficult for a neural network.

Instead of learning the different types using multiple networks, an architecture can be designed

to learn common attributes using shared parameters, and different attributes with individual

parameters. Li et al. [65] have applied parameter sharing for to the network model to help it

learn both the global and the local perspectives for person re-identification tasks. We apply

a similar concept to our network architecture to guide our model to learn similar attributes

between the left and right ear TEOAE, but also use individual weights to learn the difference.

Schroff et al [55] have used triplet loss networks [25] to achieve state of the art results for face

re-identification. Triplet loss networks are similar to the Siamese networks; a network learns to

keep the data points in the same label close, and different labels far away. The best method

for choosing a triplet pair to reduce training time and increase accuracy was also explored.

This paper proposes choosing hard triplets that maximize the loss of the loss function. Further

discussion for triplet loss network is presented in the next chapter.

Outside of image-based biometric systems, time-signal based biometric systems have also

been using neural networks. Biometric modalities such as ECG [46] and EEG [3,54,56,68] are

modelled using neural networks to perform identification and verification scenarios. Continuing

the trend, we use neural networks to learn TEOAE response characteristics.

Chapter 3

Methodology

3.1 Dataset

The TEOAE dataset was collected at the University of Toronto by the Biometrics Security

Lab (BioSec) [6]. The Vivosonic integrity V500 system was used to record this dataset in an

office environment. During the collection process, ambient noise and conversations were not

controlled in an effort to mimic real-life scenarios. The protocol used for collecting TEOAE

responses is noted in Table 3.1. There were no post-processing steps after the data was collected.

Whole Wave Reproducibility (WWR) was used as the metric to ensure the quality of the

TEOAE dataset. WWR measures how close two responses are correlated. The number of

responses for both ears, sessions, and subjects are different because the recording was stopped

when the response reached a steady-state (WWR > 90%). As shown in Figure 3.1, the length

of each session varies from 23 to 336 responses. Medical journals [29] show that the time taken

to reach steady state is highly dependent on the individual. The last ten responses are the

steadiest responses because the response has reached a steady state.

Two sessions per individual were collected in total, and the time difference between the two

was a minimum of one week. Signals from both the left and right ear were collected for all

participants. TEOAE signals were measured with two microphones per ear in short succession,

and these values were saved into two buffers. In total, TEOAE responses for 54 individuals

were collected in the database. The BioSec TEOAE database is the only TEOAE database

collected for biometric security. The distribution for the data is presented in Figure 3.1.

11

Chapter 3. Methodology 12

The location of the first ten and the last ten responses in a single recording session is

shown in Figure 3.2. Figure 3.3 shows the first ten responses and Figure 3.4 shows the last ten

responses. As it can be seen in the figures, the first ten responses are very noisy compared to

the last ten responses. Looking at Figure 3.3.a and Figure 3.3.c, the values of the same response

in Buffers A and B are slightly different due to different sensor positions in the ear, and the

physical property of the microphones. To mitigate the positioning issue and to reduce noise,

we average the two buffers and use the mean response for training and testing. The averaged

responses between the A and B buffers are shown in Figure 3.4.e, Figure 3.4.f. and Figure 3.3.a

and Figure 3.3.b show that the responses between the left ear and right ear are different. The

responses from different subjects can be seen in Figure 3.5.

Table 3.1: TEOAE recording protocol [17]

StimulusParameters

STI-Mode Non-LinearClick Interval 21.12 ms

Click Duration 80 µsSound Level 80dB peSPL

TestControl

Record Window 20msLow Pass Cut-Off 6000HzHigh Pass Cut-Off 750Hz

Artifact Rejection Threshold 55dB SPL


Figure

3.1:

Th

ed

istr

ibu

tion

sof

the

dat

ain

TE

OA

Ed

atab

ase


Figure 3.2: Location of the first and last 10 responses relative to the recording session. Lengthof each recording session for each subject are different and they range from 23 to 336 responses.


(a) Left Ear Buffer A (b) Right Ear Buffer A

(c) Left Ear Buffer B (d) Right Ear Buffer B

(e) Left Ear Averaged (f) Right Ear Averaged

Figure 3.3: First 10 TEAOE responses for subject 1


(a) Left Ear Buffer A (b) Right Ear Buffer A

(c) Left Ear Buffer B (d) Right Ear Buffer B

(e) Left Ear Averaged (f) Right Ear Averaged

Figure 3.4: Last 10 TEAOE responses for subject 1


(a) Subject 40 Left Ear (b) Subject 40 Right Ear

(c) Subject 24 Left Ear (d) Subject 24 Right Ear

(e) Subject 26 Left Ear (f) Subject 26 Right Ear

Figure 3.5: Last 10 TEAOE averaged responses for multiple subjects


3.2 Pre-processing

The majority of signal pre-processing is done on the Vivosonic integrity V500 sensor as it

captures the data. The sensor removes the noise and the stimulus signal from the output.

TEOAE is recorded using two microphones and saved into two separate buffers. The sensor

outputs a vector of size 660. The responses in the two buffers are averaged to reduce noise. We

pre-process by normalizing the averaged response. The normalization is done as follows:

xi ∈ X

µ =1

D

D∑i=1

xi

σ =

√√√√ 1

D

D∑i=0

(xi − µ

)2Y =

X − µσ

(3.1)

where X ∈ R660 is a raw TEOAE response, and D is the size of the input vector.

Normalization removes inconsistencies in the data, stabilizes neural networks, and speeds

up training. From a feature perspective, it prevents a model from learning the strength of a

response. The strength changes over time [29], and removing it helps a model generalize better

for multi-session data. Also, a TEOAE response is different between the left and right ear.

Without the normalization step, we would not be able to train a model using data from both

ears because the data distribution would be vastly different.

3.3 Components of Neural Network

We present our architecture diagram for single ear authentication in Figure 3.6 and the building

blocks for our architecture is presented in Figure 3.7. We first present the components of our

neural network and the reasoning behind using them.


Figure 3.6: Proposed Network architecture for single ear authentication. The number besidethe convolution block denote the dimension of the convolutional filter, K is the kernel size, andS is the stride of the max pooling layer.


Figure 3.7: The components of the convolution and the embedding block. K represents theKernel size, and P represents the padding size. /2 for 1D-Convolution means that the numberof channel dimensions is reduced by half.

3.3.1 1D Convolutional Layer

A TEOAE response is a one-dimensional time signal, so our network uses a one-dimensional

convolutional layer. Lecun and Bengio were first to use a convolutional neural network for

image, speech, and time series data [32].

A Convolutional layer imposes a structure of local connectivity and assumes that a local

group of neurons matter more than neurons further away. Making this assumption allows us to

structure neural networks in a way that reduces the number of trainable parameters. For time

series data, it forces networks to look at data closer to each other in time than data further

away. The convolution neural network learns location invariant features. A convolutional layer

operation is as follows:

hkij = (W k ∗ x)ij + bk

xij ∈ Xj

k = {1..K}, j = {1..J}

(3.2)

where Xj is the input vector, J is the number of training examples, and K is the output channel


dimension. W ∈ RK×Fw is the convolutional filter weights, where Fw is the filter size. b ∈ RK

is the bias vector. Output of the convolutional layer is H ∈ RJ×D×K . Figure 3.8 illustrates the

convolution operation.

Figure 3.8: Example of a 1D-convolution. This diagram shows a 1-D convolution layer witha kernel size of 3, a stride of 1, and padding of 1. The input to the layer is shown in grey. Theoriginal input is shown in dark grey, and the added padding is shown in light grey. The kernelslides over the input, and the inner product between the input and the kernel is the output tothe layer. The equation for calculating the output y2 is y2 = w1 ∗ x1 + w2 ∗ x2 + w3 ∗ x3

.

3.3.2 Rectified Linear Unit

Rectified Linear Unit (ReLU) [19] is the most commonly used non-linear activation function in

designing neural networks [33, 49]. The non-linear activation function allows neural networks

to learn a non-linear input output relationship. The ReLU function is better than tanh or

sigmoid functions in reducing the vanishing gradient problem [40] which can be problematic


when training large networks. The ReLU activation function is as follows:

hlij = max(hl−1ij , 0

)hl−1ij ∈ H

l−1j

(3.3)

where H l−1j is the output from the previous layer, and hlij and hl−1

ij are scalar values. l denotes

the current layer, and the l − 1 denotes the previous layer. The graph in Figure 3.9 shows the

output of the ReLU function with an input x.

Figure 3.9: Graph of the Rectified Linear Unit(ReLU) function

3.3.3 Layer Normalization

Similarly to normalization done in the pre-processing step, the output of each neural network

layer is also normalized. As neural networks train, the distribution of outputs from each layer

shifts considerably and causes a problem known as covariance shift [57]. Layer normalization [34]

is a technique to normalize the output before the next neural network layer. It calculates the

mean and the variance on a per sample basis. The normalization helps networks converge faster,

and achieve better performance. The mean and the variance of the layer outputs are calculated

as follows:

µj =1

D

D∑i=0

hl−1ij (3.4)

σj =

√√√√ 1

D

D∑i=0

(hl−1ij − µj

)2j = {1..J} i = {1..I}

(3.5)


where hl−1ij is the output from the previous layer, D is the input vector size, and J is the size

of a mini-batch, and I is the number of features. Output of the layer is calculated by:

hlij = γjhl−1ij − µj√σ2j + ε

+ βj j = {1..J} (3.6)

where γj and βj are parameters learned through training, µj is calculated using Eq. 3.4,

and σj is calculated by Eq. 3.5 of the network and they scale the normalization. The layer

normalization step is illustrated in Figure 3.10

Figure 3.10: Example of a Layer Normalization layer. Layer norm calculates mean and thevariance along the feature dimension. The total number of features learned are 2 × mini −batchsize

3.3.4 Max pooling

The number of parameters in neural networks needs to be reduced for faster computation, faster

convergence, and smaller network size. A Max pooling layer reduces the number of parameters

by selecting the most important features. The filter with size k is moved along the input by

shifting its index by a stride s. The maximum input value within the shifting filter is chosen as

the output. Max pooling reduces the total output dimension of a layer when s > 1 and makes

neural networks local scale invariant [39]. The equation for a max pooling layer is given by:


Y = max(hl−1s∗k , . . . , h

l−1s∗k+w)

k = {1...K}(3.7)

K =

⌊L− w

s+ 1

⌋(3.8)

where L is the output dimension from hl−1. The example for a max-pooling layer is shown in

Figure 3.11.

Figure 3.11: Example of a max pooling layer. This diagram shows a max pooling layer witha kernel size of 3, a stride of 1, and a padding of 1. The input to the layer is shown in grey.The original input is shown in dark grey, and the added padding is shown in light grey. Thekernel slides kernel slides over the input, and the maximum value of the input is produced asthe output


3.3.5 Dropout

Dropout is a technique pioneered by Srivastava et al. [59] to prevent overfitting of a neural

network. A dropout layer randomly sets some of the neurons to zero with probability p. When

neural networks train, each neuron becomes highly dependent on other neurons to generate

useful features. Dropout randomly breaks these dependencies and forces each neuron to become

a better feature extractor themselves. Dropping random neurons could also be seen as training

neural networks in multiple configurations. To get the best inference performance, all outputs

from all configurations should be averaged, but keeping track of all possible configurations is

infeasible and computationally costly. The solution is to approximate the output by scaling the

output of the trained neural network without applying dropout by a factor of p.

Figure 3.12: How a standard Neural network differs from neural network with dropout [59]

3.3.6 Triplet Loss

The triplet loss [66] is very commonly used in face and person recognition tasks [13, 55] and

speaker identification tasks [9, 67]. The triplet loss [66] is used to train neural networks with

weight w as an embedding function fw : RDin → RDembed , where Din is the input vector

dimension, and Dembed is the embedding vector dimension. A triplet consists of xa, xp, xn

samples which are called the anchor, positive and negative sample. Samples from the anchor

and the positive are from the same class, but the negative is from a different class. A minimum


of two separate classes are needed to generate a triplet pair. The triplet loss is used to train

a neural network model that learns to optimize the embedding space such that the Euclidean

distance between the anchor(ea), and the positive(ep) are closer than the distance between the

anchor(ea) and the negative(en). The embeddings ea, ep, en are computed by e = fw(x) where

e is the embedding, and x is the data sample. The Euclidean distance function is defined as g.

The loss is calculated by:

Ltriplet =N∑i=1

[g(eai , e

pi )

2 − g(eai , eni )2 + α

]+

yai = ypi , yai 6= yni

(3.9)

where y is the label for each embedding e, and α is the threshold for the distance margin.

Figure 3.13 illustrates the triplet loss objective. The triplet configuration on the left shows

the negative closer to the anchor than the positive. The loss of this configuration would be a

positive value. The configuration on the right shows the positive closer than the negative. As

long as g(ea, ep) + α is smaller than g(ea, en), loss will be zero, as it has achieved the goal of

increasing the inter-class distance.

One of the problems with using the triplet loss objective is its large intra-class variance.

The requirements for the loss function is to satisfy maximum inter-class distance, but it does

not explicitly set requirements for intra-class distances. Intra-class distance requirements are

somewhat implicitly enforced as an embedding with a large intra-class distance will also fail to

maximize inter-class distances. The triplet loss will ignore intra-class distances as long as it

can maximize inter-class distances. For a biometric security system, small intra-class distance

is essential, as it helps a network generalize better to data points that it has not seen before.

Since our model will operate on data that it has not been trained on, generalization is essential.

To improve the intra-class variance problem, we tried using a quadruplet loss [12] objective,

but it resulted in a lower authentication performance. More discussion about the quadruplet

loss is given in Appendix B.6


Figure 3.13: Diagram for Triplet loss. Error is calculated when the negative is closer to anchorthan the positive. The end goal for training is the diagram on the right.

3.3.7 Triplet Mining

Like most data sampling techniques used in machine learning practices, triplets can be picked at

random. However, the number of triplet pairs grows exponentially with the number of samples

and to compute all triplet pairs is computationally infeasible. As training time increases, more

triplet pairs become perfectly separable and result in zero loss. These pairs are not helpful

for neural network training as there is nothing new to learn. Samples that do not maximize

Eq. 3.9 or results in zero loss are called easy triplets, and the opposite are called hard triplets.

Training using easy triplet pairs results in a network with slow convergence speed and lower

accuracy as it cannot learn distinguishing features. To maximize training speed, and achieve

better results, picking a hard triplet is essential.

Two strategies to pick the ideal triplet are the online and the offline strategies. The offline

strategy uses the current state of the network to pick the next hard sample. It selects hard

triplets by calculating the Euclidean distance of the embeddings in the dataset. These triplets

are used to calculate the loss and update the neural network weights. The online strategy


randomly samples mini-batches from the dataset ahead of training, and constructs the hard

triplet from each mini-batch during training. The offline strategy can pick harder negatives

than the online strategy because it looks at a larger pool of data. However, picking the hardest

global triplet may have adverse effects on training due to the possibility of mislabeled data.

Picking the triplets inside a mini-batch can mitigate the problem.

In this thesis, online batch hard mining strategy for selecting triplet pairs is used. This

selection method has been used to improve the convergence rate and accuracy in face re-

identification tasks [55]. A mini-batch is constructed by sampling of N responses from each

of the M identities. Additionally, N samples from the identities that are not part of the M

identities are randomly sampled and added to the mini-batch. The final sample size of the

mini-batch is (M + 1)×N samples. Firstly, the embeddings for all samples in a mini-batch are

computed. Secondly, the Euclidean distance between all anchor and positive pairs, and anchor

and negative pairs are computed. Thirdly, all combinations of (ea, en) and (ea, ep) pairs that

result in zero loss are removed, and the loss is calculated. The algorithm for choosing hard

negatives is given in Algorithm 1.

We have experimented with a selection method which also picked the hardest (ea, ep) pair

in a mini-batch and combining it with the (ea, en)pair, but it did not yield any improvements.

Enforcing another rule for picking the triplet pair limits the number of samples used to calculate

the loss. The limited training samples could explain the lack of improvements for this selection

method.

3.4 Network Architecture

We build a convolutional block with the components discussed in previous sections and stack

them to build our neural network architecture.

3.4.1 Convolution Block

The convolution block is designed by stacking a 1D-convolution, a ReLU, and a layer normal-

ization layer. Figure 3.7 shows the convolution block architecture. The output from each layer

is passed on to the next layer. The input to the block is of size R660×Ci and the output of the


Algorithm 1: Online batch hard triplet selection

Input:

TEOAE embedding e

Label Y

Threshold α1

Output:

Triplet (ea, ep, en)

initialize triplet = []for label ∈ set(Y ) do

pos pair = anchor, positive pair combinationsneg pair = anchor, negative pair combinationsdist pos = anchor positive pair distancedist neg = anchor negative pair distancefor pos idx ∈ pos pair do

losses = []for neg idx ∈ neg pair do

losses.append(dist pos[pos idx]− dist neg[neg idx] + α)endindex = argmax(losses)if losses[index] > 0 then

anchor = pos pair[anchor][pos idx]positive = pos pair[positive][pos idx]negative = neg pair[negative][index]triplet.append(anchor, positive, negative)

end

end

endreturn triplet


block is R660×Co , where Ci is the number of input channels, and Co are the number of filters

used in the 1D-convolutional layer.

The order of the layers inside the block was chosen based on experiments. The input

is passed through a convolutional layer and a non-linearity. The non-linearity modifies the

output values and causes the mean and variance to shift. Intuitively it makes sense to apply

normalization as the last step before passing it to the next layer, but based on experiments,

the order of a non-linearity and a normalization layer does not seem to matter.

3.4.2 Embedding Block

The embedding block is designed to generate an embedding of size RDembed from an input

R660×Cin . A max pooling layer is first used to reduce the number of features to half, and a

1D-convolution layer is used to reduce the number of channels by half. Reducing the number

of neurons in the layer before the fully connected layer reduces the size of a model because

a fully connected layer has weights connecting every input to every output. The output of a

1D-covolution layer is then passed through a ReLU, a layer normalization layer, and a fully

connected layer to generate an embedding. The Figure 3.7 shows the structure of an embedding

block. The dimension of the embedding block output Dembed is a hyperparameter that requires

tuning.

3.4.3 Parameter Sharing

The TEOAE dataset contains responses from both the left and right ears. Medical studies

show that the responses from both ears are different even when measured within the same

session [29]. However, a structural similarity of the ears suggests that there may be similarities

between responses. These similarities allow neural network architectures to be optimized.

Previous methods were effective because CWT was an excellent feature extractor for TEOAE

responses. The newly designed feature extractor needs to be more powerful than CWT to

increase the performance of the biometric system. For feature extraction using CWT, a CWT

scale and a mother wavelet had to be tuned. Instead of choosing a CWT scale, multiple CWT

scales can be combined to eliminate tuning, but the experiments show that combining multiple

scales reduce the performance due to extra noise in the features. To increase generalization,


a TEOAE response feature extractor that does not need to be tuned based on the dataset is

required. In theory, neural network feature extractors should have better performance than

CWT because the neural network is optimized to extract features from the TEOAE responses.

A TEOAE feature extractor can be trained using an encoder-decoder scheme. However, this

scheme is not ideal as it requires multiple training steps. Firstly, an encoder-decoder network

needs to be trained. Secondly, the encoder needs to be separated and placed on top of another

network that can generate TEOAE embeddings. Because a single network cannot effectively

learn the TEOAE structure of both ears, two embedding networks need to be trained from

the encoder output. In total, this scheme would require training three different networks. To

reduce the number of training steps, we design our architecture so it can be trained in one step.

Following the multi-task learning strategies discussed in the literature review, the proposed

network architecture has two sections: the common feature section, and the individual feature

section. The common feature section is a TEOAE feature extractor, and it is common for both

ears and shares the weights. This section learns features that are common to both ears and

replaces the CWT feature extractor that was used in previous methods. The Individual feature

section has separate sets of weights and is designed to learn features unique to each ear.

The common feature section reduces the number of parameters in a neural network by

sharing the parameters. It allows a network to be trained using data from both ears. Training

with more data increases the accuracy and generalization as shown by Sun et al. [61]. In total,

the number of parameters is reduced from 2 ∗Ncommon + 2 ∗Nind to Ncommon + 2 ∗Nind, where

Ncommon is the number of trainable parameters in the common feature section, and Nind is the

number of trainable parameters in the individual ear section.

3.4.4 Full Architecture

The diagram for the architecture is given in Figure 3.6. The arrows show the direction of the

output. A normalized TEOAE response is used as an input to the network. The first three

blocks are part of the common feature section and the two blocks after the division are part of

the individual feature section.

After the common feature section computes the input, the network has two pathways to

compute the embeddings. The pathways are chosen depending on which ear a TEOAE response


is collected. This double pathway architecture forces the network to learn the separate distri-

butions for each ear. The two convolutional blocks that come after the split are part of the

individual feature section. After the individual feature section, the embedding block is used to

compute the final embedding. The output of the convolution layer in the embedding block is

of dimension R330×Co . This output is then transformed into a one-dimensional vector of size

330× Co and passed to a fully connected layer.

3.4.5 Fusion Architecture

Diagram for the fusion architecture is shown in Figure 3.14. Fusion architecture combines the

left and right ear pathway outputs from the neural network and concatenates them to create

a one-dimensional vector. The embedding block produces a vector of size RDembed and the

concatenated output is of size R2∗Dembed . We choose this architecture because using one neural

network to train for both ears is adverse to the authentication performance. The concatenated

one-dimensional vector is reduced to the final output with dimension RDembed using a fully

connected layer.


Figure 3.14: Proposed Network architecture for fusion of both ears. The number beside theconv block denote the dimension of the convolutional filter, K is the kernel size, and S is thestride of the max pooling layer.


3.5 Training

This section discusses the training process of our neural network model. We discuss the charac-

teristics of a GPU, the hyperparameters of the model, the neural network weight initialization

technique, and the mini-batch sampling process.

3.5.1 Graphics Processing Unit

Training neural networks require a GPU to perform the computation efficiently. The explosion

of deep neural networks is due to the explosion in computing power brought by GPUs. A single

computation core on a GPU is slower than a CPU, but the number of cores on a GPU is much

larger. The increased number of cores allow parallelization which speeds up neural network

training.

The biggest bottleneck for using a GPU is the data transfer time between a CPU and a GPU.

To reduce this transfer time GPUs have separate onboard memory. The amount of memory

on a GPU is far smaller than the amount on a CPU. A typical GPU would have between 8

Gigabyte (GB) to 32GB of internal memory whereas CPU memory would be in the hundreds

of GB. The GPU memory and its cores are connected with a High Bandwidth Memory (HBM)

Bus to achieve maximum throughput. The data connection between a CPU and a GPU uses

a Peripheral Component Interconnect Express (PCIe) bus. The current generation PCIe bus

is about 30 times slower than the current generation HBM2 memory bus. Tuning the data

transfer between CPU memory and GPU memory is essential for maximum performance.

When the computational complexity is low, transferring data in small amounts can starve

a GPU from data and reduce its throughput. The main hyperparameter that controls the data

transfer between the GPU and CPU is the mini-batch size. Transferring larger data can be

more efficient than transferring smaller data due to the reduced overhead required to facilitate

the data transfer. Multiple mini-batches can be combined into a bigger batch and transferred

to a GPU to increase performance.

When training neural networks, GPU memory is predominantly used by data, models, and

gradients. Gradient information for every operation of a neural network is saved and used by

the backpropagation algorithm to update the weights. The memory impact increases when the


size of a mini-batch increase due to more data being stored on a GPU. Large mini-batch size

increases the amount of computation, and as a by-product increases the amount of gradient

information saved on a GPU. The size of a model also similarly increases memory consumption.

Firstly, models need to be saved on the GPU. Secondly, bigger models require more operations

which produces more gradient information.

3.5.2 Hyperparameter Explanation

There are numerous hyperparameters in our neural network as discussed in the above sections.

The dataset hyperparameters and the network hyperparameters have to be tuned to achieve

generalization and authentication performance.

Dataset Hyperparameters

The size of a mini-batch is an important dataset hyperparameter as it determines the number of

triplet pairs a network will process at once. Values for M identities and N samples are chosen

to optimize training time, and authentication performance. The M and N have to be large

enough that sufficient hard triplet pairs can be selected, but also small enough to fit into GPU

memory. When choosing these values, the amount of memory available on a GPU determines

the upper bound, and the quality of hard triplet pairs in a mini-batch determines the lower

bound.

Convolution Block Hyperparameters

Majority of the hyperparameters in the convolution block are used to tune the performance of

the 1D-convolution layer. The kernel size, stride, padding, and the number of output channels

are the hyperparameters to be optimized. These values are chosen to reduce the effort required

to stack multiple blocks. As discussed above, deep neural networks gain performance by stacking

more blocks. Keeping the input channels and output channels constant makes stacking easier.

The padding is calculated by:

P =(K − 1)

2(3.10)


where P is the amount of padding, and K is the kernel size. We keep the stride to one so that

the convolution filter slides over the whole input vector. With these constraints, only the kernel

size and the output channel size is left to be tuned. These constraints are common in the latest

deep neural network architectures such as VGG network [58], and ResNet [24]. As discussed

in the literature review, random hyperparameter search is used to find the kernel size and the

output channel size.

Embedding Block Hyperparameters

The embedding block is tuned similarly to the convolution block but has the embedding size

as an additional hyperparameter. The neural network embeds a TEOAE response vector of

size 660 to a smaller dimensional space. The embedding space needs to be sufficiently big

enough to separate the classes, but small enough to reduce the embedding space to allow for

efficient computation. It is the most crucial hyperparameter for a neural network because it

impacts the intra-class separability in the embedding space [55]. The 1D-convolution layer in

the embedding block is tuned similarly to the convolution block. Kernel size is chosen based

on a random search, but the output channel size is chosen to be half the size of the number of

input channels to reduce the number of trainable parameters.

Training Hyperparameters

The number of blocks for the common feature section and the individual feature section needs

to be tuned to determine the final architecture. Choosing a small number for each section

results in a model that does not learn, and choosing a large number of results in a model that

overfits.

The number of epochs to train the network is another hyperparameter that needs to be

tuned along with the learning rate of the network. Learning too slowly will result in the

network getting stuck in local minima, but a big learning rate might cause the network to

bounce around the loss surface without converging.


3.5.3 Hyperparameter Selection

Each hyperparameter configurations was tested at least five times to compute the mean and

the variance of the results. The number of blocks for the common feature section and the

individual feature section, the output channels for the one-dimensional convolutional layer, and

the embedding size of the network were tested with great depth. The parameters were randomly

searched and fine-tuned by hand. Table 3.2 shows the number of trainable parameters for each

block of the network and the hyperparameters chosen for each layer. Table 3.3 shows the chosen

hyperparameters.

Table 3.2: Network Hyperparameters are shown in this table. The size in and size out isshown as features × channels. Kernel for convolution block is shown as kernel size × channels,Maxpooling is shown as kernel size as k and stride as s, and embedding block is described askernel size × channels, embedding size.

LayerName section Size In Size Out Kernel # of Params

Conv Block 1 Common Feature 660 × 1 660 × 64 3 × 64 8,448Conv Block 2 Common Feature 660 × 64 660 × 64 3 × 64 20,544Conv Block 3 Common Feature 660 × 64 660 × 64 3 × 64 20,544Conv Block 4 Common Feature 660 × 64 660 × 64 3 × 64 20,544Conv Block 5 Individual Feature 660 × 64 660 × 32 3 × 32 10,272Conv Block 6 Individual Feature 660 × 32 660 × 32 3 × 32 7,200Conv Block 7 Individual Feature 660 × 32 660 × 32 3 × 32 7,200Max Pooling Individual Feature 660 × 32 329 × 32 k = 3, s = 2 0Embed Block Embedding 329 × 32 128 3 × 16, 128 677,392

Total 1,484,208

Table 3.3: List of Training Hyperparameters and their values

Hyperparameter Values

Optimizer AMSGrad/ADAMLearning Rate 0.001MiniBatch Size 125

MiniBatch (Number of classes) 25MiniBatch (Samples per class) 5

Epochs 30


3.5.4 Weight Initialization

The filter weights are randomly initialized as the optimal filter weights are not known. The

training problems that are caused by bad weight initialization tend to be in two categories:

vanishing or exploding gradients [4]. The vanishing gradient problem is caused when a network

does not have enough gradient information to update its weights. It happens when the weights

initialized are close to zero. The exploding gradients problem is caused when the gradients are

too big to make any meaningful changes to a network. This problem occurs when the initial

weights of a network are large. Weights are randomly sampled from a normal distribution with

zero mean and a variance calculated by:

σ =√

2 ∗Nparam (3.11)

where Nparam is the number of trainable parameters in a layer. The initialization step stabilizes

the network training and allows a network to generalize across different random seeds, and

different dataset splits. This initialization scheme has been used by Schroff et al. to initialize

a neural network [24].

3.5.5 Mini-batch Sampling

In this section, we discuss the implementation details for selecting the mini-batches for single

ear and fusion of both ears scenario.

Single Ear

For a single ear scenario, the training dataset is first divided into the left, and right dataset.

The mini-batches are constructed for each ear using the separate dataset. For every mini-batch,

5 TEOAE responses from 25 individuals are randomly sampled. 5 additional responses are

randomly sampled from the individuals that are not part of the previously chosen 25 individuals

to increase variability in the mini-batch. Since the mini-batches for the left and right are

separately constructed with random sampling, there is no guarantee that all identities in the

left mini-batches would also be in the right mini-batches.

As was mentioned in Chapter 3.1, the number of samples for every identity is different due


to the collection process. The training dataset is unbalanced with each class having a different

number of responses. Two schemes for constructing mini-batches were tested. The first scheme

samples the mini-batch without replacement. We disregard the class size difference and pick

every sample in a class until there are no samples left. The next scheme involves sampling every

identity with replacement until the number of samples matches that of the identity with the

most number of samples. The results of the two schemes were similar, but the first scheme had

lower computational overhead than the second scheme. Therefore, the first scheme was chosen.

The total number of samples between the left and right dataset were also different. The

difference in the number of samples resulted in a small difference between the number of mini-

batches for the left and right. The performance impact for the difference in mini-batches is

minimal, as all samples would be used in the training process with a sufficient number of

iterations and random permutations.

Fusion of Both Ears

Fusion of both ears requires a different method for generating mini-batches because a single

training sample requires both the left and right ear TEOAE response. Firstly, the training

datasets for the left and right ears were combined into a joint dataset. Secondly, 5 responses from

25 individuals were randomly sampled from the joint dataset. The sampled dataset contains

a mix of the left, and right TEOAE responses. Lastly, for the responses that were sampled,

a response from the opposite ear of the same individual was randomly sampled to create a

mini-batch with responses from both ears.

The test dataset was constructed by concatenating the last ten responses from the second

session for both ears. Since the left and right ear responses were collected at different times,

there is no time overlap between the two ears. We test the fusion scenario by combining the

steadiest ten samples from both ears.

3.5.6 Training Procedure

The network is trained in two stages. Firstly, the loss is computed using left channel mini-

batches. Secondly, the right channel loss is computed using the right channel mini-batches. The

losses from these two channels are summed and backpropagated to update the weights of the


network. Reversing the training process does not make a difference because the backpropagation

happens at the same time for both channels.

We also tested training the left channel for a full iteration and training the right channel for

a full iteration after. The network went through what is known as catastrophic forgetting [50].

After the left iteration was complete, the common feature section forgot what it learned for the

right channel. The solution to this problem is still being investigated [27].

In this chapter, we discussed the dataset and the neural network implementation details. In

the next chapter, we present the various templating methods, comparison metrics, and biometric

security system architecture.

Chapter 4

Templates and Comparison Metrics

In this section, we discuss different methods to create identity templates, and the distance met-

rics used to accept or reject an identity. We also discuss how the verification and identification

system is implemented.

4.1 Identity Templates

Identity templates are the user information that is created to register a user on a biometric

security system. The identity templates are used to determine the identity of a probe-sample.

A probe-sample is a sample that is presented to the biometric system for authentication. The

distance or the probability is measured between a probe-sample and a template to decide to

accept or reject the user from the system. In this section, we discuss the pros and cons of two

templating methods: the Mean template and the SVM template.

4.1.1 Mean Template

The Mean template is the easiest templating method to implement, and it is created by averaging

TEOAE embeddings generated by fw as discussed in 3.3.6. By calculating the average, we are

locating the centroid of the embeddings. The mean template is generated by:

T =1

N

(N∑i=1

ei

)(4.1)

41

Chapter 4. Templates and Comparison Metrics 42

where N is the number of samples used for the enrollment session. This templating method

only requires data from one identity. No information from other identities or extra training is

required.

4.1.2 SVM Template

Support vector machine (SVM) [15] is a machine learning technique that is commonly used for

classification or regression. The goal of the method is to find a linear boundary that maximizes

the distance between the binary class multi-dimensional data. The separating hyperplane is

described as wx + b where x is the input, b is the bias, and w is the learned weights. When

wx+ b ≥ 0 the sample is labeled as a +1 class, and when wx+ b < 0 the model labels it a −1

class. The hyperplane is found by solving the following constrained optimization problem:

minimizew,b,ξ

‖w‖2 + C

N∑i=1

ξi

subject to yi(w · xi + b) ≥ 1− ξi,

ξi ≥ 0

where ξ is the slack variable to relax the constraint because the data may not be completely

separable by the hyperplane, xi is the training sample, yi is the training label, N is the number

of samples, ei is the embedding generated by fw for response i, and parameter C controls

whether to maximize margin or minimize training loss.

Since SVM separates the classes using a hyperplane, it is not useful when dealing with data

that is not linearly separable. A Specialized SVM kernel can be used to project the data to a

higher dimensional space. When the data is in a higher dimensional space, it may be possible

to separate the data with a hyperplane. SVMs are used for binary classification but can be

extended to multi-class classification by training multiple one vs. one, or one vs. all classifiers.

The SVM template is created by training one vs. all SVM classifier. Linear SVM classifiers

are known to improve verification performance in biometric systems [16,30,47,48]. We train an

SVM classifier for every identity against all other identities. When registering a new individual,

data for all individuals previously registered in the system is required for training. As the

number of registered identities grows, so does the amount of data required to train a template.


4.1.3 Comparison

The SVM template has the potential to produce better authentication results compared to the

Mean template because all the identities are known to the system. The SVM template can

increase the intra-class distance of registered individuals. However, as the number of registered

individuals grow, the authentication performance of SVM template might suffer because the

data points may not be linearly separable.

The main problem for SVM template is that it needs to be retrained when a new identity

is added. When the number of identities registered in the system is low, SVM template can

be useful. Retraining templates are quick, and an SVM model may be able to separate data

efficiently. As the number of identities grows, problems such as dataset size, and training time

becomes a significant problem. The Mean template does not require retraining every time a

new identity is registered.

A SVM template cannot be trained with only one identity because SVM models are designed

to separate a binary class with a maximum distance. Without the negative identities, the

boundary cannot exist. A predefined negative dataset needs to be used to train a model to

develop a single user verification system. On the other hand, mean template can set a distance

threshold to reject probe-samples that are far from the registered template.

The Mean template method depends on the embedding function fw to generate an embed-

ding that is separable by Euclidean distance. The database does not need to save previous

registration data because training a mean template does not require data points from other

identities. An embedding function that accounts for various intra-class differences is a prereq-

uisite for mean template method. Training a better embedding function requires more data,

and when the training dataset is small, the mean template method might not be an option.

4.2 Comparison Metrics

This section describes the distance functions used to calculate the similarities between a tem-

plate and a probe-sample. These functions are used to calculate the distance for CWT/LDA

method and the mean template method. The closer a probe-sample and a template are, the

higher the probability that a probe-sample is from the same identity as the template.


4.2.1 Eucledian Distance

Euclidean distance is the most common distance metric, and it is used by the triplet loss to

optimize the embedding space. It calculates the straight line distance between two points. The

distance is calculated by:

d(X,Y ) =

√√√√ n∑i=1

(xi − yi)2 (4.2)

where X and Y are the vectors being compared and n is the dimension of the comparison

vectors.

4.2.2 Cosine Distance

Cosine similarity is commonly used when comparing similarities between embeddings. It com-

putes the angle between the two vectors. When the orientation of the two vectors are the same,

the resulting value is 1, and 0 when they are orthogonal. We calculate the cosine similarity by:

similarity = cos−1

(X · Y

||X|| · ||Y ||

)(4.3)

where X and Y are the two vectors that are being compared. We then change the similarity

into a distance by computing

d(X,Y ) = 1− similarity (4.4)

4.2.3 Pearson Distance

Pearson correlation distance measures the linear correlation between the two vectors. The

output of the function is 0 when the two vectors are not correlated, −1 when the vectors

are negatively correlated, and 1 when the vectors are positively correlated. The correlation

coefficient is calculated by:

ρx,y =cov(X,Y )

σxσy(4.5)

where X and Y are the two vectors that are being compared, and σx and σy are the standard

deviation for vectors X and Y . The pearson distance is calculated by:

d = 1− ρx,y (4.6)


where ρx,y is the pearson correlation.

4.3 System Architecture

There are three main modes to a biometric security system: enrollment, identification, and

verification. In this section, we discuss the three modes.

4.3.1 Enrollment

The enrollment mode registers an individual into the system. The goal of the system is to

recognize these individuals who are in the system and reject imposters. The enrollment process

requires the TEOAE responses, and the identity of the user registering on to the system.

Multiple responses are collected during registration to create an accurate template.

Figure 4.1 shows the enrollment process for both the SVM template and the Mean template.

For both methods, responses are first acquired using the sensor and pre-processed by normaliz-

ing them. The embedding e is computed from the normalized response using the neural network

which acts as the embedding function fw.

The SVM templates uses both the embeddings from the new responses and the embeddings

of the identities stored in the enrollment database. The trained template is saved in the template

database with the identity of the individual as the primary key. The new responses used during

registration are saved into the enrollment database. For the Mean template, the average of the

embeddings is computed and then saved to the Template Database.

4.3.2 Verification

The verification mode is a one-to-one matching mode. An individual first claim to be of a

certain identity, and secondly, the system checks the presented probe-sample against a template

saved in the system. This mode is commonly used on smartphones. When someone presents

a fingerprint on a smartphone, the system assumes that the owner of the phone is attempting

to gain access. The fingerprint security system checks the probe-sample against the fingerprint

template of the owner and accepts or rejects a probe-sample based on a decision threshold. The

decision threshold can be set based on each identity or could be set based on the overall system.


Determining the threshold requires a careful inspection of EER graph and Receiver Operating

Characteristic (ROC). The explanation for the two metrics is provided in section 5.2.

Figure 4.2 shows the operation of a verification system. First, a claimed identity and

TEOAE responses are collected from an individual. The identity template is retrieved from the

database, and the embedding generated using the response is compared against the template.

The system decides to accept or reject the identity by using the comparison metrics discussed

in sections above or using the SVM class probability.

4.3.3 Identification

The identification mode is a classification mode. The system is given a probe-sample with an

unknown identity, and the system has to find the identity if the identity is registered in the

system. The system also has to reject a probe-sample if it is not registered. Testing a system

using a dataset with only the individuals known to the system is called a closed-set problem.

Testing a system using a dataset containing individuals that are unknown to the system is called

an open-set problem. Biometric identification is commonly seen in crime scene investigation,

where the investigators match a fingerprint found in a crime scene to one registered in the

database. The system checks a probe-sample against every registered template to find the best

match. Identification mode is more time-consuming than the verification mode because it has

to make N comparisons, where N is the number of templates registered in the system.

Figure 4.3 shows the identification system. A TEOAE response with an unknown identity is

presented to the system. The system pre-processes the response and computes an embedding.

The template that is the closest to the embedding is chosen as the identity for the given response.


Figure 4.1: System diagram for enrolling a new user. SVM Template Enrollment is shown onthe top and Enrollment using Mean Template is shown on the bottom.


Figure 4.2: System diagram for Verification of a prob-sample is shown. Verifying using theSVM template is shown at the top, and Mean template is shown at the bottom

Figure 4.3: System diagram for identifying a probe-sample is shown. Identification usingSVM template is shown at the top, and Mean Template is shown at the bottom.

Chapter 5

Experiments

In this section, we discuss the tests used to compare different methods and present the results

for identification and verification for single ear, and for fusion of both ears.

5.1 Experimental Setting

There are a total of 54 subjects and two sessions for each subject in the TEOAE dataset. The

dataset is divided into three parts: a training set, a template set, and a test set. A training

set is used to train the neural network, and it is not used in CWT based system because the

CWT feature extractor does not require training. A template set is used to build templates for

each subject, and a test set is used as probe samples. We test the system for verification and

identification scenarios. Within each scenario, three different training, testing, and template

sets were used to test different aspects of our neural network model. These tests are the 54

subject test, the 24 subject test, and the 10 subject test. This section describes each test.

5.1.1 54 subject test

Liu and Hatzinakos [36] used the 54 subject test to evaluate their biometric authentication

system. The training set was constructed using first session responses from all subjects after

removing the last ten responses from each subject. The last ten responses were removed from

each subject to ensure that the responses used in the training set are not used in the template

set. The template set was created using the last ten responses from each subject in the first

49

Chapter 5. Experiments 50

session. The last ten responses in the second session were used as the test set. This test includes

the same subjects in both the training set and the test set. We use this test to see how well

our model can separate the classes that the model has seen before.


The 24 subject test has been used by Liu et. al [37], and Majid et al. [30]. This test divides the

subjects rather than the sessions into a training set and the test set. The training set includes

30 subjects, and the test set includes the other 24 subjects. The training set includes responses

for both sessions from the 30 subjects. The template set and the test set are created using the

24 subjects. The template set includes the last ten responses from the first session, and the

test set includes the last ten responses from the second session. The number of samples for this

test and the 54 subject test are roughly equal, as shown in Table 5.1. We use the 24 subject

test test to see how the methods perform for the subjects not in the training set.


The 10 subject test is similar to the 24 subject test but with an increased number of subjects

in the training set. The training set includes responses from both sessions for 44 subjects. The

template set and the test set includes the last ten responses from both sessions for 10 subjects.

The last ten responses from the first session are used as the template set, and responses in the

second session are used as test set. This test has more training data and subjects compared to

the 24 subject test. By comparing the results of the 24 subject test and this test, the performance

difference of each method when it is trained using more data can be evaluated. We also compare

the results between CWT and mean template method for this test to determine which method

is better at extracting TEOAE features for authentication.

5.1.4 Dataset Generalization

The 24 subject test and the 10 subject test can have different training and testing data depending

on how the subjects are split. One train and test split cannot test the variability of our neural

network model performance. Our neural network is tested across multiple dataset splits to

capture the variability in a dataset.


Table 5.1: Number of responses in the training data set averaged across 20 different datasplits. The training set of the 24 subject test includes responses from 30 subjects, and thetraining set of the 10 subject test includes responses from 40 subjects. The data splits and theamount of responses in the training set for each data split is shown in Table C.2 and Table C.1

Test Dataset Size

54 Subject test 25,60224 Subject test 27,528.710 Subject test 40,793.7

We test the methods using 20 different dataset splits. Firstly, 20 integers between one

and one hundred were picked to be used as random seeds for initializing the neural network.

Secondly, the dataset is split into a test set and training set based on random values generated

by the random number generated with a given random seed. These splits are used to test all

other methods. The dataset split with random seeds are shown in Table C.2 for the 24 subject

test, and Table C.1 for the 10 subject test. The dataset for the 54 subject test is only tested

using one split because the dataset does not vary.

5.1.5 Neural Network Generalization

The weights of the neural network are randomly sampled as described in the sections above.

The stochastic nature of neural networks needs to be tested to prove the generalization. We test

neural network generalization by initializing the neural network using multiple random seeds.

The CWT and CWT/LDA method are not stochastic, so we only test it once for 54 subject

test.

5.1.6 Tested Methods

We test our method against three other methods. Liu and Hatzinakos [36] developed two

methods: CWT and CWT/LDA. In verification mode, both CWT and CWT/LDA methods

use CWT to generate TEOAE features from the responses. For CWT/LDA method, an LDA

model is trained on top of the CWT features. Both methods calculate the Pearson correlation

distance to accept or reject an identity. In the identification method, instead of the Pearson

correlation distance, a multinomial logistic regression model trained to identify individuals. The

logistic regression model outputs the probabilities of each identity registered in the system, and


the identity with the highest probability is chosen as the output of the model.

The third method is called the No-FS method. We use this method as a baseline for

verification scenarios. This method combines features that have shown to work for time signal

based biometric modality and trains a Linear SVM model in one v.s. all configuration. The

features used in this method are CWT, Short-Time Fourier Transform (STFT), autocorrelation,

maximum standard deviation, kurtosis, skewness, and cepstrum features. These features have

been recommended by [1, 2, 11, 30, 42, 44]. Work was done by Majid et al. [30] has used this

method as a baseline. The size of the feature vector is 7522.

5.2 Metrics

In this section, we discuss the performance metrics used to evaluate the verification, and iden-

tification mode.

5.2.1 Verification

We use EER as a metric for the verification mode. EER is the rate at which False Acceptance

Rate (FAR) and False Rejection Rate (FRR) are equal. The FAR is calculated by:

FAR =False Acceptance

False Acceptance + True Rejection(5.1)

where False Acceptance is the number of negative samples that were classified as positives, True

Rejection is the number of negative samples that were classified as negative. Therefore, FAR is

the ratio between the falsely classified negative samples and the total negative samples. FRR

is calculated by:

FRR =False Rejection

False Rejection + True Acceptance(5.2)

where False Rejection is a positive sample that was classified as a negative, and True Acceptance

is the number of positives samples classified as positive. FRR is the number of falsely rejected

positive samples out of all the positive samples.

The system accepts or rejects a sample based on a probability threshold. We set this

threshold so that samples with probability above this threshold are accepted, and the samples


below it are rejected. When the threshold is at zero, we accept all samples. FAR would be

at 100% because we accepted all negative samples, and the FRR is 0% because no negative

samples were rejected. As we increase the threshold, the FAR decreases while FRR increases.

FAR approaches 0% as the threshold is increased because we do not accept any samples, and

the FRR approaches 100% because all positive samples are rejected. The value of the FRR

and FAR meet at some threshold, and the values of FAR and the FRR is the EER. We show

the EER graphs for the experiments in Figure 5.1. The system with a lower EER is considered

better.

The EER test set contains P ×N responses from the second session, where P is the number

of identities, and N is the number of responses per identity. We calculate the distance of P

identities against (P − 1)×N negative responses, and N positive responses. We combine these

P × P ×N distances and calculate the EER.


For identification, accuracy is the most commonly used metric to compare authentication per-

formances [45]. We present the Cumulative Match Characteristic (CMC) curve to evaluate the

methods based on a multi-rank prediction. The accuracy test set contains P × N responses.

The template with the smallest distance to the probe response is considered to be the identity of

the response. We then calculate accuracy based on how many of these identities were correctly

identified. The system with higher accuracy is considered better.

5.3 Single Ear Results

We present the results for verification mode and identification mode for the single ear.

5.3.1 Verification

We present our results for verification scenarios in Table 5.2-5.4. For CWT/LDA [36] and

CWT [36] methods, we chose two CWT scales based on their performance: the CWT scale

that achieves minimum left EER, and the CWT scale that achieves minimum right EER.

When a single scale achieves minimum accuracy for both left and right ear, only one scale is


presented. A combination of CWT scales from one to ten(Multi-Scale) is presented for CWT

and CWT/LDA methods. The No-FS method is presented as a baseline. A subset of the

graphs used for the EER calculation is shown in Figure 5.1.

The results show that our method outperforms previous methods in some tests, especially

for the left ear. Table 5.2 presents the results for the 54 subject test. For the left ear, the Mean

template method performs better than previous methods with EER of 4.28%. For the right

ear, the CWT/LDA method performed the best. Results for the 24 subject test are presented

in Table 5.3. The CWT/LDA(multi-scale) outperformed all other methods for both the left

ear and the right ear. Finally, Table 5.4 shows the results for the 10 subject test. Our SVM

template method outperformed other methods for the left and right ears. We can see that in a

test with a smaller number of subjects, the SVM template outperforms the Mean template but

in a larger number of registered subjects, the trend reverses.

In the previous studies [30,36], it was shown that EER of the left ear was much lower than

the right ear. Our neural network based method shows smaller performance differences between

the left and the right. The smaller difference is due to the optimization criterion for our neural

network model. When our model is training, we stop the process when the sum of the left and

right ear training loss reaches a minimum. Since we optimize the model for both ears rather

than individually, the performance difference between the two is lower.

The results between the mean template and CWT methods for the 24 subject test and the 10

subject test show that neural network based TEOAE feature extractor performs better than the

generic CWT feature extractor. Another limitation of CWT as a feature extractor is shown in

the results; the scale which produces the best result changes for every test. Combining multiple

CWT scales shows lower performance compared to an optimized scale.


Table 5.2: Verification performance of different methods for 54 subject test

MethodLeft EER

(mean ± std)Right EER

(mean ± std)

Ours(SVM template)

5.63% ± 1.21% 2.96% ± 1.03%

Ours(Mean template)

4.28% ± 1.19% 2.38% ± 0.796%

CWT/LDA(Multi-scale) [36]

7.47% 3.70%

CWT/LDA(scale=7) [36]

4.44% 1.85%

CWT(Multi-scale) [36]

9.92% 8.89%

CWT(scale=8) [36]

9.26% 9.44%

CWT(scale=9) [36]

10.30% 6.85%

No-FS 38.3%± 1.04% 41.7% ± 0.78%


MethodLeft EER


(mean ± std)

Ours(SVM template)

5.60% ± 3.04% 4.16% ± 2.16%

Ours(Mean template)

7.44% ± 3.28% 4.82% ± 1.96%


4.20% ± 2.95% 2.92% ± 1.84%


4.89% ± 2.47% 4.42% ± 2.49%


9.47% ± 3.26% 8.57% ± 2.32%

CWT(scale=8) [36]

8.70% ± 2.74% 8.84% ± 2.48%

No-FS 38.7%± 2.91% 41.2% ± 3.22%



MethodLeft EER


(mean ± std)

Ours(SVM template)

2.89% ± 4.87% 3.63% ± 4.41%

Ours(Mean template)

5.85% ± 5.38% 4.87% ± 3.84%


8.81% ± 6.49% 4.61% ± 4.38%


6.08% ± 4.74% 4.37% ± 3.97%


10.5% ± 8.03% 10.8% ± 7.50%

CWT(scale=8) [36]

8.72% ± 5.55% 8.89% ± 5.44%

CWT(scale=9) [36]

10.5% ± 8.03% 7.88% ± 5.44%

No-FS 39.16%± 5.21% 39.6% ±6.38%


(a) Left Ear 54 subject test (b) Right Ear 54 subject test

(c) Left Ear 24 subject test (d) Right Ear 24 subject test

(e) Left Ear 10 subject test (f) Right Ear 10 subject test

Figure 5.1: Verfication scenario EER graphs for subject 0



We show our experimental results for all identification tests in Table 5.5 - 5.7. Similar to

the verification results, we present two CWT scales for CWT/LDA [36] and CWT [36] that

achieve the highest left ear, and right ear accuracy. Only one scale is shown if the same CWT

scale achieved the highest accuracy for both ears. CWT/LDA(multi-scale) is also added for

comparison.

Our proposed methods outperform the CWT/LDA and the CWT method in all identifica-

tion scenarios. For the left ear the accuracy was 91.4% for the 54 subject test, 89.7% for the 24

subject test, and 94.7% for the 10 subject test. For the right ear the accuracy was 96.3% for the

54 subject test, 93.2% for the 24 subject test, and 96.6% for the 10 subject test. The CWT/LDA

method had an accuracy in the low to mid 80s. The CMC curve is presented in Figure 5.2.

Our methods achieve better results than the CWT/LDA method for rank 1 prediction. The

CWT/LDA method has low accuracy for rank 1 but it shows similar performance for rank 3

accuracy. The CWT method does not perform well.

We compare the results between the mean template and CWT results for the 24 subject test

and the 10 subject test. We can draw the same conclusion as the verification results regarding

the performance of the neural network as a TEOAE feature extractor; the feature extractor

trained using a neural network is better than CWT. We also observe that the scale that produces

the best result also changes for every test. Our neural network method generalizes well across

different training set sizes and shows smaller variance than previous methods. The results

suggest that our method is more robust against dataset splits than the CWT based feature

extraction method.


Table 5.5: Identification performance of different methods for 54 subject test

Method Left Accuracy Right Accuracy(mean ± std) (mean ± std)

Ours (SVM) 91.4% ± 2.22% 96.3% ± 1.36%Ours (Mean) 89.8% ± 2.81% 94.5% ± 1.61%

CWT/LDA(Multi-Scale) [36] 78.5% 80.7%CWT/LDA(Scale=6) [36] 85.2% 81.5%CWT/LDA(Scale=10) [36] 83.3% 88.5%

CWT(Multi-Scale) [36] 58.3% 53.0%CWT(Scale=10) [36] 58.3% 59.6%



Ours (SVM) 89.7% ± 6.84% 92.3% ± 3.80%Ours (Mean) 87.6% ± 7.99% 93.2% ± 4.51%

CWT/LDA(Multi-Scale) [36] 86.6% ± 5.84% 92.0% ± 5.19%CWT/LDA(Scale=6) [36] 84.3% ± 6.67% 84.3% ± 7.07%CWT/LDA(Scale=7) [36] 83.5% ± 5.64% 85.6% ± 6.82%

CWT(Multi-scale) [36] 83.5% ± 5.64% 85.6% ± 6.82%CWT(Scale=10) [36] 68.8% ± 5.69% 75.4% ± 5.24%



Ours (SVM) 93.8% ± 11.4% 94.7% ± 10.1%Ours (Mean) 94.7% ± 10.1% 96.6%± 4.44%

CWT/LDA(Multi-Scale) [36] 83.9% ± 11.7% 88.2% ± 8.55%CWT/LDA(Scale=9) [36] 84.6% ± 11.5% 89.9% ± 8.14%

CWT(Multi-scale) [36] 78.25% ± 14.9% 82.7% ± 12.2%CWT(Scale=10) [36] 51.8% ± 17.0% 60.3% ± 14.8%


(a) Left Ear 54 subject test (b) Right Ear 54 subject test

(c) Left Ear 24 subject test (d) Right Ear 24 subject test

(e) Left Ear 10 subject test (f) Right Ear 10 subject test

Figure 5.2: CMC curve for single ear identification scenario


5.4 Both ears

We present the results for verification and identification using both the left and right ear TEOAE

responses. We test our method against CWT/LDA(mul-score) and CWT (Mul-score) method

proposed by Liu and Hatzinakos [36]. For both methods, the CWT scales proposed by Liu and

Hatzinakos were used.

5.4.1 Verification

We present our verification result in Table 5.8. Both our methods performed better than

CWT/LDA method in the 54 subject test and the 10 subject test. The Mean template method

achieved an EER of 0.187% for the 54 subject test and the SVM template method achieved

an EER of 1.18% for the 10 subject test. The CWT/LDA method outperformed our method

in the 24 subject test.

The performance of the fusion ear scenario increases with the number of subjects and the

number of data points used for training similar to single ear results. The increase in subjects

seems to be a more significant factor in improving the final performance. The relationship

between the number of subjects and performance is also shown in the CWT/LDA and the

CWT methods proposed by Liu and Hatzinakos [36]

The same trend can be seen for SVM template and the Mean template methods as discussed

in the single ear verification result. The SVM template outperform the Mean template method

in small subject tests but are worse in the more substantial number of subjects.


We present our identification result in Table 5.9. Our method performs better for 54 subject

test and 24 subject test. We achieve 99.3% for 54 subject test and 94.9% on our 24 subject

test. We achieve higher accuracy with lower variance. For the 10 subject test, our method has

considerable lower variance, showing that it is more robust to changes in TEOAE dataset and

distributions. The CMC graph is shown in Figure 5.3.


Table 5.8: Verification performance of different methods for fusion of ear scenario

Method SubjectsEER

(mean ± std)

Ours (SVM)

54

0.414% ± 0.564%

Ours (Mean) 0.187% ± 0.146%

CWT/LDA [36] 0.604%

CWT [36] 7.22%

Ours (SVM)

24

2.64% ± 1.67%

Ours (Mean) 3.99% ± 1.71%

CWT/LDA [36] 1.27% ± 1.07%

CWT [36] 6.11% ± 3.18%

Ours (SVM)

10

1.18% ± 2.50%

Ours (Mean) 3.71% ± 3.91%

CWT/LDA [36] 2.15% ± 2.15%

CWT [36] 6.42% ± 6.09%

Table 5.9: Identification performance of different methods for fusion of ear scenario

Method SubjectsAccuracy

(mean ± std)

Ours (SVM)

54

99.3% ± 1.04%

Ours (Mean) 98.6% ± 1.07%

CWT/LDA [36] 92.8%

CWT [36] 63.1%

Ours (SVM)

24

94.6% ± 3.34%

Ours (Mean) 95.0% ± 4.40%

CWT/LDA [36] 91.1% ± 4.88%

CWT [36] 69.4% ± 4.88%

Ours (SVM)

10

95.4% ± 6.81%

Ours (Mean) 95.2% ± 5.97%

CWT/LDA [36] 97.7% ± 10.2%

CWT [36] 78.2% ± 10.3%


(a) 54 subject test

(b) 24 subject test

(e)10 subject test

Figure 5.3: CMC curve for fusion ear identification scenario


5.5 Training Time Comparison

We compare the training time between our neural network method and CWT/LDA method.

For the neural network method, the time taken to train the model for 30 epochs was measured.

When training the model, the lowest training error happens before the 30th epoch, but the

specific epoch is not known before training. In most experiments, 30 epochs were enough for

the model to reach the lowest error. The computation was done on an IBM Power8 CPU with

an NVIDIA P100 GPU machine. The training time for each test and the training data size are

shown in Table 5.10

For CWT/LDA method, only the time to train an LDA model was measured. The train-

ing time for logistic regression used in identification scenario was negligible. The CWT/LDA

method has multiple retraining steps because a new model has to be trained every time a new

subject is registered to the system. Registering N subjects would require N − 1 training steps.

The time to train N − 1 LDA models was measured on a Power8 CPU with 96 Threads. The

training time is presented in Table 5.11.

We can see that the CWT/LDA method is much faster than the neural network due to the

smaller training set, and lower computational cost. The training time increases with the size

of the data for both methods.

Test Name Data Size Training Time(s)

10 Subject Test 40,793.7 1126.6724 Subject Test 27,528.7 762.2654 Subject Test 25,602 834.41

Table 5.10: Training time for neural networks with different data sizes

Test Name Data Size Training Time(s)

10 Subject Test 100 0.3024 Subject Test 240 2.7554 Subject Test 540 16.24

Table 5.11: Training time for CWT/LDA method with different data sizes


5.6 Inference Time Comparison

Inference computation time was measured on a machine with IBM PowerPC 8 CPU with

NVidia P100 GPU. The CWT/LDA approach is dependent on the CPU processing power, and

the neural network approach is mainly dependent on the GPU processing power. For both

methods, the time was measured from pre-processing to producing the probability or distance

metric.

For the neural network, we used a combination of NumPy and PyTorch library to perform

pre-processing and inference. The time for CWT/LDA method measures CWT, LDA transform,

and the logistic regression inference times for one CWT scale. We used the pywt python library

for continuous wavelet transform and scikit-learn for LDA and logistic regression. The pywt

library does not have the Daubechies 5 mother wavelet, so the Gauss 3 mother wavelet was

used to approximate the CWT computing time. The results are shown in Table 5.12.

The inference time for the neural network is longer than the CWT/LDA method. The

biggest bottleneck for neural network method is the inference, and for the CWT/LDA it was

the pre-processing and feature extraction time.

Method Pre-Processing(µs) Inference(µs) Total(µs)

Ours 211.2 1312.6 1523.8CWT/LDA 657.0 4.169 661.2

Table 5.12: Inference time comparison between neural network and CWT/LDA

Chapter 6

Conclusions

In this paper, we focused on building a multi-session TEAOE biometric identification and

verification system. We first introduced the data collection process and the attributes of the

TEOAE response, discussed previous research in TEOAE biometrics security, and presented

the current state-of-the-art in other biometric systems using a neural network.

Previous methods used CWT to extract features from the data and used these features to

create a model. The hyperparameters required for the CWT feature extraction are a mother

wavelet and a scale. The CWT method overfits to the dataset, and the authors of the previous

works have recommended further research for a method that is better than CWT. Our paper

focused on removing the dependency on CWT feature extraction.

We also focused on reducing the number of parameters to reduce the size of the model. Past

work in one-shot learning, siamese network, and multi-task learning were applied to design an

efficient neural network. The implementation took advantage of the common structure between

the TEOAE response of the left and right ear. Parts of the network were designed with shared

parameters, which learned the commonality, and other parts of the network were designed to

learn the different distributions between the left and right ear.

For a biometric system, it is essential to allow registration of a new identity with ease.

The registration process for a biometric security system can be challenging due to the limited

amount of biometric samples that can be collected. Difficult registration processes can cause

frustration for the users. Neural networks designed for classification are slow and difficult to

train and are not fit for biometric systems. Instead, we designed a neural network architecture

66

Chapter 6. Conclusions 67

that produces an embedding of a TEOAE response. The similarity or distance between the

two embeddings was used to identify and verify individuals. We used the triplet loss objective

which penalizes a model when a negative class is closer compared to a positive class in the

embedding space to train the model. Training a triplet loss network requires an algorithm to

pick hard triplets for better performance and convergence rate. We used online batch hard

negative mining strategy to only select triplets that maximize the loss.

We also discussed different templating methods to register a user. We presented the pros

and cons of each strategy along with different distance functions that calculate the similarity

between the templates and the probe-samples.

Combining all of these theories, we built the neural network model, trained the network, and

tested it using the TEOAE responses collected at the University of Toronto by the Biometric

Security lab. We presented the results for verification, identification scenarios, and three differ-

ent tests for each scenario. These different scenarios were designed to test model generalization.

We also compared the results between using a response from one ear and using responses from

both ears.

Our method outperforms previous results in both single ear and fusion ear identification

scenarios. It also produces comparable or slightly better results in verification scenarios. Test

results indicate that there is potential for improvements given more training samples per identity

and the number of identities.

Future work should focus on increasing the size of the data and reducing TEOAE response

acquisition time for the system. Increasing generalization through more data, and reducing the

acquisition time will be a massive step towards making the system more effective. Testing the

TEOAE authentication method under different stimulus signal to verify that it works under

different conditions will help the system more robust. Also designing a system that works

across different stimulus signal to mitigate the risk of stolen biometric information would be

significant. Further work should be done in further reducing the size of the neural network

model so that it is computationally more efficient.

Bibliography

[1] F. Agrafioti and D. Hatzinakos. Ecg based recognition using second order statistics. In 6th

Annual Communication Networks and Services Research Conference (cnsr 2008), pages

82–87, May 2008.

[2] N. Armanfard, M. Komeili, J. P. Reilly, and L. Pino. Vigilance lapse identification us-

ing sparse eeg electrode arrays. In 2016 IEEE Canadian Conference on Electrical and

Computer Engineering (CCECE), pages 1–4, May 2016.

[3] Pouya Bashivan, Irina Rish, Mohammed Yeasin, and Noel Codella. Learning rep-

resentations from EEG with deep recurrent-convolutional neural networks. CoRR,

abs/1511.06448, 2015.

[4] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient

descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994.

[5] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J.

Mach. Learn. Res., 13:281–305, February 2012.

[6] BioSec. Medical biometric databases, 2015.

[7] Avishek Joey Bose and Parham Aarabi. Adversarial attacks on face detectors using neural

net based constrained optimization. CoRR, abs/1805.12302, 2018.

[8] Leon Bottou. Online algorithms and stochastic approximations. In David Saad, editor,

Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998.

revised, oct 2012.

68

Bibliography 69

[9] H. Bredin. Tristounet: Triplet loss for speaker turn embedding. In 2017 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5430–5434, March

2017.

[10] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Sig-

nature verification using a ”siamese” time delay neural network. In Proceedings of the

6th International Conference on Neural Information Processing Systems, NIPS’93, pages

737–744, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.

[11] Paul Chambers, Neil J. Grabham, Matthew A. Swabey, Mark Lutman, Neil White, John

Chad, and Stephen Beeby. A comparison of verification in the temporal and cepstrum-

transformed domains of transient evoked otoacoustic emissions for biometric identification.

3:246 – 264, 06 2011.

[12] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep quadruplet

network for person re-identification. ArXiv e-prints, April 2017.

[13] De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and Nanning Zheng. Person re-

identification by multi-channel parts-based cnn with improved triplet loss function. In The

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[14] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with

application to face verification. In 2005 IEEE Computer Society Conference on Computer

Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546 vol. 1, June 2005.

[15] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Mach. Learn., 20(3):273–

297, September 1995.

[16] H. P. da Silva, A. Fred, A. Loureno, and A. K. Jain. Finger ecg signal for user authen-

tication: Usability and performance. In 2013 IEEE Sixth International Conference on

Biometrics: Theory, Applications and Systems (BTAS), pages 1–8, Sept 2013.

[17] J. Gao, F. Agrafioti, S. Wang, and D. Hatzinakos. Transient otoacoustic emissions for

biometric recognition. In 2012 IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP), pages 2249–2252, March 2012.

Bibliography 70

[18] Jiexin Gao. Towards a unified signal representation via empirical mode decomposition,

Nov 2012.

[19] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural net-

works. In Geoffrey Gordon, David Dunson, and Miroslav Dudk, editors, Proceedings of the

Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15

of Proceedings of Machine Learning Research, pages 315–323, Fort Lauderdale, FL, USA,

11–13 Apr 2011. PMLR.

[20] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Ex-

amples. ArXiv e-prints, December 2014.

[21] P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch,

Y. Jia, and K. He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv

e-prints, June 2017.

[22] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant

mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern

Recognition (CVPR’06), volume 2, pages 1735–1742, June 2006.

[23] J.W. Hall. Handbook of Otoacoustic Emissions. Singular Thomson Learning, 2000.

[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for

image recognition. CoRR, abs/1512.03385, 2015.

[25] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for

person re-identification. CoRR, abs/1703.07737, 2017.

[26] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift. ArXiv e-prints, February 2015.

[27] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,

J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran,

and R. Hadsell. Overcoming catastrophic forgetting in neural networks. ArXiv e-prints,

December 2016.

Bibliography 71

[28] G. R. Koch. Siamese neural networks for one-shot image recognition. 2015.

[29] Krzysztof M. Kochanek, Lech K. liwa, Klaudia Puchacz, and Adam Pika, 2015.

[30] M. Komeili, W. Louis, N. Armanfard, and D. Hatzinakos. Feature selection for nonstation-

ary data: Application to human recognition using medical biometrics. IEEE Transactions

on Cybernetics, 48(5):1446–1459, May 2018.

[31] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with

deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.

Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–

1105. Curran Associates, Inc., 2012.

[32] Yann Lecun and Yoshua Bengio. Convolutional networks for images, speech, and time-

series. MIT Press, 1995.

[33] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature,

521(7553):436–444, 2015.

[34] J. Lei Ba, J. R. Kiros, and G. E. Hinton. Layer Normalization. ArXiv e-prints, July 2016.

[35] W. F. Leung, S. H. Leung, W. H. Lau, and A. Luk. Fingerprint recognition using neural

network. In Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop,

pages 226–235, Sep 1991.

[36] Y. Liu and D. Hatzinakos. Earprint: Transient evoked otoacoustic emission for biometrics.

IEEE Transactions on Information Forensics and Security, 9(12):2291–2301, Dec 2014.

[37] Y. Liu and D. Hatzinakos. Human acoustic fingerprints: A novel biometric modality for

mobile security. In 2014 IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP), pages 3784–3788, May 2014.

[38] D. Masters and C. Luschi. Revisiting Small Batch Training for Deep Neural Networks.

ArXiv e-prints, April 2018.

[39] J. Nagi, F. Ducatelle, G. A. Di Caro, D. Cirean, U. Meier, A. Giusti, F. Nagi, J. Schmidhu-

ber, and L. M. Gambardella. Max-pooling convolutional neural networks for vision-based

Bibliography 72

hand gesture recognition. In 2011 IEEE International Conference on Signal and Image

Processing Applications (ICSIPA), pages 342–347, Nov 2011.

[40] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann

machines. In Proceedings of the 27th International Conference on International Conference

on Machine Learning, ICML’10, pages 807–814, USA, 2010. Omnipress.

[41] Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled:

High confidence predictions for unrecognizable images. CoRR, abs/1412.1897, 2014.

[42] I. Odinaka, P. H. Lai, A. D. Kaplan, J. A. O’Sullivan, E. J. Sirevaag, S. D. Kristjansson,

A. K. Sheffield, and J. W. Rohrbaugh. Ecg biometrics: A robust short-time frequency

analysis. In 2010 IEEE International Workshop on Information Forensics and Security,

pages 1–6, Dec 2010.

[43] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary

DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differ-

entiation in pytorch. 2017.

[44] S. Pathoumvanh, S. Airphaiboon, B. Prapochanung, and T. Leauhatong. Ecg analysis for

person identification. In The 6th 2013 Biomedical Engineering International Conference,

pages 1–4, Oct 2013.

[45] P. Jonathon Phillips, Alvin F. Martin, Charles L. Wilson, and Mark A. Przybocki. An

introduction to evaluating biometric systems. IEEE Computer, 33:56–63, 2000.

[46] M.M. Al Rahhal, Yakoub Bazi, Haikel AlHichri, Naif Alajlan, Farid Melgani, and R.R.

Yager. Deep learning approach for active classification of electrocardiogram signals. Infor-

mation Sciences, 345:340 – 354, 2016.

[47] P. S. Raj and D. Hatzinakos. Feasibility of single-arm single-lead ecg biometrics. In 2014

22nd European Signal Processing Conference (EUSIPCO), pages 2525–2529, Sept 2014.

[48] P. S. Raj, S. Sonowal, and D. Hatzinakos. Non-negative sparse coding based scalable access

control using fingertip ecg. In IEEE International Joint Conference on Biometrics, pages

1–6, Sept 2014.

Bibliography 73

[49] P. Ramachandran, B. Zoph, and Q. V. Le. Searching for Activation Functions. ArXiv

e-prints, October 2017.

[50] Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learn-

ing and forgetting functions. Psychological review, 97 2:285–308, 1990.

[51] Sebastian Ruder. An overview of multi-task learning in deep neural networks. CoRR,

abs/1706.05098, 2017.

[52] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal repre-

sentations by error propagation. Technical report, California Univ San Diego La Jolla Inst

for Cognitive Science, 1985.

[53] Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J Fleet. Adversarial manipulation

of deep representations. In International Conference on Learning Representations (ICLR),

2016.

[54] Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer,

Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Wolfram

Burgard, and Tonio Ball. Deep learning with convolutional neural networks for brain

mapping and decoding of movement-related information from the human EEG. CoRR,

abs/1703.05051, 2017.

[55] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding

for face recognition and clustering. CoRR, abs/1503.03832, 2015.

[56] S. N. A. Seha and D. Hatzinakos. Human recognition using transient auditory evoked

potentials: a preliminary study. IET Biometrics, 7(3):242–250, 2018.

[57] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting

the log-likelihood function. 90:227–244, 10 2000.

[58] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale

image recognition. CoRR, abs/1409.1556, 2014.

Bibliography 74

[59] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-

dinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of

Machine Learning Research, 15:1929–1958, 2014.

[60] Statista. Global unit sales of headphones and headsets from 2013 to 2017 (in millions),

2018.

[61] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unrea-

sonable effectiveness of data in deep learning era. CoRR, abs/1707.02968, 2017.

[62] Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. Synthesizing

obama: Learning lip sync from audio. ACM Trans. Graph., 36(4):95:1–95:13, July 2017.

[63] Matthew A. Swabey, Paul Chambers, Mark E. Lutman, Neil M. White, John E. Chad,

Andrew D. Brown, and Stephen P. Beeby. The biometric potential of transient otoacoustic

emissions. Int. J. Biometrics, 1(3):349–364, March 2009.

[64] MedLife Technologies. ascreen tiny oae device, 2018.

[65] F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. Joint learning of single-image and cross-

image representations for person re-identification. In 2016 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pages 1288–1296, June 2016.

[66] Kilian Q. Weinberger, John Blitzer, and Lawrence K. Saul. Distance metric learning for

large margin nearest neighbor classification. In In NIPS. MIT Press, 2006.

[67] Chunlei Zhang and Kazuhito Koishida. End-to-end text-independent speaker verification

with triplet loss on short utterances. In Proc. Interspeech 2017, pages 1487–1491, 2017.

[68] W. L. Zheng, J. Y. Zhu, Y. Peng, and B. L. Lu. Eeg-based emotion classification using

deep belief networks. In 2014 IEEE International Conference on Multimedia and Expo

(ICME), pages 1–6, July 2014.

Appendix A

Performance

A.1 File Size Comparison

Small model size is essential to use the neural network on the mobile phone. We used PyTorch

neural network framework [43] to compute and save the model. File sizes are based on PyTorch’s

proprietary save file. The file sizes of the network trained using different hyperparameter are

shown in Table A.1. As more blocks and channels are used, the size of the network grows. The

recommended size for mobile phones according to iOS specifications is around 10MB.

75

Appendix A. Performance 76

Table

A.1:

PyT

orch

mod

elfi

lesi

zes

for

diff

eren

thyp

erp

aram

eter

s

Com

mon

sect

ion

#of

Ch

ann

els

Ind

ivid

ual

Sec

tion

#of

Ch

ann

els

Com

mon

Sec

tion

#of

Blo

cks

Ind

ivid

ual

Sec

tion

#of

Blo

cks

Em

bed

din

gS

ize

Fil

eS

ize

84

33

128

913K

84

22

128

913K

84

11

128

826K

84

43

6471

2K16

84

364

1.4M

32

162

212

83.

5M32

164

364

2.8M

32

162

225

66.

1M64

324

364

5.6M

64

322

254

4.1M

64

322

225

613

M64

322

212

87.

1M256

128

34

6433

M

Appendix B

Failed Experiments

In this section, we discuss some of the failed experiments that made sense in the beginning but

failed to produce the state of the art results. We leave some brief notes on the reasons for the

approach, and why it might have failed. Research is an iterative process, and this section does

contain all the results because some ideas were abandoned halfway.

B.1 Random Forest

We first tried training a random forest model using CWT as the features. The random forest

model performed better but was not significantly better. It still had the problems with having

to choose a CWT scale and the mother wavelet. The gaus3 mother wavelet was tested, but it

does not seem to work as well as Daubechies 5.

In testing. The random forest did not perform well. The accuracy for the left ear on the

best scale was 60.3%. It did not perform better than the previous state-of-the-art and was

abandoned.

B.2 Simple Convnet

We first designed a network with a convolutional neural network or a fully connected network

that kept reducing the number of parameters until it reached the desired encoding dimension.

There was a limitation on stacking the network because the number of parameters in the layers

77

Appendix B. Failed Experiments 78

above had to be bigger than the final encoding. It restricted the number of layers that we could

stack. The results were better than the random forest. The response was processed using CWT,

and multiple scales were stacked and were made into an image. This image was processed using

2D-Convolution. The network is presented in Figure B.1

Figure B.1: Simple Architecture used to perform classification


B.3 One Network

When one network was trained to predict both the left and the right ears, the accuracy of the

left ear suffered greatly. The accuracy of the right ear identification did not seem to matter.

This architecture is presented in B.2

Figure B.2: Architecture diagram for training the convolutional neural network withoutshared parameters


B.4 Auto Encoder

We trained an autoencoder with four layers and the network trained well with low loss. We

used the CWT as the feature. We abandoned the idea due to multiple training steps required

to build this network.

B.5 CWT and Neural Networks

Originally a 2D-convolution was performed on the image. We stacked the CWT scaled like an

image. We used up to 100 scales to maximize the performance. After testing the network with

multiple random forest seeds, it was clear that the network was not generalizing. The variance

of the network was far too significant even after weight initialization. We decided to remove

CWT from pre-processing.

B.6 Quadruplet Loss

We have tried using Quadruplet loss for the network, but it did not perform better than the

triplet loss. When using the quadruplet loss, another negative had to be chosen. This negative

was randomly picked rather than choosing the best negative like the triplet selection and might

be the reason behind a lower performance. Even though the accuracy and the EER were lower,

it was off by a couple of percentage points. The quadruplet loss is visualized in Figure B.3


Figure B.3: The final objective of quadruplet loss

Appendix C

Data Splits

Table C.1: Random seed and Test Subject distribution for 10 subject test

Random Seed Test Subjects Number of Training Responses

72 4,6,9,13,23,27,35,39,42,53 39,7805 6,17,21,23,26,31,40,46,51,53 40,40698 0,5,13,16,19,32,33,34,35,49 39,59023 14,16,17,18,22,24,28,33,36,50 42,96083 1,12,17,26,28,29,33,34,45,46 41,55459 1,6,16,17,18,27,39,46,48,51 41,5704 4,6,12,17,20,22,28,31,37,52 41,60633 1,5,15,21,30,32,40,41,44,50 42,80827 6,9,11,17,18,20,36,39,40,46 40,68885 1,5,7,8,20,30,39,44,45,47 39,96245 2,5,7,16,19,26,41,44,50,51 41,98241 8,9,10,14,15,36,37,40,49,53 40,97418 12,16,26,30,31,33,37,38,44,45 41,60035 7,14,17,34,35,36,45,49,50,52 41,04057 9,10,20,27,29,30,39,43,49,50 39,78042 3,5,12,17,19,32,44,48,49,52 40,38026 2,5,9,10,20,36,37,39,47,50 41,24262 4,5,15,27,28,32,35,44,47,49 40,62040 4,16,20,21,26,34,38,39,49,53 35,23824 2,5,8,13,22,24,27,41,46,53 42,094

82

Appendix C. Data Splits 83

Table C.2: Random seed and Test Subject distribution for 24 subject test

Random Seed Test Subjects Number of Training Responses

724,6,9,11,13,14,17,18,22,23,25,27,31,

33,34,35,36,39,42,43,47,49,52,5326,530

50,2,3,4,5,6,17,19,21,23,24,26,31,32,

33,34,40,41,45,46,50,51,52,5329,276

980,5,8,10,11,12,13,14,16,17,19,24,29,

31,32,33,34,35,36,43,46,47,49,5028,340

233,8,10,13,14,16,17,18,20,22,23,24,28,

29,33,34,35,36,41,44,47,48,50,5226,796

831,2,7,10,11,12,17,19,21,24,26,28,29,

31,33,34,35,37,38,41,44,45,46,5031,114

591,5,6,7,16,17,18,21,22,27,28,32,34,35,39,41,42,45,46,48,50,51,52,53

28,388

44,6,7,11,12,14,15,16,17,18,20,22,24,

25,27,28,29,31,32,34,37,42,43,5229,346

330,1,4,5,6,10,15,17,21,25,26,27,28,30,

32,36,37,39,40,41,44,46,49,5028,612

270,1,5,6,9,11,14,16,17,18,20,27,28,30,36,39,40,41,43,44,46,47,49,50

28,422

850,1,5,7,8,14,16,17,18,20,23,26,30,35,38,39,40,41,44,45,46,47,49,53

26,448

450,2,5,7,9,13,16,18,19,20,26,27,28,29,36,37,40,41,44,48,49,50,51,53

26,370

415,7,8,9,10,14,15,18,27,29,30,31,33,36,37,39,40,41,43,46,47,49,51,53

28,952

187,12,15,16,20,22,23,25,26,27,30,31,32,33,37,38,39,40,41,44,45,48,50,52

28,968

352,4,7,13,14,17,18,20,22,23,26,31,32,

34,35,36,38,39,42,43,45,49,50,5226,758

572,3,7,9,10,16,19,20,27,29,30,31,32,33,35,36,38,39,42,43,44,45,49,50

25,916

423,4,5,6,8,9,12,13,15,16,17,19,24,26,

32,33,34,37,44,45,48,49,50,5227,134

260,2,3,5,7,8,9,10,11,14,20,22,29,35,

36,37,39,40,42,44,45,47,50,5227,982

622,4,5,6,7,15,16,20,22,23,25,27,28,31,32,34,35,41,43,44,47,49,52,53

25,044

400,2,4,5,11,16,18,20,21,24,25,26,29,33,34,35,36,38,39,40,48,49,50,53

22,520

242,5,8,10,13,14,16,19,20,21,22,24,26,

27,32,37,38,39,40,41,44,46,52,5327,658

Documents

On the Variability of TEOAE Human Identification and ... · Martanda, Rahul Udasi, Sam Haruna, SeungWan Choi, Shehzad Akbar, and Van Nguyen for their support. There were de nitely