STATISTICAL INFERENCE OF SYNTAX FROM VOCAL SEQUENCES …

The Pennsylvania State University

The Graduate School

Eberly College of Science

STATISTICAL INFERENCE OF SYNTAX FROM VOCAL

SEQUENCES AND IMPLICATIONS FOR NEURAL MECHANISMS

A Dissertation in

Physics

by

Sumithra Surendralal

c© 2016 Sumithra Surendralal

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

August 2016

The dissertation of Sumithra Surendralal was reviewed and approved∗ by the fol-

lowing:

Dezhe Z. Jin

Associate Professor of Physics

Dissertation Advisor, Chair of Committee

Patrick J. Drew

Assistant Professor of Engineering Science and Mechanics

Assistant Professor of Neurosurgery

Assistant Professor of Biomedical Engineering

John C. Collins

Distinguished Professor of Physics

Lu Bai

Assistant Professor of Biochemistry and Molecular Biology

Assistant Professor of Physics

Nitin Samarth

Professor of Physics

George A. and Margaret M. Downsbrough Department Head

∗Signatures are on file in the Graduate School.

ii

Abstract

Learned vocalization in animals is a fascinating natural behavior, the neural andperipheral mechanisms behind which are not completely known. In vocal se-quences, acoustic units called syllables are produced following certain learned rulesor syntax. In order to understand the putative neural correlates of this behavior,we first need a quantitative description of the behavior itself. A vocal sequence, sayAAABCCCDD can be parsed into two structures - a non-repeat structure ABCD,on which is imposed a repeat structure A(3)B(1)C(3)D(2). In this dissertation wedevelop statistical methods to infer concise, finite-state characterizations of boththese aspects of the syntax from observed vocal sequences. The need to exercisecaution in assigning vocal sequences to syntactic categories based on small samplesizes is emphasized by designing measures that place bounds on model categoriesthat can be inferred from observations. In particular, we focus on the PartiallyObservable Markov Model (POMM) - a model with a Markov chain of abstract,hidden states that have a many-to-one mapping to observed syllables - to charac-terize the non-repeat structure. Through careful quantitative analysis of observeddata, we show that the normal song syntax of the Bengalese finch is consistentwith the features expected from a POMM. The songs of deafened birds show adeviation from this normal structure. In a statistically significant number of casesamong the birds studied, the loss of auditory feedback results in a loss of themany-to-one mappings. The observations suggest that auditory feedback can in-duce complexity in the Bengalese finch song syntax, but is not sufficient to explaincomplexity entirely. We suggest that the Bengalese finch song syntax is encodedin the interplay between auditory feedback and the intrinsic song-generating cir-cuitry. Finally, in canary and swamp sparrow songs we show that there is an exactinverse relationship between syllable duration and the most probable number ofrepetitions of the syllable. Such a precise relationship indicates the existence offundamental biological constraints on the performance of syllable repeats.

iii

Table of Contents

List of Figures viii

List of Tables xi

List of Symbols and Abbreviations xii

Acknowledgments xiii

Chapter 1Introduction 11.1 Animal vocal sequences . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Songbirds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Neurobiology of sequence generation . . . . . . . . . . . . . . . . . 5

1.2.1 The song production system in the avian brain . . . . . . . . 51.2.2 Neural models of sequence generation involving the HVC . . 7

1.3 Models of syntax for vocal sequences . . . . . . . . . . . . . . . . . 81.3.1 The Chomsky hierarchy . . . . . . . . . . . . . . . . . . . . 10

1.4 Structure of the dissertation . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 2Partially Observable Markov Models of Sequence Syntax 132.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Model representation - Finite state machines . . . . . . . . . . . . . 14

2.2.1 Markov models . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.2 Hidden Markov Model (HMM) . . . . . . . . . . . . . . . . 172.2.3 Partially Observable Markov Model (POMM) . . . . . . . . 17

2.3 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.1 Measures of model fit . . . . . . . . . . . . . . . . . . . . . . 19

iv

2.3.1.1 Sequence completeness . . . . . . . . . . . . . . . . 202.3.1.2 Log-likelihood . . . . . . . . . . . . . . . . . . . . . 24

2.3.2 Measures of sequence similarity . . . . . . . . . . . . . . . . 262.3.2.1 Repeat distributions . . . . . . . . . . . . . . . . . 272.3.2.2 n-gram distributions . . . . . . . . . . . . . . . . . 272.3.2.3 Step distributions . . . . . . . . . . . . . . . . . . . 27

2.4 Model inference - inference of a POMM from data . . . . . . . . . . 272.4.1 Expectation maximization - Baum-Welch algorithm . . . . . 282.4.2 Grid search for an optimal model . . . . . . . . . . . . . . . 322.4.3 Establishing error bounds . . . . . . . . . . . . . . . . . . . 342.4.4 Grid search stopping criterion . . . . . . . . . . . . . . . . . 342.4.5 Finding the optimal state vector . . . . . . . . . . . . . . . . 362.4.6 Reduced representation by filtering non-dominant transitions 372.4.7 Checking for equivalent POMMs - state-merging . . . . . . . 372.4.8 Demonstration of grid search with a toy model . . . . . . . . 38

Chapter 3Comparison of Syntactic Structures 403.1 The Bengalese finch . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.1.1 Description of data . . . . . . . . . . . . . . . . . . . . . . . 413.1.2 Identification of songs from recording transcriptions . . . . . 42

3.2 Syntax of Bengalese finch song . . . . . . . . . . . . . . . . . . . . . 433.2.1 Changes in syntax caused by deafening . . . . . . . . . . . . 463.2.2 Persistence of dominant transitions after deafening . . . . . 50

3.3 Humpback whale . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.3.1 Description of data . . . . . . . . . . . . . . . . . . . . . . . 553.3.2 Challenges in inferring the syntax of humpback whale song . 553.3.3 Markov model of humpback whale themes . . . . . . . . . . 57

Chapter 4Repeat Structure in Vocal Sequences 614.1 Syllable repetitions in multiple species . . . . . . . . . . . . . . . . 624.2 Distribution of the number of syllable repetitions . . . . . . . . . . 63

4.2.1 Sigmoidal model of adaptation . . . . . . . . . . . . . . . . . 634.3 Evidence of inverse relationship between syllable duration and most

probable repeat number . . . . . . . . . . . . . . . . . . . . . . . . 654.3.1 Swamp sparrow . . . . . . . . . . . . . . . . . . . . . . . . . 654.3.2 Canary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.3.3 Bengalese finch . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4 Other calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

v

4.4.1 Distribution of phrase duration and repeat number . . . . . 704.4.2 Exponential distributions lead to inverse relationship . . . . 71

4.5 Mechanisms of repeat generation . . . . . . . . . . . . . . . . . . . 734.5.1 Auditory feedback could regulate repetition . . . . . . . . . 734.5.2 Constrained phrase duration . . . . . . . . . . . . . . . . . . 75

Chapter 5Semi-automated Classification of Song Syllables 765.1 Morphology of a song . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2 Identification of song syllables . . . . . . . . . . . . . . . . . . . . . 775.3 Semi-automated classification of song syllables . . . . . . . . . . . . 79

5.3.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . 805.3.2 Syllable features for classification . . . . . . . . . . . . . . . 82

5.3.2.1 Duration . . . . . . . . . . . . . . . . . . . . . . . 825.3.2.2 Wiener entropy . . . . . . . . . . . . . . . . . . . . 835.3.2.3 Hough transform . . . . . . . . . . . . . . . . . . . 83

5.3.3 SVM ensembles . . . . . . . . . . . . . . . . . . . . . . . . . 865.3.4 Transcription of a song . . . . . . . . . . . . . . . . . . . . . 86

Chapter 6Conclusion 886.1 Partially Observable Markov Model - inference and evaluation . . . 886.2 Comparison of syntactic structures . . . . . . . . . . . . . . . . . . 906.3 Statistics of syllable repetitions . . . . . . . . . . . . . . . . . . . . 91

Appendix ABaum-Welch Algorithm for estimation of POMM parameters 93A.1 Estimating forward probabilities . . . . . . . . . . . . . . . . . . . . 95A.2 Estimating backward probabilities . . . . . . . . . . . . . . . . . . . 96A.3 New transition matrix T . . . . . . . . . . . . . . . . . . . . . . . . 96

Appendix BConfidence Intervals for Entropy of Sequences 98B.1 Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Appendix CFinding Dominant Transitions in a POMM 100C.1 Random assignment of transition probabilities from a uniform dis-

tribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100C.2 Assignment of significance levels . . . . . . . . . . . . . . . . . . . . 102

vi

Bibliography 104

vii

List of Figures

1.1 The song system in the songbird brain . . . . . . . . . . . . . . . . 61.2 Branching synfire chain in HVC for songs with probabilistic transi-

tions between syllables . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Chomsky hierarchy of languages . . . . . . . . . . . . . . . . . . . . 10

2.1 State transition diagram for a Finite State Machine . . . . . . . . . 152.2 State transition diagram for a simple Markov process . . . . . . . . 172.3 Example of an HMM . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 Example of a POMM . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5 Empirical probability distribution over observed sequences . . . . . 192.6 Sequence completeness distributions under Markov models . . . . . 212.7 Sequence completeness depends on the number of sequences avail-

able for model inference . . . . . . . . . . . . . . . . . . . . . . . . 222.8 Sequence completeness depends on the repeat probabilities of sym-

bols in the Markov model . . . . . . . . . . . . . . . . . . . . . . . 252.9 Sequence completeness depends on the sparsity of the transition

matrix measured by the state transition entropy . . . . . . . . . . . 262.10 Schematic of search on a grid for an optimal model . . . . . . . . . 332.11 Scaling of true and sample standard deviations as a function of the

fraction of sub-sampled sequences . . . . . . . . . . . . . . . . . . . 352.12 Construction of a sequence completeness distribution from the data 362.13 Inference of a POMM for a toy model . . . . . . . . . . . . . . . . . 39

3.1 Song syntax of Bengalese finch, Bird 1, before and after deafening . 433.2 Song syntax of Bengalese finch, Bird 2, before and after deafening . 443.3 Song syntax of Bengalese finch, Bird 3, before and after deafening . 453.4 Song syntax of Bengalese finch, Bird 4, before and after deafening . 463.5 Song syntax of Bengalese finch, Bird 5, before and after deafening . 473.6 Song syntax of Bengalese finch, Bird 6, before and after deafening . 483.7 On average the state transition entropy increases and sequence

length decreases after deafening. . . . . . . . . . . . . . . . . . . . 49

viii

3.8 Sequence completeness of Bengalese finch song sequences underMarkov models before and after deafening . . . . . . . . . . . . . . 50

3.9 p-values of sequence completeness of Bengalese finch song sequencesunder Markov models before and after deafening . . . . . . . . . . . 51

3.10 Sequence completeness and corresponding p-values under syntaxmodels with only the dominant transitions retained at significancelevels α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.11 Dominant transitions in the syntax after deafening . . . . . . . . . 533.12 Transcription of part of a humpback whale song . . . . . . . . . . . 543.13 Dependence of sequence entropy on the number of unique sub-

sequences obtained using different segmentation lengths . . . . . . . 563.14 Dependence of sequency entropy on segmentation length and sys-

tem size using randomly generated sequences . . . . . . . . . . . . . 573.15 Entropy of a segmented humpback whale sequence depends on the

segmentation length k as well as the total system size S . . . . . . . 583.16 Transcription of part of a humpback whale song with repeat units

and themes highlighted . . . . . . . . . . . . . . . . . . . . . . . . . 593.17 Markov model of themes in the songs of a population of humpback

whales and n-gram distribution matches . . . . . . . . . . . . . . . 60

4.1 Sample repeat distributions of an individual canary’s syllables andfits based on the sigmoidal model of adaptation . . . . . . . . . . . 64

4.2 Inverse relationship between syllable duration and most probablenumber of repetitions for swamp sparrows . . . . . . . . . . . . . . 66

4.3 Inverse relationship between syllable duration and most probablerepeat number for six individual canaries. . . . . . . . . . . . . . . . 67

4.4 Mean and modal number of repetitions show the inverse relationshipwith syllable duration for the canary population . . . . . . . . . . . 68

4.5 Inverse relationship is not exact between syllable duration and mostprobable number of repetitions for Bengalese finches . . . . . . . . . 69

5.1 Waveform of a Bengalese finch song with the corresponding spec-trogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2 Syllable types in the vocal repertoire of a canary . . . . . . . . . . . 805.3 Linear boundary surface in an SVM separating labelled data . . . . 815.4 Non-linear SVM separating data points using a kernel function . . . 825.5 Summary of the Hough transform . . . . . . . . . . . . . . . . . . . 845.6 Separation of syllables in feature space based on duration, and the

two Hough transform coordinates ρ and θ . . . . . . . . . . . . . . 855.7 Transcription of a Bengalese Finch song . . . . . . . . . . . . . . . 87

ix

A.1 Calculation of forward probabilities in the trellis of the observationsequence y1,y2,y2 for a 4-state POMM . . . . . . . . . . . . . . . . . 93

B.1 Scaling of sample and true means and standard deviations . . . . . 99

C.1 The probability density function for a variable taking a value basedon random assignment from a uniform distribution . . . . . . . . . 102

x

List of Tables

3.1 Bengalese finch song statistics . . . . . . . . . . . . . . . . . . . . . 423.2 Sequence completeness and p-values for Bengalese finch song se-

quences under Markov models . . . . . . . . . . . . . . . . . . . . . 513.3 Sequence completeness and p-values for pre-deafening sequences

based on syntax post-deafening . . . . . . . . . . . . . . . . . . . . 52

5.1 Duration of syllables in a Bengalese Finch song. . . . . . . . . . . . 83

xi

List of Symbols and Abbreviations

FSM Finite State Machines

POMM Partially Observable Markov Model

HMM Hidden Markov Model

T Transition Matrix

E Emission Matrix

SVM Support Vector Machines

Pc Sequence completeness

L Log-likelihood

xii

Acknowledgments

I have a feeling I will look back on my years in graduate school as being enormouslyinfluential in changing my views about a lot of things in life - the most importantof them being my ideas about what doing research in the sciences really involves -aspects I enjoy, and those I do not. I would like to thank Dezhe Jin, my adviser,for his support over the years and for giving me the opportunity to learn about oneof the most fascinating areas of science - the generation of learned vocalizationsin the animal kingdom. Dezhe is an excellent teacher - his theoretical mechanicsclass is perhaps the best physics class I have ever taken - and I aspire to his teach-ing standards. My deep gratitude to Jefferey Markowitz and Timothy Gardner,Boston University; Dana Moseley and Jeffrey Podos, University of Massachussets,Amherst; Michael Noad, Cetacean Ecology and Acoustics Laboratory, The Uni-versity of Queensland, Luca Lamoni and Luke Rendell, University of St.Andrews;and Kristofer Bouchard and Michael Brainard, University of California, San Fran-cisco, for generously sharing audio recordings and transcriptions without whichthis dissertation would not have been possible. Thanks to Phillip Schafer, JasonWittenbach, Eugene Tupikov, and Leo Tavares, for all the discussions we havehad, not just about work. Many thanks to Patrick Drew for checking in on howI was doing once in a while. It was immensely helpful to be able to knock at hisoffice door seeking advice on several occasions. I am greatly indebted to Clarefor listening to me. I would also like to thank the physics department for theopportunity to gain valuable teaching experience over the years. Thank you tothe Center for Neural Engineering for the office space with windows (short-livedthough that was) and for the many grad club discussions and pizza. Thanks to myfriends Ila, Lakshmy, Leo, Nithin, Bruce, Yisi, Riddhi, Neha, Dolon, Salini, Ag-netha, Ramya, Kundan, Ganesh, and Latha for being there for me through manyfun and not-so-fun times. Jaya aunty, and Rajan uncle, thank you for putting upwith extended periods of absolutely no word from me while I figured things out.Amma, Achen, and Chandu, thank you for always cheering me on - even when I’vebeen annoyingly difficult to deal with. And Sreejith - you know all that I want tosay without me having to actually say it.

xiii

Dedication

To Valsu and Lal

xiv

Chapter 1Introduction

Sequences in biological systems such as the locomotive actions of a worm, the ar-

rangement of nucleotides in a genome, and the waggle dance of the honey bee, are

fascinating natural phenomena, the analysis of which has the potential to give valu-

able insights into the inner workings of biological systems. Much as the sequence

of numbers in a geometric progression, biological sequences are concatenations of

elements that are realized one after another following some pattern. In order to

understand a biological sequence, two components are necessary - a knowledge of

the rules that should be followed to create the sequence, and the machinery to

generate the sequence. The brain is responsible for both these components in the

case of motor sequences in higher organisms. Specialized networks of neurons in

the brain store the rules, access them when necessary, and facilitate the associ-

ated motor behavior. However, there is no complete understanding till date of

all the brain regions and neural mechanisms involved in neither the learning, nor

the generation, of a large number of motor sequences. Animal vocal sequences are

examples of such poorly-understood motor sequences. The focus of this disserta-

tion is on identifying statistical regularities in animal vocal sequences, inferring

generative rules, or what can broadly be called the syntax, of these sequences, and

establishing comparative measures to distinguish different categories of sequences,

in the context of aiding an understanding of the neurobiology of vocal sequence

generation.

Syntax, as defined by linguists, refers to the ordering of words within the sen-

tence of a language [1]. This ordering preserves associations between word cat-

1

egories such as nouns, verbs, and prepositions. It is an association independent

of meaning or semantics, and the morphology of the words themselves. In the

context of animal vocal sequences, the syntax we refer to in this dissertation is

phonetical syntax - patterns of sounds that do not individually or collectively have

any referential meaning [2]. The main focus of the dissertation is on the syntax

of songbird songs. The beginning of neurobiological studies on sequence learn-

ing in songbirds happened in parallel with investigations into abstract syntactic

structures in linguistics led by Noam Chomsky [3]. This led to an emphasis on ab-

stract representations of song sequences such as finite state machines [4–6], in the

spirit of models of computation that were developed in the field of computational

linguistics as a consequence of Chomsky’s inquiry [7].

In this dissertation, we are particularly interested in a finite state represen-

tation of song syntax called the Partially Observable Markov Model [8] that has

been shown in one study to be a good characterization of the song syntax of the

Bengalese finch, a songbird [6]. Is there a prototypical syntax for the songs of a

particular species? What are the factors that influence syntactic structure? These

are both interesting questions. We are specifically interested in understanding the

role of auditory feedback in regulating the syntax. The songs of three songbird

species - Bengalese finches, canaries, and swamp sparrows are used in the research.

A short analysis of humpback whale songs is included to demonstrate that the

methods developed are generalizable to the vocal sequences of other animals.

The data analyzed in this dissertation were obtained from several other research

groups. These include laboratory recordings and transcriptions of Bengalese finch

songs from Kristofer Bouchard and Michael Brainard, University of California,

San Francisco; similar data for canary songs from Jefferey Markowitz and Tim-

othy Gardner, Boston University; laboratory recordings of swamp sparrow songs

from Dana L.Moseley and Jeffrey Podos, University of Massachussets, Amherst;

recordings of humpback whale songs collected in Eastern Australia by Michael

Noad, Cetacean Ecology and Acoustics Laboratory, The University of Queens-

land, Australia, and transcriptions of these songs from Luca Lamoni and Luke

Rendell, University of St.Andrews, Scotland. These datasets were analyzed using

tools developed based on ideas motivated by statistical inference. The results will

be presented in the context of implications for neural models.

2

1.1 Animal vocal sequences

Vocalizations in the animal kingdom can be broadly categorized into those that

are learned and those that are innate. Vocal learners are able to listen to the

vocalizations of another member of their species, or sometimes another species, and

hone their own vocalizations by trial and error to match that of a tutor [9]. Humans

are examples of such vocal learners. Vocal learners may also have some innate

vocalizations. Laughter in humans, for example, is a vocalization that is innate [10],

while everyday speech is learned. This co-existence of two modes of vocalization

is not limited to humans. Among other members of the animal kingdom, there is

evidence that three groups of birds - parrots, hummingbirds and songbirds [11], as

well as cetaceans such as dolphins [12], humpback whales [13] and killer whales [14],

pinnipeds such as seals [15], non-human primates such as marmosets [16], and other

mammals such as bats [17], and more recently, elephants [18] and mice [19], are

vocal learners.

Each vocal sequence is composed of basic acoustic elements, the alphabet of

the vocal sequence if you will, that have variously been referred to as syllables,

notes, and units. We will use the term syllables. Some of these sequences are more

complex than others, with complexity being a loosely-defined property that could

refer to large vocal repertoires composed of many syllables, syllables with a large

range of temporal and spectral modulations, highly non-random, long-range and

complex-correlated sequencing of the syllables, the ability to vocalize for a large

duration of time, or the number of unique song sequences, to name a few. One

could ideally imagine defining a scale using some such definition of complexity and

seeking to arrange vocal learners on this scale. In what sense are the vocalizations

of a human more complex than that of a canary, if that is indeed the case? For

the purposes of this dissertation, we choose to define complexity on the basis of

the structure of syllable transitions in a finite state representation of sequence

syntax. We assign probabilities to different syllable transitions and can therefore

resort to characterizations of the syntax in terms of standard information-theoretic

measures such as entropy, and some others of our design.

3

1.1.1 Songbirds

Oscines or songbirds are an avian group with about 4500 species (half of all avian

species) that are known for their learned vocalizations called song [20]. Songs are

distinguished from what are known as calls - innate vocalizations that are produced

by most members of the animal kingdom [21]. Just as in human infants, there is a

sensitive period of development during which juvenile songbirds learn their vocal-

izations by listening to an adult tutor [22]. The first songbird, the songs of which

were established to be learned, was the chaffinch, following the work of William

Thorpe in 1954 [23]. The research done by Thorpe, Marler [24,25] and others in the

1950’s and 1960’s further set the stage for the songbird to be considered a model

system in the study of learned behavior. However, it was the work of Fernando

Nottebohm in the 1970’s that truly made the songbird a relevant model system.

Nottebohm in his experiments on canaries found that the male canary had a dedi-

cated network of brain nuclei that were linked to the production of song [26]. With

this discovery began the investigations into the neural mechanisms and functions

that were involved in helping the songbird learn and generate song sequences. The

songbird has since then been used as a model system to study questions not just in

ethology, but also in neurobiology and neurolinguistics. It is an ideal model system

since songbirds can breed under laboratory conditions and many are spontaneous

singers, making possible the collection of a large number of samples of this highly

stereotyped learned behavior under experimentally controlled conditions.

Of the 4500 species of songbirds, very few have been studied in the context of

understanding the neural basis of learned vocalizations. White-crowned sparrows

[27], canaries [26, 28], chaffinches [23, 29], zebra finches [30, 31], and Bengalese

finches have been the typical subjects of most research. Even within this small

group, there are fundamental differences in the learning and production of song.

Zebra finches and Bengalese finches are close-ended or age-limited learners - birds

for which there is a time window after which the bird does not learn any new

song. Canaries, however, are open-ended learners - learning new songs throughout

their adult lives. But as far as song production is concerned, the deterministic

song of the zebra finch (fixed transitions between syllables) contrasts with the

more variable songs of Bengalese finches and canaries (probabilistic transitions

between syllables). Seeking features of neural control systems that could lead to

4

both inter-species similarities and contrasts such as these is important to further

our understanding of the mechanisms behind learned vocalizations.

1.2 Neurobiology of sequence generation

The generation of vocalizations is a biological feat involving the precise coordina-

tion of oral, vocal, and respiratory muscles, that must be orchestrated by neural

control. The neural circuitry involved in vocalizations has been studied the most

in humans and songbirds. While innate vocalizations have been linked to the

midbrain [32, 33], it is hypothesized that a direct cortical pathway to motor nu-

clei involved with vocalization as seen in songbirds and humans is necessary for

complex vocal learning [10,32,34]. Such a connection is yet to be seen in any non-

learner. In general, in mammals, it is argued that there are two separate neural

pathways for the production of innate and learned vocalizations [35]. Many brain

areas in birds are hypothesized to be homologous - similar in position, structure,

and evolutionary origin but not necessarily in function - to those of mammals [36],

although this is still an active area of research. To add to the list of similarities,

the mechanism of generating vocal fold movements has recently been shown to be

the same in the syrinx of songbirds and the larynx of mammals [37]. Given these

fundamental similarities between generative neural pathways and peripheral mech-

anisms, there are still many differences that need to be explained to understand

the diversity of vocalizations in the animal kingdom.

1.2.1 The song production system in the avian brain

A collection of brain areas in oscines specialized for song is referred to as the

song system or the song circuit [26]. Several forebrain nuclei are involved in the

song system of oscine songbirds - a feature that is conspicuous by its absence in

suboscine birds. There are two main forebrain pathways in the song system -

a posterior forebrain pathway also referred to as the descending motor pathway,

and an anterior forebrain pathway as shown in Fig. 1.1. Both these pathways

start with a cortical-like region called the HVC (proper name). In the anterior

forebrain pathway, involved in the learning of song, the HVC connects to Area X, a

homologue to mammalian basal ganglia which connects to the dorso-lateral division

5

HVC

nXIIts

RA

Syrinx

LMAN

Area X

DLM

NIf

UVA

Resp Nuc

A B

NIf

UVA DLM

LMAN

XHVC

RA

Motor

AuditoryPathway

Vocal OutputDescending Motor Pathway

Anterior Forebrain Pathway

Excitatory

Inhibitory

Figure 1.1. The song system in the songbird brain. The descending motor pathwayassociated with the production of song, the anterior forebrain pathway required forthe learning and maintenance of song, and auditory input into the song system arehighlighted.

of the medial thalamus (DLM), which in turn connects to the lateral part of the

magnocellular nucleus of anterior nidopallium (LMAN). In the posterior forebrain

pathway, involved in the production of song, neurons in the HVC project to a

brain nucleus called the Robust nucleus of the Arcopallium (RA). RA neurons then

project downstream to the motor neurons involved in respiration and vocalization.

It is this direct cortical pathway that is hypothesized to be unique to vocal learners.

The vocal production and vocal learning pathways interact through connections

from LMAN in the anterior forebrain pathway to RA in the posterior forebrain

pathway. Learning in the songbird is facilitated by access to auditory information

- both from a tutor as well from the bird’s own song. The songbird brain also has

a set of nuclei that are part of the auditory pathway. The auditory pathway feeds

into the HVC through three connections - through the nucleus Uvaformis (UVA),

through the Nucleus Interfacialis (NIf), and via a direct connection to HVC.

Of the many connections in the songbird brain, a few are central to our discus-

sion - internal connections within the HVC, HVC-RA projections, and auditory

feedback connections. There is much speculation about the HVC being the seat

of syntax. The study of syntactic structure via statistical models - both of form,

and any change due to disruptions - is necessary to further our understanding of

the involvement of these connections.

6

1.2.2 Neural models of sequence generation involving the HVC

HVC and RA together form the premotor circuit in the song production pathway.

In zebra finches, RA activity during singing is characterized by trains of short

bursts of spikes bound by periods of inhibition, with each burst being associated

with a unique subsyllabic acoustic event [38]. The pattern of activity in RA imme-

diately upstream from motor neurons is therefore considered to be responsible for

sound production. In later experiments, also in awake zebra finches, it was found

that an HVC-RA projection neuron (referred to as HVCRA) emits a single burst of

spikes (≈ 6ms) during a song motif (≈ 1s) [39, 40]. Different HVCRA neurons fire

at different times during the motif. There are two possible hypotheses about the

role of the HVC that could explain these observations [41]. One is that the HVCRA

neurons form a representation of temporal order by producing a continuous stream

of activity on a 10-millisecond timescale [40]. More recently, based on modeling

song production in terms of dynamical systems, the other hypothesis is that bursts

encode transitions between different elemental ‘gestures’ of the song - periods of

time when the model parameters representing pressure in the bird’s air sac and the

spring-like tension on a vibrating membrane controlled by the muscles surrounding

the syrinx were either unchanged or strictly increasing or decreasing [42]. The two

hypotheses are not mutually exclusive since it is possible that while bursting ac-

tivity in HVCRA neurons aligns with transitions between gestures, enough HVCRA

neurons are active throughout each gesture to account for temporal ordering. We

will therefore assume that the HVC is responsible for temporal order in a sequence.

Branching chain model of stochastic sequence generation in HVC

Several neural models consider the topological connectivity of neurons (the graph

formed by neurons physically connected via synapses) in the HVC to take the form

of synfire chains [43–46]. In a synfire chain, neurons are ordered into groups that

are connected in a feed-forward fashion. All connections are excitatory. Activity

propagates synchronously from group to group in the synfire chain. In one model,

HVC-RA neurons are modeled as synfire chains with each neuron having an in-

trinsic bursting property based on dendritic calcium spikes [47]. Global inhibition

through HVC interneurons is included to regulate activity. Working within this

model of the neural circuitry, a neural sequence in the HVC corresponding to the

7

A

B

C

AB

C

Figure 1.2. Branching synfire chain in HVC for songs with probabilistic transitionsbetween syllables such as those of the Bengalese finch. Each chain corresponds to asyllable. Syllable A could transition to syllable B or C depending on whether activitypropagation in synfire chain B wins over the activity in synfire chain C or vice versa.

deterministic transitions between syllables in the song of a zebra finch can be set up

as activity propagation through a set of chains, each connected to just one other in

a feed-forward manner. However, to account for probabilistic transitions between

syllables in the song of a Bengalese finch, chains are connected in a branching

manner as seen in Fig. 1.2 [45]. Till date there has been no observation of the

exact organization of neurons in the HVC. The only clue that we have that speaks

to it is the observation that HVC-RA neurons that are activated at the same time

are not located next to each other. This suggests that a group in the synfire chain

is an abstract entity that is defined by co-activation of neurons. When we refer to

‘states’ in models of syntax later on, we will roughly be thinking of a one-to-one

mapping between the states and these groups in the synfire chain model.

1.3 Models of syntax for vocal sequences

Any sequence can be described by considering the statistical patterns that it

presents. If we note down the numbers that show up in a sequence of rolls of

8

a fair die, for example, we can see that there is no discernible structure to the

number sequence. This is because these numbers result from independent trials,

with each throw of the die having no influence on any of the others. In a paper

from 1907, Russian mathematician Andrei Andreevich Markov considered instead

the possibility of a chain of dependent variables y1, y2, . . . , yn for which yk+n is only

dependent on yk for any k [48]. The simplest case is when n = 1 where dependence

is limited to the immediate predecessor of a variable in the chain. This is called

a first order dependence. Consider for example the board game Snakes and Lad-

ders. Every new position of your marker on the board is influenced only by your

previous position on the board. All moves before that do not matter at all. Such

chains, be they of first or higher orders came to be called Markov chains in the

popular literature. Markov chains are fully specified by stating the set of unique

elements the sequence is made of, and the probabilities of an element in the set

being followed by any element in the set, called transition probabilities. The first

references to the syntax of birdsong assumed that song sequences were Markov

chains [49,50]. However, in a Markov chain, if an element repeats with probability

p, then the probability of the element repeating n times before transitioning to a

different element is given by the binomial distribution

P (n) = pn−1(1− p) (1.1)

This is a monotonically decreasing function in n. However, it has been observed

that in the songs of some songbirds, the distribution of syllable repeats may not

be monotonic [6, 51], indicating that the song sequences are non-Markovian.

Also, even though the use of the word ‘sequence’ might draw to mind a linear

organization of elements, it is possible that the relationship between elements in

a sequence could be highly non-linear. An example that makes this clear is the

following English sentence - If it is good, then it can be published. If and then are

not adjacent to each other in the sentence, yet the presence of If necessitates that

of then later in the sentence. This sort of ‘non-local’ or long-range dependence

requires that the syntax be non-linear while the production of the elements can be

linear in time. Such dependencies are not limited to natural languages, but could

be seen in motor sequences such as animal vocalizations as well. Hence, descrip-

tions of sequence syntax must go beyond Markov models. In this dissertation, we

9

recursively enumerablecontext-sensitive

mildly context-sensitivecontext-free

regular

songbird songs

natural languages

Figure 1.3. Chomsky hierarchy of languages.

specifically study a more complex model called the Partially Observable Markov

Model (POMM).

1.3.1 The Chomsky hierarchy

Languages 1 are typically classified into four major classes according to the Chom-

sky hierarchy shown in Fig. 1.3. In order of increasing complexity, these are -

regular languages, context-free languages, context-sensitive languages, and recur-

sively enumerable languages. The distinction between these languages is made

based on the computational machines that can be used to generate or recognize

them. Regular languages are those which can be recognized using a machine with a

finite number of states. Animal vocalizations are thought to belong to the class of

regular languages, and more specifically to the subclass of finite languages, while

natural languages are classified as mildly context-sensitive. However, these are

broad categories. All animal vocalizations for example do not seem equal. We are

interested in devising finer divisions within the hierarchy. A question that exem-

plifies our effort is - How different from a Bengalese finch’s song is the song of a

canary, and how different from that is the song of a humpback whale? We attempt

to make comparisons of this nature in this dissertation.

1A language here is simply a finite or infinite set of strings, each finite in length and composedof a finite number of elements [52]

10

1.4 Structure of the dissertation

This dissertation consists of 6 chapters including the introductory chapter of which

this section is a part.

Chapter 2 introduces finite state representations of various syntax models for

symbol sequences - starting with the simple Markov model and considering vari-

ous higher order models. We focus specifically on the Partially Observable Markov

Model (POMM). We study model inference in detail and discuss some new mea-

sures to evaluate the model. The inference of the POMM is limited by data size.

The dependence of all measures discussed on data size is carefully analyzed, es-

pecially since recording limitations often lead to small sample sizes for animal

vocalizations. We derive bounds on all sequence statistics that are discussed in

this context. The importance of considering sample size as a prime factor in in-

ferring the category of models to which the song syntax of a species is assigned is

also emphasized.

In Chapter 3 we apply the methods developed in Chapter 2 to the songs of Ben-

galese finches. Firstly, we show that the syntax of the Bengalese finch is a Partially

Observable Markov Model - there is a many -to-one mapping between syllables in

song and abstract states of the model, which are hypothesized to be chain net-

works of neurons in the songbird brain. Changes in the syntax of Bengalese Finch

song after the removal of auditory feedback by deafening are studied. We find that

for four of the six birds studied, removal of auditory feedback leads to the disap-

pearance of the many-to-one mapping. However, the absence of this observation

in the remaining two birds suggests that auditory feedback is not solely respon-

sible for the regulation of the many-to-one mapping. This chapter also includes

a short analysis of humpback whale songs to demonstrate the generalizability of

these methods, as well as limitations.

In the previous chapter, the POMM models of Bengalese finch and humpback

whale songs were inferred from sequences in which all repetitions of syllables were

disregarded. In Chapter 4 we focus solely on the features of syllable repetitions in

song. We show an inverse relationship between the duration of the syllable and the

most probable number of repetitions for the songs of canaries and swamp sparrows.

These are two species of songbirds whose songs are predominantly composed of

syllable repetitions - single occurrences of syllables are rare, if any. We also show

11

that this relationship is not exact for the songs of a Bengalese finch. We speculate

about possible neural and peripheral mechanisms behind the generation of syllable

repetitions that could result in adherence to, or deviation from, such a relationship.

Chapter 5 is a stand-alone portion of the dissertation. The data for all analysis

in previous chapters were symbol sequences. However, the mapping of an audio

recording into a symbol sequence was not discussed. Field or laboratory recordings

of vocalizations must first go through several stages of pre-processing. We discuss

methods of identifying the time intervals in the processed recording during which

vocalizations are present, by distinguishing them from silence. We develop a semi-

automated method of classifying the identified syllables into types or categories

using a supervised learning technique - the Support Vector Machine (SVM). The

use of image-based features to distinguish syllables is advocated in comparison

with the predominant use of sound-based features in this field of research.

Chapter 6 concludes the dissertation. We discuss the role of syntax models in

informing research on the neurobiology of sequence generation as well as point out

possible extensions of the work presented in this dissertation.

12

Chapter 2Partially Observable Markov Models

of Sequence Syntax

It is often difficult to distinguish patterns in sequences originating from a rule as

opposed to statistical coincidences. While statistical coincidences average out in

large enough sets of data, we need to be careful with the distinction when working

with small data sets. Nevertheless, the rule can still be inferred using tools that

analyse the statistics in the data, with a level of confidence that can be quantified

in the language of probabilities. The task of modeling the syntax of vocal sequences

is therefore one of statistical inference. In this chapter, we discuss various models

of syntax, with a focus on the Partially Observable Markov Model (POMM). We

develop methods of inference that allow identifying and encoding these rules into

a POMM. We also address questions of performance of the model as well as the

inference scheme in the case of finite data sets. Finally, we discuss some model

evaluation measures.

2.1 Introduction

Let Y1 = y1, Y2 = y2, ... denote a sequence of observations of the random variable

Y . In the case of animal vocal sequences, the random variable is a syllable type ie

if there are m syllable types in an animal’s vocal repertoire, then each observation

in a sequence could be one of m possibilities. The simplest scenario is one in which

the syllables appear with probabilities that are independent of preceding syllables.

13

Such sequences of syllables can be produced without a memory of what transpired

before.

More interesting sequences contain complex patterns which manifest as cor-

relations between the syllables observed at different times. Such patterns could

imply that the ‘machine’ that generated the sequence possesses some memory (in

some physical form), based on which the rules governing the choice of syllable

are tweaked. Thus the patterns require a complex computing process to gener-

ate them. Analysis of patterns and the complexity of the rules that can generate

them opens a window into the complexity of the machine which, in the case of our

interest, is the brain.

One approach to grading the complexity of patterns is to identify the simplest

computing scheme that can reproduce the patterns, borrowing notions from math-

ematical models of computing. This procedure has several aspects to it namely -

(i) identifying the level of complexity of the model, (ii) using the available data

to identify the specific model within the given class of complexity, and (iii) eval-

uating the ability of the identified model to reproduce the patterns. All these

tasks are made difficult by various limitations posed by finiteness of the available

sequence data, which can result in an insufficient representation of characteris-

tic correlations and the generation of spurious correlations arising from statistical

coincidences. For the sequences in our study, we are interested in models with a fi-

nite amount of memory - namely Markov models and Partially Observable Markov

Models (POMM). We discuss methods for training the models with data, model

validation, and checks on overfitting.

2.2 Model representation - Finite state machines

A Finite State Machine (FSM) is an abstract representation of regular languages

(see Sec. 1.3.1) that can also be thought of as an abstract apparatus that performs

computations. FSMs are systems that can be described by a ‘current state’ x(t)

which at a time can take a value from one among a finite set of possible states.

A finite set of inputs can trigger a transition in the state. The transition depends

on the current state and the input trigger through a transition function T . The

system can produce output symbol y(t) taking values from a finite set of symbols

either during the transition or in between the transition. The output can depend

14

on the current and previous state through an emission function E [53]. The state

is an abstract structure that can be chosen to be something appropriate based on

the process being represented. FSMs allow representing of patterns contained in

most simple sequences into an appropriate choice of states, symbols, transitions

and emissions. While FSMs are defined to have deterministic transitions and

emissions, these can be generalized to a class stochastic FSM’s the simplest of

which are Markov processes and Hidden Markov models.

FSMs can be visually represented using state transition diagrams. To illustrate

the notions and the mapping to the diagram, let us invent a one-player luck-based

challenge that uses combined coin flips and die throws to illustrate an FSM. The

goal is to stay in the game for as long as possible. The game always begins with a

coin toss. A player upon getting heads on a toss throws the die. If a 1 shows up,

then the coin should be flipped again. If 2,3,4 or 5 falls, the player throws the die

again. If a 6 shows up, the player is out. The mechanics/rules of this game can be

represented by a simple FSM shown in Fig. 2.1.

CoinTail

Head

1

2,3,4,5

Start

End

6

Die

Figure 2.1. An illustration of a Finite State Machine represented by a state transitiondiagram

In the diagram shown, the bubbles represent states, and the arrows represent

state transitions. The arrow labels indicate the output/input symbol corresponding

to a transition. Note that in this FSM the output from one state is the input to

the next state.1 The FSM shown has four states - Start, Coin, Die, and End.

1This need not always be the case. In general, Finite State Machines can have different setsof input and output symbols.

15

There are two symbols H and T associated with the state Coin, while there are six

symbols 1,2,3,4,5, and 6 associated with the state Die. The Start and End states

are each associated with a null output {}. Game progress happens in discrete time

steps.

The FSM considered is an example of a Moore machine, where the output

depends only on the current state - H or T depends only on the state Coin; it

does not matter whether this state was entered after a throw of 2 or 5 on the die.

The same FSM can also be represented as an equivalent Mealy Machine where

the transition between states is based on the input symbol. This distinction is

made here since Finite State Machines are either represented as Moore and Mealy

in different research articles about the syntax of sequences, and we would like to

emphasize that they are equivalent forms. Any discussion that follows is applicable

to either form.

Finally, a finite state model is an abstract representation and does not necessar-

ily need to have a one-to-one correspondence with the physical process it describes.

However, the hope is to find some mapping between the machine and the process.

In the case of models of vocal sequence syntax, we seek a mapping to neural models

of sequence generation.

2.2.1 Markov models

When the occurrence of an observation in a sequence only depends on an observa-

tion that occurred before it at a particular position in the sequence, the sequence

is said to be Markovian, or generated by a Markov source. An example of such

a Markovian source is shown in Fig. 2.2 where the observations can be one of

two symbols {y1, y2}, and with probabilities assigned to the transitions between

the symbols. Markov sequences can be thought of as originating from stochastic

FSMs with state symbols but no output symbols [54]. The Markov model can also

be represented by a transition matrix T which specifies the transition probabilities

between two symbols. The occurrence of a symbol in position i, depends only on

the symbol in position i − 1. This is a first-order Markov model. If it depended

on the symbol in position i − 2, it would be a second-order Markov model. In

general, if the occurrence of a symbol in position i in the sequence depends on the

symbol in position i−n, we have an nth order Markov model. In this dissertation,

16

‘Markov model’ refers to the first-order Markov model unless stated otherwise.

y10.8 y2

0.2

0.6

0.4 T =

(0.8 0.20.6 0.4

)

Figure 2.2. An example of a first order Markov model represented using the statetransition diagram in a manner similar to an FSM, and the corresponding transitionmatrix. The numbers next to the arrows show the transition probabilities.

2.2.2 Hidden Markov Model (HMM)

A Hidden Markov Model (HMM) is used to model stochastic sequences more com-

plex than a Markov chain. The HMM is a standard statistical model with a wide

range of applications in time-series modeling [55]. In an HMM, an observed sym-

bol sequence is also associated with a hidden state sequence. The hidden state

sequence is a Markov chain. For a symbol sequence consisting of m unique sym-

bols, an HMM is represented by the set T,E, k, where k is the number of states,

T is a k × k transition matrix specifying the transition probabilities between the

states, and E is a k ×m emission matrix specifying the probability of each state

emitting each of the symbols. In an HMM, each state has an emission distribution

over all symbols such that the same state could emit more than one symbol. An

example of a 4-state HMM is shown in Fig. 2.3. An HMM can be thought of as a

stochastic FSM [54].

2.2.3 Partially Observable Markov Model (POMM)

A Partially Observable Markov Model2 is a special case of the Hidden Markov

Model [8]. In the HMM, each state is associated with a probability distribution over

all m possible symbols that can appear in the sequence. This means that a single

state can could potentially emit any of them symbols, each time a transition to that

state occurs. By contrast, in a Partially Observable Markov Model (POMM) [56],

each state is associated with only one symbol, i.e., a state emits one of the m

2Not to be confused with the Partially Observable Markov Decision Process which has addi-tional structure called a decision maker which can influence state transitions.

17

s1 s2

T12

s3 s4

y1 y2

E4(y2)

Figure 2.3. An HMM with k = 4 states which can emit m = 2 discrete symbols y1or y2. Tij is the probability of transitioning from state si to state sj . Ej(yk) is theprobability of emitting symbol yk in state sj . In this particular HMM, states can onlyreach neighboring states.

symbols with probability 1. Multiple states could however emit the same symbol

making the POMM a many-to-one mapping scheme as shown in Fig. 2.4. The use

s1 s2

T12

s3 s4

y1 y2

E4(y2) = 1

Figure 2.4. A POMM with k = 4 states which can emit m = 2 discrete symbols y1or y2. Tij is the probability of transitioning from state si to state sj . Ej(yk) is theprobability of emitting symbol yk in state sj . In a POMM Ej(yk) = 1.

of the POMM rather than the full HMM is motivated by computational models

to understand the neural basis of birdsong [45], where each state is considered to

be a chain network of neurons in the songbird HVC (see Sec. 1.2.2.). The neural

sequences arising from a chain network are Markovian, but the resulting syllable

sequences need not be if each chain represents a state in the POMM rather than

a syllable directly. A POMM can model all distributions that can be modeled by

an HMM- i.e., every HMM can be represented by an equivalent POMM [8]. It

has been shown in one study that for Bengalese finches (two birds), the syntax is

consistent with a POMM [6]. In the following sections, we discuss model evaluation

and inference mainly for the POMM since it is very likely that the model is a

representation of syntax for the vocal sequences of other species as well.

18

Figure 2.5. Empirical probability distribution over observed sequences which is anapproximation to the true probability distribution. A model defines a distribution oversequences which is an approximation to the empirical distribution

2.3 Model evaluation

The construction of a stochastic model of the syntax defines a probability distribu-

tion over the set of all sequences. One optimizes the parameters of the model such

that the probability distribution defined by the model is a good approximation of

the true probability distribution defined by the process generating the observed

sequences. In the absence of direct and/or complete knowledge of the process gen-

erating the data, one has to infer the match between the distributions purely from

available data (Fig. 2.5). In evaluating the quality of this approximation, we need

to empirically identify the relevant statistical features that have to be reproduced

by the model and accordingly define model performance. Given enough number of

parameters, a complex enough model can always be found to match any distribu-

tion. In order to avoid such overfitting and capturing spurious patterns, we need

to place limits on model performance. In the following sections, we discuss these

issues in the context of POMM inference.

2.3.1 Measures of model fit

One approach to evaluating a model is to define metrics to ‘score’ the inferred

model. A high score reflects a good model. We consider the use of two such

19

metrics - sequence completeness and log-likelihood.

2.3.1.1 Sequence completeness

Consider a model that generates sequences with a distribution P , and let Y be a

set of observed sequences (generated by another process, say Q, possibly the same

as P ). Sequence completeness measures the total probability under P of finding

each of the unique sequences in Y .

Pc =∑y

Py (2.1)

If all sequences present within the observed set can be generated by the model

and these are the only sequences that the model can generate, then Pc = 1. In

this sense it can be thought of as a measure of the similarity between the two

distributions. However it is a weak measure of similarity in that we do not require

P (y) = Q(y) for any sequence y.

Factors that affect sequence completeness

We use completeness as a tool to check whether a model inferred from the data is

capable of reproducing the unique sequences found in the data. More precisely, we

use part of the data to learn a model and measure the completeness of the model

against the remaining part of data. Completeness values depend strongly on the

amount of available data i.e, number of song sequences, number of syllables, and

the nature of the model that generates the data.

Effect of finiteness on sequence completeness

The number of sequences available for the inference of a finite state model could

play a significant role in the reliability of the model fit. We test how using different

numbers of sequences in the inference of the model affects the sequence complete-

ness. We want to quantify the effect of the finite size of the available dataset

with simple first-order Markov models (a first-order Markov model is a POMM in

which the number of states is the same as the number of syllables). We construct

such models using different numbers of symbols with the transition probabilities

out of each symbol/state drawn from a uniform distribution. We use m=2,3,4

20

0 0.2 0.4 0.6 0.8 10

200

400

600

800

1000

0 0.2 0.4 0.6 0.8 10

200

400

600

800

1000

0 0.2 0.4 0.6 0.8 10

100

200

300

400

500

600

700

800

0 0.2 0.4 0.6 0.8 10

200

400

600

800

1000

m=3, N=200 m=3, N=500

m=5, N=200 m=5, N=500

coun

ts

sequence completeness

Figure 2.6. Sequence completeness distributions under Markov modelsfor N sequencesfor m symbols. There is a large variation in possible distributions for the same N andm. The average sequence completeness distribution in each case is shown with a thickline.

and 5 symbols (excluding the silent start/end symbol). N sequences are generated

from each model with N ranging from 100 to 12800. Of these N sequences, half

are used as fit sequences representing a model. The other half are chosen as test

sequences. In order to calculate the completeness of the test sequences using the

fit set, we first calculate the probability distribution over sequences in the fit set.

We then find unique sequences in the test set and sum their probabilities based on

the distribution over the fit set to get the sequence completeness. We then obtain

a distribution of the sequence completeness thus calculated for 1000 random splits

of the N sequences. In Fig. 2.6. we see the distributions of sequence completeness

so obtained for m = 2,m = 5 symbols, and for different numbers of sequences

N = 200, N = 500.

In Fig. 2.7, the markers represent the mean completeness obtained from these

21

102 103 1040

0.2

0.4

0.6

0.8

1

number of sequences

mea

n s

equen

ce c

om

ple

tenes

s

n=2, fixed Tn=3, fixed Tn=4, fixed Tn=5, fixed Tn=2, varying Tn=3, varying Tn 4 ar ing T

102 103 1040

0.2

0.4

0.6

0.8

1

number of sequences

without repeats with repeats

Figure 2.7. Sequence completeness depends on the number of sequences availablefor model inference.The mean sequence completeness increases with an increase in thenumber of sequences N with and without repeats allowed in the sequences.

distributions and the error bars represent the standard deviation. Since our con-

struction of a POMM is limited to the non-repeat structure of sequences, we first

track the change in sequence completeness with number of sequences for Markov

models in which self-transitions are not allowed (left panel of Fig. 2.7). The cal-

culations are exactly the same as the ones described above. The mean sequence

completeness increases with an increase in the number of sequences N .

We now include the possibility of a symbol repeating, ie., self-transitions are

allowed. In this case we know that the number of repeats of a symbol in a sequence

is variable. This would mean that there would be fewer sequences in the fit set

identical to those in the test set and we expect the completeness to be lower on

average. This is indeed the case (right panel of Fig. 2.7). Even with repetitions

allowed (right panel), the mean sequence completeness increases with the number

of sequences for Markov models on m = 2, 3, 4, 5 symbols. For the same number

of sequences even though the sequence completeness is mostly likely to be greater

for smaller numbers of symbols. However it is not necessary that the sequence

completeness is higher for smaller numbers of symbols. Overall however, there is

still an increase in completeness with the number of sequences.

22

Variability in sequences

Sequence completeness represents a match of sequences in two sets. It can be

argued that in general a model that can generate a broad distribution of possible

sequences, will result in a small sequence completeness, as the total number of

sequences found in a finite set of test sequences will form only a small part of the

possibilities. In other words, a model with higher variability will lead to lower

values of sequence completeness. One possibility is that for the same number of

sequences, if the number of possible combinations of symbols is large, for example if

the number of possible symbols in the alphabet is large, the sequence completeness

would be lower. There are multiple other features of the syntax that can lead to

increased variability and lower sequence completeness -

1. Presence of cycles or repeat structures in the syntax - When a se-

quence is long, the number of possible combinations of symbols that could

lead to a sequence of any given length is large. This means that the prob-

ability of finding an exact match in another set would be low. We could

hypothesize that the average length of the sequences could also affect com-

pleteness. In fact we can imagine that any factor that increases the average

length of sequences could result in a low value of sequence completeness.

This could happen in several ways - if the probability of most symbols tran-

sitioning to the start symbol is low, then the average length of sequences

would be high. This could also happen if there are cycles in the syntax.

The simplest cycle is the zeroth order cycle which is a self-transition. The

greater the repeat probability of a symbol, the higher the possibility of a

sequence containing that symbol being longer than average. To study this,

we constrained the repeat probabilities (diagonal of transition matrix) to all

be the same value for each model picked. We generated 100 such models,

generated 1000 sequences from each, constructed the distribution of sequence

completeness for each, and recorded the mean sequence completeness. We

then tuned the repeat probability over a range of values and repeated the

procedure. As seen in Fig. 2.8 as the repeat probability increases, the mean

sequence completeness decreases on average. Also, we see that it is possible

that a model based on a larger set of symbols with small repeat probabilities

could sometimes lead to a greater sequence completeness than one based on

23

fewer symbols but with high repeat probabilities.

2. Sparsity of transition matrix - When the number of possible transitions

out of the states in a POMM is small (sparse transition matrix), the number

of unique sequences that can be generated is also small. This would mean

that it is more likely for sequences in the fit set to also be found in the

test - i.e., we expect the sequence completeness to be high. We need a

single measure for the sparsity of the matrix, so that we can study sequence

completeness as a function of this quantity. The simplest possibility would be

to count the total number of zeros in the matrix (extremely low probabilities,

smaller than a threshold can also be considered to be zero). However, this

would not take into account the actual values of the non-zero probabilities.

Another possibility is to define the entropy of state transitions - every row i

of the transition matrix T represents transition probabilities out of state i,

for each row, the entropy would be∑

j Tij log Tij. We then sum this quantity

across different states. In Fig. 2.9 sequence completeness is studied as a

function of state transition entropy.

2.3.1.2 Log-likelihood

The likelihood of a set of sequences is defined as the probability over all observed

sequences given a model. For a set of sequences Y = {y1,y2, . . . ,yn}, the likeli-

hood is L = P (Y) =∏

i Pyi . However, many of the n sequences could be identical.

If there are ky sequences of type y, then

L(Py) =∏y′

Py′ky′ (2.2)

The log-likelihood is obtained by taking the logarithm of Eq.2.2

L(Py) =∑y′

ky′ logPy′ (2.3)

A high log-likelihood indicates a good model.

24

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

repeat probability

mea

n s

equen

ce c

om

ple

tenes

s

5-sym

3-sym

Figure 2.8. Sequence completeness depends on the repeat probabilities of symbols inthe Markov model. Each circle in the plot represents the mean sequence completenessover N = 1000 sequences for a single model. Higher repeat probabilities increase theaverage length of sequences that can be generated by the model.

An upper bound on log-likelihood

When we consider methods of model inference in later sections, we keep track of the

log-likelihood as a means of evaluating improvements in the estimation of model

parameters. It is a natural question to ask if there is a bound on the log-likelihood

for a given dataset that we should match the performance of the model against. To

find this bound, we seek the set of Py that maximize L(Py). This is a constrained

maximization problem, since the Py are probabilities, and therefore must obey the

constraint∑

y′ P (y′) = 1. Hence using the Lagrange multiplier λ, the function

we need to maximize is L(λ, Py) = L(Py) − λ(∑

y′ P (y′) − 1). Maximizing with

respect to Py and λ, ie setting

∂

∂PyL(λ, Py) = 0 and

∂

∂λL(λ, Py) = 0

25

10500

0.2

0.4

0.6

0.8

1

state transition entropy

mea

n s

equen

ce c

om

ple

tenes

s

252015

R2 = 0.9

Figure 2.9. Sequence completeness depends on the sparsity of the transition matrixmeasured by the state transition entropy.

we get

Py = ky/λ and λ = N

Hence the maximum log-likelihood is achieved by the empirical, frequentist ap-

proximation to the distribution. Py ∝ ky. Moreover the upper bound is given by

the Shannon entropy of this distribution:

Lmax(Py) = N∑y′

Py′ logPy′ where Py ∝ ky

We have therefore shown that the upper bound on the log-likelihood achievable

by a model is precisely the entropy of the observed sequences or the data.

2.3.2 Measures of sequence similarity

Another approach to evaluating a model is by using a set of sequences generated

from the model. The statistics of this set can be compared with the statistics of

26

observed sequences. If the model is a good fit to the observed sequences, then we

can demand that certain chosen set of key statistics in the data must agree within

the limits of estimation error due to the finite size of the data set.

2.3.2.1 Repeat distributions

Symbols can be repeated multiple times in a sequence. Since the transitions be-

tween symbols, including self-transitions are stochastic, the number of times a

symbol is repeated varies. The distribution of the number of repetitions in all

appearances of the symbol in the observed set of sequences is called the repeat

distribution.

2.3.2.2 n-gram distributions

A sequence can be segmented into multiple sub-sequences. A good model of the

sequence syntax should be able to replicate the statistics of these sub-sequences as

well. A sub-sequence of length n is called an n-gram. For example for the sequence

ABCD, AB, BC and CD are 2-grams. ABC and BCD are 3-grams. A distribution

can be defined over all n-grams in the observed set of sequences for each n.

2.3.2.3 Step distributions

Some symbols appear more frequently at the beginning of sequence, some others

at the end and so on. In general, the location of a symbol in a sequence is a

statistic that must also be replicated by a good model of the sequence syntax.

The distribution over all positions in a sequence that a symbol is found at, or the

number of steps from the beginning of a sequence that a symbol is found at, over

the entire set of sequences, is defined to be the step distribution of that symbol.

The step distributions for all m symbols should be matched by a good model.

2.4 Model inference - inference of a POMM from

data

A model is defined both by its structure, as well as its parameters. Given a set

of sequences that the model is to be inferred from, some inference algorithms are

27

used to estimate both the structure and the parameters, while others are used to

estimate just the parameters given the model structure.

2.4.1 Expectation maximization - Baum-Welch algorithm

A partially observable Markov model contains a many-to-one mapping from states

to symbols. The inference of the parameters of the POMM (and HMMs in gen-

eral) from the observed data is difficult compared to a Markov model because

of the many possible state sequences that could result in the same observed se-

quence. Expectation-maximization is a method that is suited to such classes of

problems [57]. In this section we describe the Baum-Welch algorithm that imple-

ments expectation-maximization.

Let the observed probability over the syllable sequences be Pobs(Y) and the

probability of the sequences Y under a POMM with transition matrix T and

emission matrix E, be P (Y;T,E). For the cases of interest here, the emission

matrix is taken to be fixed, and so it won’t be explicitly mentioned in the notations

used below.

By tuning the transition matrix T we hope to find the distribution P (Y;T )

that best approximates the true observed distribution Pobs(Y). A measure of the

dissimilarity of the two distributions is the Kullback-Leibler divergence

D =∑Y

Pobs (Y) logPobs (Y)

P (Y|T )=∑Y

Pobs (Y) [logPobs (Y)− logP (Y|T )]

and this quantity is to be minimized. Since the true distribution Pobs (Y) is inde-

pendent of T , minimization of D by tuning T is equivalent to maximization of the

log-likelihood ∑Y

Pobs (Y) logP (Y|T )

Pobs (Y) is the distribution over syllable sequences inferred from the observed se-

quences:

Pobs (Y) =1

n

n∑i=1

δY,y(i) ,

where y(i) is the ith out of the n observed sequences (allowing repetitions). We can

28

replace the above quantity by the average over observed sequences

L (T ) =1

n

n∑i=1

logP(y(i)|T

)= 〈logP (Y|T )〉

We will use the angle brackets to imply average over the observed sequences in

what follows. Ideally, we want to identify the maxima of the above quantity in the

very high dimensional parameter space of T . In general this is an impractical task,

however the Baum-Welch algorithm can achieve a simpler task. Given a model Tin,

the algorithm can produce a new model Tout which gives a higher log-likelihood for

the data. The algorithm can be used to iteratively improve the model. Starting

from any initial guess, the algorithm can thus lead us to a local maxima in the

parameter space.

Overview

In order to explain and justify the algorithm we define the following functional F

and quote a few properties.

For a distribution Q (S|Y ) over the state sequences and T , the functional

F (Q, T ) is defined as follows

F (Q, T ) =

⟨∑S

Q (S|Y ) logP (S, Y |T )

Q (S|Y )

⟩

For well-definedness, Q is restricted to be among those distributions such that

Q (S|Y ) is zero whenever the state sequence S cannot emit the sequence Y under

the fixed emission matrix E. The functional F satisfies the following properties

1. For a given T , F (Q, T ) gives a lower bound on the log likelihood L (T ) ie

for any choice of Q (S)

L (T ) ≥ F (Q, T )

This is a consequence of Jensen’s inequality for expectation values of concave

functions of real valued random variables.

2. The transition matrix T and the observed sequences Y together define a prob-

ability distribution over possible state sequences, which we denote P (S|Y, T )

(discussed later). It can be seen that for given T and the observed sequences

29

Y, F (Q, T ) is maximized when Q = P (S|Y, T ). The maximum is L (T ).

This can be seen from the definition of F above.

3. For a choice of Q, and given the observed data, F (Q, T ) can be maximized

by tuning T .

Each iteration of the Baum Welch algorithm takes in an initial Tin. From property

#2 above, we have that

L (Tin) = F(Q, Tin

)where Q (S|Y ) = P (S|Y, Tin)

From property # 3, we have that a new Tout can be inferred by maximizing wrt T

such that

F(Q, Tout

)≥ F

(Q, Tin

).

From property # 1, we have that F(Q, Tout

)forms a lower bound on L (Tout) ie

L (Tout) ≥ F(Q, Tout

)Combining these inequalities we see that

L (Tout) ≥ L (Tin)

Thus we have inferred a transition matrix Tout such that it improves the log-

likelihood compared to Tin.

Optimization steps

The previous section explains why the Baum-Welch algorithm succeeds in improv-

ing the log-likelihood. In practice the algorithm is useful as the optimization steps

described in properties # 2 and # 3 above are feasible using the forward backward

algorithm.

The maximization of F(Q, T

)wrt T to obtain Tout described in property # 3

is also tractable. In order to see this we first note that the maximization of F wrt

T is equivalent to maximization of

FQ =

⟨∑S

Q (S|Y ) logP (S,Y|T )

⟩

30

Labelling the matrix elements of T as tij, the joint distribution of S, Y can be

written as

P (S,Y|T ) = δY,O(S)P (S|T ) = δY,O(S)

∏s,s′∈Unique states

tkss′ss′

where O (S) is the observed sequence resulting from a state sequence S. This is

known since the emission matrix of the POMM is assumed to be known. δa,b is 1

if a = b and 0 otherwise and kss′ is the number of occurrences of the transitions

s to s′ in a state sequence S. Incorporating the constraint that the transition

matrix rows need to be normalized, we have to maximize the following quantity

with respect to pss′ and the Lagrange multipliers λs⟨∑S

Qi (S|Y ) log

∏s,s′∈Unique states

tkss′ss′

⟩−∑s

λs

(−1 +

∑s′

tss′

)

From straightforward maximization we find that

tss′ =1

λs

⟨∑S

Q (S|Y ) kss′

⟩

where λs are determined by normalization. The Baum-Welch algorithm calculates

this efficiently employing the forward backward algorithm procedure described

below. We write (τ being the length of S and S(n) representing the state at

location n in the sequence S)

kss′ =τ−1∑n=1

δs,S(n)δs′,S(n+1)

Plugging into above equation we have

tss′ =1

λs

⟨∑S

τ−1∑n=1

Q (S|Y ) δs,S(n)δs′,S(n+1)

⟩

=1

λs

⟨∑S

τ−1∑n=1

Q (S|Y ) δs,S(n)δs′,S(n+1)

⟩

31

Note that from the discussion in the previous section Q (S|Y ) = P (S|Y, T in) =P(S,Y |T in)P (Y |T in) . The above quantity can be written as

1

λs

⟨1

Aτ (0|Y )

τ−1∑n=1

An(s, Y1→n|T in

)tinss′Bn+1

(s′, Yn+1→τ |T in

)⟩

where An (s, Y1→n|T in) is the probability of finding the state to be s at step n

starting from the beginning of the sequence.

An(s, Y1→n|T in

)=

∑(S1,S2...Sn)

δSn,s

n∏i=1

ESi,YitinSi−1Si

Here E is the emission matrix and ESk,Yk is 1 if Sk can emit Yk and zero otherwise.

S0 is set to be some dormant/start/end state denoted 0. Aτ (0, Y |T in) gives the

probability of returning to the dormant state after τ steps and therefore gives the

probability P (Y |T in). Bn (s, Yn+1→τ |T in) is the probability of finding the state to

be s at step n+ 1, given the observed subsequence Yn+1→τ .

Bn

(s, Yn+1→τ |T in

)=

∑(Sn+1,Sn+2...Sτ )

δSn+1,s

n∏i=1

ESi,YitinSi−1Si

The above expressions for tss′ , A and B can be computationally evaluated given

the observed sequences and input T in. The start and end states and syllables need

to be carefully treated in the actual implementation. The details of our compu-

tational implementation of the Baum-Welch and the forward-backward algorithm

can be found in Appendix A. To summarize, for a specified POMM structure

(number of states and assignment of symbols to states), the Baum-Welch algo-

rithm estimates the transition probabilities. However, we would also like to find

an efficient way of finding the optimal POMM structure.

2.4.2 Grid search for an optimal model

We can consider the problem of finding the optimal POMM structure as a param-

eter search on a grid. One of the main components of the model structure is the

number of states associated with each symbol. The combinatorial challenge of de-

termining the optimal number of states in a POMM increases in difficulty with the

32

Figure 2.10. Schematic of search on a grid for an optimal model. The grid is a discretelattice of dimensions equal to the number of unique syllables. Each node on the gridcorresponds to a specific state vector. The node shown in blue represents the mappingevery syllable to one state - a Markov model. At each node of the grid, an optimal modelis found by using the Baum-Welch algorithm.

number of syllables in a bird’s repertoire. The trial and error method of finding the

number of states associated with each syllable becomes almost impossible for birds

like the nightingale and the warbler that sing more than 100 types of syllables.

We can define a state vector associated with every model. The ith element of the

state vector S = s1, s2, . . . , sm represents the number of states that are assigned to

symbol i in the model. If all elements of S are 1, then the state vector represents a

Markov model. If any si 6= 1, then the state vector represents a POMM. The grid

is a discrete lattice with the number of dimensions equal to the number of unique

symbols in the set of sequences and each node representing a unique state vector.

The left panel of Fig. 2.10. shows an example of a three-dimensional grid. The

parameter search starts with the assignment of one state to every syllable repre-

sented by the blue node on the grid in Fig. 2.10. This corresponds to a Markov

model. At every node on the grid, the optimal transition matrix corresponding to

that particular state vector is obtained by using the Baum-Welch algorithm. This

is represented by the panel on the right in Fig. 2.10.

We then consider the addition of a single state to each of the syllables in

turn. For each addition considered, the Baum-Welch algorithm is again used to

obtain the optimal model corresponding to that state vector. The addition with

the highest log-likelihood is retained, while other additions are discarded. We

33

now consider another round of state additions to the model that we decide to

retain in the previous step. Although there is no guarantee that the path through

the grid is globally maximal, local log-likelihood maximization at grid points is

guaranteed. There is however an upper bound on the log-likelihood for a given

dataset dependent on the finiteness of the dataset.

2.4.3 Establishing error bounds

The entropy of a distribution ideally should not depend on the number of sample

sequences available. Hence to find the error bounds on the log-likelihood, we should

first find the error bounds on the entropy, and then multiply by the total number

of observed sequences.The basic assumption is that if we know subsampled means

and standard deviation obtained from a data sample (set of song sequences in our

case), we can infer the true means and standard deviation. Why is there a true

standard deviation at all? If we had an infinite set of sequences, the standard

deviation would be zero. But we should account for the fact that we only have

finite data. The procedure to obtain the true mean and standard deviation and the

sample bootstrapped mean and standard deviation is as follows. We first pick any

POMM and treat this as our true source or generator. For example, in the analysis

that follows we consider 10-state and 17-state POMMs built on N = 845 songs

of Bengalese Finch, Bird 2. We then create 1000 sets of N sequences generated

from the POMM (generator), calculate the entropy for each set, and find the mean

and standard deviation of the distribution so obtained. This is the true mean

and standard deviation. We then consider a single set of N sequences as our

sample data. We consider a fraction α of these sequences, pick α × N sequences

by bootstrapping without replacement (1000 times), and calculate the mean and

standard deviation of the entropy distribution obtained. These are considered to

be the sample mean and sample standard deviation. We can now compare these

to the true values. If we find a functional relationship between the two, we can

use it in the choice of error bounds for the data at hand.

2.4.4 Grid search stopping criterion

Any good model should be sufficiently generalizable. This means that the param-

eters of the model should not be fine-tuned to match the specifics of the observed

34

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

α

σ sam

ple/σ

true

Figure 2.11. The scaling of sample standard deviation to the true standard deviationis shown as a function of the fraction α of sequences chosen for sub-sampling. The blueand red markers represent samples from two different models. The scaling is model-independent.

sequence set precisely. It should instead be representative of the general rules un-

derlying these sequences such that any other sequence produced using the same

rule could also be produced by the model. In order to ensure that the POMMs

we infer are sufficiently generalizable, we split the dataset into two - a training or

fitting set, and a testing or validation set. Consider the grid in Fig. 2.10. Using

the sequences in the training set as input, the Baum-Welch algorithm is first used

to construct a POMM which, for an advance of a single state on the grid, gives

the maximum possible log-likelihood of the training sequences. At this point we

consider cross-validation to evaluate the POMM. We consider the sequence com-

pleteness Pctest of the testing sequences based on the POMM constructed using the

training sequences.

Sec. 2.3.1.1 describes the determination of the distribution of sequence com-

pleteness DPC for a finite set of sequences based on multiple random splits of the

dataset into training and testing sequences. Very briefly, given a set of sequences,

we construct the distribution of sequence completeness by randomly splitting the

data in half multiple times and calculating the completeness of sequences in one

35

Data random partition

test data Calculate completeness

repeat

sequence completeness

prob

abilit

y

fit dataInfer model

model

Figure 2.12. Construction of a sequence completeness distribution from the data. Theset of sequences is randomly split into two sets - a test and a fit set multiple times. Foreach split, the sequence completeness of test sequences is calculated using the distributionover sequences defined by the fit set. A syntax model of the sequences is accepted if thesequence completeness under the model falls within 5% of the sequence completenessdistribution so constructed.

half against the other half (see Fig. 2.12).

Now, the fit and test sets YFit and YTest represent a single such split.

Once the sequence completeness Pctest of the test sequences is determined, a

p-value of Pctest is computed based on the distribution DPC . The p-value is chosen

to be the probability from the distribution of finding sequence completeness values

below Pctest . This is computed assuming that the distribution is normal. If the

POMM is representative of the syntax of all sequences in the data set, then we

expect Pctest to fall within the range of values represented by the distribution DPC .

At the very least we would expect Pctest to be within the lower range of values. We

choose this range to be within 5% of all values in the distribution. This means, if

p > 0.05, then the POMM is accepted.

2.4.5 Finding the optimal state vector

A model is accepted based on the criterion in Sec. 2.4.4 using a single split of the

dataset. However, it is possible that a different split could lead to very different

results. Hence we repeat the splitting of the data and inference of the optimal

state vector multiple times - typically 20 times. We then assign the state vector

that is inferred most frequently to be the optimal state vector.

Finally, we use the full set of observed sequences and run the Baum-Welch

36

algorithm to retrain a POMM with the structure specified by the optimal state

vector.

2.4.6 Reduced representation by filtering non-dominant transi-

tions

Depending on whether the objective is to construct a model that is compact and

gives us a sense of the most important features of the syntax, or to construct

a model that captures every detail of the observed set of sequences, we could

choose to prune the POMM further or not. Transitions of very low probability

could potentially be ignored without losing the key features of the syntax. Such

a reduction would help highlight key features of the syntax without taking away

from the accuracy of the model. We follow a method of backbone extraction

for complex weighted networks, where dominant weighted edges in a graph are

identified by comparing the normalized weight of each edge against the probability

that the edge of that weight could have occurred just by chance [58]. Although the

networks considered in the study were large-scale networks such as the US airport

network and the Florida Bay food web, the methods developed are relevant for

networks of any size. The null hypothesis is that the edge weights associated with

a node in the network, the transition probabilities into or out of a state in our case,

are assigned by random assignment from a uniform distribution. We only retain

transitions that reject this null hypothesis. The details of the calculations can be

found in Appendix C.

2.4.7 Checking for equivalent POMMs - state-merging

Another useful procedure to ensure that we obtain the most compact POMM

structure is to allow for the merging of states since there could be equivalent

POMMs of different sizes. We consider the possibility of merging states associated

with the same syllable based on a concatenated observation sequence O as follows.

We use the forward backward algorithm to find the probability of a state at any

step of the concatenated sequence. We find piA = p (Si = 1A|O, T,E) and piB =

p (Si = 1B|O, T,E) where S = (S1, S2 . . . ) is the state sequence. If piA and piB are

non zero at every step where Oi = 1 (i is the step count), then we can say that

they are good candidates for merging. If piA = 0 when piB 6= 0 and vice versa

37

most of the time, then they are less mergeable. Using this idea we can calculate a

‘merge-score’ which we define as

M =

∑′

i piA + piB∑i piA + piB

(2.4)

where the sum in the numerator is over those time steps where piA and piB

are non-zero. Note that M = 0 if the states emit different syllables. Once M is

calculated for every pair of states, we can decide on the ideal choice for merging

by starting with the pair with the largest M and moving on to other pairs in

decreasing order of their merge score. For every merger, the log-likelihood of the

merged model is calculated and the model with the maximum log-likelihood is

chosen. We then move onto considering merges possible with the new model. For

the algorithm this means that we approach state-merging in two stages

1. Merge using maximum merge scores and log-likelihoods

2. In the model generated by stage 1, look for multiple states associated with

the same syllable and check for all-common parents. If this is the case, merge

the states.

2.4.8 Demonstration of grid search with a toy model

We first consider the inference of a POMM given an artificially generated dataset

that only contains sequences AB and CBA in the ratio 3 : 1. Fig. 2.13 details

the inference procedure. The iterative inference procedure starts with a Markov

model on the grid (1,1,1) with one state assigned to each of the syllables A, B and

C. The change in log-likelihood and sequence completeness with the addition of

a state to each of the syllables in turn can be seen in the last row of the figure.

Only the addition that maximizes the log-likelihood and sequence completeness

is retained. For this small artificially generated set of sequences, with the total

addition of two states to the Markov model for syllables A and B, the true model

is inferred exactly (Pc = 1). In the case of more complex datasets, grid search is

continued until the stopping criteria outlined in Sec. 2.4.4 are met.

38

Start with

(NA=1,NB=1,NC=1)(NA=2,NB=1,NC=1) (NA=2,NB=2,NC=1)

(NA=1,NB=1,NC=1)

4 6 8 10 12−800

−700

−600

−500

−400

−300

−200

log−

likel

ihood

(NA=1,NB=2,NC=1)

(NA=1,NB=1,NC=2)

(NA=2,NB=1,NC=1)

4 6 8 10 12−800

−700

−600

−500

−400

−300

−200

number of states

(including start state S)

(NA=3,NB=1,NC=1)

(NA=2,NB=1,NC=2)

(NA=2,NB=2,NC=1)

4 6 8 10 12−800

−700

−600

−500

−400

−300

−200

number of states


number of states


sequen

ce c

om

ple

tenes

s

4 6 8 10 120

0.2

0.4

0.6

0.8

1

4 6 8 10 120

0.2

0.4

0.6

0.8

1

4 6 8 10 120

0.2

0.4

0.6

0.8

1

Figure 2.13. Inference of a POMM given the artificially generated set of sequences[AB(75%), CBA(25%)]. The increases in log-likelihood and sequence completeness aswell as the model obtained at every node on the grid are shown.

39

Chapter 3Comparison of Syntactic Structures

Probabilistic models of song syntax are compact representations of rules for pos-

sible arrangements of vocal elements. An accurate model of the syntax is useful

on several fronts. Many neural and behavioral studies use tutoring paradigms in

which young birds are exposed to artificial songs in order to understand how se-

quences are learned. Having a good model of the syntax of different birds, both

conspecific and heterospecific, provides greater control over the selection and al-

teration of elements to be tutored. This is true for studies of mate preferences in

songbirds as well. Changes in syntax as a result of manipulations such as neu-

ral lesions or altered auditory feedback can also be tracked to identify the role

of different brain regions or peripheral mechanisms involved in the learning and

generation of sequences. Finally, if there exists a mapping between statistical and

neural models, then the topology of the syntax model could be used to infer the

topology of the underlying neural network.

We test if the method of inferring the POMM described in Chapter 2, leads to

a good stochastic model for the syntax of a songbird, the Bengalese finch, and a

cetacean, the humpback whale. In the case of the Bengalese finch we have access

to song sequences from birds both before and after the disruption of auditory

feedback. This allows us to study changes in syntax caused by the disruption. It

seems plausible that the removal of an important feedback signal wold very likely

lead to a loss of structure in the songs of the bird. We ask if this is seen as a

change in the syntactic category - POMM to Markov.

40

3.1 The Bengalese finch

The Bengalese or society finch (Lonchura striata Domestica) has long been the

subject of investigations into learning and song generation in songbirds. These

birds are age-limited or close-ended learners - i.e., there exists a developmental time

window during which the acquisition of song by learning occurs. Experimental

studies have shown that the adult Bengalese finch relies on auditory feedback

for the maintenance of its song [59, 60]. Supporting this idea, a recent study in

these birds revealed stimulus specific adaptation to repeated auditory responses

in the HVC - a key sensory-motor nucleus implicated in song control, as well as

an immediate decrease in syllable repetitions after deafening [51]. The analysis

of Bengalese finch songs in this study focused on syllable repetitions alone. It

has also been shown in a previous study that the song of the Bengalese finch is a

POMM [6]. We construct the syntax for the songs of six Bengalese finches both

before and after deafening. We first confirm earlier results that the syntax of

Bengalese finch song is consistent with a POMM. We then compare the syntax

before and after deafening. We find that the many-to-one mapping between states

and syllables in the POMM is dependent on the presence of auditory feedback for

some of the Bengalese finches studied, while for others this is not the case. The

results suggest a more complex relationship between many-to-one mapping and

auditory feedback than an on/off dependence.

3.1.1 Description of data

The data used in this analysis is part of a larger dataset collected for another

study [51]. The authors have made the data freely available in the public domain1. The data that we consider from this set are transcriptions of song sequences

from six male adult Bengalese finches - labeled bfa14, bfa16, bfa19,bfa7, o10bk90,

and o46bk78. We will refer to these birds by labels Bird 1, Bird 2, ..., Bird 6

henceforth. Song transcriptions are available for the six birds before, and soon

after deafening (2-3 days post-deafening) by bilateral cochlear removal. Each of

the birds sings songs with about 7-9 unique syllables.

1As of May 2016, the data is freely available and can be downloaded fromhttp://users.phys.psu.edu/ djin/SharedData/ KrisBouchard/

41

Table 3.1. Bengalese finch song statistics

Bird IDNo: of songs No: of syllables

pre-deafening post-deafening pre-deafening post-deafening

Bird 1 160 271 687 744Bird 2 57 330 578 1053Bird 3 20 30 391 358Bird 4 108 272 653 892Bird 5 59 143 354 423Bird 6 69 83 660 841

3.1.2 Identification of songs from recording transcriptions

In the dataset, syllables are labelled a through l, and x through z, as well as

with symbols 0 and - for unknown syllable identities. A Bengalese Finch song

bout typically begins with short vocalizations called introductory notes. The role

of introductory notes has long been open to speculation. In zebra finches it has

been observed that the number of introductory notes before a bout correlates with

the time since the previous bout suggesting that introductory notes reflect neural

preparation before initiation of a learned motor sequence [61]. Introductory notes

are therefore not considered part of a song sequence. In the dataset, these notes are

i, j, k, and l. We define songs in the transcribed sequences as segments that appear

between introductory notes. In keeping with convention, introductory notes are

not retained as part of the song. However, it must be pointed out that other studies

consider isolated introductory notes that appear between song bouts to be part

of the song [59]. Transcriptions that contain the unknown symbols 0 and - are

not considered for the analysis. With the removal of these syllables, the number

of sequences obtained for each individual before and after deafening are shown in

Table. 3.1. As seen in Table. 3.1 the number of sequences obtained pre-deafening

in all cases is much smaller than the number available post-deafening. Since we

choose to segment sequences on the basis of the appearance of introductory notes,

it is possible that there were simply more introductory notes in the post-deafening

songs leading to a larger number of segments. Table. 3.1 also lists the total number

of syllables available from each recording. If the numbers pre- and post-deafening

are comparable, then the larger number of sequences obtained post-deafening could

42

be the result of more introductory notes. However, given the numbers, it is hard to

make this inference. The difference in number of samples before and after deafening

will influence any comparison of sequences if not explicitly accounted for. We must

therefore be careful about considering this difference in calculations.

S

1 (A)

0.77

9 (H)

0.13 3 (C)

0.06

4 (C)

0.04

6 (E)

0.01

2 (B)

0.59

5 (D)

0.72

7 (F)

0.01

0.92

8 (G)

0.11

0.43

0.07

0.05

0.95

0.050.04

0.96

0.970.01

0.01

0.90

0.10

(a) Bird 1, pre-deafening

S

1 (A)

0.35

2 (B)

0.01

3 (C)

0.03

4 (D)

0.14

5 (E)

0.11

6 (F)

0.03

7 (G)

0.20

8 (H)

0.13

0.34

0.03

0.33

0.07

0.72

0.08

0.01

0.08

0.01

0.03

0.78

0.07

0.01

0.39

0.06

0.01

0.81

0.01

0.18

0.13

0.03

0.01

0.03

0.01

0.410.01

0.01

0.05

0.02

0.70

(b) Bird 1, post-deafening

Figure 3.1. Song syntax of Bengalese finch, Bird 1, before and after deafening

3.2 Syntax of Bengalese finch song

Songs of the Bengalese finch consist of variable sequences of discrete syllables.

The variability could be in the number of syllable types sung in each sequence,

the length of the sequence, the number of times a syllable is repeated and so on.

The syllable sequences follow probabilistic rules, and have been shown to be well-

described by the Partially Observable Markov Model with Adaptation (POMMA)

[6]. However, the inference of the model in the study involved a considerable

amount of manual fine-tuning. We use the methods developed in Chapter 2 to

derive the song syntax of the non-repeat structure of six Bengalese finches before

43

S

2 (A)

0.98

10 (G)

0.02

1 (A)

3 (B)

0.01

5 (D)

0.52

0.96

4 (C)

6 (D)

0.9111 (H)

0.07

7 (E)

0.82

8 (E)

0.84

0.14

9 (F)

0.94

0.10

0.19

0.83

1.00

1.00

0.76 0.24


S

1 (A)

0.98

5 (E)

0.02

4 (D)

0.01

0.05

2 (B)

0.36

0.89 0.02

7 (F)

6 (E)

0.67

8 (G)

0.09

0.03

9 (H)

0.670.21

0.78

0.01 3 (C)

1.00

0.98

0.02

0.95

0.05

S



and after deafening. The repeats in the songs of these six birds were the sole focus

of the main study [51] from which this data is taken, and are consistent with a

model of repeat adaptation which will be visited in Chapter 4 when we exclusively

discuss the repeat structure of syllables in various songbird songs. The reference

to syntax in the rest of this chapter is therefore for the non-repeat structure of

song sequences.

We confirm that the POMM is a good model of the non-repeat structure of

Bengalese finch songs based on the syntax inferred for the six birds before deafening

using grid search in combination with the Baum-Welch algorithm - Figs. 3.1(a) -

3.6(a). In the state transition diagrams for the syntax, the pink bubble is the

44

S

1 (A)

1.00

4 (C)

5 (D)

0.40

7 (F)

0.38

6 (E)

0.97

2 (B)

0.66

3 (B)

0.34

1.00

1.00

9 (G)

1.00 0.59

0.41

8 (F)

1.00

1.00


S

2 (B)

0.03

5 (E)

0.23

1 (A)

0.304 (D)

0.37

6 (F)

0.03

7 (G)

0.033 (C)

0.96

0.02

0.01

0.02

0.61

0.05

0.93

0.04

0.470.42

0.05

0.05

0.10

0.74

0.10

0.01

0.06

0.07 0.53

0.07

0.33

0.36

0.64



start/end state, while all blue bubbles represent states that have transitions to the

end state. Each state is indexed by a number shown in the bubble and is associated

with the syllable indicated in brackets after the number. As far as the many-to-one

mapping is concerned, five of the six birds (except Bird 4) have atleast one syllable

that is assigned to multiple states indicating deviation from a Markov model, i.e.,

Bengalese finch song is more complex than what we would expect from a Markov

model. The largest deviation from a Markov model is through the addition of

states to three of the syllables (Bird 6). We also find diverse state transition

structures. Some are simple, mainly composed of deterministic transitions with

occasional stochastic branches (Bird 3, for example), while others are composed of

branched transitions from most states (Bird 1, for example). Given this assortment

of structures, we can conclude that there is no prototypical syntax for the songs

45

S

1 (A)

0.99

2 (B)

0.11 5 (E)

0.25

4 (D)

0.12

7 (G)

0.15

0.12 3 (C)

0.54

8 (H)

0.02

0.85

1.00

6 (F)

1.00

1.0

1.00

1.0


S

1 (A)

0.91

7 (G)

0.04

2 (B)

0.04

5 (E)

0.17

0.08

0.13

3 (C)

0.19

4 (D)

0.49

0.03

0.04

0.42

0.01

0.14

0.21

6 (F)

0.98

0.89

0.04

0.04

0.14

0.02

0.74

1.00



of all Bengalese finches.

3.2.1 Changes in syntax caused by deafening

Experimental studies have shown that the Bengalese finch relies on auditory feed-

back for the maintenance of its song [59,60]. Disruption in auditory feedback does

not have the same effect on the songs of different species of songbirds. In zebra

finches which are close-ended learners with fixed song, there is no immediate de-

terioration in the song [62, 63]. With bilateral cochlear removal, the crystallized

song of the adult zebra finch deteriorated to have only about 36% of the song

syllables produced before surgery only after 16 weeks post-surgery. However, in

Bengalese finches that are also close-ended or age-limited learners, but with vari-

46

S

7 (F)

0.03

3 (C)

0.97

2 (B)

4 (D)

0.57

0.02

5 (E)

0.97

6 (E)

1 (A)

0.47 8 (F)

0.51

0.98

0.02

1.00

1.00

1.00


S

1 (A)

0.39

3 (C)

0.39

7 (F)

0.13

4 (D)

0.08

2 (B)

0.91

0.01

0.01

0.03

0.01

0.11

0.07

0.02

5 (E)

0.84

6 (E)

0.92

0.13

0.02

0.02

0.04

0.96

0.98

0.02



able or stochastic song sequences, the change is immediate [59,60]. As far as neural

responses are concerned, when a singing Bengalese finch is presented with tran-

sient perturbations to auditory feedback, reliable decreases in HVC activity were

observed for short latencies [64]. This means that auditory signals have real-time

access to the song-generating circuitry. The abstract states in a POMM must have

neural correlates. One possibility is that each state is a chain network of neurons

as in the branching synfire chain model discussed in Sec. 1.2.2, and a many-to-one

mapping means that same syllable is redundantly encoded by multiple chains. The

other possibility is that multiple states associated with the same syllable represent

47

S

4 (B)

0.27

2 (A)

0.095 (B)

0.64

1 (A)

0.34

3 (B)

0.03

0.06

6 (C)

0.91

9 (E)

0.91

0.02

1.00

0.92

0.07 0.02

0.04

7 (C)

0.94

8 (D)

0.0510 (F)

0.95

1.00

1.00

1.00


S

1 (A)

0.19

2 (B)

0.81

0.67

5 (E)

0.84

6 (F)

0.06

0.014 (D)

0.91

0.33

3 (C)

0.67

0.47

0.53

0.37

0.63



different activity patterns in the same chain network. The differences in activity

could be driven by differing feedback, including auditory feedback. It is therefore

reasonable to speculate that auditory feedback may be necessary for the many-to-

one mapping in the syntax of the Bengalese finch and that the removal of auditory

feedback will drive changes in the structure of the POMM.

On inferring the syntax of song sequences after deafening for the same six

birds, we notice several significant changes (syntax seen in Figs. 3.1(b) - 3.6(b)).

48

2 4 6 8 101

2

3

4

5

6

7

8

9

10

state transition entropy pre−deafening

state

tra

nsi

tion e

ntr

opy p

ost−

dea

fenin

g

(a)

0 5 10 15 200

5

10

15

20

mean sequence length pre−deafening

mea

n s

equen

ce len

gth

post−

dea

fenin

g

(b)

Figure 3.7. On average the state transition entropy increases and sequence lengthdecreases after deafening.

Firstly, there is an increase in the total number of possible transitions, and the

transitions appear to be more random. We can test this by comparing the state

transition entropy as in Sec. 2 before and after deafening. Larger entropy values

after deafening indicate more random transitions. Secondly, the average length

of sequences is smaller after deafening. Both these can be seen in Fig. 3.7. One

possible cause for this could be an increase in how frequently an introductory note

is produced, since song sequences are defined to be the segments between two of

these notes.

Finally, some of the many-to-one mappings between the states and the sylla-

bles vanish after deafening. This is an indication of the syntax becoming more

Markovian. We would like to compare how Markovian the syntax is before and

after deafening. This can be done by computing sequence completeness under a

Markov model pre- and post-deafening - i.e., we infer Markovian transition prob-

abilities from the data and calculate the probabilities of all unique sequences in

the dataset based on these probabilities. The values obtained are displayed in Fig.

3.8. However, this is where we must use caution. A relatively higher sequence

completeness value does not necessarily indicate a more Markovian model. This

is based on our discussion in Chapter 2 where we determined that the sequence

completeness values obtained using models inferred from different numbers of se-

quences are not directly comparable. We must instead compare the p-values of the

sequence completeness obtained under these Markov models. To obtain p-values,

49

Bird 1 Bird 2 Bird 3 Bird 4 Bird 5 Bird 60

0.2

0.4

0.6

0.8

1

**

*

**

*

pre-deafeningpost-deafening

sequ

ence

com

plet

enes

s un

der

a M

ark

ov m

odel

Figure 3.8. Sequence completeness of Bengalese finch song sequences under Markovmodels before and after deafening. Since the number of sequences is different pre- andpost- deafening, direct comparisons of sequence completeness values should not be made.

we first construct sequence completeness distributions using multiple splits of the

sequence set into fit and test sets and finding the completeness of the test set using

the probability distribution over sequences defined by the fit set as described in

Sec. 2.3.1.1. The sequence completeness under the Markov model should be atleast

within 5% of the sequence completeness distribution, or, equivalently, the p-value

should be greater than or equal to 0.05, if the syntax has to be considered to be

Markovian. Based on this criterion, four out of the six birds studied have a syntax

that is more Markovian after deafening (see Fig. 3.9 and Table. 3.2).

3.2.2 Persistence of dominant transitions after deafening

As described in Sec. 2.4.6, the dominant transitions in any probabilistic finite state

model can be found by testing if each transition probability is greater than or equal

to the value we would expect if the values were drawn from a uniform distribution.

Since our goal is to have a concise representation of the syntax that also results in

a sequence completeness with p-value greater than 0.05, once we find the dominant

transitions in the syntax at difference significance levels α, we select the syntax

50

0

0.2

0.4

0.6

0.8

1

p-va

lue

for

sequ

ence

com

plet

ene

ss

pre-deafening post-deafening

Figure 3.9. p-values of sequence completeness under Markov models before and afterdeafening based on the distributions generated from the data. A low p-value (p < 0.05)indicates that a Markov model is not a good representation of the sequence syntax. Eachcircle represents a single bird and the lines connect p-values for the same bird pre- andpost-deafening.

Table 3.2. Sequence completeness and p-values under Markov models

Bird IDBefore deafening After deafening

Sequence completeness p-value Sequence completeness p-value

Bird 1 0.83 0.03 0.81 0.95Bird 2 0.36 0.001 0.90 0.003Bird 3 0.33 0.002 0.24 0.24Bird 4 0.67 0.04 0.79 0.84Bird 5 0.40 0 0.82 0.003Bird 6 0.23 0 0.35 0.64

structure at the α for which the p-value condition is satisfied. As an illustration

of this, Fig. 3.11. shows the dominant transitions for Bird 1 after deafening at

significance levels α ranging from 0.1 to 1. As seen from the p-values of sequence

completeness displayed in Fig. 3.10b, the smallest α for which p-value ≥ 0.05,

is α = 0.7. Looking for this structure gives us a way of visualizing if the most

dominant transitions remain unchanged or not as the result of deafening as seen

51

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

α

sequ

ence

com

plet

enes

s

pre−deafeningpost−deafening

(a)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

α

p va

lue

pre−deafeningpost−deafening

(b)

Figure 3.10. Sequence completeness and corresponding p-values under syntax modelswith only the dominant transitions retained at significance levels α

Table 3.3. Sequence completeness and p-values for pre-deafening sequences based onsyntax post-deafening

Bird ID Sequence completeness p-value

Bird 1 0.19 0Bird 2 0.88 0.99Bird 3 0.06 0Bird 4 0.38 0Bird 6 0.11 0

in Fig. 3.11. for the syntax of Bird 1 before and after deafening.

Based on the diversity of inferred syntax structures, we conclude that normal

Bengalese finch song syntax with its many-to-one mappings between states and

syllables is more complex than a Markov model. Deafening reduced this complexity

in most of the birds studied, while the complexity persisted in others. These

observations suggest that auditory feedback can induce complexity in the Bengalese

finch song syntax, but is not sufficient to explain complexity entirely. There most

certainly have to be contributions from other intrinsic factors. We suggest that the

Bengalese finch song syntax is encoded in the interplay between auditory feedback

and the intrinsic song-generating circuitry.

52

0.04

S

1(A)

0.77

9(H)

0.13 3(C)

0.06

4(C)

6(E)

0.01

2(B)

0.59

5(D)

0.72

7(F)

0.01

0.92

8(G)

0.43

0.07

0.05

0.95

0.05

0.04

0.96

0.97

0.11

0.01

0.01

0.90

0.10

S

1(A)

0.35

8(H)

0.13

3(C)

5(E)

0.11

2(B)

4(D)

6(F)

7(G)

0.20

0.14

0.34

0.07

0.33

0.72

0.78

0.08

0.39

0.81

0.18

0.06

0.41

0.07

0.05

0.70

0.13

Pre-deafening Post-deafening

Figure 3.11. Dominant transitions in the syntax after deafening compared with thesyntax before deafening for Bird 1. Most frequent sequences are retained with lowerprobabilities after deafening, but many new transitions are introduced. The many-to-one mapping for syllable C is also lost after deafening.

3.3 Humpback whale

Among cetaceans, the humpback whale (Megaptera novaeangliae), found in all

oceans of the world, is capable of complex vocalizations called whale song [65,66].

Whale song is composed of an arrangement of mostly low-frequency vocalizations

(10 Hz - 2 KHz) continuously sung for as long as 30 minutes [65] with a typical

length of 10-20 minutes. The longest songs recorded in the animal kingdom have

been that of the humpback whale. As far as it is known, just as in songbirds,

53

Figure 3.12. Transcription of part of a humpback whale song. The units in the songare ‘BA’, ‘BW’, ‘GR’, ‘HMM’, ‘HSH’, ‘HSQ’, ‘LBA’, ‘LGO’, ‘MM’, ‘RA’

the singers are always males. The exact purpose of song in these cetaceans is not

agreed upon, but most songs have been recorded during the breeding season [67].

Humpback whales are migratory - migrating annually between high-latitude

waters in summer and low-latitude waters in winter. Whales within a population

sing roughly the same song (similar arrangement of units) [65, 66]. The song that

the population sings can evolve from year to year, but again all members of the

population but transmission of a song type between different populations in the

western and central South Pacific over an 11-year period has been demonstrated

[68]. This is evidence of vocal learning among humpback whales. There is definite

structure to the arrangement of acoustic ‘units’ (syllables in the songbird literature)

in whale song. It has been demonstrated that independently identically distributed

(iid) and first order Markov models fail to capture the full structure of humpback

whale songs [69]. This analysis was from an information theoretic perspective with

the rejection of simple models based on the discrepancy in expected and observed

entropy values. If the syntax is not Markovian, we would like to test if it is

POMM. However, the methods we have developed are not suitable for the kind

of song sequence data available for humpback whales. The following sections give

a description of the available datasets and explain the need for developing better

methods of syntax inference than what we have for birdsong. Some preliminary

54

analyses of alternatives are also described.

3.3.1 Description of data

The dataset consists of transcriptions from recordings of a population of 11

humpback whales recorded off the coast of Eastern Australia in 2003 (record-

ing by Michael Noad, Cetacean Ecology and Acoustics Laboratory, The Univer-

sity of Queensland, transcription by Luca Lamoni, Luke Rendell, University of

St.Andrews). There is one continuous recording available for each whale lasting

between 15 and 30 minutes. In the transcriptions, the basic element of the se-

quence is the unit or syllable. In terms of the number of units, the longest song

has 2167 units. There are 9 kinds of units in the song - ‘BA’, ‘BW’, ‘GR’, ‘HMM’,

‘HSH’, ‘HSQ’, ‘LBA’, ‘LGO’, ‘MM’, ‘RA’. An example of the transcription for part

of one whale’s song is shown in Fig. 3.12.

3.3.2 Challenges in inferring the syntax of humpback whale song

With animal vocal sequences that are hard to record or very long - such as hump-

back whale songs, the data set often contains one or a few sequences. If we only

have one long sequence to train the POMM on, the grid search method we dis-

cussed in Chapter 2 cannot be used since it depends on the calculation of bounds

on log-likelihood and sequence completeness using multiple observed sequences for

the same individual. Could we use sub-sequences generated from the long sequence

to construct bounds on the maximum possible log-likelihood of the long sequence

given a syntax model? Is there a choice of segmentation length for which we could

consider the corresponding sub-sequences to be independent? Such a possibility

would be a starting point in the syntax analysis of long sequences.

Entropy calculations for long sequences Our goal is to relate the entropy of

sub-sequences obtained from the long sequence to the log-likelihood of the long

sequence. To do so we first need to understand entropy calculations using data

with correlations, and how the choice of length of the sub-sequence affects these

calculations. We first chose a humpback whale song (without syllable repeats), and

calculated the entropy of sets of sub-sequences. Each set contained sub-sequences

of a particular length. The entropy shows an initial increase with segmentation

length, peaks, and then begins to decrease. We can understand this in terms of the

55

0 10 20 30 40 500

20

40031022, system size = 332

segmentation length

num

ber o

f uni

que

sub−

sequ

ence

s

0 10 20 30 40 500

1

2

entro

py o

f sub

−seq

uenc

es

Figure 3.13. Dependence of sequence entropy on the number of unique sub-sequencesobtained using different segmentation lengths for a 300-unit long humpback whale song

number of unique sub-sequences that the segmentation results in. As can be seen

in Fig. 3.13, the entropy values follow the increase and decrease in the number of

unique sub-sequences resulting from different segmentation lengths.

However, the curves seen in the figure could be dependent on the length of

the long sequence, i.e., the system size S. If S is large, does this shift the location

of the peak? We would expect that the number of unique sequences and thereby

the entropy would begin to decrease for larger segmentation lengths, i.e., the peak

would occur at larger segmentation lengths. Before we use observed data, we would

like to first understand this dependence using randomly generated sequences of

different lengths. The variation in entropy with segmentation length k is studied.

Considering a randomly generated sequence of size S, for small k, nk � S.

Since the sub-sequences are independent, the probability of occurrence of each is

simply 1nk

. Hence the entropy of the sub-sequences is E = −∑

1nklog( 1

nk). There

are nk terms in the sum. Therefore E = log(nk). Since the number of possible

combinations of length k is much smaller than the system size, we can get a good

estimate of the distribution of sub-sequences of length k and thereby the entropy for

small k. This is seen as the agreement with the lower asymptote in Fig. 3.14. For

56

segmentation length1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

9

entr

op

y

S = 1000upper asymptoteS = 5000upper asymptoteS = 10000upper asymptotelower asymptote

10 20 30 40 50

2000

4000

0.7

0.9

1.1

1.3

1.5

1.7

1.9

2.1

2.3

2.5

segmentation length

syst

em

siz

e

entropy

6000

8000

10000

Figure 3.14. Dependence of sequency entropy on segmentation length and system sizeusing randomly generated sequences. The asymptotes for large and small segmentationlengths are displayed.

large k, nk � S, i.e., number of possible sub-sequences of length k is much greater

than the system size. Hence we expect to see the combinations that do occur, only

once. The number of sequences that can occur would be Sk. The entropy in this

limit is therefore E = log(Sk), the upper asymptote drawn for different system sizes

in Fig. 3.14. The crossover between the two asymptotes occur at a segmentation

length k such that nk ≈ Sk. It is now reasonable to ask how the entropy of

the actual humpback whale song sub-sequences compare to these asymptotes. We

considered the analysis above for real song sequences using both non-overlapping

and overlapping segments of length k. Results are shown in Fig. 3.15. As we

expect, the sequences are far from random as seen by the deviation from the lower

asymptote. Since for large k, the bottle-neck is the system size, we see a good

agreement with the upper asymptote - i.e., the expectation for a random sequence.

This is because for large segmentation lengths, the number of unique sequences is

small so as to be effectively random.

3.3.3 Markov model of humpback whale themes

Humpback whale sequences are often referred to as possessing a hierarchical struc-

ture [65]. Units are arranged into phrases, phrases into themes, and themes into

songs. The analysis of syntax in humpback whale songs in studies [68, 69] assume

Markovian transitions between themes with a theme defined to be a segment of

57

0 10 20 30 400

0.5

1

1.5

2

2.5

3

3.5

4

segmentation length

entr

opy

031022, system size = 332

non−overlapping segmentsoverlapping segmentslower asymptotehigher asymptote

Figure 3.15. Entropy of a segmented humpback whale sequence depends on the seg-mentation length k as well as the total system size S. For small k entropy calculationsdo not agree with those expected from random sequences demonstrating that the seg-ments have non-random structure. However for large k the agreement with the upperasymptote implies that they cannot be used to infer the structure of humpback whalesongs. This is due to to large segmentation lengths leading to very few unique sequences.

the song in which specific units repeat in combination until a new combination of

units is sung.

We suggest redefining the basic element of the humpback whale song to be a

’repeat unit’ instead of a unit. A repeat unit is a note or combination of notes

that repeat as a whole in the song. For example, in the sequence MM BA MM BA

MM LBA MM LBA MM RA RA RA RA, the repeat units are MMBA, MMLBA,

MM and RA. The number of consecutive repeats of a repeat unit is variable at

different points in the song. This number is not the same even when the repeat unit

appears in the same context at different points in the song. Hence it is reasonable

to assume that an exact number of consecutive repeats of the repeat unit is not

the elementary unit of the song that is learned, and the form of which is conserved.

The transcription in Fig. 3.12 is reorganized in terms of repeat units in Fig. 3.16

with what would be the themes in the song highlighted with different colors.

We can now consider each theme to be an independent sequence and consider

the construction of finite state models for the themes. Although this does not

58

Figure 3.16. Transcription of part of a humpback whale song with repeat units andthemes highlighted

serve the purpose of obtaining the full syntax for humpback whale songs, it is a

good segue into more complicated models. The structure of the Markov model for

themes and the match of n-gram distributions for the observed sequences and those

generated from the model are shown in Fig. 3.17. Preliminarily, we can assume

based on the distribution matches that the syntax at the level of themes has to be

more complex than a Markov model.

59

S

4 (HMM)

0.07

9 (MM)

0.93

1 (BA)

3 (GR)

0.09

0.82

2 (BW)

0.71 0.735 (HSH)

0.22

0.38

6 (HSQ)

0.37

7 (LBA)

0.90

8 (LGO)

0.81

10 (RA)

0.441.00 0.49 0.35 0.11 0.04

0.99

(a)

0 20 400

0.1

0.2

2−gram index

Pro

babi

lity

0 20 400

0.1

0.2

3−gram index0 20 40

0

0.1

0.2

4−gram index

0 20 400

0.1

0.2

5−gram index

Pro

babi

lity

0 20 400

0.1

0.2

6−gram index0 20 40

0

0.1

0.2

7−gram index

(b)

Figure 3.17. (a)Markov model of themes in the songs of a population of humpbackwhales, (b)n-gram distribution matches between observed sequences and sequences gen-erated using the Markov model

60

Chapter 4Repeat Structure in Vocal

Sequences

Actions or gestures in behavioral sequences are often repeated several times. Move-

ments such as walking, clapping, breathing and blinking are all examples of such

behavior where you can identify a basic motor gesture that is stereotypically and

repetitively generated to form a sequence. Repetitive behavior can also be asso-

ciated with certain pathological conditions. In speech disorders like stuttering,

palilalia, echolalia and verbigeration, syllables, words, or sometimes whole phrases

are repeated [70]. Tourette’s syndrome is characterized by vocal tics including

repetitive throat-clearing, sniffing, and grunting, as well as motor tics such as

repetitive head-bobbing, shoulder-jerking and blinking [71]. Repetitions provide

a good system to understand features of a behavior since it becomes possible to

compare an action or gesture with what are essentially copies of itself.

Many syllables in vocal sequences are repeated multiple times before a new

kind of syllable is vocalized. We refer to a sequence of repetitions of the same

kind of syllable as a phrase or a trill. The number of times a syllable is repeated

varies from rendition to rendition of the phrase. In this chapter we discuss the

statistics of syllable repetitions in the songs of canaries, swamp sparrows and Ben-

galese finches. In canaries and sparrows, various studies have shown that there is

an innate preference for a trilled syntax with each song sequence composed only

of repeats of a single syllable type [72–75]. In Bengalese finches however, song

sequences contain a mix of repeats and single occurrences of different kinds of

61

syllables. We study the dependence of repetition statistics on the different time

scales involved in the production of vocalizations. Every instance of a syllable, for

example, shows remarkably little variation in duration. Timing has to therefore be

critical in the production of stereotyped motor commands. Constraints in duration

may be imposed by the specifics of neural control, or by peripheral mechanisms,

or by both. We study the relationship between syllable duration and the most

probable number of repetitions of the syllable and show that there is a precise

inverse relationship between the two for canaries and swamp sparrows, while this

is not the case for Bengalese finches.

4.1 Syllable repetitions in multiple species

Syllable repetitions or trills have been studied previously, although the term ‘trills’

may specifically refer to note/syllable repetition at rapid rates. A phrase of sylla-

ble repetitions has also been called a ‘tour’ in reference to green finch and canary

songs [76]. Trilled songs have been previously studied in the context of perfor-

mance tradeoffs between maximal frequency bandwidths and syllable repetition

rates in 34 species of the Emberizid family [77] including birds such as the swamp

sparrow, song sparrow, dark-eyed junco, canary and northern cardinal. High rep-

etition rates are associated with syllables that have narrow fequency bandwidths.

More recently it has been shown that the performance quality of trills can reflect

the age of the singer in the nightingale [78], with older males rendering trills closer

to the performance limit. The songs of multiple songbirds have been observed to

consist of variable syllable repetitions - the Black-capped Chickadee [79], Mountain

Chickadee [80] and Mexican Chickadee [81] vary the number of syllable repeats in

their song. This has been observed in a suboscine passerine as well - the Flam-

mulated Attila (Attila flammulatus), a Neo-tropical tyrant flycatcher [82]. It is

believed that suboscine songs are not learned and do not require auditory feed-

back for normal vocal behavior [33]. The presence of variable repeats in suboscine

song is therefore particularly interesting.

In zebra finches, it has been reported that syllable repetitions can be present in

the songs of non-tutored or isolate birds [83]. This seems to suggest that repetitions

may be a feature of the song that is not learned in the same manner as other

features of the song. Further, in a separate experiment that was part of the same

62

study, when the offspring of birds with songs that contained no syllable repetitions

were tutored by birds with repeats in their song, many developed repeats in their

song as well. However, the repeated syllables were not necessarily the ones repeated

by the tutors. The researchers concluded that the learned aspect was the tendency

to repeat syllables and not the exact syllable repetitions.

4.2 Distribution of the number of syllable repetitions

The number of repetitions of a syllable varies from rendition to rendition of a

phrase. The distribution of the number of repeats for a syllable could be monoton-

ically decreasing or peaked at repeat numbers greater than 1. If the probability of

this transition is a constant, and the generation of repeats is assumed to be Marko-

vian, then the distribution of repeats should decrease monotonically [6]. However,

the repeat distribution of some syllables of Bengalese finches, zebra finches, and

canaries are non-monotonic, mostly unimodal, and sometimes bimodal [6, 51]. As

discussed in the text accompanying Eq. 1.1 in Sec. 1.3, a peaked repeat distribution

is indicative of a non-Markovian song syntax. Both monotonic and peaked repeat

distributions can be explained by considering the repeat probability of a syllable -

the probability of a syllable transitioning into itself - to be adaptive, i.e., changing

as a function of the number of repetitions. POMM for the non-repeat structure

of sequences combined with adaptation for the repeat structure is referred to as

POMMA [6]. Typically, for any sequence with repetitions of syllables, a POMM is

constructed using sequences with the repeats removed. So for a sequence, AAAB-

BCDDD we consider the non-repeat sequence ABCD to construct the POMM.

The repeat distributions for each of the syllables are then considered separately

and fitted with a model of adaptation.

4.2.1 Sigmoidal model of adaptation

A recent computational model suggests that long syllable repetitions are sustained

by an excitatory input to the neurons encoding the repeating syllables [51]. The

input decays over time, with a time constant that is independent of the syllable

duration. In this model, the initial probability of the syllable repeating is high due

to the strong input. However, with continued repetition, this probability decreases

63

number of repeats n

P(n

|T)

Syllable A Syllable B Syllable C Syllable D Syllable E

1 1 1 1 19 11 6 13 20

Figure 4.1. Sample repeat distributions of an individual canary’s syllables and fitsbased on the sigmoidal model of adaptation

as the input decays. This process produces peaked repeat number distributions

with long tails that are typically observed for long repeating syllables. The central

assumption in this model is that given that a syllable has repeated n successive

times (n ≥ 1), the probability that the syllable occurs an (n + 1)th time is given

by

p(n) = 1− c

1 + abn(4.1)

The likelihood of such an additional repeat decreases with n in a sigmoidal

manner. The parameters a and b quantify the location of the drop of the sigmoid

the shape. The model builds in the effect of syllable duration T on the location of

the drop in the sigmoid by setting

b = exp

(−νTτ

)(4.2)

where ν and τ are model parameters. The drop in the sigmoidal curve occurs at

nthreshold ∼ τνT

. The plateaus of the sigmoid at n ≶ nthreshold are determined by

the parameters a, c. The probability of a syllable of duration T occurring exactly

n times is given by

P (n|T ) = (1− p(n))n−1∏s=1

p(s) =c

1 + a exp(−νnTτ

)

n−1∏s=1

[1− c

1 + a exp(−νsTτ

)

]

In the study, the adapting excitatory input was considered to be auditory feedback.

However, the mathematical model is relevant in the case of any other kind of

adapting feedback as well. Fig. 4.1 shows the fits of this model to the repeat

distributions for few of an individual canary’s syllables. One of the predictions of

64

the model is that the most probable number of repeats, or the peak repeat number

np of a syllable should be inversely proportional to the duration T of the syllable

np = c

(1

T

)(4.3)

4.3 Evidence of inverse relationship between syllable

duration and most probable repeat number

The prediction of an inverse relationship between the syllable duration and most

probable number of repeats of a syllable is strong. If true, it would imply the

existence of fundamental biological constraints in the production of syllable repe-

titions. Further, if it holds across species, then we have identified a behavior that

is likely to be conserved across multiple species and must be studied further ex-

perimentally since it is possible that the neural mechanism behind it is conserved

across these species too. We test if the inverse relationship is observed for the

songs of the canary, the swamp sparrow and the Bengalese finch. We do so by

considering the fit

np = c

(1

Tα

)(4.4)

α ≈ −1 signifies an inverse relationship.

4.3.1 Swamp sparrow

Swamp sparrow recordings of six birds were obtained from Dana Moseley and

Jeffrey Podos, University of Massachussets, Amherst. All songs were manually

transcribed by us. Since the analysis is based on constructing distributions of

the number of repetitions of a syllable for different renditions, we require a large

number of instances of trills of the same syllable. Hence, only syllables appearing

in atleast 50 trills of length equal to the peak repeat number were considered (Male

02 - syllable 1 & syllable 2, Male 06 - syllable 2 & syllable 9, Male 21 - syllable 7

& syllable 8). The data from the six birds was combined for the analysis. Syllable

duration is calculated by dividing the total duration of the trill by the number of

syllables in the trill. The most probable number of repetitions for each syllable is

calculated from the repeat distributions and the average duration of syllables are

65

101

102

103

100

101

102

103 α=0.97+/-0.06, c=1.83s

0 100 200 300 400 5000

10

20

30

40

50

60

70

80

syllable duration (ms) syllable duration (ms)

mos

t pro

babl

e nu

mbe

r of

re

peat

s

Figure 4.2. The most probable or modal number of repeats as a function of syllableduration for six syllables from a population of six swamp sparrows. Song recordings fromDana Moseley and Jeffrey Podos, University of Massachussets, Amherst.

calculated using 20 instances of each syllable. A linear fit on the logarithmic scale

is considered. An inverse relationship (α = −0.97) does exist between syllable

duration and most probable repeat number for the swamp sparrow (population of

six birds) as seen in Fig. 4.2.

4.3.2 Canary

Canary songs are composed of repetitions of syllables separated by brief silent

intervals. A set of repetitions of a syllable is usually referred to as a phrase or

a tour. Every phrase in a canarys song contains only repetitions of the same

syllable. Canary recordings and transcriptions for six birds were obtained from

Jeffrey Markowitz and Timothy Gardner, Boston University. Each canary sings

between 20-25 syllables. This means that we can test the inverse relationship for

each individual bird. The relationship is amazingly exact for each of the six birds

studied as seen in Fig. 4.3.

For the sake of completeness, we would like to check if the inverse relationship

is exact at the population level, i.e., if we consider the combined data of all six

birds. However, since the value of c is different for different birds, we can imagine

that the data from different birds might possibly lie on parallel lines, leading to a

spread in the data that would not be ideal for a good linear fit. One way around

the problem is to collapse the data.

66

−100

102030405060

Bird 1c=1.535s

α=-1.1+/-0.14

0 100 200 300 400 500Syllable duration (ms)

Mos

t pro

babl

e re

peat

num

ber

Bird 2c=1.354s

α=-1.1+/-0.13

−100

102030405060

0 100 200 300 400 500

Mos

t pro

babl

e re

peat

num

ber

Syllable duration (ms)

−100

102030405060


Mos

t pro

babl

e re

peat

num

ber

−100

102030405060


Mos

t pro

babl

e re

peat

num

ber

−100

102030405060


Mos

t pro

babl

e re

peat

num

ber

−100

102030405060


Mos

t pro

babl

e re

peat

num

ber

Bird 4c=1.357s

α=-1.01+/-0.15

Bird 3c=1.410s

α=-1.01+/-0.12

Bird 5c=1.563s

α=-1.0+/-0.15

Bird 6c=1.743s

α=-1.03+/-0.19

Figure 4.3. Inverse relationship between syllable duration and most probable repeatnumber for six individual canaries. Error bars show standard error. Canary recordingswere obtained from Timothy Gardner and Jeffrey Markowitz, Boston University

67

−4 −2 0 2 4−4

−3

−2

−1

0

1

2

3

4

x−<x>

y−

<y>

Mode, α = −1.0739

Mean, α = −1.1304

Figure 4.4. Mean and modal number of repetitions show the inverse relationship withsyllable duration for the canary population (all six birds). x = lnT and y = lnnmode ory = lnnmean.

We are trying to find the exponent and intercept of the linear fit to lnT vs

lnnmode or lnnmean, where T is the average syllable duration, nmode is the most

probable repeat number and nmean is the mean repeat number. Consider x = lnT

and y = lnnmode or y = lnnmean. We could subtract the means from these variables

to obtain data collapse.

Plotting the data for both nmode and nmean using the transformed variables

y− < y > and x− < x >, visually, in Fig. 4.4, we can see that the exponents of

the linear fit are almost the same in both cases. Both the most probable number

and mean repeat number show an inverse relationship with syllable duration. We

also show that the difference between the two is not significant when analyzed

statistically. The null hypothesis for the analysis is that the slopes of the two

regression lines are equal. We obtain p=0.5465 (ANCOVA) and cannot reject the

null hypothesis. The slopes are therefore not significantly different. The differ-

ence between the slopes is 0.0565 and this lies in the 95 % confidence interval

[−0.1269, 0.2400]. If for the sigmoidal model, the most probable and mean repeat

numbers are proportional, then we should not in fact expect the slopes to be dif-

ferent. Even though an analytical expression for the mean is hard to obtain, we

can numerically simulate repeat distributions following the sigmoidal model for

68

Figure 4.5. Inverse relationship is not exact between syllable duration and most prob-able number of repetitions for Bengalese finches] Data collected by Kristofer Bouchardand Michael Brainard, University of California, San Fransisco. Data available online athttp://users.phys.psu.edu/ ˜djin/SharedData/KrisBouchard/

different parameter values and see that the two are indeed proportional, and more

strongly almost equal. We found that this is indeed the case.

4.3.3 Bengalese finch

Although the prediction of an inverse relationship between syllable duration and

the most probable repeat number follows from a model used to study syllable

repetitions in Bengalese finches [51], the relationship was never tested in the study

for these birds. Statistics of syllable repeats from the study for 32 birds is freely

available in the public domain. Using this data, we test if the inverse relationship

holds for Bengalese finches. It turns out that this is not the case. As seen in Fig.

4.5, the decrease in the most probable number of repeats with syllable duration is

much faster than cT

for Bengalese finches (α = −1.7).

69

4.4 Other calculations

4.4.1 Distribution of phrase duration and repeat number

The data presented in the above sections suggest a simple relation < n >< T >= d

between the means of the repeat number, syllable duration and phrase duration. In

this section we try to gain an understanding of the distributions of these variables

based on a few simple assumptions and the results of previous studies on these

quantities.

As mentioned earlier, the distribution of the repeat number for the syllables can

be modeled using the sigmoidal model. The probability of a syllable of duration

T occurring exactly n times is given by

P (n|T ) = (1− p(n))n−1∏s=1

p(s) =c

1 + a exp(−νnTτ

)

n−1∏s=1

[1− c

1 + a exp(−νsTτ

)

]

For a distribution of syllable durations (for a specific syllable) given by some

P(T ), the distribution of the phrase duration d defined as nT is give by

P (d) =∑n

∫dT P(T )P (n|T )δ(d− nT ) (4.5)

Note that the distribution P(T ) models the variability in the syllable durations

between different phrases and not within a single phrase. For the case of P(T )

being the normal distribution if mean < T > and variance σ2, the distribution

P (d) is given by

P (d) =1

σ√

2π

c

1 + a exp(−νdτ

)

∑n

exp

(−(d/n− µ)2

σ2

) n−1∏i=1

(1− c

1 + a exp(−νidτn

)

)(4.6)

This can be computed numerically to get the distribution P (d) of phrase du-

rations, and can be compared to observed phrase duration distributions if there is

enough data to construct these distributions.

70

0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

n

pr(n

)

a=0.42

b=0.3

b=1

b=2

b=5

4.4.2 Exponential distributions lead to inverse relationship

We have so far consider the following sigmoidal form for the repeat probability pr

of a syllable

pr(n) = 1− c

1 + a exp(−νnT

τ

)However, this form was based on the model used in the study on adaptation of

syllable repeats in the Bengalese finch [51]. Let us consider the following form

pr(tn) = ae−tbn

pr(n) = ae−(nT )b

When n → 0, pr(n) → a, and when n → ∞, pr(n) → 0. We require 0 < a < 1..

For different ranges of b, the function pr(n) is as seen in the figure below The

function for 0 < b < 1 is called a stretched exponential, b = 1 corresponds to a

regular exponential function, and b > 1 is called a compressed exponential function

(b = 2, being the normal function). b > 1 gives us sigmoids that we can use. More

parameters are required in the function since we want the lower repeat probability

to be non-zero, as well as the location of the fastest change in the sigmoid to be

tunable. This should be possible with the modified function below

71

0 1 2 3 4 5 6 7 8 9 100.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

n

pr(n

)

a=0.42, b=10, c=0.3, τ=5

a=0.42, b=5, c=0.3, τ=5

pr(n) = ae−(nTτ

)b + c

0 < a + c < 1, would represent the lower repeat probability (0 < a < 1 − c),and τ > 0 would change the location of the drop in the sigmoid

Using this form, the probability of a syllable repeating exactly N times would

be

P (N) = (1− pr(N))N−1∏i=1

pr(i)

=(

1− ae−(NTτ

)b − c)N−1∏

i=1

(ae−(

iTτ)b + c

)Do we still see the inverse relationship between np and T?

p′

r(n) = −ab(T

τ

)bn(b−1)e−(

nTτ

)b

p”r(n) = −ab(T

τ

)b [−bn2(b−1)e−(

nTτ

)b(T

τ

)b+ e−(

nTτ

)b(b− 1)nb−2

]

72

The fastest change corresponds to p”r(n) = 0, leading to the condition(T

τ

)bbn2(b−1)

p = (b− 1)nb−2p

nbp =

(1− 1

b

)( τT

)bThe inverse relation follows

np =

(1− 1

b

) 1b τ

T

In general, the probability of syllable repeating could follow any exponential dis-

tribution, and the inverse relationship would still be required to hold true. This

means that there could be potentially a range of feedback mechanisms influencing

syllable repetitions.

4.5 Mechanisms of repeat generation

According to the inverse relationship, npT = c where np is the most probable

number of repeats, T is the syllable duration and c is a constant. However, in Sec.

4.3.2 we showed that the most probable number of repeats and the mean number of

repeats are approximately equal for the sigmoidal model of adaptation, i.e., np =<

n >. Now, < n > T is the average phrase duration d. The inverse relationship

therefore implies that there is a typical phrase length (a phrase is a repeated set

of syllables). In such a relationship, any two of these quantities - < n >, T , d have

to be encoded by the generative processes. A pertinent question that can be asked

is - Which of these is actually being encoded neurally? The inverse relationship

between syllable duration and most probable number of repeats is independent

of any considerations about underlying physical mechanisms. However, we could

speculate about its physical origins.

4.5.1 Auditory feedback could regulate repetition

In the Bengalese finch, auditory feedback is known to influence song sequencing

and may provide the external drive that sustains repetition. Supporting this idea,

73

a recent study in these birds revealed stimulus specific adaptation to repeated

auditory responses in a key sensory motor forebrain region implicated in song

control as well as an immediate decrease in repetitions after deafening [51]. On the

other hand, earlier reports have suggested that canary phrase structure is largely

intact even for birds that never heard themselves sing. The mechanisms underlying

the long time constant of phrase persistence remain to be elucidated. In swamp

and song sparrows, deafened juveniles retained species-specific characteristics such

as phrase duration [84].

The Waterschlager canary is a songbird specifically bred for its song [11]. Hence

the song is very different from that of wild canaries. The Waterschlager strain also

has a genetic auditory defect. The bird is partially deaf, with sensitivity to sounds

above 2 KHz decreased by as much as 40 dB [11]. These are sounds produced

mainly by the right side of the syrinx. 90 percent of the syllables generated by

the canary in song are produced by the left side of the syrinx. These songs have a

distinct structure of strong repetitions of the same syllable before a new syllable

is sung. Perhaps the lateralization of singing in the canary brain can be used to

understand whether repetitions are a right-brained feature, and if so why? Also,

if these strains are deaf to syllables of frequencies higher than 2 KHz, yet repeat

distributions for these syllables are distinctly non-Markovian, i.e., not a monotoni-

cally decreasing distribution with the most probable number of repeats being one,

then it may be an indication that the mechanism of repeat generation is different

from that of Bengalese finches. We do not have information about whether the

birds whose songs we have analyzed were deaf exactly to the frequencies considered

standard in the literature. However, a clear indication of the possibility could help

motivate new experiments.

In early deafening experiments, canaries that were deafened as juveniles devel-

oped species-specific patterning of syllables into songs - the presence of tours and a

preference for durations of tours and silent intervals known to be statistically more

likely in normal canaries [85]. Further, in experiments aimed at identifying features

of the canary song that are possibly genetically programmed, it was observed that

even canaries reared in isolation develop normal phrase structure on sexual mat-

uration [86]. In the same study, canaries were able to learn and reproduce songs

with atypical features such as absence of syllable repetitions when tutored with

abnormal songs. However, with sexual maturation, these canaries sang songs with

74

the species-specific phrase structure even if abnormal learned syllables constituted

these phrases. Phrases have been stated as having a fundamental time-scale that

is independent of the syllable duration [87].

4.5.2 Constrained phrase duration

The observed relationship could be a trivial outcome of the fact that the total

duration of the repeated segment is constrained to be a constant independent

of syllable duration. This would mean that even with fluctuations about this

constant value, the average number of repeats has to be inversely proportional to

the syllable duration simply as a consequence of the constraint. This constraint

could be a physical limit - say the amount of air available in the airsacs of the

bird. In the song system, RA receives projections from LMAN which is a part

of the Anterior Forebrain Pathway (AFP). RA in turn projects to downstream

motor and respiratory neurons. LMAN is thought to be responsible for introducing

variability in juvenile song [88]. If syllable repetitions can be considered a temporal

variant of normal song, it is possible that repetitions are controlled by LMAN or

more generally the basal ganglia circuit (Area X → DLM → LMAN). There also

seems to be a strong link between repetitions and respiration. This can happen

through the LMAN → RA → respiratory nuclei pathway. The phrase duration

could also be genetically encoded or pre-programmed, perhaps a limit on some

neural process [85,86]. These are both mechanisms we speculate to be alternatives

to the regulation of repeats by auditory feedback alone.

75

Chapter 5Semi-automated Classification of

Song Syllables

The datasets used in the analyses in previous chapters were sets of vocal sequences

transcribed into symbol sequences. All sequences in these datasets were tran-

scribed manually to ensure accuracy. This involves converting an audio recording

into a spectrogram - a visual representation of frequencies in the recording, iden-

tifying distinct syllable types from the spectrogram, and labelling every syllable

by its type. However, this is a time-consuming endeavour and subject to human

error. The automation of song transcription procedures would be a highly wel-

come development in the field since it would enforce consistency and reliability in

transcriptions. However this is in fact an extremely hard problem. Most human

speech recognition systems rely on large databases containing multiple exemplars

of word utterances [89]. Also, the performance of these recognition systems can

be quantified relatively easily since we are dealing with speech forms that we as

humans can readily identify by ear. But with the vocal sequences of other animals,

we do not know what the syllables in the song should be.

Many studies have tackled the problem of syllable recognition and classifica-

tion in songbird vocalizations with varying degrees of success [6, 90–92]. They

have all been semi-automated procedures. There is no procedure available till date

that eliminates the need for manual involvement completely. Currently, to our

knowledge, the most advanced and widely used free software that integrates both

the recording of songs, as well as measures syllable features and performs syllable

76

classification is the Sound Analysis Pro [93]. The feature-based classification algo-

rithm that the software implements makes use of acoustic features such as pitch,

amplitude, frequency modulation etc for the categorization of syllables. However,

since we humans are good at visually identifying syllables by their spectrograms

(this is how manual transcription of songs is done), it is possible that incorporating

image-based features into classification methods might improve the performance

of syllable classifiers. This chapter details a syllable recognition and classification

system we developed by coupling selective acoustic and image-based features with

a standard supervised learning method used for classification - the Support Vector

Machine (SVM).

5.1 Morphology of a song

The song of a songbird is composed of acoustic units called syllables that are song

following some syntax. Syllables are separated from each other by silent intervals.

The silent intervals between syllables within a song are much smaller than the

interval between songs. In fact, we identify a new song based on the large silent

intervals. In Fig. 5.1. we see the example of the pressure wave from the recording

of a Bengalese finch song in the top panel. The corresponding spectrogram giving

the spectral or frequency content of the song is displayed below the waveform.

Distinct syllables are identified by differences in temporal and spectral content.

We utilize these difference in the classification method that we develop. However,

before syllable classification, we first need to distinguish syllables from silences in

the song.

5.2 Identification of song syllables

In order to isolate the vocal elements in a recording, it is necessary to devise

a method that filters out the noise and picks out reliable vocal data from the

pressure wave d(t). We do this by identifying the silent intervals in the song since

every vocalization is preceded and followed by a period of silence. Thresholding the

amplitudes of the pressure waves is a common approach of isolating vocal elements

in birdsong [94, 95]. The idea would be to set a threshold θ below which all data

is considered to be either silence or noise. However, this threshold must not be a

77

23 24 25 26 27 28 29 30−0.2

0

0.2pr

essu

re(a

.u)

time (s)

frequ

ency

(Hz)

2000

4000

6000

8000

10000

Figure 5.1. Waveform of a Bengalese finch song with the corresponding spectrogram

constant for the entire song, but a function of time θ(t). This is because the mean

equilibrium value of the data could vary with time. This variation is ‘subtracted’

when the threshold is adjusted accordingly with time. The threshold function θ(t)

is determined based on the pressure amplitudes in consecutive time windows of

100 ms each. For example, if the data is sampled at a frequency of 40000 Hz for a

period of 30 s, there are 1.2 million data points.

However, for purposes of identifying the threshold function, it is enough to have

a representative set of pressure amplitudes. The representative set is obtained by

first finding the maximum amplitude | A(t) | in a single oscillation cycle of d(t).

In order to do this, a copy of d(t) , d′(t) is made. Every element in d′(t) is then

shifted by one. A product of d(t) and d′(t) now gives a set of positive and negative

values. Every time a negative value is encountered, it indicates that the pressure

wave has crossed the zero axis - i.e, we have located a node. Three consecutive

negative values would then tell us that a complete oscillation cycle is contained

between the first and third nodes.

The maximum value A(t) of the pressure amplitude in one oscillation cycle

78

is then determined to obtain the envelope of the waveform. We consider the

logarithm of A(t) to emphasize the difference between different amplitudes. Since

doing so magnifies extremely low noise levels, any noise level below 0.02(a.u) is

set to be 0.02(a.u). This is then smoothed by using the second order Savitzky-

Golay filter [96]. We then identify vocal elements by detecting continuous regions

in A(t) that are above a threshold function θ(t) defined for a moving window of

constant size (100 ms, 4000 time points). We also define a step size for the moving

window. One issue that was encountered is the loss of data towards the end of the

waveform if the number of data points for the last window happened to be less

than that needed to complete a full window. This is avoided to prevent loss of

data by choosing a step size of 1 data point. For example, if we choose a window

of size equal to 4000 time points - the first window is indexed by data point 1

followed by 3999 points, the second by data point 2 flanked by data point 1 on

one side and 3998 points on the other and so on. This ensures that all the data

points are covered. The maximum Amax(t) and minimum Amin(t) of A(t) within

each window are then determined. The threshold is defined to be a fraction α of

the difference between Amax(t) and Amin(t) in a time window of size 100 ms plus

Amin(t). This fraction was chosen to be α = 0.4 by trial and error.

The waveform of an isolated syllable was transformed into a spectrogram s(f, t),

which is the energy density at frequency f and time t using the multitaper method.

Distinct syllables obtained using our method for the song of a canary are shown

in Fig. 5.2. Once the syllables in the song have been isolated, they have to be

classified into identified types by pattern recognition.

5.3 Semi-automated classification of song syllables

Pattern recognition is a standard problem in machine learning [97]. In general

any algorithm for pattern recognition is based on identifying regularities in the

dataset. Such algorithms can fall under one of two broad categories - unsupervised

or supervised learning methods. In an unsupervised learning method, inferences

are drawn from the data without the aid of pre-assigned labels to a representative

subset of the data. Such methods include clustering, k-means, mixture models,

hierarchical clustering, among others. Unsupervised methods have been previ-

ously used to classify syllables in birdsong [6]. By contrast, in supervised learning

79

Figure 5.2. Spectrograms of different syllable types in the vocal repertoire of a canary(1 bird). Duration and frequency information are not shown to scale. Syllables can bedistinguished by visual inspection of the spectrograms.

methods, inferences are drawn from the data based on labelled training data. We

focus on a supervised learning method, the Support Vector Machine, for syllable

classification. The training data for our SVM are 20 exemplars each of all distinct

syllable types identified from the recordings of a songbird.

5.3.1 Support Vector Machines

Support Vector Machine (SVM) is a machine learning algorithm that can be used

to classify data points in Rn into a set of predetermined discrete set of classes

Y . The algorithm uses a training set containing pairs (xi, yi) - assigning the data

points xis to the class yis - to learn the classifier, and is thus a supervised learning

technique [97, 98]. The learned classifier can be then used to classify new data

80

x1

x2

Separating hyperplane

wTx+b=0

margin

Figure 5.3. Linear boundary surface in an SVM that separates data labelled by{+1,−1}

points x. This is similar to regression in that SVM uses the training data (xi, yi)

to find a fit function f (x, α) (α being the fitting parameters) such that f minimizes

some empirical loss function evaluated on the training data. SVM however differs

from regression in that the fit function produces discrete outputs.

The simplest of problems that can be tackled with the SVM is the task of

finding a linear boundary surface separating data labelled by {+1,−1} similar to

what is shown in Fig. 5.3. The function f in this situation has the form

f (x|α = (w, b)) = sign[wTx+ b

]and the optimization algorithm can be used to pick the parameters such that some

notion of loss is minimized. Simple and intuitive choices of loss functions could be

the number of misclassified data points, or (negative of) the margin between the

boundary and the data points, or a weighted combination of the two. SVM can

be used also in cases where the separating surface has a non-linear form. Rather

than fitting a non-linear boundary surface, non-linear SVM, takes the approach of

appending additional dimensions to the data ie placing the original n dimensional

81

Figure 5.4. Mapping data that is not separable by a linear boundary into a higherdimensional feature space leads to a linear boundary in the higher dimensional featurespace. When this is mapped back into the input space as in the last panel, we see thatthe boundary is a curve a circle in this case - that separates the data.

data points into a larger, n+ d dimensional space through

x = (x1, x2, x3 . . . xn)→ x′ = (x1, x2, x3 . . . xn, π1 (x) , π2 (x) , . . . πd (x)) .

After a suitable choice of the kernel functions πi, the new points x′ can be separated

using a hyperplane using the techniques of linear SVMs. Fig. 5.4 illustrates the non-

linear SVM used to separate the data points using a kernel function π3 (x) = x21+x22.

5.3.2 Syllable features for classification

When syllables are considered as training data points xi for an SVM, they are

represented as points in a multidimensional feature space. Each dimension of the

feature space is a distinguishing characteristic of the syllable type. The choice

of features is crucial to the performance of the classifier. We used a minimal

feature set - three acoustic-based and two image-based features, for a reasonable

classification performance on Bengalese finch and canary syllables.

5.3.2.1 Duration

Different syllable types have fairly distinct durations. Hence the first feature of

the syllable we use for the SVM is syllable duration. Although multiple syllables

may have roughly the same duration, the combination of this feature with others

ensures that this is not an issue. The durations (Mean ± Standard Deviation) of

seven types of syllables for a Bengalese finch along with the coefficient of variation

82

Type Duration (ms) Coeffecient of Variation

1 40.3± 5.1 0.12682 87.4± 5.9 0.06743 50.7± 6.7 0.13144 53.7± 6.0 0.11145 63.6± 10.7 0.16776 50.2± 8.8 0.17577 50.0± 7.7 0.1546

Table 5.1. Duration of syllables in a Bengalese Finch song.

(Standard Deviation/Mean) are shown in Table. 5.1.

5.3.2.2 Wiener entropy

Wiener entropy of an audio is a measure that quantifies the deviation of the audio

from white noise. White noise is characterized by uniform power in all frequency

bands. In contrast, the sound produced by resonant structures or animal vocaliza-

tions contains multiple harmonics of a frequency. Wiener entropy is defined as the

ratio of the geometric mean to arithmetic mean of a power-spectrum, and is often

expressed in logarithmic scale:

W = log10

N

√∏Ni=1 s(ωi)

1N

∑Ni s(ωi)

= log10

exp [〈ln s〉]〈s〉

(5.1)

where ωi labels the ith frequency bin and s(ωi) is the total power spectrum inside

this bin. Within this definition, white noise has the highest Wiener entropy (0)

and a pure note, such as a sinusoidal wave has the lowest Wiener entropy (−∞).

All other signals have intermediate values. A spectrogram gives us both temporal

and spectral information for a syllable. Hence we can calculate the Wiener entropy

along both dimensions, since the variation in the power spectrum is unique along

each dimension. This gives us the two other acoustic features used with the SVM.

5.3.2.3 Hough transform

We now look for image-based features. When we visually inspect spectrograms

of syllables, we identify matches of syllables roughly based on the orientation and

extent of ‘lines’ on the spectrogram. Hough transform is a feature extraction tool

83

Figure 5.5. Summary of the Hough transform. Any edge in an image, which is simplya set of collinear points is represented as a set of intersecting sinusoidal curves as theresult of a Hough transform.

that can be used to identify structures in an image that can be approximated by line

segments (and geometric curves in general). The simplest Hough transform allows

extraction of linear segments in an image, through combining radon transform and

binning. The transform takes in a two dimensional array representing an image A

and maps it to another two dimensional array representing an new image B. The

map satisfies the following qualitative features:

1. Isolated points in A maps to sinusoidal waves in the output figure.

2. Every linear segment in A maps to points in the image B with the intensity

of the point.

3. Almost collinear sets of pixels in image A map to points of larger size in the

image.

84

Figure 5.6. Separation of syllables in feature space based on duration, and the twoHough transform coordinates ρ and θ

The transform maps a point in A represented by the Cartesian coordinates (x, y)

to a sinusoidal curve in B:

Radon [(x, y)]→ {(θ, x cos θ + y sin θ) such that θ ∈ (0, π)} (5.2)

The transform of an image A with multiple points is an image B with all the

sinusoidal curves piled together. Intensity of a pixel in the final image B is propor-

tional to the number of sinusoidal curves passing through it. This is summarized

in Fig. 5.5. In particular, it can be shown from simple geometrical arguments that

all the sinusoidal curves in B arising from the transform of a set of collinear points

in A intersect at a single point, reinforcing the intensity of the pixel represented

by that point. This point is represented by two coordinates - ρ and θ.

As can be seen in Fig. 5.2, the images of the spectrogram of individual syllables

contain prominent linear structure. This, in the space of the Hough transformed

variables appears as a high intensity point (ρ, θ), the location of which can be

identified by simple peak detection. This process effectively assigns to each syllable

a pair of numbers (ρ, θ) which are the last two features used with the SVM.

Fig. 5.6 shows an example of the separation of six Bengalese finch syllable types

based on three of the five features.

85

5.3.3 SVM ensembles

Since we are dealing with a multi-class classification problem, we construct a set

of binary classifiers which in combination act as a multi-class classifier. However,

often the support vectors obtained from the learning are not sufficient to classify

all test samples accurately. To improve performance, we use the strategy of using

an ensemble of classifiers [99] in place of each binary classifier. An ensemble of clas-

sifiers is a collection of several classifiers whose individual decisions are combined

(using for example a simple voting strategy) to classify the test samples. Several

different methods of creating these ensembles seem to exist in the literature, all

aiming to make one SVM as different from the other SVMs as possible by creating

different training sets for each of them.

Bootstrapping is one such method where k replicate training sets are con-

structed from the training set, by randomly resampling with replacement. A

training syllable may appear more than once or never in a given training set.

Every replicate training set trains a different SVM all aimed to perform the same

binary classification task.

In summary, for classification into one of n classes, there should be n binary

classifiers. Each of the n classifiers in turn consist of an ensemble of binary clas-

sifiers of size k. A majority voting scheme is then employed where a syllable is

classified as class i if the majority of the classifiers in the ensemble classify the

syllable as class i.

5.3.4 Transcription of a song

In a given recording, song sequences were identified by examining the inter-syllable

duration between syllables sn and sn+1. If this duration ∆τ = tn+1−tn was greater

than 200 ms, a given song sequence was assumed to have ended with syllable sn and

a new one begun with syllable sn+1. Syllables were identified as voiced elements

between silent intervals and isolated (see top panel of Fig. 5.7). From the complete

set of syllables, 20 training exemplars were identified for each unique syllable type

by visual inspection of their spectrograms. The five classification features were

extracted for these syllables and the hyperplanes that separated distinct syllable

types in feature space were determined using the SVM. All other syllables were

then classified based on their position with respect to these hyperplanes. The

86

Figure 5.7. Transcription of a Bengalese Finch song. In the top panel, a song sequencebegins at the magenta line and ends at the cyan line

classification performance was higher than 85% for all syllables for both Bengalese

finch and canary songs (an example of syllable labelling after SVM classification

is shown in the lower panel of Fig. 5.7), with performance as high as 98% for some

syllables (mostly clean whistles). We constructed a GUI to manually inspect the

classification performance and easily reassign misclassified or unclassified syllables.

Thus although we were unable to eliminate the need for manual involvement, we

were able to construct a good classifier based on a minimal number of syllable

features.

87

Chapter 6Conclusion

Vocal sequences are observable expressions of complex neural processes occurring

in the brain. The study of vocal sequences and the patterns contained in them can

shed light on neural computations. The relative simplicity of sequence structure

and the possibility of laboratory collection of data have made songbird songs an

ideal candidate for such studies. Even so, there are several challenges, especially in

developing methods for efficient and reliable syntax inference using a finite amount

of data.

6.1 Partially Observable Markov Model - inference

and evaluation

Earlier studies have shown that the syntax of Bengalese finch songs can be effi-

ciently represented by a class of probabilistic finite state systems called Partially

Observable Markov Models (POMM). In the first part of the dissertation we iden-

tified and developed various methods to infer the POMM from a set of symbol

sequences. Although they were devoted to modeling the syntax of vocal sequences

in this dissertation, the methods are very general and can easily be applied to other

classes of problems where sequential structures can be mapped into POMMs. We

discussed several different metrics to evaluate the fit of the model and showed that

finiteness of available data places upper bounds on the values of the metrics. Reli-

able inference of the model syntax from finite data involves optimization of model

parameters without over-fitting. We develop a scheme to perform such controlled

88

inferences, specifically utilizing a metric called sequence completeness.

Discussion

The combinatorial challenge of determining the optimal number of states in a

POMM by the grid search algorithm we developed for model inference increases

in difficulty with the number of syllables in an animal’s repertoire. The current

method of finding the number of states associated with each syllable is computa-

tionally efficient for a bird with 6-10 syllables in its repertoire such as the Bengalese

finch, but becomes highly inefficient for birds like the nightingale and the warbler

that sing more than 100 types of syllables. There is much scope for the design

of more efficient computational paradigms for inference methods described in this

dissertation.

Also, it was assumed that we knew all syllables in the animal’s vocal repertoire

for our analyses before we constructed syntax models. However, if we want to

consider problems such as real-time computation of syntax, where the syntax is

dynamically inferred during vocalizations, we need to allow for the possibility of

incorporating unseen observations in our models. This goes back historically to a

question in inductive inference - what to do when the utterly unexpected occurs,

an outcome for which no slot has been provided in the support of your distribution.

This is not the problem of observing an impossible event that is an event whose

existence has been considered and the probability of which is considered to be zero.

Rather, this problem arises when we observe an event whose existence we did not

previously suspect. In earlier literature, this is called the sampling of species

problem [100]. In one of the many side-projects that came out of this dissertation,

we considered the use of non-parametric Bayesian inference techniques such as the

Hierarchical Dirichlet Process [101] and the infinite Hidden Markov Model [102,103]

to come up with ways of incorporating the encounter of a new syllable or symbol in

a discrete sequence. However, such extensions were not necessary for the questions

we tried to answer in this dissertation and were therefore not pursued actively.

But we would like to draw attention to these possibilities.

89

6.2 Comparison of syntactic structures

We then used the inference methods that were developed in the first part of the

dissertation to construct the syntax of the non-repeat structure of Bengalese finch

songs before and after deafening, to understand the role of auditory feedback in the

regulation of syntax. Before deafening, the syntax is a POMM for five out of the

six birds, with the sixth bird having a Markovian syntax. Significantly, the many-

to-one mapping that existed between states in the POMM and syllables of the song

were absent in the song syntax for all the birds except one after deafening - i.e.,

the syntax is more Markovian. Many new syllable transitions that did not exist

before the removal of auditory feedback also appear post-deafening. This implies

that auditory feedback has a role in maintaining the song syntax for an individual

Bengalese finch and more specifically in influencing the many-to-one mappings in

the syntax of Bengalese finches. However, the fact that the changes in syntax were

not seen in all the birds studied suggests that auditory feedback most likely acts

in combination with other factors such as the topographical connectivity patterns

in the song-generating neural circuitry. It is necessary to confirm these results in

studies involving a larger number of birds and different species of birds. It would

be particularly interesting to consider a bird such as the canary that is considered

to not rely heavily on auditory feedback.

Discussion

The procedure used for the comparison of syntax structures in Bengalese finches

pre- and post-deafening is not limited to birdsong. Our analysis suggests that

metrcs such as sequence completeness are tools that can be used in the study

of more general questions - Are some sets of sequences more Markovian or more

POMM-like than others? How different are the syntax models of different species

from each other? Considering grammars in the Chomsky hierarchy, finite-state

grammars can generate strings generated by higher grammars as part of their

language e.g. a Markov model based on transitions between two syllables A and

B can generate the palindromic string ABBA, the context-free string AAABBB,

the copy string ABAB. However, a finite-state grammar cannot generate only these

strings [104]. Some interesting questions to ask are - Can we distinguish a Markov

90

model from a POMM from a higher grammar based on the observed sequences?

If so, how many sequences should we observe, or what is the minimum required

sample size, before we can state with some confidence that the sequences were

generated by one grammar or another?

Also, syntax models such as the POMM are construted based on the analysis

of local transition rules between the states or syllables. Long vocal sequences such

as humpback whale songs are referred to in the literature as possessing a hierarchi-

cal structure [65,69] with notes arranged into units (syllables), units into phrases,

phrases into themes, and themes into songs. The reference to hierarchy also seems

to imply a parallel organization of structures that generate them including associ-

ated time scales. However, the appearance of hierarchy may be a result of human

perception. It is possible that local rules of transition between syllables would

suffice to describe the syntax and we would see hierarchical structures emerge.

For example, we may tend to split the simple sequence ABCABCDBCABC... into

chunks ABC, DBC, ABC, ABC although it could be generated by a simple Markov

model in which the transitions A → B, B → C,D → B occur with probability 1,

C → A occurs with probability 23

and C → A occurs with probability 13. Ascer-

taining whether the syntax of humpback whale song can be represented by a finite

state model or not would be an important endeavor. However, as we discussed in

the section on humpback whale songs in Chapter 2, the methods we developed are

not suitable for dealing with long and continuous vocalizations since all bounds

are defined based on the availability of multiple independent sequences from the

same individual. Extensions of our methods for a long continuous sequence could

be interesting future research.

6.3 Statistics of syllable repetitions

Finally, we considered the repeat structure of syllables in song sequences. We

demonstrated that a precise inverse relationship exists between syllable duration

and the most probable number of repetitions of the syllable for canaries and swamp

sparrows - two species with songs composed mostly of syllable repetitions or trills.

However, this relationship does not hold for the songs of a Bengalese finch. This

contrast in behavior can be a window into differences in mechanisms of song gener-

ation in different songbirds. The inverse relationship also implies that the duration

91

of a set of repets, or a phrase, is a constant on average. We also raise the question

of which variables involved in the inverse relationship are encoded neurally for ca-

naries and swamp sparrows - syllable duration, phrase duration or the probability

of the number of syllable repeats. This is an question since in some species it

has been assumed that phrase duration and syllable duration are independently

encoded. For example, in canaries it is suggested that the phrase duration is ge-

netically encoded [76]. However given the inverse relationship, it is possible that

it is simply a result of a constraint on the the syllable duration and the number of

syllable repetitions.

Discussion

It is plausible that the exact inverse relationship specifies an upper bound on the

performance of syllable repetitions. Repetition-generation mechanisms of some

songbirds like the Bengalese finch may operate in regimes well below the perfor-

mance limit, while others operate exactly at the limit. There is evidence that

swamp sparrows tutored with trills that are artificially accelerated are unable to

keep up with the pace and end up singing trills with a ‘broken syntax’ [105] in

which discrete chunks are missing from what should be a continuous trill. There

exist neurons in the HVC that project to Area X, and are known to exhibit activity

that is precisely time-locked to every repetition of a syllable [106]. In awake singing

swamp sparrows it has been shown that at accelerated trill rates, these neurons fail

to respond to consecutive syllables [107]. Similar experiments in different species

could help test the hypothesis that different species operate at different distances

from the performance limit, with the distance given by how fast the most probable

number of repeats scales with syllable duration.

92

Appendix ABaum-Welch Algorithm for

estimation of POMM parameters

The algorithmic problem of inferring a Hidden Markov Model from data is de-

scribed in Rabiner & Juang [55]. For observed syllable sequences {y} = {y1, y2, . . .}with each observation taking one of m values, a Partially Obervable Markov Model

(POMM) has to be constructed in which each syllable yi is generated by a hidden

state Si such that corresponding to the syllable sequence y there is a state sequence

S. Assume that there are N kinds of states {h1, h2, . . . , hN}. The number of states

t = 1

s1

s2

s3

s4

y1

t = 2

s1

s2

s3

s4

y2

t = 3

s1

s2

s3

s4

y2

Figure A.1. Calculation of forward probabilities in the trellis of the observation se-quence y1,y2,y2 for a 4-state POMM. The thick arrows indicate the most probable tran-sitions. As an example, the transition between state s1 at time t=2 and state s4 at timet=3 has probability α2(1)T14E4(y2), where αt(i) is the probability to be in state si attime t.

93

N and the state-syllable type assignments are initially fixed. Each state is assigned

a single syllable type, but the same syllable type may be assigned to several states.

We consider a single ‘silent’ state (the first state in the system) to represent both

start and end states each emitting a single symbol that signifies either the start

or end of a syllable sequence. All the available observation sequences are concate-

nated into one long sequence with the symbol indicating a silent state marking

the beginning and end of each individual sequence. The elements of the model are

the distribution Γ for the start state, the state transition probabilities T , and the

state-syllable emissions E. For the computational implementation, Γ is a vector

of length N, T is an N ×N matrix and E is an N ×m matrix. The first state in

the state sequence that leads to the observed concatenated syllable sequence will

always be the start state. Hence all the elements of the vector Γ are 0 except for

the first element (corresponding to the silent states) which is 1. Γ is a constant

in our implementation of the POMM. Initially, a random state transition matrix

T is chosen. For the POMM, the emission matrix E is constructed based on the

state-syllable assignments and is fixed in our implementation of the POMM.

Eij =

1 if state igenerates syllable j

0 otherwise

Using the emission matrix, an ‘observation matrix’ O of size N × N is con-

structed for each syllable type - e.g., In a four-state system, if syllable y1 is of type

1 generated by states 2 and 3, then the observation matrix for type 1 is

O1 =

0 0 0 0

0 1 0 0

0 0 1 0

0 0 0 0

In other words, Oi is a diagonal matrix with the diagonal given by ith column of

Φ. The construction of the observation matrix simplifies the implementation of

the forward-backward algorithm which is the first of two parts of the Baum-Welch

algorithm.

If the concatenated sequence is of length τ , the objective is to obtain the state

transition probabilities that are most likely to give rise to y. In order to do so, we

94

are interested in calculating the probability γt (h) defined as follows.

γt (h) gives the probability that the state St occurring at step 0 ≤ t ≤ τ is h,

given the entire observed sequence y1,...,τ .

γt (h) = P (St = h | y1,...τ , T ) =P (St = h, y1,...τ | T )

P (y1,...τ | T )=P (A ∩B)

P (B)

where A is the event of state h occuring at step t, B is the event of observation of

the entire syllable sequence. It turns out that this can be written as

γt (h) =αt (h)× βt (h)∏

t ct

where α and β are

forward probabilities αt (h) = P (St = h and y1,...,t)

backward probabilities βt (h) = P (yt+1,...l | St = h)

αt (h) gives the probability that a state h occurs at step t followed by syllables

y1→t.

βt (h) gives the probability that the syllables yt+1→τ occur given that the state

at step t was h. ct gives the probability that the given syllable occurs in the model.

A.1 Estimating forward probabilities

The forward variable α is calculated as follows: The probability of starting a new

sequence and making a transition that emits the first observed syllable is given by

the vector

α1:2 ∼ ΓΠO2

1 corresponds to the start state and the start symbol, so α1 = Γ. Also, since α1:2

is the probability of occurrence of possible states, the sum of all the elements of

the vector should give one. Hence the vector has to be normalized. Define

c2 =∑

h=all states

α1:2 (h)

95

to be the normalization factor such that

α2 = α1:2 = c−12 ΓTO2

Similarly,Wikipedia example

α1:3 = c−13 α1:2TO3

...

α1:τ = c−1τ α1:τ−1TOτ

T τi=1ci can then be interpreted as the probability or likelihood of obtaining the

observed sequence for all possible final states at each point in the sequence, i.e.,

P (y1, y2, . . . yτ | T ) = T τi=1ci

A.2 Estimating backward probabilities

The backward variable β can be calculated similarly starting with the last observed

syllable giving βτ :τ = 1

βτ−1:τ = c−1τ ΠOτβτ :τ...

β2:τ = c−13 TO3β3:τ

The β’s are scaled using the same factors c as the α’s. The two can be combined

to obtain the probability distribution γt of being in each state at every point t in

the sequence

γt (h) = α1:t (h) βt+1:τ (h) =αt (h)× βt (h)∏

t ct

A.3 New transition matrix T

Let us take K possible sequences of states that can generate the given syllable

sequence. At a particular step t, γt (h) gives the probability of finding state h. So

we expect that γK of the K state sequences have h in position t. So total number

96

of times state h occurs, at any location, is expected to be

K ×∑t=1→τ

γt (h)

Now, consider the number of times a state hp at step t is followed by hq in step

t + 1. Probability that syllables y1→t occur and the state at t is hp is given by

αt (hp). Probability that syllables y1→t occur and the state at t is hp and the state

at t + 1 is hq is given by αt (hp) × Tpq. Probability that syllables y1→t occur and

the state at t is hp and the state at t+1 is hq and that the syllable yt+1 is observed

is given by αt (hp)× Tpq × Eq(yt+1). Probability that syllables y1→t occur and the

state at t is hp and the state at t+1 is hq and that the syllable yt+1 is observed and

the remaining syllables yt+2→T occur is given by αt (hp)×Tpq×Eq(yt+1)×βt+1 (hq).

The probability of the above given the observed sequence can be obtained by

dividing this by the probability of the observed sequence,

ζt (p→ q) =αt (hp)× Tpq × Eq(yt+1)× βt+1 (hq)∏τ

t=1 ct

So the number of times a state hp at step t is followed by hq in step t + 1 is

ζt (p→ q)K. Hence the number of times a state hp at any step is followed by hq

in the next step is

K ×τ−1∑t=1

ζt (p→ q)

Combining the above two results, transition probabilities can be estimated to be

Tpq =

∑τ−1t=1 ζt (p→ q)∑τ−1

t=1 γt (h)

The equation above is used to update the transition probabilities in a given

iteration. In order to move onto successive iterations, the log of the likelihood T τi=1ci

is calculated for each iteration, and the iterations are continued if the difference

between the log-likelihoods in iteration j and iteration j + 1 is smaller than a set

tolerance limit. A tolerance limit of 1× 10−3 is typically used in all cases of model

inference in this dissertation.

97

Appendix BConfidence Intervals for Entropy of

Sequences

The entropy of a distribution ideally should not depend on the number of sample

sequences available. Hence to find the error bounds on the log-likelihood, we should

first find the error bounds on the entropy, and then multiply by the total number

of observed sequences.

B.1 Subsampling

The basic assumption is that if we know the subsampled means and standard

deviation obtained from a data sample (set of song sequences in our case), we

can infer the true means and standard deviation. Why is there a true standard

deviation at all? If we had an infinite set of sequences, the standard deviation

would be zero. But we should account for the fact that we only have finite data.

The procedure to obtain the true mean and standard deviation and the subsampled

mean and standard deviation is as follows. We first pick any POMM and treat

this as our true source or generator. In the analysis below we consider a 10-state

POMM built on N = 850 sequences. We then create 1000 sets of N sequences

generated from each model, calculate the entropy for each set, and find the mean

and standard deviation of the distribution so obtained. This is the true mean

and standard deviation. We then consider a single set of N sequences as our

sample data. We consider a fraction α of these sequences, pick α × N sequences

98

0 0.5 10.88

0.9

0.92

0.94

0.96

0.98

1

fraction

sam

ple

mea

n / t

rue

mea

n

0 0.5 10

0.5

1

1.5

2

2.5

3

fraction

sam

ple

sd /

true

sd

Figure B.1. Scaling of sample and true means and standard deviations

by subsampling without replacement (1000 times), and calculate the sample mean

and sample standard deviation of the entropy distribution obtained. We can now

compare these to the true values.

Fig. B.1 shows the ratio of the sample mean and standard deviation to the true

mean and standard deviation for different fractions α of sequences picked. We can

use these curves to obtain error bounds on the maximum log-likelihood attainable

by a model inferred from N sequences.

99

Appendix CFinding Dominant Transitions in a

POMM

Based on the work by Serrano et al (2009) [58] on complex weighted networks, we

find the backbone of the POMM, or the set of significantly dominant transitions

in the POMM, by testing each transition against a null hypothesis. The null

hypothesis is that the transition probabilities into or out of a state are produced

by a random assignment from a uniform distribution. We can define a p-value for

each transition - the probability that if the null hypothesis is true, then the variable

under consideration would have a value greater than or equal to the observed value.

Dominant transitions in the syntax are defined to be the transitions for which the

null hypothesis is rejected at an assigned significance level.

C.1 Random assignment of transition probabilities

from a uniform distribution

If a state has k transitions, we would like to find the ones that reject the null

hypothesis. A state has transitions both into it and out of it - kin transitions into,

and kout transitions out of the state. The following analysis is applicable to either

set, and we will therefore consider k transitions out of the state for purposes of

explanation.

Since the sum of all k transition probabilities out of a state must be unity, the

problem is to divide the interval (0, 1] into k pieces. This is possible by choosing

100

k − 1 points within the interval. Let the distribution associated with k random

intervals be defined as pk. We will prove by induction that pk(x)dx = (k − 1)(1−x)k−2dx. All intervals are equivalent, and therefore to find the probability density

function for the random assignment of interval sizes from a uniform distribution,

we could solve the problem of finding the probability of assigning a length to any

one of the k intervals.

Consider the equivalent problem of finding the distance x between two ran-

domly chosen adjacent points on a line of length 1. We arbitrarily decide to keep

track of the distance between the end of the line corresponding to x = 0 and the 1st

point in line beyond it; i.e, the 1st interval. Assume we have k − 2 points already

chosen on the line such that there are k−1 intervals. The probability that the first

interval is of length q is pk−1(q). Now if we want to choose a new point adjacent

to the point that is currently first in line, this point could be chosen at a location

closer or further away from x = 0 than the current first point. One possibility is

that the new point is at location x with probability 1 and the point that was pre-

viously first in line was at a position beyond x. The probability of this occurrence

is given by 1 ×∫ 1

xpk−1(y)dy. The other possibility is that the point already first

in line was at x and the new point is chosen beyond x - the probability of this is

pk−1(x)∫ 1

xp(y)dy where p(y) = 1. Therefore the probability of the first interval

being of length x if there is 1 point or 2 intervals already is

p3(x)dx = 1×∫ 1

x

p3−1(y)dy + p3−1(x)

∫ 1

x

p(y)dy

= (1− x) + (1− x)

= 2(1− x)

If the state has only one transition, i.e, k = 1, it is trivially true that p1(x) = 0.

We have already made the argument that p2(x) = 1− x and p3(x) = 2(1− x). Let

us assume that it is true that for k intervals

pk(x)dx = (k − 1)(1− x)k−2dx (C.1)

Then

pk+1(x)dx = k(1− x)k−1dx (C.2)

101

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3

3.5

4

x

p k(x)

Figure C.1. The probability density function for a variable taking a value based onrandom assignment from a uniform distribution decreases monotonically.

By the principle of induction, Eq. (C.1). is true for all k.

C.2 Assignment of significance levels

For any POMM, we want to calculate the probability that each transition pij from

state i to state j is compatible with the null hypothesis. This is done by calculating

a p-value αij - the probability under a true null hypothesis of obtaining a value for

the transition probability that is equal to or greater than the observed probability

pij. This is because Eq. (C.1) describes a monotonically decreasing function as

shown for k = 5 in the sketch in Fig. C.1 - which means that all low-probability

events occur at values greater than pij as shown by the shaded region. The p-value

αij is therefore given by

αij =

∫ 1

pij

pk(x)dx (C.3)

= 1−∫ pij

0

pk(x)dx (C.4)

102

= 1− (k − 1)

∫ pij

0

(1− x)k−2dx (C.5)

Each transition pij in the POMM can be associated with a p-value αij. We

can assign a global significance level α to the p-values to determine the dominant

transitions. Small values of α correspond to the most dominant transitions, i.e.,

αij < α.

103

Bibliography

[1] Kuiper, K. and J. Nokes (2013) Theories of Syntax: Concepts and CaseStudies, Palgrave Macmillan.

[2] Bickerton, D. and E. Szathmary (2009) Biological foundations and ori-gin of syntax, Mit Press.

[3] Chomsky, N. (1957) Syntactic Structures, Mouton, The Hague.

[4] Berwick, R. C., K. Okanoya, G. J. L. Beckers, and J. J. Bolhuis(2011) “Songs to syntax: the linguistics of birdsong.” Trends in cognitivesciences, 15(3), pp. 113–21.

[5] Katahira, K., K. Suzuki, K. Okanoya, and M. Okada (2011) “Com-plex sequencing rules of birdsong can be explained by simple hidden Markovprocesses.” PloS one, 6(9), p. e24516.

[6] Jin, D. Z. and A. a. Kozhevnikov (2011) “A compact statistical modelof the song syntax in Bengalese finch.” PLoS computational biology, 7(3), p.e1001108.

[7] Hopcroft, J. E. (1979) Introduction to automata theory, languages, andcomputation, Pearson Education India.

[8] Callut, J. and P. Dupont (2004) “A Markovian approach to the inductionof regular string distributions,” Icgi, 3264(3264), pp. 77–90.

[9] Wilbrecht, L. and F. Nottebohm (2003) “Vocal learning in birds andhumans.” Mental retardation and developmental disabilities research reviews,9(3), pp. 135–48.

[10] Deacon, T. W. (1998) The symbolic species: The co-evolution of languageand the brain, WW Norton & Company.

104

[11] Marler, P. R. and H. Slabbekoorn (2004) Nature’s music: the scienceof birdsong, Academic Press.

[12] Reiss, D. and B. McCowan (1993) “Spontaneous vocal mimicry and pro-duction by bottlenose dolphins (Tursiops truncatus): evidence for vocal learn-ing.” Journal of Comparative Psychology, 107(3), p. 301.

[13] Payne, K., P. Tyack, and R. Payne (1983) “Progressive changes in thesongs of humpback whales (Megaptera novaeangliae): a detailed analysis oftwo seasons in Hawaii,” Communication and behavior of whales, pp. 9–57.

[14] Foote, A. D., R. M. Griffin, D. Howitt, L. Larsson, P. J. O.Miller, and A. R. Hoelzel (2006) “Killer whales are capable of vocallearning.” Biology letters, 2(4), pp. 509–12.

[15] Sanvito, S., F. Galimberti, and E. H. Miller (2007) “ObservationalEvidences of Vocal Learning in Southern Elephant Seals: a LongitudinalStudy,” Ethology, 113(2), pp. 137–146.

[16] Pistorio, A. L., B. Vintch, and X. Wang (2006) “Acoustic analy-sis of vocal development in a New World primate, the common marmoset(Callithrix jacchus) a),” The Journal of the Acoustical Society of America,120(3), pp. 1655–1670.

[17] Boughman, J. W. (1998) “Vocal learning by greater spear-nosed bats.”Proceedings. Biological sciences / The Royal Society, 265(1392), pp. 227–33.

[18] Poole, J. H., P. L. Tyack, A. S. Stoeger-Horwath, and S. Wat-wood (2005) “Animal behaviour: elephants are capable of vocal learning,”Nature, 434(7032), pp. 455–456.

[19] Arriaga, G., E. P. Zhou, and E. D. Jarvis (2012) “Of mice, birds, andmen: the mouse ultrasonic song system has some features similar to humansand song-learning birds.” PloS one, 7(10), p. e46610.

[20] Williams, H. (2004) “Birdsong and singing behavior.” Annals of the NewYork Academy of Sciences, 1016, pp. 1–30.

[21] Doupe, A. J. and P. K. Kuhl (1999) “Bird Song and Human Speech:Common Themes and Mechanisms,” Annu. Rev. Neurosci., 22, pp. 567–631.

[22] Bolhuis, J. J., K. Okanoya, and C. Scharff (2010) “Twitter evolu-tion: converging mechanisms in birdsong and human speech.” Nature reviews.Neuroscience, 11(11), pp. 747–59.

105

[23] Thorpe, W. H. (1954) “The Process of Song-Learning in the Chaffinchas Studied by Means of the Sound Spectrograph,” Nature, 173(4402), pp.465–469.

[24] Marler, P. and D. Isaac (1960) “Song variation in a population of BrownTowhees,” The Condor, 62(4), pp. 272–283.

[25] ——— (1960) “Physical analysis of a simple bird song as exemplified by theChipping Sparrow,” The Condor, 62(2), pp. 124–135.

[26] Nottebohm, F., T. M. Stokes, and C. M. Leonard (1976) “Centralcontrol of song in the canary, Serinus canarius.” The Journal of comparativeneurology, 165(4), pp. 457–86.

[27] Marler, P. (1970) “A comparative approach to vocal learning: song devel-opment in White-crowned Sparrows.” Journal of comparative and physiolog-ical psychology, 71(2p2), p. 1.

[28] Calder, W. A. (1970) “Respiration during song in the canary (Serinuscanaria),” Comparative biochemistry and physiology, 32(2), pp. 251–258.

[29] Hinde, R. (1958) “Alternative motor patterns in chaffinch song,” AnimalBehaviour, 6(3), pp. 211–218.

[30] Doupe, A. J. and M. Konishi (1991) “Song-selective auditory circuits inthe vocal control system of the zebra finch,” Proceedings of the NationalAcademy of Sciences, 88(24), pp. 11339–11343.

[31] Konishi, M. and E. Akutagawa (1985) “Neuronal growth, atrophy anddeath in a sexually dimorphic song nucleus in the zebra finch brain,” Nature,315, pp. 145–147.

[32] Jurgens, U. (2002) “Neural pathways underlying vocal control.” Neuro-science and biobehavioral reviews, 26(2), pp. 235–58.

[33] Kroodsma, D. E. and M. Konishi (1991) A suboscine bird (easternphoebe, Sayornis phoebe) develops normal song without auditory feedback,Elsevier Masson.

[34] Suthers, R., W. Fitch, R. Fay, and A. Popper (2016) Vertebrate SoundProduction and Acoustic Communication, Springer Handbook of AuditoryResearch, Springer International Publishing.

[35] Jurgens, U. (2009) “The neural control of vocalization in mammals: areview.” Journal of voice : official journal of the Voice Foundation, 23(1),pp. 1–10.

106

[36] Jarvis, E. D., O. Gunturkun, L. Bruce, A. Csillag, H. Karten,W. Kuenzel, L. Medina, G. Paxinos, D. J. Perkel, T. Shimizu,G. Striedter, J. M. Wild, G. F. Ball, J. Dugas-Ford, S. E. Du-rand, G. E. Hough, S. Husband, L. Kubikova, D. W. Lee, C. V.Mello, A. Powers, C. Siang, T. V. Smulders, K. Wada, S. A.White, K. Yamamoto, J. Yu, A. Reiner, and A. B. Butler (2005)“Avian brains and a new understanding of vertebrate brain evolution,” Na-ture Reviews Neuroscience, 6(2), pp. 151–159.

[37] Elemans, C. P. H., J. H. Rasmussen, C. T. Herbst, D. N. During,S. A. Zollinger, H. Brumm, K. Srivastava, N. Svane, M. Ding,O. N. Larsen, S. J. Sober, and J. G. Svec (2015) “Universal mech-anisms of sound production and control in birds and mammals.” Naturecommunications, 6, p. 8978.

[38] Albert, C. Y. and D. Margoliash (1996) “Temporal hierarchical controlof singing in birds,” Science, 273(5283), pp. 1871–1875.

[39] Hahnloser, R. H. R., A. A. Kozhevnikov, and M. S. Fee (2002) “Anultra-sparse code underlies the generation of neural sequences in a songbird,”Nature, 419, pp. 65–70.

[40] Fee, M. S., A. A. Kozhevnikov, and R. H. R. Hahnloser (2004)“Neural mechanisms of vocal sequence generation in the songbird.” Annalsof the New York Academy of Sciences, 1016, pp. 153–70.

[41] Troyer, T. W. (2013) “Neuroscience: The units of a song.” Nature,495(7439), pp. 56–7.

[42] Amador, A., Y. S. Perl, G. B. Mindlin, and D. Margoliash (2013)“Elemental gesture dynamics are encoded by song premotor cortical neu-rons.” Nature, 495(7439), pp. 59–64.

[43] Fiete, I. R., R. H. Hahnloser, M. S. Fee, and H. S. Seung (2004)“Temporal sparseness of the premotor drive is important for rapid learningin a neural network model of birdsong,” Journal of neurophysiology, 92(4),pp. 2274–2282.

[44] Fiete, I. R., W. Senn, C. Z. Wang, and R. H. Hahnloser (2010)“Spike-time-dependent plasticity and heterosynaptic competition organizenetworks to produce long scale-free sequences of neural activity,” Neuron,65(4), pp. 563–576.

[45] Jin, D. Z. (2009) “Generating variable birdsong syllable sequences withbranching chain networks in avian premotor nucleus HVC,” Physical ReviewE, 80(5), pp. 1–13.

107

[46] Okubo, T. S., E. L. Mackevicius, H. L. Payne, G. F. Lynch, andM. S. Fee (2015) “Growth and splitting of neural sequences in songbirdvocal development,” Nature, 528(7582), pp. 352–357.

[47] Jin, D. Z., F. M. Ramazanolu, and H. S. Seung (2007) “Intrinsic burst-ing enhances the robustness of a neural network model of sequence generationby avian brain area HVC.” Journal of computational neuroscience, 23(3), pp.283–99.

[48] Basharin, G. P., A. N. Langville, and V. A. Naumov (2004) “Thelife and work of A.A. Markov,” Linear Algebra and its Applications, 386, pp.3–26.

[49] Dobson, C. W. and R. E. Lemon (1979) “Markov sequences in songs ofAmerican thrushes,” Behaviour, 68(1), pp. 86–105.

[50] Okanoya, K. (2004) “The Bengalese finch: a window on the behavioral neu-robiology of birdsong syntax,” Annals of the New York Academy of Sciences,1016(1), pp. 724–735.

[51] Wittenbach, J. D., K. E. Bouchard, M. S. Brainard, and D. Z.Jin (2015) “An Adapting Auditory-motor Feedback Loop Can Contributeto Generating Vocal Repetition.” PLoS computational biology, 11(10), p.e1004471.

[52] Lyons, J. (1981) Language and linguistics, Cambridge University Press.

[53] Hennie, F. C. (1968) Finite-state models for logical machines, John Wiley& Sons.

[54] Herve Bourlard, S. B. (2002) Hidden Markov Models and other FiniteState Automata for Sequence Processing, 2nd ed., MIT Press, Cambridge,MA, USA.

[55] L.R.Rabiner, B. (2007) “An introduction to hidden Markov models.” Cur-rent protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [etal.], (January).

[56] Dupont, P. (2005) “Inducing Hidden Markov Models to Model Long-TermDependencies,” Machine Learning: ECML, pp. 513–521.

[57] Moon, T. K. (1996) “The expectation-maximization algorithm,” Signalprocessing magazine, IEEE, 13(6), pp. 47–60.

[58] Serrano, M. ., M. Bogu, and A. Vespignani (2009) “Extracting themultiscale backbone of complex weighted networks,” Proceedings of the Na-tional Academy of Sciences, 106(16), pp. 6483–6488.

108

[59] Woolley, S. M. N. and E. W. Rubel (1997) “Bengalese FinchesLonchura Striata Domestica Depend upon Auditory Feedback for the Main-tenance of Adult Song,” J. Neurosci., 17(16), pp. 6380–6390.

[60] Okanoya, K. and A. Yamaguchi (1997) “Adult Bengalese finches(Lonchura striata var. domestica) require real-time auditory feedback to pro-duce normal song syntax.” Journal of neurobiology, 33(4), pp. 343–56.

[61] Rajan, R. and A. J. Doupe (2013) “Behavioral and neural signatures ofreadiness to initiate a learned motor sequence.” Current biology : CB, 23(1),pp. 87–93.

[62] Nordeen, K. W. and E. J. Nordeen (1992) “Auditory feedback is nec-essary for the maintenance of stereotyped song in adult zebra finches,” Be-havioral and neural biology, 57(1), pp. 58–66.

[63] Lombardino, A. J. and F. Nottebohm (2000) “Age at deafening affectsthe stability of learned song in adult male zebra finches,” The Journal ofNeuroscience, 20(13), pp. 5054–5064.

[64] Sakata, J. T. and M. S. Brainard (2008) “Online Contributions of Audi-tory Feedback to Neural Activity in Avian Song Control Circuitry,” Journalof Neuroscience, 28(44), pp. 11378–11390.

[65] Payne, R. S. and S. McVay (1971) “Songs of humpback whales,” Science,173(3997), pp. 585–597.

[66] Payne, K. and R. Payne (1985) “Large scale changes over 19 years in songsof humpback whales in Bermuda,” Zeitschrift fur Tierpsychologie, 68(2), pp.89–114.

[67] Helweg, D. A., A. S. Frankel, J. R. Mobley Jr, and L. M. Her-man (1992) “Humpback whale song: our current understanding,” in Marinemammal sensory systems, Springer, pp. 459–483.

[68] Garland, E. C., A. W. Goldizen, M. L. Rekdahl, R. Constantine,C. Garrigue, N. D. Hauser, M. M. Poole, J. Robbins, and M. J.Noad (2011) “Dynamic horizontal cultural transmission of humpback whalesong at the ocean basin scale,” Current Biology, 21(8), pp. 687–691.

[69] Suzuki, R., J. R. Buck, and P. L. Tyack (2006) “Information entropy ofhumpback whale songs.” The Journal of the Acoustical Society of America,119(3), pp. 1849–1866.

[70] Cummings, J. and M. Mega (2003) Neuropsychiatry and Behavioral Neu-roscience, Oxford University Press.

109

[71] Robertson, M. M. (2000) “Tourette syndrome, associated conditions andthe complexities of treatment,” Brain, 123(3), pp. 425–462.

[72] Marler, P. and S. Peters (1982) “Structural changes in song ontogenyin the swamp sparrow Melospiza georgiana,” The Auk, pp. 446–458.

[73] Dooling, R. and M. Searcy (1980) “Early perceptual selectivity in theswamp sparrow,” Developmental Psychobiology, 13(5), pp. 499–506.

[74] Mota, P. G. and G. C. Cardoso (2001) “Song organisation and patternsof variation in the serin (Serinus serinus),” Acta ethologica, 3(2), pp. 141–150.

[75] Guttinger, H. R. (1985) “Consequences of domestication on the songstructures in the canary,” Behaviour, 94(3), pp. 254–278.

[76] ——— (1979) “The Integration of Learnt and Genetically Programmed Be-haviour,” Zeitschrift fr Tierpsychologie, 49(3), pp. 285–303.

[77] Podos, J. (1997) “A performance constraint on the evolution of trilled vo-calizations ina songbird family (Passeriformes : Emberizidae),” Evolution,51(2), pp. 537–551.

[78] Sprau, P., T. Roth, V. Amrhein, and M. Naguib (2013) “The predic-tive value of trill performance in a large repertoire songbird, the nightingaleLuscinia megarhynchos,” Journal of Avian Biology, 44(6), pp. 567–574.

[79] Hailman, J. P. and M. S. Ficken (1986) “Combinatorial animal commu-nication with computable syntax: chick-a-dee calling qualifies as languagebystructural linguistics,” Animal Behaviour, 34(6), pp. 1899–1901.

[80] Bloomfield, L. L., I. Charrier, and C. B. Sturdy (2004) “Note typesand coding in parid vocalizations. II: The chick-a-dee call of the mountainchickadee (Poecile gambeli),” Canadian Journal of Zoology, 82(5), pp. 780–793.

[81] Ficken, M. S., E. D. Hailman, and J. P. Hailman (1994) “The chick-a-dee call system of the Mexican chickadee,” Condor, pp. 70–82.

[82] Leger, D. W. (2005) “First documentation of combinatorial song syntaxin a suboscine passerine species,” The Condor, 107(4), pp. 765–774.

[83] Helekar, S., G. Espino, A. Botas, and D. Rosenfield (2003) “Devel-opment and adult phase plasticity of syllable repetitions in the birdsongof captive zebra finches (Taeniopygia guttata).” Behavioral neuroscience,117(5), p. 939.

110

[84] Marler, P. and V. Sherman (1983) “Song structure without auditoryfeedback: emendations of the auditory template hypothesis.” The Journal ofneuroscience : the official journal of the Society for Neuroscience, 3(3), pp.517–531.

[85] Guttinger, H. R. (1981) “Self-differentiation of Song Organization Rulesby Deaf Canaries,” Zeitschrift fur Tierpsychologie, 56(4), pp. 323–340.

[86] Gardner, T. J., F. Naef, and F. Nottebohm (2005) “Freedom andrules: the acquisition and reprogramming of a bird’s learned song.” Science(New York, N.Y.), 308(5724), pp. 1046–1049.

[87] Markowitz, J. E., E. Ivie, L. Kligler, and T. J. Gardner (2013)“Long-range Order in Canary Song.” PLoS computational biology, 9(5), p.e1003052.

[88] Olveczky, B. P., A. S. Andalman, and M. S. Fee (2005) “Vocal exper-imentation in the juvenile songbird requires a basal ganglia circuit,” PLoSBiol, 3(5), p. e153.

[89] Lippmann, R. P. (1997) “Speech recognition by machines and humans,”Speech communication, 22(1), pp. 1–15.

[90] Anderson, S. E., A. S. Dave, and D. Margoliash (1996) “Template-based automatic recognition of birdsong syllables from continuous record-ings,” The Journal of the Acoustical Society of America, 100(2), pp. 1209–1219.

[91] Kogan, J. A. and D. Margoliash (1998) “Automated recognition of birdsong elements from continuous recordings using dynamic time warping andhidden Markov models: A comparative study,” The Journal of the AcousticalSociety of America, 103(4), pp. 2185–2196.

[92] Tachibana, R. O., N. Oosugi, and K. Okanoya (2014) “Semi-automaticclassification of birdsong elements using a linear support vector machine,”PloS one, 9(3), p. e92584.

[93] Tchernichovski, O., F. Nottebohm, C. E. Ho, B. Pesaran, andP. P. Mitra (2000) “A procedure for an automated measurement of songsimilarity,” Animal Behaviour, 59(6), pp. 1167–1176.

[94] Janata, P. (2001) “Quantitative assessment of vocal development in thezebra finch using self-organizing neural networks,” The Journal of the Acous-tical Society of America, 110(5), pp. 2593–2603.

111

[95] Du, P. and T. W. Troyer (2006) “A segmentation algorithm for zebrafinch song at the note level,” Neurocomputing, 69(10), pp. 1375–1379.

[96] Savitzky, A. and M. J. Golay (1964) “Smoothing and differentiation ofdata by simplified least squares procedures.” Analytical chemistry, 36(8), pp.1627–1639.

[97] Bishop, C. M. (2006) Pattern Recognition and Machine Learning (Infor-mation Science and Statistics), Springer-Verlag New York, Inc.

[98] Boser, B. E., I. M. Guyon, and V. N. Vapnik (1992) “A TrainingAlgorithm for Optimal Margin Classifiers,” in Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory, ACM Press, pp. 144–152.

[99] Kim, H.-C., S. Pang, H.-M. Je, D. Kim, and S. Y. Bang (2002) “Patternclassification using support vector machine ensemble,” in Pattern Recogni-tion, 2002. Proceedings. 16th International Conference on, vol. 2, IEEE, pp.160–163.

[100] Zabell, S. L. (1992) “Predicting the unpredictable,” Synthese, 90(2), pp.205–232.

[101] Teh, Y. W., M. I. Jordan, M. J. Beal, and D. M. Blei (2012) “Hier-archical dirichlet processes,” Journal of the american statistical association.

[102] Beal, M. J., Z. Ghahramani, and C. E. Rasmussen (2001) “The in-finite hidden Markov model,” in Advances in neural information processingsystems, pp. 577–584.

[103] Van Gael, J., Y. Saatci, Y. W. Teh, and Z. Ghahramani (2008)“Beam sampling for the infinite hidden Markov model,” in Proceedings of the25th international conference on Machine learning, ACM, pp. 1088–1095.

[104] Durbin, R., S. R. Eddy, A. Krogh, and G. Mitchison (1998) Bio-logical sequence analysis: probabilistic models of proteins and nucleic acids,Cambridge university press.

[105] Podos, J., S. Nowicki, and S. Peters (1999) “Permissiveness in thelearning and development of song syntax in swamp sparrows,” Animal Be-haviour, 58(1), pp. 93–103.

[106] Fujimoto, H., T. Hasegawa, and D. Watanabe (2011) “Neural codingof syntactic structure in learned vocalizations in the songbird,” The Journalof Neuroscience, 31(27), pp. 10023–10033.

112

[107] Prather, J., S. Peters, R. Mooney, and S. Nowicki (2012) “Sensoryconstraints on birdsong syntax: neural responses to swamp sparrow songswith accelerated trill rates,” Animal behaviour, 83(6), pp. 1411–1420.

113

Vita

Sumithra Surendralal

Education

Doctor of Philosophy, Physics August 2016The Pennsylvania State University, University Park, PA

Master of Science, Physics May 2009University of Madras, Chennai, India

Bachelor of Science, Physics May 2007Women’s Christian College, Chennai, India

Work Experience

Research Assistant Apr-Jul 2010Institute of Mathematical Sciences, Chennai, India

Awards

Outstanding Physics Teaching Assistant Award 2016American Association of Physics Teachers (AAPT)

The Professor Stanley Shepherd Graduate Teaching Assistant Award 2016The Pennsylvania State University, University Park,PA

Graduate Teaching Assistant Award 2013The Pennsylvania State University, University Park,PA

Documents

STATISTICAL INFERENCE OF SYNTAX FROM VOCAL SEQUENCES …