Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
The Pennsylvania State University
The Graduate School
Eberly College of Science
STATISTICAL INFERENCE OF SYNTAX FROM VOCAL
SEQUENCES AND IMPLICATIONS FOR NEURAL MECHANISMS
A Dissertation in
Physics
by
Sumithra Surendralal
c© 2016 Sumithra Surendralal
Submitted in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
August 2016
The dissertation of Sumithra Surendralal was reviewed and approved∗ by the fol-
lowing:
Dezhe Z. Jin
Associate Professor of Physics
Dissertation Advisor, Chair of Committee
Patrick J. Drew
Assistant Professor of Engineering Science and Mechanics
Assistant Professor of Neurosurgery
Assistant Professor of Biomedical Engineering
John C. Collins
Distinguished Professor of Physics
Lu Bai
Assistant Professor of Biochemistry and Molecular Biology
Assistant Professor of Physics
Nitin Samarth
Professor of Physics
George A. and Margaret M. Downsbrough Department Head
∗Signatures are on file in the Graduate School.
ii
Abstract
Learned vocalization in animals is a fascinating natural behavior, the neural andperipheral mechanisms behind which are not completely known. In vocal se-quences, acoustic units called syllables are produced following certain learned rulesor syntax. In order to understand the putative neural correlates of this behavior,we first need a quantitative description of the behavior itself. A vocal sequence, sayAAABCCCDD can be parsed into two structures - a non-repeat structure ABCD,on which is imposed a repeat structure A(3)B(1)C(3)D(2). In this dissertation wedevelop statistical methods to infer concise, finite-state characterizations of boththese aspects of the syntax from observed vocal sequences. The need to exercisecaution in assigning vocal sequences to syntactic categories based on small samplesizes is emphasized by designing measures that place bounds on model categoriesthat can be inferred from observations. In particular, we focus on the PartiallyObservable Markov Model (POMM) - a model with a Markov chain of abstract,hidden states that have a many-to-one mapping to observed syllables - to charac-terize the non-repeat structure. Through careful quantitative analysis of observeddata, we show that the normal song syntax of the Bengalese finch is consistentwith the features expected from a POMM. The songs of deafened birds show adeviation from this normal structure. In a statistically significant number of casesamong the birds studied, the loss of auditory feedback results in a loss of themany-to-one mappings. The observations suggest that auditory feedback can in-duce complexity in the Bengalese finch song syntax, but is not sufficient to explaincomplexity entirely. We suggest that the Bengalese finch song syntax is encodedin the interplay between auditory feedback and the intrinsic song-generating cir-cuitry. Finally, in canary and swamp sparrow songs we show that there is an exactinverse relationship between syllable duration and the most probable number ofrepetitions of the syllable. Such a precise relationship indicates the existence offundamental biological constraints on the performance of syllable repeats.
iii
Table of Contents
List of Figures viii
List of Tables xi
List of Symbols and Abbreviations xii
Acknowledgments xiii
Chapter 1Introduction 11.1 Animal vocal sequences . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Songbirds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Neurobiology of sequence generation . . . . . . . . . . . . . . . . . 5
1.2.1 The song production system in the avian brain . . . . . . . . 51.2.2 Neural models of sequence generation involving the HVC . . 7
1.3 Models of syntax for vocal sequences . . . . . . . . . . . . . . . . . 81.3.1 The Chomsky hierarchy . . . . . . . . . . . . . . . . . . . . 10
1.4 Structure of the dissertation . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2Partially Observable Markov Models of Sequence Syntax 132.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Model representation - Finite state machines . . . . . . . . . . . . . 14
2.2.1 Markov models . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.2 Hidden Markov Model (HMM) . . . . . . . . . . . . . . . . 172.2.3 Partially Observable Markov Model (POMM) . . . . . . . . 17
2.3 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.1 Measures of model fit . . . . . . . . . . . . . . . . . . . . . . 19
iv
2.3.1.1 Sequence completeness . . . . . . . . . . . . . . . . 202.3.1.2 Log-likelihood . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Measures of sequence similarity . . . . . . . . . . . . . . . . 262.3.2.1 Repeat distributions . . . . . . . . . . . . . . . . . 272.3.2.2 n-gram distributions . . . . . . . . . . . . . . . . . 272.3.2.3 Step distributions . . . . . . . . . . . . . . . . . . . 27
2.4 Model inference - inference of a POMM from data . . . . . . . . . . 272.4.1 Expectation maximization - Baum-Welch algorithm . . . . . 282.4.2 Grid search for an optimal model . . . . . . . . . . . . . . . 322.4.3 Establishing error bounds . . . . . . . . . . . . . . . . . . . 342.4.4 Grid search stopping criterion . . . . . . . . . . . . . . . . . 342.4.5 Finding the optimal state vector . . . . . . . . . . . . . . . . 362.4.6 Reduced representation by filtering non-dominant transitions 372.4.7 Checking for equivalent POMMs - state-merging . . . . . . . 372.4.8 Demonstration of grid search with a toy model . . . . . . . . 38
Chapter 3Comparison of Syntactic Structures 403.1 The Bengalese finch . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.1 Description of data . . . . . . . . . . . . . . . . . . . . . . . 413.1.2 Identification of songs from recording transcriptions . . . . . 42
3.2 Syntax of Bengalese finch song . . . . . . . . . . . . . . . . . . . . . 433.2.1 Changes in syntax caused by deafening . . . . . . . . . . . . 463.2.2 Persistence of dominant transitions after deafening . . . . . 50
3.3 Humpback whale . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.3.1 Description of data . . . . . . . . . . . . . . . . . . . . . . . 553.3.2 Challenges in inferring the syntax of humpback whale song . 553.3.3 Markov model of humpback whale themes . . . . . . . . . . 57
Chapter 4Repeat Structure in Vocal Sequences 614.1 Syllable repetitions in multiple species . . . . . . . . . . . . . . . . 624.2 Distribution of the number of syllable repetitions . . . . . . . . . . 63
4.2.1 Sigmoidal model of adaptation . . . . . . . . . . . . . . . . . 634.3 Evidence of inverse relationship between syllable duration and most
probable repeat number . . . . . . . . . . . . . . . . . . . . . . . . 654.3.1 Swamp sparrow . . . . . . . . . . . . . . . . . . . . . . . . . 654.3.2 Canary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.3.3 Bengalese finch . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Other calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
v
4.4.1 Distribution of phrase duration and repeat number . . . . . 704.4.2 Exponential distributions lead to inverse relationship . . . . 71
4.5 Mechanisms of repeat generation . . . . . . . . . . . . . . . . . . . 734.5.1 Auditory feedback could regulate repetition . . . . . . . . . 734.5.2 Constrained phrase duration . . . . . . . . . . . . . . . . . . 75
Chapter 5Semi-automated Classification of Song Syllables 765.1 Morphology of a song . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2 Identification of song syllables . . . . . . . . . . . . . . . . . . . . . 775.3 Semi-automated classification of song syllables . . . . . . . . . . . . 79
5.3.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . 805.3.2 Syllable features for classification . . . . . . . . . . . . . . . 82
5.3.2.1 Duration . . . . . . . . . . . . . . . . . . . . . . . 825.3.2.2 Wiener entropy . . . . . . . . . . . . . . . . . . . . 835.3.2.3 Hough transform . . . . . . . . . . . . . . . . . . . 83
5.3.3 SVM ensembles . . . . . . . . . . . . . . . . . . . . . . . . . 865.3.4 Transcription of a song . . . . . . . . . . . . . . . . . . . . . 86
Chapter 6Conclusion 886.1 Partially Observable Markov Model - inference and evaluation . . . 886.2 Comparison of syntactic structures . . . . . . . . . . . . . . . . . . 906.3 Statistics of syllable repetitions . . . . . . . . . . . . . . . . . . . . 91
Appendix ABaum-Welch Algorithm for estimation of POMM parameters 93A.1 Estimating forward probabilities . . . . . . . . . . . . . . . . . . . . 95A.2 Estimating backward probabilities . . . . . . . . . . . . . . . . . . . 96A.3 New transition matrix T . . . . . . . . . . . . . . . . . . . . . . . . 96
Appendix BConfidence Intervals for Entropy of Sequences 98B.1 Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Appendix CFinding Dominant Transitions in a POMM 100C.1 Random assignment of transition probabilities from a uniform dis-
tribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100C.2 Assignment of significance levels . . . . . . . . . . . . . . . . . . . . 102
vi
Bibliography 104
vii
List of Figures
1.1 The song system in the songbird brain . . . . . . . . . . . . . . . . 61.2 Branching synfire chain in HVC for songs with probabilistic transi-
tions between syllables . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Chomsky hierarchy of languages . . . . . . . . . . . . . . . . . . . . 10
2.1 State transition diagram for a Finite State Machine . . . . . . . . . 152.2 State transition diagram for a simple Markov process . . . . . . . . 172.3 Example of an HMM . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 Example of a POMM . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5 Empirical probability distribution over observed sequences . . . . . 192.6 Sequence completeness distributions under Markov models . . . . . 212.7 Sequence completeness depends on the number of sequences avail-
able for model inference . . . . . . . . . . . . . . . . . . . . . . . . 222.8 Sequence completeness depends on the repeat probabilities of sym-
bols in the Markov model . . . . . . . . . . . . . . . . . . . . . . . 252.9 Sequence completeness depends on the sparsity of the transition
matrix measured by the state transition entropy . . . . . . . . . . . 262.10 Schematic of search on a grid for an optimal model . . . . . . . . . 332.11 Scaling of true and sample standard deviations as a function of the
fraction of sub-sampled sequences . . . . . . . . . . . . . . . . . . . 352.12 Construction of a sequence completeness distribution from the data 362.13 Inference of a POMM for a toy model . . . . . . . . . . . . . . . . . 39
3.1 Song syntax of Bengalese finch, Bird 1, before and after deafening . 433.2 Song syntax of Bengalese finch, Bird 2, before and after deafening . 443.3 Song syntax of Bengalese finch, Bird 3, before and after deafening . 453.4 Song syntax of Bengalese finch, Bird 4, before and after deafening . 463.5 Song syntax of Bengalese finch, Bird 5, before and after deafening . 473.6 Song syntax of Bengalese finch, Bird 6, before and after deafening . 483.7 On average the state transition entropy increases and sequence
length decreases after deafening. . . . . . . . . . . . . . . . . . . . 49
viii
3.8 Sequence completeness of Bengalese finch song sequences underMarkov models before and after deafening . . . . . . . . . . . . . . 50
3.9 p-values of sequence completeness of Bengalese finch song sequencesunder Markov models before and after deafening . . . . . . . . . . . 51
3.10 Sequence completeness and corresponding p-values under syntaxmodels with only the dominant transitions retained at significancelevels α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.11 Dominant transitions in the syntax after deafening . . . . . . . . . 533.12 Transcription of part of a humpback whale song . . . . . . . . . . . 543.13 Dependence of sequence entropy on the number of unique sub-
sequences obtained using different segmentation lengths . . . . . . . 563.14 Dependence of sequency entropy on segmentation length and sys-
tem size using randomly generated sequences . . . . . . . . . . . . . 573.15 Entropy of a segmented humpback whale sequence depends on the
segmentation length k as well as the total system size S . . . . . . . 583.16 Transcription of part of a humpback whale song with repeat units
and themes highlighted . . . . . . . . . . . . . . . . . . . . . . . . . 593.17 Markov model of themes in the songs of a population of humpback
whales and n-gram distribution matches . . . . . . . . . . . . . . . 60
4.1 Sample repeat distributions of an individual canary’s syllables andfits based on the sigmoidal model of adaptation . . . . . . . . . . . 64
4.2 Inverse relationship between syllable duration and most probablenumber of repetitions for swamp sparrows . . . . . . . . . . . . . . 66
4.3 Inverse relationship between syllable duration and most probablerepeat number for six individual canaries. . . . . . . . . . . . . . . . 67
4.4 Mean and modal number of repetitions show the inverse relationshipwith syllable duration for the canary population . . . . . . . . . . . 68
4.5 Inverse relationship is not exact between syllable duration and mostprobable number of repetitions for Bengalese finches . . . . . . . . . 69
5.1 Waveform of a Bengalese finch song with the corresponding spec-trogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Syllable types in the vocal repertoire of a canary . . . . . . . . . . . 805.3 Linear boundary surface in an SVM separating labelled data . . . . 815.4 Non-linear SVM separating data points using a kernel function . . . 825.5 Summary of the Hough transform . . . . . . . . . . . . . . . . . . . 845.6 Separation of syllables in feature space based on duration, and the
two Hough transform coordinates ρ and θ . . . . . . . . . . . . . . 855.7 Transcription of a Bengalese Finch song . . . . . . . . . . . . . . . 87
ix
A.1 Calculation of forward probabilities in the trellis of the observationsequence y1,y2,y2 for a 4-state POMM . . . . . . . . . . . . . . . . . 93
B.1 Scaling of sample and true means and standard deviations . . . . . 99
C.1 The probability density function for a variable taking a value basedon random assignment from a uniform distribution . . . . . . . . . 102
x
List of Tables
3.1 Bengalese finch song statistics . . . . . . . . . . . . . . . . . . . . . 423.2 Sequence completeness and p-values for Bengalese finch song se-
quences under Markov models . . . . . . . . . . . . . . . . . . . . . 513.3 Sequence completeness and p-values for pre-deafening sequences
based on syntax post-deafening . . . . . . . . . . . . . . . . . . . . 52
5.1 Duration of syllables in a Bengalese Finch song. . . . . . . . . . . . 83
xi
List of Symbols and Abbreviations
FSM Finite State Machines
POMM Partially Observable Markov Model
HMM Hidden Markov Model
T Transition Matrix
E Emission Matrix
SVM Support Vector Machines
Pc Sequence completeness
L Log-likelihood
xii
Acknowledgments
I have a feeling I will look back on my years in graduate school as being enormouslyinfluential in changing my views about a lot of things in life - the most importantof them being my ideas about what doing research in the sciences really involves -aspects I enjoy, and those I do not. I would like to thank Dezhe Jin, my adviser,for his support over the years and for giving me the opportunity to learn about oneof the most fascinating areas of science - the generation of learned vocalizationsin the animal kingdom. Dezhe is an excellent teacher - his theoretical mechanicsclass is perhaps the best physics class I have ever taken - and I aspire to his teach-ing standards. My deep gratitude to Jefferey Markowitz and Timothy Gardner,Boston University; Dana Moseley and Jeffrey Podos, University of Massachussets,Amherst; Michael Noad, Cetacean Ecology and Acoustics Laboratory, The Uni-versity of Queensland, Luca Lamoni and Luke Rendell, University of St.Andrews;and Kristofer Bouchard and Michael Brainard, University of California, San Fran-cisco, for generously sharing audio recordings and transcriptions without whichthis dissertation would not have been possible. Thanks to Phillip Schafer, JasonWittenbach, Eugene Tupikov, and Leo Tavares, for all the discussions we havehad, not just about work. Many thanks to Patrick Drew for checking in on howI was doing once in a while. It was immensely helpful to be able to knock at hisoffice door seeking advice on several occasions. I am greatly indebted to Clarefor listening to me. I would also like to thank the physics department for theopportunity to gain valuable teaching experience over the years. Thank you tothe Center for Neural Engineering for the office space with windows (short-livedthough that was) and for the many grad club discussions and pizza. Thanks to myfriends Ila, Lakshmy, Leo, Nithin, Bruce, Yisi, Riddhi, Neha, Dolon, Salini, Ag-netha, Ramya, Kundan, Ganesh, and Latha for being there for me through manyfun and not-so-fun times. Jaya aunty, and Rajan uncle, thank you for putting upwith extended periods of absolutely no word from me while I figured things out.Amma, Achen, and Chandu, thank you for always cheering me on - even when I’vebeen annoyingly difficult to deal with. And Sreejith - you know all that I want tosay without me having to actually say it.
xiii
Dedication
To Valsu and Lal
xiv
Chapter 1Introduction
Sequences in biological systems such as the locomotive actions of a worm, the ar-
rangement of nucleotides in a genome, and the waggle dance of the honey bee, are
fascinating natural phenomena, the analysis of which has the potential to give valu-
able insights into the inner workings of biological systems. Much as the sequence
of numbers in a geometric progression, biological sequences are concatenations of
elements that are realized one after another following some pattern. In order to
understand a biological sequence, two components are necessary - a knowledge of
the rules that should be followed to create the sequence, and the machinery to
generate the sequence. The brain is responsible for both these components in the
case of motor sequences in higher organisms. Specialized networks of neurons in
the brain store the rules, access them when necessary, and facilitate the associ-
ated motor behavior. However, there is no complete understanding till date of
all the brain regions and neural mechanisms involved in neither the learning, nor
the generation, of a large number of motor sequences. Animal vocal sequences are
examples of such poorly-understood motor sequences. The focus of this disserta-
tion is on identifying statistical regularities in animal vocal sequences, inferring
generative rules, or what can broadly be called the syntax, of these sequences, and
establishing comparative measures to distinguish different categories of sequences,
in the context of aiding an understanding of the neurobiology of vocal sequence
generation.
Syntax, as defined by linguists, refers to the ordering of words within the sen-
tence of a language [1]. This ordering preserves associations between word cat-
1
egories such as nouns, verbs, and prepositions. It is an association independent
of meaning or semantics, and the morphology of the words themselves. In the
context of animal vocal sequences, the syntax we refer to in this dissertation is
phonetical syntax - patterns of sounds that do not individually or collectively have
any referential meaning [2]. The main focus of the dissertation is on the syntax
of songbird songs. The beginning of neurobiological studies on sequence learn-
ing in songbirds happened in parallel with investigations into abstract syntactic
structures in linguistics led by Noam Chomsky [3]. This led to an emphasis on ab-
stract representations of song sequences such as finite state machines [4–6], in the
spirit of models of computation that were developed in the field of computational
linguistics as a consequence of Chomsky’s inquiry [7].
In this dissertation, we are particularly interested in a finite state represen-
tation of song syntax called the Partially Observable Markov Model [8] that has
been shown in one study to be a good characterization of the song syntax of the
Bengalese finch, a songbird [6]. Is there a prototypical syntax for the songs of a
particular species? What are the factors that influence syntactic structure? These
are both interesting questions. We are specifically interested in understanding the
role of auditory feedback in regulating the syntax. The songs of three songbird
species - Bengalese finches, canaries, and swamp sparrows are used in the research.
A short analysis of humpback whale songs is included to demonstrate that the
methods developed are generalizable to the vocal sequences of other animals.
The data analyzed in this dissertation were obtained from several other research
groups. These include laboratory recordings and transcriptions of Bengalese finch
songs from Kristofer Bouchard and Michael Brainard, University of California,
San Francisco; similar data for canary songs from Jefferey Markowitz and Tim-
othy Gardner, Boston University; laboratory recordings of swamp sparrow songs
from Dana L.Moseley and Jeffrey Podos, University of Massachussets, Amherst;
recordings of humpback whale songs collected in Eastern Australia by Michael
Noad, Cetacean Ecology and Acoustics Laboratory, The University of Queens-
land, Australia, and transcriptions of these songs from Luca Lamoni and Luke
Rendell, University of St.Andrews, Scotland. These datasets were analyzed using
tools developed based on ideas motivated by statistical inference. The results will
be presented in the context of implications for neural models.
2
1.1 Animal vocal sequences
Vocalizations in the animal kingdom can be broadly categorized into those that
are learned and those that are innate. Vocal learners are able to listen to the
vocalizations of another member of their species, or sometimes another species, and
hone their own vocalizations by trial and error to match that of a tutor [9]. Humans
are examples of such vocal learners. Vocal learners may also have some innate
vocalizations. Laughter in humans, for example, is a vocalization that is innate [10],
while everyday speech is learned. This co-existence of two modes of vocalization
is not limited to humans. Among other members of the animal kingdom, there is
evidence that three groups of birds - parrots, hummingbirds and songbirds [11], as
well as cetaceans such as dolphins [12], humpback whales [13] and killer whales [14],
pinnipeds such as seals [15], non-human primates such as marmosets [16], and other
mammals such as bats [17], and more recently, elephants [18] and mice [19], are
vocal learners.
Each vocal sequence is composed of basic acoustic elements, the alphabet of
the vocal sequence if you will, that have variously been referred to as syllables,
notes, and units. We will use the term syllables. Some of these sequences are more
complex than others, with complexity being a loosely-defined property that could
refer to large vocal repertoires composed of many syllables, syllables with a large
range of temporal and spectral modulations, highly non-random, long-range and
complex-correlated sequencing of the syllables, the ability to vocalize for a large
duration of time, or the number of unique song sequences, to name a few. One
could ideally imagine defining a scale using some such definition of complexity and
seeking to arrange vocal learners on this scale. In what sense are the vocalizations
of a human more complex than that of a canary, if that is indeed the case? For
the purposes of this dissertation, we choose to define complexity on the basis of
the structure of syllable transitions in a finite state representation of sequence
syntax. We assign probabilities to different syllable transitions and can therefore
resort to characterizations of the syntax in terms of standard information-theoretic
measures such as entropy, and some others of our design.
3
1.1.1 Songbirds
Oscines or songbirds are an avian group with about 4500 species (half of all avian
species) that are known for their learned vocalizations called song [20]. Songs are
distinguished from what are known as calls - innate vocalizations that are produced
by most members of the animal kingdom [21]. Just as in human infants, there is a
sensitive period of development during which juvenile songbirds learn their vocal-
izations by listening to an adult tutor [22]. The first songbird, the songs of which
were established to be learned, was the chaffinch, following the work of William
Thorpe in 1954 [23]. The research done by Thorpe, Marler [24,25] and others in the
1950’s and 1960’s further set the stage for the songbird to be considered a model
system in the study of learned behavior. However, it was the work of Fernando
Nottebohm in the 1970’s that truly made the songbird a relevant model system.
Nottebohm in his experiments on canaries found that the male canary had a dedi-
cated network of brain nuclei that were linked to the production of song [26]. With
this discovery began the investigations into the neural mechanisms and functions
that were involved in helping the songbird learn and generate song sequences. The
songbird has since then been used as a model system to study questions not just in
ethology, but also in neurobiology and neurolinguistics. It is an ideal model system
since songbirds can breed under laboratory conditions and many are spontaneous
singers, making possible the collection of a large number of samples of this highly
stereotyped learned behavior under experimentally controlled conditions.
Of the 4500 species of songbirds, very few have been studied in the context of
understanding the neural basis of learned vocalizations. White-crowned sparrows
[27], canaries [26, 28], chaffinches [23, 29], zebra finches [30, 31], and Bengalese
finches have been the typical subjects of most research. Even within this small
group, there are fundamental differences in the learning and production of song.
Zebra finches and Bengalese finches are close-ended or age-limited learners - birds
for which there is a time window after which the bird does not learn any new
song. Canaries, however, are open-ended learners - learning new songs throughout
their adult lives. But as far as song production is concerned, the deterministic
song of the zebra finch (fixed transitions between syllables) contrasts with the
more variable songs of Bengalese finches and canaries (probabilistic transitions
between syllables). Seeking features of neural control systems that could lead to
4
both inter-species similarities and contrasts such as these is important to further
our understanding of the mechanisms behind learned vocalizations.
1.2 Neurobiology of sequence generation
The generation of vocalizations is a biological feat involving the precise coordina-
tion of oral, vocal, and respiratory muscles, that must be orchestrated by neural
control. The neural circuitry involved in vocalizations has been studied the most
in humans and songbirds. While innate vocalizations have been linked to the
midbrain [32, 33], it is hypothesized that a direct cortical pathway to motor nu-
clei involved with vocalization as seen in songbirds and humans is necessary for
complex vocal learning [10,32,34]. Such a connection is yet to be seen in any non-
learner. In general, in mammals, it is argued that there are two separate neural
pathways for the production of innate and learned vocalizations [35]. Many brain
areas in birds are hypothesized to be homologous - similar in position, structure,
and evolutionary origin but not necessarily in function - to those of mammals [36],
although this is still an active area of research. To add to the list of similarities,
the mechanism of generating vocal fold movements has recently been shown to be
the same in the syrinx of songbirds and the larynx of mammals [37]. Given these
fundamental similarities between generative neural pathways and peripheral mech-
anisms, there are still many differences that need to be explained to understand
the diversity of vocalizations in the animal kingdom.
1.2.1 The song production system in the avian brain
A collection of brain areas in oscines specialized for song is referred to as the
song system or the song circuit [26]. Several forebrain nuclei are involved in the
song system of oscine songbirds - a feature that is conspicuous by its absence in
suboscine birds. There are two main forebrain pathways in the song system -
a posterior forebrain pathway also referred to as the descending motor pathway,
and an anterior forebrain pathway as shown in Fig. 1.1. Both these pathways
start with a cortical-like region called the HVC (proper name). In the anterior
forebrain pathway, involved in the learning of song, the HVC connects to Area X, a
homologue to mammalian basal ganglia which connects to the dorso-lateral division
5
HVC
nXIIts
RA
Syrinx
LMAN
Area X
DLM
NIf
UVA
Resp Nuc
A B
NIf
UVA DLM
LMAN
XHVC
RA
Motor
AuditoryPathway
Vocal OutputDescending Motor Pathway
Anterior Forebrain Pathway
Excitatory
Inhibitory
Figure 1.1. The song system in the songbird brain. The descending motor pathwayassociated with the production of song, the anterior forebrain pathway required forthe learning and maintenance of song, and auditory input into the song system arehighlighted.
of the medial thalamus (DLM), which in turn connects to the lateral part of the
magnocellular nucleus of anterior nidopallium (LMAN). In the posterior forebrain
pathway, involved in the production of song, neurons in the HVC project to a
brain nucleus called the Robust nucleus of the Arcopallium (RA). RA neurons then
project downstream to the motor neurons involved in respiration and vocalization.
It is this direct cortical pathway that is hypothesized to be unique to vocal learners.
The vocal production and vocal learning pathways interact through connections
from LMAN in the anterior forebrain pathway to RA in the posterior forebrain
pathway. Learning in the songbird is facilitated by access to auditory information
- both from a tutor as well from the bird’s own song. The songbird brain also has
a set of nuclei that are part of the auditory pathway. The auditory pathway feeds
into the HVC through three connections - through the nucleus Uvaformis (UVA),
through the Nucleus Interfacialis (NIf), and via a direct connection to HVC.
Of the many connections in the songbird brain, a few are central to our discus-
sion - internal connections within the HVC, HVC-RA projections, and auditory
feedback connections. There is much speculation about the HVC being the seat
of syntax. The study of syntactic structure via statistical models - both of form,
and any change due to disruptions - is necessary to further our understanding of
the involvement of these connections.
6
1.2.2 Neural models of sequence generation involving the HVC
HVC and RA together form the premotor circuit in the song production pathway.
In zebra finches, RA activity during singing is characterized by trains of short
bursts of spikes bound by periods of inhibition, with each burst being associated
with a unique subsyllabic acoustic event [38]. The pattern of activity in RA imme-
diately upstream from motor neurons is therefore considered to be responsible for
sound production. In later experiments, also in awake zebra finches, it was found
that an HVC-RA projection neuron (referred to as HVCRA) emits a single burst of
spikes (≈ 6ms) during a song motif (≈ 1s) [39, 40]. Different HVCRA neurons fire
at different times during the motif. There are two possible hypotheses about the
role of the HVC that could explain these observations [41]. One is that the HVCRA
neurons form a representation of temporal order by producing a continuous stream
of activity on a 10-millisecond timescale [40]. More recently, based on modeling
song production in terms of dynamical systems, the other hypothesis is that bursts
encode transitions between different elemental ‘gestures’ of the song - periods of
time when the model parameters representing pressure in the bird’s air sac and the
spring-like tension on a vibrating membrane controlled by the muscles surrounding
the syrinx were either unchanged or strictly increasing or decreasing [42]. The two
hypotheses are not mutually exclusive since it is possible that while bursting ac-
tivity in HVCRA neurons aligns with transitions between gestures, enough HVCRA
neurons are active throughout each gesture to account for temporal ordering. We
will therefore assume that the HVC is responsible for temporal order in a sequence.
Branching chain model of stochastic sequence generation in HVC
Several neural models consider the topological connectivity of neurons (the graph
formed by neurons physically connected via synapses) in the HVC to take the form
of synfire chains [43–46]. In a synfire chain, neurons are ordered into groups that
are connected in a feed-forward fashion. All connections are excitatory. Activity
propagates synchronously from group to group in the synfire chain. In one model,
HVC-RA neurons are modeled as synfire chains with each neuron having an in-
trinsic bursting property based on dendritic calcium spikes [47]. Global inhibition
through HVC interneurons is included to regulate activity. Working within this
model of the neural circuitry, a neural sequence in the HVC corresponding to the
7
A
B
C
AB
C
Figure 1.2. Branching synfire chain in HVC for songs with probabilistic transitionsbetween syllables such as those of the Bengalese finch. Each chain corresponds to asyllable. Syllable A could transition to syllable B or C depending on whether activitypropagation in synfire chain B wins over the activity in synfire chain C or vice versa.
deterministic transitions between syllables in the song of a zebra finch can be set up
as activity propagation through a set of chains, each connected to just one other in
a feed-forward manner. However, to account for probabilistic transitions between
syllables in the song of a Bengalese finch, chains are connected in a branching
manner as seen in Fig. 1.2 [45]. Till date there has been no observation of the
exact organization of neurons in the HVC. The only clue that we have that speaks
to it is the observation that HVC-RA neurons that are activated at the same time
are not located next to each other. This suggests that a group in the synfire chain
is an abstract entity that is defined by co-activation of neurons. When we refer to
‘states’ in models of syntax later on, we will roughly be thinking of a one-to-one
mapping between the states and these groups in the synfire chain model.
1.3 Models of syntax for vocal sequences
Any sequence can be described by considering the statistical patterns that it
presents. If we note down the numbers that show up in a sequence of rolls of
8
a fair die, for example, we can see that there is no discernible structure to the
number sequence. This is because these numbers result from independent trials,
with each throw of the die having no influence on any of the others. In a paper
from 1907, Russian mathematician Andrei Andreevich Markov considered instead
the possibility of a chain of dependent variables y1, y2, . . . , yn for which yk+n is only
dependent on yk for any k [48]. The simplest case is when n = 1 where dependence
is limited to the immediate predecessor of a variable in the chain. This is called
a first order dependence. Consider for example the board game Snakes and Lad-
ders. Every new position of your marker on the board is influenced only by your
previous position on the board. All moves before that do not matter at all. Such
chains, be they of first or higher orders came to be called Markov chains in the
popular literature. Markov chains are fully specified by stating the set of unique
elements the sequence is made of, and the probabilities of an element in the set
being followed by any element in the set, called transition probabilities. The first
references to the syntax of birdsong assumed that song sequences were Markov
chains [49,50]. However, in a Markov chain, if an element repeats with probability
p, then the probability of the element repeating n times before transitioning to a
different element is given by the binomial distribution
P (n) = pn−1(1− p) (1.1)
This is a monotonically decreasing function in n. However, it has been observed
that in the songs of some songbirds, the distribution of syllable repeats may not
be monotonic [6, 51], indicating that the song sequences are non-Markovian.
Also, even though the use of the word ‘sequence’ might draw to mind a linear
organization of elements, it is possible that the relationship between elements in
a sequence could be highly non-linear. An example that makes this clear is the
following English sentence - If it is good, then it can be published. If and then are
not adjacent to each other in the sentence, yet the presence of If necessitates that
of then later in the sentence. This sort of ‘non-local’ or long-range dependence
requires that the syntax be non-linear while the production of the elements can be
linear in time. Such dependencies are not limited to natural languages, but could
be seen in motor sequences such as animal vocalizations as well. Hence, descrip-
tions of sequence syntax must go beyond Markov models. In this dissertation, we
9
recursively enumerablecontext-sensitive
mildly context-sensitivecontext-free
regular
songbird songs
natural languages
Figure 1.3. Chomsky hierarchy of languages.
specifically study a more complex model called the Partially Observable Markov
Model (POMM).
1.3.1 The Chomsky hierarchy
Languages 1 are typically classified into four major classes according to the Chom-
sky hierarchy shown in Fig. 1.3. In order of increasing complexity, these are -
regular languages, context-free languages, context-sensitive languages, and recur-
sively enumerable languages. The distinction between these languages is made
based on the computational machines that can be used to generate or recognize
them. Regular languages are those which can be recognized using a machine with a
finite number of states. Animal vocalizations are thought to belong to the class of
regular languages, and more specifically to the subclass of finite languages, while
natural languages are classified as mildly context-sensitive. However, these are
broad categories. All animal vocalizations for example do not seem equal. We are
interested in devising finer divisions within the hierarchy. A question that exem-
plifies our effort is - How different from a Bengalese finch’s song is the song of a
canary, and how different from that is the song of a humpback whale? We attempt
to make comparisons of this nature in this dissertation.
1A language here is simply a finite or infinite set of strings, each finite in length and composedof a finite number of elements [52]
10
1.4 Structure of the dissertation
This dissertation consists of 6 chapters including the introductory chapter of which
this section is a part.
Chapter 2 introduces finite state representations of various syntax models for
symbol sequences - starting with the simple Markov model and considering vari-
ous higher order models. We focus specifically on the Partially Observable Markov
Model (POMM). We study model inference in detail and discuss some new mea-
sures to evaluate the model. The inference of the POMM is limited by data size.
The dependence of all measures discussed on data size is carefully analyzed, es-
pecially since recording limitations often lead to small sample sizes for animal
vocalizations. We derive bounds on all sequence statistics that are discussed in
this context. The importance of considering sample size as a prime factor in in-
ferring the category of models to which the song syntax of a species is assigned is
also emphasized.
In Chapter 3 we apply the methods developed in Chapter 2 to the songs of Ben-
galese finches. Firstly, we show that the syntax of the Bengalese finch is a Partially
Observable Markov Model - there is a many -to-one mapping between syllables in
song and abstract states of the model, which are hypothesized to be chain net-
works of neurons in the songbird brain. Changes in the syntax of Bengalese Finch
song after the removal of auditory feedback by deafening are studied. We find that
for four of the six birds studied, removal of auditory feedback leads to the disap-
pearance of the many-to-one mapping. However, the absence of this observation
in the remaining two birds suggests that auditory feedback is not solely respon-
sible for the regulation of the many-to-one mapping. This chapter also includes
a short analysis of humpback whale songs to demonstrate the generalizability of
these methods, as well as limitations.
In the previous chapter, the POMM models of Bengalese finch and humpback
whale songs were inferred from sequences in which all repetitions of syllables were
disregarded. In Chapter 4 we focus solely on the features of syllable repetitions in
song. We show an inverse relationship between the duration of the syllable and the
most probable number of repetitions for the songs of canaries and swamp sparrows.
These are two species of songbirds whose songs are predominantly composed of
syllable repetitions - single occurrences of syllables are rare, if any. We also show
11
that this relationship is not exact for the songs of a Bengalese finch. We speculate
about possible neural and peripheral mechanisms behind the generation of syllable
repetitions that could result in adherence to, or deviation from, such a relationship.
Chapter 5 is a stand-alone portion of the dissertation. The data for all analysis
in previous chapters were symbol sequences. However, the mapping of an audio
recording into a symbol sequence was not discussed. Field or laboratory recordings
of vocalizations must first go through several stages of pre-processing. We discuss
methods of identifying the time intervals in the processed recording during which
vocalizations are present, by distinguishing them from silence. We develop a semi-
automated method of classifying the identified syllables into types or categories
using a supervised learning technique - the Support Vector Machine (SVM). The
use of image-based features to distinguish syllables is advocated in comparison
with the predominant use of sound-based features in this field of research.
Chapter 6 concludes the dissertation. We discuss the role of syntax models in
informing research on the neurobiology of sequence generation as well as point out
possible extensions of the work presented in this dissertation.
12
Chapter 2Partially Observable Markov Models
of Sequence Syntax
It is often difficult to distinguish patterns in sequences originating from a rule as
opposed to statistical coincidences. While statistical coincidences average out in
large enough sets of data, we need to be careful with the distinction when working
with small data sets. Nevertheless, the rule can still be inferred using tools that
analyse the statistics in the data, with a level of confidence that can be quantified
in the language of probabilities. The task of modeling the syntax of vocal sequences
is therefore one of statistical inference. In this chapter, we discuss various models
of syntax, with a focus on the Partially Observable Markov Model (POMM). We
develop methods of inference that allow identifying and encoding these rules into
a POMM. We also address questions of performance of the model as well as the
inference scheme in the case of finite data sets. Finally, we discuss some model
evaluation measures.
2.1 Introduction
Let Y1 = y1, Y2 = y2, ... denote a sequence of observations of the random variable
Y . In the case of animal vocal sequences, the random variable is a syllable type ie
if there are m syllable types in an animal’s vocal repertoire, then each observation
in a sequence could be one of m possibilities. The simplest scenario is one in which
the syllables appear with probabilities that are independent of preceding syllables.
13
Such sequences of syllables can be produced without a memory of what transpired
before.
More interesting sequences contain complex patterns which manifest as cor-
relations between the syllables observed at different times. Such patterns could
imply that the ‘machine’ that generated the sequence possesses some memory (in
some physical form), based on which the rules governing the choice of syllable
are tweaked. Thus the patterns require a complex computing process to gener-
ate them. Analysis of patterns and the complexity of the rules that can generate
them opens a window into the complexity of the machine which, in the case of our
interest, is the brain.
One approach to grading the complexity of patterns is to identify the simplest
computing scheme that can reproduce the patterns, borrowing notions from math-
ematical models of computing. This procedure has several aspects to it namely -
(i) identifying the level of complexity of the model, (ii) using the available data
to identify the specific model within the given class of complexity, and (iii) eval-
uating the ability of the identified model to reproduce the patterns. All these
tasks are made difficult by various limitations posed by finiteness of the available
sequence data, which can result in an insufficient representation of characteris-
tic correlations and the generation of spurious correlations arising from statistical
coincidences. For the sequences in our study, we are interested in models with a fi-
nite amount of memory - namely Markov models and Partially Observable Markov
Models (POMM). We discuss methods for training the models with data, model
validation, and checks on overfitting.
2.2 Model representation - Finite state machines
A Finite State Machine (FSM) is an abstract representation of regular languages
(see Sec. 1.3.1) that can also be thought of as an abstract apparatus that performs
computations. FSMs are systems that can be described by a ‘current state’ x(t)
which at a time can take a value from one among a finite set of possible states.
A finite set of inputs can trigger a transition in the state. The transition depends
on the current state and the input trigger through a transition function T . The
system can produce output symbol y(t) taking values from a finite set of symbols
either during the transition or in between the transition. The output can depend
14
on the current and previous state through an emission function E [53]. The state
is an abstract structure that can be chosen to be something appropriate based on
the process being represented. FSMs allow representing of patterns contained in
most simple sequences into an appropriate choice of states, symbols, transitions
and emissions. While FSMs are defined to have deterministic transitions and
emissions, these can be generalized to a class stochastic FSM’s the simplest of
which are Markov processes and Hidden Markov models.
FSMs can be visually represented using state transition diagrams. To illustrate
the notions and the mapping to the diagram, let us invent a one-player luck-based
challenge that uses combined coin flips and die throws to illustrate an FSM. The
goal is to stay in the game for as long as possible. The game always begins with a
coin toss. A player upon getting heads on a toss throws the die. If a 1 shows up,
then the coin should be flipped again. If 2,3,4 or 5 falls, the player throws the die
again. If a 6 shows up, the player is out. The mechanics/rules of this game can be
represented by a simple FSM shown in Fig. 2.1.
CoinTail
Head
1
2,3,4,5
Start
End
6
Die
Figure 2.1. An illustration of a Finite State Machine represented by a state transitiondiagram
In the diagram shown, the bubbles represent states, and the arrows represent
state transitions. The arrow labels indicate the output/input symbol corresponding
to a transition. Note that in this FSM the output from one state is the input to
the next state.1 The FSM shown has four states - Start, Coin, Die, and End.
1This need not always be the case. In general, Finite State Machines can have different setsof input and output symbols.
15
There are two symbols H and T associated with the state Coin, while there are six
symbols 1,2,3,4,5, and 6 associated with the state Die. The Start and End states
are each associated with a null output {}. Game progress happens in discrete time
steps.
The FSM considered is an example of a Moore machine, where the output
depends only on the current state - H or T depends only on the state Coin; it
does not matter whether this state was entered after a throw of 2 or 5 on the die.
The same FSM can also be represented as an equivalent Mealy Machine where
the transition between states is based on the input symbol. This distinction is
made here since Finite State Machines are either represented as Moore and Mealy
in different research articles about the syntax of sequences, and we would like to
emphasize that they are equivalent forms. Any discussion that follows is applicable
to either form.
Finally, a finite state model is an abstract representation and does not necessar-
ily need to have a one-to-one correspondence with the physical process it describes.
However, the hope is to find some mapping between the machine and the process.
In the case of models of vocal sequence syntax, we seek a mapping to neural models
of sequence generation.
2.2.1 Markov models
When the occurrence of an observation in a sequence only depends on an observa-
tion that occurred before it at a particular position in the sequence, the sequence
is said to be Markovian, or generated by a Markov source. An example of such
a Markovian source is shown in Fig. 2.2 where the observations can be one of
two symbols {y1, y2}, and with probabilities assigned to the transitions between
the symbols. Markov sequences can be thought of as originating from stochastic
FSMs with state symbols but no output symbols [54]. The Markov model can also
be represented by a transition matrix T which specifies the transition probabilities
between two symbols. The occurrence of a symbol in position i, depends only on
the symbol in position i − 1. This is a first-order Markov model. If it depended
on the symbol in position i − 2, it would be a second-order Markov model. In
general, if the occurrence of a symbol in position i in the sequence depends on the
symbol in position i−n, we have an nth order Markov model. In this dissertation,
16
‘Markov model’ refers to the first-order Markov model unless stated otherwise.
y10.8 y2
0.2
0.6
0.4 T =
(0.8 0.20.6 0.4
)
Figure 2.2. An example of a first order Markov model represented using the statetransition diagram in a manner similar to an FSM, and the corresponding transitionmatrix. The numbers next to the arrows show the transition probabilities.
2.2.2 Hidden Markov Model (HMM)
A Hidden Markov Model (HMM) is used to model stochastic sequences more com-
plex than a Markov chain. The HMM is a standard statistical model with a wide
range of applications in time-series modeling [55]. In an HMM, an observed sym-
bol sequence is also associated with a hidden state sequence. The hidden state
sequence is a Markov chain. For a symbol sequence consisting of m unique sym-
bols, an HMM is represented by the set T,E, k, where k is the number of states,
T is a k × k transition matrix specifying the transition probabilities between the
states, and E is a k ×m emission matrix specifying the probability of each state
emitting each of the symbols. In an HMM, each state has an emission distribution
over all symbols such that the same state could emit more than one symbol. An
example of a 4-state HMM is shown in Fig. 2.3. An HMM can be thought of as a
stochastic FSM [54].
2.2.3 Partially Observable Markov Model (POMM)
A Partially Observable Markov Model2 is a special case of the Hidden Markov
Model [8]. In the HMM, each state is associated with a probability distribution over
all m possible symbols that can appear in the sequence. This means that a single
state can could potentially emit any of them symbols, each time a transition to that
state occurs. By contrast, in a Partially Observable Markov Model (POMM) [56],
each state is associated with only one symbol, i.e., a state emits one of the m
2Not to be confused with the Partially Observable Markov Decision Process which has addi-tional structure called a decision maker which can influence state transitions.
17
s1 s2
T12
s3 s4
y1 y2
E4(y2)
Figure 2.3. An HMM with k = 4 states which can emit m = 2 discrete symbols y1or y2. Tij is the probability of transitioning from state si to state sj . Ej(yk) is theprobability of emitting symbol yk in state sj . In this particular HMM, states can onlyreach neighboring states.
symbols with probability 1. Multiple states could however emit the same symbol
making the POMM a many-to-one mapping scheme as shown in Fig. 2.4. The use
s1 s2
T12
s3 s4
y1 y2
E4(y2) = 1
Figure 2.4. A POMM with k = 4 states which can emit m = 2 discrete symbols y1or y2. Tij is the probability of transitioning from state si to state sj . Ej(yk) is theprobability of emitting symbol yk in state sj . In a POMM Ej(yk) = 1.
of the POMM rather than the full HMM is motivated by computational models
to understand the neural basis of birdsong [45], where each state is considered to
be a chain network of neurons in the songbird HVC (see Sec. 1.2.2.). The neural
sequences arising from a chain network are Markovian, but the resulting syllable
sequences need not be if each chain represents a state in the POMM rather than
a syllable directly. A POMM can model all distributions that can be modeled by
an HMM- i.e., every HMM can be represented by an equivalent POMM [8]. It
has been shown in one study that for Bengalese finches (two birds), the syntax is
consistent with a POMM [6]. In the following sections, we discuss model evaluation
and inference mainly for the POMM since it is very likely that the model is a
representation of syntax for the vocal sequences of other species as well.
18
Figure 2.5. Empirical probability distribution over observed sequences which is anapproximation to the true probability distribution. A model defines a distribution oversequences which is an approximation to the empirical distribution
2.3 Model evaluation
The construction of a stochastic model of the syntax defines a probability distribu-
tion over the set of all sequences. One optimizes the parameters of the model such
that the probability distribution defined by the model is a good approximation of
the true probability distribution defined by the process generating the observed
sequences. In the absence of direct and/or complete knowledge of the process gen-
erating the data, one has to infer the match between the distributions purely from
available data (Fig. 2.5). In evaluating the quality of this approximation, we need
to empirically identify the relevant statistical features that have to be reproduced
by the model and accordingly define model performance. Given enough number of
parameters, a complex enough model can always be found to match any distribu-
tion. In order to avoid such overfitting and capturing spurious patterns, we need
to place limits on model performance. In the following sections, we discuss these
issues in the context of POMM inference.
2.3.1 Measures of model fit
One approach to evaluating a model is to define metrics to ‘score’ the inferred
model. A high score reflects a good model. We consider the use of two such
19
metrics - sequence completeness and log-likelihood.
2.3.1.1 Sequence completeness
Consider a model that generates sequences with a distribution P , and let Y be a
set of observed sequences (generated by another process, say Q, possibly the same
as P ). Sequence completeness measures the total probability under P of finding
each of the unique sequences in Y .
Pc =∑y
Py (2.1)
If all sequences present within the observed set can be generated by the model
and these are the only sequences that the model can generate, then Pc = 1. In
this sense it can be thought of as a measure of the similarity between the two
distributions. However it is a weak measure of similarity in that we do not require
P (y) = Q(y) for any sequence y.
Factors that affect sequence completeness
We use completeness as a tool to check whether a model inferred from the data is
capable of reproducing the unique sequences found in the data. More precisely, we
use part of the data to learn a model and measure the completeness of the model
against the remaining part of data. Completeness values depend strongly on the
amount of available data i.e, number of song sequences, number of syllables, and
the nature of the model that generates the data.
Effect of finiteness on sequence completeness
The number of sequences available for the inference of a finite state model could
play a significant role in the reliability of the model fit. We test how using different
numbers of sequences in the inference of the model affects the sequence complete-
ness. We want to quantify the effect of the finite size of the available dataset
with simple first-order Markov models (a first-order Markov model is a POMM in
which the number of states is the same as the number of syllables). We construct
such models using different numbers of symbols with the transition probabilities
out of each symbol/state drawn from a uniform distribution. We use m=2,3,4
20
0 0.2 0.4 0.6 0.8 10
200
400
600
800
1000
0 0.2 0.4 0.6 0.8 10
200
400
600
800
1000
0 0.2 0.4 0.6 0.8 10
100
200
300
400
500
600
700
800
0 0.2 0.4 0.6 0.8 10
200
400
600
800
1000
m=3, N=200 m=3, N=500
m=5, N=200 m=5, N=500
coun
ts
sequence completeness
Figure 2.6. Sequence completeness distributions under Markov modelsfor N sequencesfor m symbols. There is a large variation in possible distributions for the same N andm. The average sequence completeness distribution in each case is shown with a thickline.
and 5 symbols (excluding the silent start/end symbol). N sequences are generated
from each model with N ranging from 100 to 12800. Of these N sequences, half
are used as fit sequences representing a model. The other half are chosen as test
sequences. In order to calculate the completeness of the test sequences using the
fit set, we first calculate the probability distribution over sequences in the fit set.
We then find unique sequences in the test set and sum their probabilities based on
the distribution over the fit set to get the sequence completeness. We then obtain
a distribution of the sequence completeness thus calculated for 1000 random splits
of the N sequences. In Fig. 2.6. we see the distributions of sequence completeness
so obtained for m = 2,m = 5 symbols, and for different numbers of sequences
N = 200, N = 500.
In Fig. 2.7, the markers represent the mean completeness obtained from these
21
102 103 1040
0.2
0.4
0.6
0.8
1
number of sequences
mea
n s
equen
ce c
om
ple
tenes
s
n=2, fixed Tn=3, fixed Tn=4, fixed Tn=5, fixed Tn=2, varying Tn=3, varying Tn 4 ar ing T
102 103 1040
0.2
0.4
0.6
0.8
1
number of sequences
without repeats with repeats
Figure 2.7. Sequence completeness depends on the number of sequences availablefor model inference.The mean sequence completeness increases with an increase in thenumber of sequences N with and without repeats allowed in the sequences.
distributions and the error bars represent the standard deviation. Since our con-
struction of a POMM is limited to the non-repeat structure of sequences, we first
track the change in sequence completeness with number of sequences for Markov
models in which self-transitions are not allowed (left panel of Fig. 2.7). The cal-
culations are exactly the same as the ones described above. The mean sequence
completeness increases with an increase in the number of sequences N .
We now include the possibility of a symbol repeating, ie., self-transitions are
allowed. In this case we know that the number of repeats of a symbol in a sequence
is variable. This would mean that there would be fewer sequences in the fit set
identical to those in the test set and we expect the completeness to be lower on
average. This is indeed the case (right panel of Fig. 2.7). Even with repetitions
allowed (right panel), the mean sequence completeness increases with the number
of sequences for Markov models on m = 2, 3, 4, 5 symbols. For the same number
of sequences even though the sequence completeness is mostly likely to be greater
for smaller numbers of symbols. However it is not necessary that the sequence
completeness is higher for smaller numbers of symbols. Overall however, there is
still an increase in completeness with the number of sequences.
22
Variability in sequences
Sequence completeness represents a match of sequences in two sets. It can be
argued that in general a model that can generate a broad distribution of possible
sequences, will result in a small sequence completeness, as the total number of
sequences found in a finite set of test sequences will form only a small part of the
possibilities. In other words, a model with higher variability will lead to lower
values of sequence completeness. One possibility is that for the same number of
sequences, if the number of possible combinations of symbols is large, for example if
the number of possible symbols in the alphabet is large, the sequence completeness
would be lower. There are multiple other features of the syntax that can lead to
increased variability and lower sequence completeness -
1. Presence of cycles or repeat structures in the syntax - When a se-
quence is long, the number of possible combinations of symbols that could
lead to a sequence of any given length is large. This means that the prob-
ability of finding an exact match in another set would be low. We could
hypothesize that the average length of the sequences could also affect com-
pleteness. In fact we can imagine that any factor that increases the average
length of sequences could result in a low value of sequence completeness.
This could happen in several ways - if the probability of most symbols tran-
sitioning to the start symbol is low, then the average length of sequences
would be high. This could also happen if there are cycles in the syntax.
The simplest cycle is the zeroth order cycle which is a self-transition. The
greater the repeat probability of a symbol, the higher the possibility of a
sequence containing that symbol being longer than average. To study this,
we constrained the repeat probabilities (diagonal of transition matrix) to all
be the same value for each model picked. We generated 100 such models,
generated 1000 sequences from each, constructed the distribution of sequence
completeness for each, and recorded the mean sequence completeness. We
then tuned the repeat probability over a range of values and repeated the
procedure. As seen in Fig. 2.8 as the repeat probability increases, the mean
sequence completeness decreases on average. Also, we see that it is possible
that a model based on a larger set of symbols with small repeat probabilities
could sometimes lead to a greater sequence completeness than one based on
23
fewer symbols but with high repeat probabilities.
2. Sparsity of transition matrix - When the number of possible transitions
out of the states in a POMM is small (sparse transition matrix), the number
of unique sequences that can be generated is also small. This would mean
that it is more likely for sequences in the fit set to also be found in the
test - i.e., we expect the sequence completeness to be high. We need a
single measure for the sparsity of the matrix, so that we can study sequence
completeness as a function of this quantity. The simplest possibility would be
to count the total number of zeros in the matrix (extremely low probabilities,
smaller than a threshold can also be considered to be zero). However, this
would not take into account the actual values of the non-zero probabilities.
Another possibility is to define the entropy of state transitions - every row i
of the transition matrix T represents transition probabilities out of state i,
for each row, the entropy would be∑
j Tij log Tij. We then sum this quantity
across different states. In Fig. 2.9 sequence completeness is studied as a
function of state transition entropy.
2.3.1.2 Log-likelihood
The likelihood of a set of sequences is defined as the probability over all observed
sequences given a model. For a set of sequences Y = {y1,y2, . . . ,yn}, the likeli-
hood is L = P (Y) =∏
i Pyi . However, many of the n sequences could be identical.
If there are ky sequences of type y, then
L(Py) =∏y′
Py′ky′ (2.2)
The log-likelihood is obtained by taking the logarithm of Eq.2.2
L(Py) =∑y′
ky′ logPy′ (2.3)
A high log-likelihood indicates a good model.
24
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
repeat probability
mea
n s
equen
ce c
om
ple
tenes
s
5-sym
3-sym
Figure 2.8. Sequence completeness depends on the repeat probabilities of symbols inthe Markov model. Each circle in the plot represents the mean sequence completenessover N = 1000 sequences for a single model. Higher repeat probabilities increase theaverage length of sequences that can be generated by the model.
An upper bound on log-likelihood
When we consider methods of model inference in later sections, we keep track of the
log-likelihood as a means of evaluating improvements in the estimation of model
parameters. It is a natural question to ask if there is a bound on the log-likelihood
for a given dataset that we should match the performance of the model against. To
find this bound, we seek the set of Py that maximize L(Py). This is a constrained
maximization problem, since the Py are probabilities, and therefore must obey the
constraint∑
y′ P (y′) = 1. Hence using the Lagrange multiplier λ, the function
we need to maximize is L(λ, Py) = L(Py) − λ(∑
y′ P (y′) − 1). Maximizing with
respect to Py and λ, ie setting
∂
∂PyL(λ, Py) = 0 and
∂
∂λL(λ, Py) = 0
25
10500
0.2
0.4
0.6
0.8
1
state transition entropy
mea
n s
equen
ce c
om
ple
tenes
s
252015
R2 = 0.9
Figure 2.9. Sequence completeness depends on the sparsity of the transition matrixmeasured by the state transition entropy.
we get
Py = ky/λ and λ = N
Hence the maximum log-likelihood is achieved by the empirical, frequentist ap-
proximation to the distribution. Py ∝ ky. Moreover the upper bound is given by
the Shannon entropy of this distribution:
Lmax(Py) = N∑y′
Py′ logPy′ where Py ∝ ky
We have therefore shown that the upper bound on the log-likelihood achievable
by a model is precisely the entropy of the observed sequences or the data.
2.3.2 Measures of sequence similarity
Another approach to evaluating a model is by using a set of sequences generated
from the model. The statistics of this set can be compared with the statistics of
26
observed sequences. If the model is a good fit to the observed sequences, then we
can demand that certain chosen set of key statistics in the data must agree within
the limits of estimation error due to the finite size of the data set.
2.3.2.1 Repeat distributions
Symbols can be repeated multiple times in a sequence. Since the transitions be-
tween symbols, including self-transitions are stochastic, the number of times a
symbol is repeated varies. The distribution of the number of repetitions in all
appearances of the symbol in the observed set of sequences is called the repeat
distribution.
2.3.2.2 n-gram distributions
A sequence can be segmented into multiple sub-sequences. A good model of the
sequence syntax should be able to replicate the statistics of these sub-sequences as
well. A sub-sequence of length n is called an n-gram. For example for the sequence
ABCD, AB, BC and CD are 2-grams. ABC and BCD are 3-grams. A distribution
can be defined over all n-grams in the observed set of sequences for each n.
2.3.2.3 Step distributions
Some symbols appear more frequently at the beginning of sequence, some others
at the end and so on. In general, the location of a symbol in a sequence is a
statistic that must also be replicated by a good model of the sequence syntax.
The distribution over all positions in a sequence that a symbol is found at, or the
number of steps from the beginning of a sequence that a symbol is found at, over
the entire set of sequences, is defined to be the step distribution of that symbol.
The step distributions for all m symbols should be matched by a good model.
2.4 Model inference - inference of a POMM from
data
A model is defined both by its structure, as well as its parameters. Given a set
of sequences that the model is to be inferred from, some inference algorithms are
27
used to estimate both the structure and the parameters, while others are used to
estimate just the parameters given the model structure.
2.4.1 Expectation maximization - Baum-Welch algorithm
A partially observable Markov model contains a many-to-one mapping from states
to symbols. The inference of the parameters of the POMM (and HMMs in gen-
eral) from the observed data is difficult compared to a Markov model because
of the many possible state sequences that could result in the same observed se-
quence. Expectation-maximization is a method that is suited to such classes of
problems [57]. In this section we describe the Baum-Welch algorithm that imple-
ments expectation-maximization.
Let the observed probability over the syllable sequences be Pobs(Y) and the
probability of the sequences Y under a POMM with transition matrix T and
emission matrix E, be P (Y;T,E). For the cases of interest here, the emission
matrix is taken to be fixed, and so it won’t be explicitly mentioned in the notations
used below.
By tuning the transition matrix T we hope to find the distribution P (Y;T )
that best approximates the true observed distribution Pobs(Y). A measure of the
dissimilarity of the two distributions is the Kullback-Leibler divergence
D =∑Y
Pobs (Y) logPobs (Y)
P (Y|T )=∑Y
Pobs (Y) [logPobs (Y)− logP (Y|T )]
and this quantity is to be minimized. Since the true distribution Pobs (Y) is inde-
pendent of T , minimization of D by tuning T is equivalent to maximization of the
log-likelihood ∑Y
Pobs (Y) logP (Y|T )
Pobs (Y) is the distribution over syllable sequences inferred from the observed se-
quences:
Pobs (Y) =1
n
n∑i=1
δY,y(i) ,
where y(i) is the ith out of the n observed sequences (allowing repetitions). We can
28
replace the above quantity by the average over observed sequences
L (T ) =1
n
n∑i=1
logP(y(i)|T
)= 〈logP (Y|T )〉
We will use the angle brackets to imply average over the observed sequences in
what follows. Ideally, we want to identify the maxima of the above quantity in the
very high dimensional parameter space of T . In general this is an impractical task,
however the Baum-Welch algorithm can achieve a simpler task. Given a model Tin,
the algorithm can produce a new model Tout which gives a higher log-likelihood for
the data. The algorithm can be used to iteratively improve the model. Starting
from any initial guess, the algorithm can thus lead us to a local maxima in the
parameter space.
Overview
In order to explain and justify the algorithm we define the following functional F
and quote a few properties.
For a distribution Q (S|Y ) over the state sequences and T , the functional
F (Q, T ) is defined as follows
F (Q, T ) =
⟨∑S
Q (S|Y ) logP (S, Y |T )
Q (S|Y )
⟩
For well-definedness, Q is restricted to be among those distributions such that
Q (S|Y ) is zero whenever the state sequence S cannot emit the sequence Y under
the fixed emission matrix E. The functional F satisfies the following properties
1. For a given T , F (Q, T ) gives a lower bound on the log likelihood L (T ) ie
for any choice of Q (S)
L (T ) ≥ F (Q, T )
This is a consequence of Jensen’s inequality for expectation values of concave
functions of real valued random variables.
2. The transition matrix T and the observed sequences Y together define a prob-
ability distribution over possible state sequences, which we denote P (S|Y, T )
(discussed later). It can be seen that for given T and the observed sequences
29
Y, F (Q, T ) is maximized when Q = P (S|Y, T ). The maximum is L (T ).
This can be seen from the definition of F above.
3. For a choice of Q, and given the observed data, F (Q, T ) can be maximized
by tuning T .
Each iteration of the Baum Welch algorithm takes in an initial Tin. From property
#2 above, we have that
L (Tin) = F(Q, Tin
)where Q (S|Y ) = P (S|Y, Tin)
From property # 3, we have that a new Tout can be inferred by maximizing wrt T
such that
F(Q, Tout
)≥ F
(Q, Tin
).
From property # 1, we have that F(Q, Tout
)forms a lower bound on L (Tout) ie
L (Tout) ≥ F(Q, Tout
)Combining these inequalities we see that
L (Tout) ≥ L (Tin)
Thus we have inferred a transition matrix Tout such that it improves the log-
likelihood compared to Tin.
Optimization steps
The previous section explains why the Baum-Welch algorithm succeeds in improv-
ing the log-likelihood. In practice the algorithm is useful as the optimization steps
described in properties # 2 and # 3 above are feasible using the forward backward
algorithm.
The maximization of F(Q, T
)wrt T to obtain Tout described in property # 3
is also tractable. In order to see this we first note that the maximization of F wrt
T is equivalent to maximization of
FQ =
⟨∑S
Q (S|Y ) logP (S,Y|T )
⟩
30
Labelling the matrix elements of T as tij, the joint distribution of S, Y can be
written as
P (S,Y|T ) = δY,O(S)P (S|T ) = δY,O(S)
∏s,s′∈Unique states
tkss′ss′
where O (S) is the observed sequence resulting from a state sequence S. This is
known since the emission matrix of the POMM is assumed to be known. δa,b is 1
if a = b and 0 otherwise and kss′ is the number of occurrences of the transitions
s to s′ in a state sequence S. Incorporating the constraint that the transition
matrix rows need to be normalized, we have to maximize the following quantity
with respect to pss′ and the Lagrange multipliers λs⟨∑S
Qi (S|Y ) log
∏s,s′∈Unique states
tkss′ss′
⟩−∑s
λs
(−1 +
∑s′
tss′
)
From straightforward maximization we find that
tss′ =1
λs
⟨∑S
Q (S|Y ) kss′
⟩
where λs are determined by normalization. The Baum-Welch algorithm calculates
this efficiently employing the forward backward algorithm procedure described
below. We write (τ being the length of S and S(n) representing the state at
location n in the sequence S)
kss′ =τ−1∑n=1
δs,S(n)δs′,S(n+1)
Plugging into above equation we have
tss′ =1
λs
⟨∑S
τ−1∑n=1
Q (S|Y ) δs,S(n)δs′,S(n+1)
⟩
=1
λs
⟨∑S
τ−1∑n=1
Q (S|Y ) δs,S(n)δs′,S(n+1)
⟩
31
Note that from the discussion in the previous section Q (S|Y ) = P (S|Y, T in) =P(S,Y |T in)P (Y |T in) . The above quantity can be written as
1
λs
⟨1
Aτ (0|Y )
τ−1∑n=1
An(s, Y1→n|T in
)tinss′Bn+1
(s′, Yn+1→τ |T in
)⟩
where An (s, Y1→n|T in) is the probability of finding the state to be s at step n
starting from the beginning of the sequence.
An(s, Y1→n|T in
)=
∑(S1,S2...Sn)
δSn,s
n∏i=1
ESi,YitinSi−1Si
Here E is the emission matrix and ESk,Yk is 1 if Sk can emit Yk and zero otherwise.
S0 is set to be some dormant/start/end state denoted 0. Aτ (0, Y |T in) gives the
probability of returning to the dormant state after τ steps and therefore gives the
probability P (Y |T in). Bn (s, Yn+1→τ |T in) is the probability of finding the state to
be s at step n+ 1, given the observed subsequence Yn+1→τ .
Bn
(s, Yn+1→τ |T in
)=
∑(Sn+1,Sn+2...Sτ )
δSn+1,s
n∏i=1
ESi,YitinSi−1Si
The above expressions for tss′ , A and B can be computationally evaluated given
the observed sequences and input T in. The start and end states and syllables need
to be carefully treated in the actual implementation. The details of our compu-
tational implementation of the Baum-Welch and the forward-backward algorithm
can be found in Appendix A. To summarize, for a specified POMM structure
(number of states and assignment of symbols to states), the Baum-Welch algo-
rithm estimates the transition probabilities. However, we would also like to find
an efficient way of finding the optimal POMM structure.
2.4.2 Grid search for an optimal model
We can consider the problem of finding the optimal POMM structure as a param-
eter search on a grid. One of the main components of the model structure is the
number of states associated with each symbol. The combinatorial challenge of de-
termining the optimal number of states in a POMM increases in difficulty with the
32
Figure 2.10. Schematic of search on a grid for an optimal model. The grid is a discretelattice of dimensions equal to the number of unique syllables. Each node on the gridcorresponds to a specific state vector. The node shown in blue represents the mappingevery syllable to one state - a Markov model. At each node of the grid, an optimal modelis found by using the Baum-Welch algorithm.
number of syllables in a bird’s repertoire. The trial and error method of finding the
number of states associated with each syllable becomes almost impossible for birds
like the nightingale and the warbler that sing more than 100 types of syllables.
We can define a state vector associated with every model. The ith element of the
state vector S = s1, s2, . . . , sm represents the number of states that are assigned to
symbol i in the model. If all elements of S are 1, then the state vector represents a
Markov model. If any si 6= 1, then the state vector represents a POMM. The grid
is a discrete lattice with the number of dimensions equal to the number of unique
symbols in the set of sequences and each node representing a unique state vector.
The left panel of Fig. 2.10. shows an example of a three-dimensional grid. The
parameter search starts with the assignment of one state to every syllable repre-
sented by the blue node on the grid in Fig. 2.10. This corresponds to a Markov
model. At every node on the grid, the optimal transition matrix corresponding to
that particular state vector is obtained by using the Baum-Welch algorithm. This
is represented by the panel on the right in Fig. 2.10.
We then consider the addition of a single state to each of the syllables in
turn. For each addition considered, the Baum-Welch algorithm is again used to
obtain the optimal model corresponding to that state vector. The addition with
the highest log-likelihood is retained, while other additions are discarded. We
33
now consider another round of state additions to the model that we decide to
retain in the previous step. Although there is no guarantee that the path through
the grid is globally maximal, local log-likelihood maximization at grid points is
guaranteed. There is however an upper bound on the log-likelihood for a given
dataset dependent on the finiteness of the dataset.
2.4.3 Establishing error bounds
The entropy of a distribution ideally should not depend on the number of sample
sequences available. Hence to find the error bounds on the log-likelihood, we should
first find the error bounds on the entropy, and then multiply by the total number
of observed sequences.The basic assumption is that if we know subsampled means
and standard deviation obtained from a data sample (set of song sequences in our
case), we can infer the true means and standard deviation. Why is there a true
standard deviation at all? If we had an infinite set of sequences, the standard
deviation would be zero. But we should account for the fact that we only have
finite data. The procedure to obtain the true mean and standard deviation and the
sample bootstrapped mean and standard deviation is as follows. We first pick any
POMM and treat this as our true source or generator. For example, in the analysis
that follows we consider 10-state and 17-state POMMs built on N = 845 songs
of Bengalese Finch, Bird 2. We then create 1000 sets of N sequences generated
from the POMM (generator), calculate the entropy for each set, and find the mean
and standard deviation of the distribution so obtained. This is the true mean
and standard deviation. We then consider a single set of N sequences as our
sample data. We consider a fraction α of these sequences, pick α × N sequences
by bootstrapping without replacement (1000 times), and calculate the mean and
standard deviation of the entropy distribution obtained. These are considered to
be the sample mean and sample standard deviation. We can now compare these
to the true values. If we find a functional relationship between the two, we can
use it in the choice of error bounds for the data at hand.
2.4.4 Grid search stopping criterion
Any good model should be sufficiently generalizable. This means that the param-
eters of the model should not be fine-tuned to match the specifics of the observed
34
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2
α
σ sam
ple/σ
true
Figure 2.11. The scaling of sample standard deviation to the true standard deviationis shown as a function of the fraction α of sequences chosen for sub-sampling. The blueand red markers represent samples from two different models. The scaling is model-independent.
sequence set precisely. It should instead be representative of the general rules un-
derlying these sequences such that any other sequence produced using the same
rule could also be produced by the model. In order to ensure that the POMMs
we infer are sufficiently generalizable, we split the dataset into two - a training or
fitting set, and a testing or validation set. Consider the grid in Fig. 2.10. Using
the sequences in the training set as input, the Baum-Welch algorithm is first used
to construct a POMM which, for an advance of a single state on the grid, gives
the maximum possible log-likelihood of the training sequences. At this point we
consider cross-validation to evaluate the POMM. We consider the sequence com-
pleteness Pctest of the testing sequences based on the POMM constructed using the
training sequences.
Sec. 2.3.1.1 describes the determination of the distribution of sequence com-
pleteness DPC for a finite set of sequences based on multiple random splits of the
dataset into training and testing sequences. Very briefly, given a set of sequences,
we construct the distribution of sequence completeness by randomly splitting the
data in half multiple times and calculating the completeness of sequences in one
35
Data random partition
test data Calculate completeness
repeat
sequence completeness
prob
abilit
y
fit dataInfer model
model
Figure 2.12. Construction of a sequence completeness distribution from the data. Theset of sequences is randomly split into two sets - a test and a fit set multiple times. Foreach split, the sequence completeness of test sequences is calculated using the distributionover sequences defined by the fit set. A syntax model of the sequences is accepted if thesequence completeness under the model falls within 5% of the sequence completenessdistribution so constructed.
half against the other half (see Fig. 2.12).
Now, the fit and test sets YFit and YTest represent a single such split.
Once the sequence completeness Pctest of the test sequences is determined, a
p-value of Pctest is computed based on the distribution DPC . The p-value is chosen
to be the probability from the distribution of finding sequence completeness values
below Pctest . This is computed assuming that the distribution is normal. If the
POMM is representative of the syntax of all sequences in the data set, then we
expect Pctest to fall within the range of values represented by the distribution DPC .
At the very least we would expect Pctest to be within the lower range of values. We
choose this range to be within 5% of all values in the distribution. This means, if
p > 0.05, then the POMM is accepted.
2.4.5 Finding the optimal state vector
A model is accepted based on the criterion in Sec. 2.4.4 using a single split of the
dataset. However, it is possible that a different split could lead to very different
results. Hence we repeat the splitting of the data and inference of the optimal
state vector multiple times - typically 20 times. We then assign the state vector
that is inferred most frequently to be the optimal state vector.
Finally, we use the full set of observed sequences and run the Baum-Welch
36
algorithm to retrain a POMM with the structure specified by the optimal state
vector.
2.4.6 Reduced representation by filtering non-dominant transi-
tions
Depending on whether the objective is to construct a model that is compact and
gives us a sense of the most important features of the syntax, or to construct
a model that captures every detail of the observed set of sequences, we could
choose to prune the POMM further or not. Transitions of very low probability
could potentially be ignored without losing the key features of the syntax. Such
a reduction would help highlight key features of the syntax without taking away
from the accuracy of the model. We follow a method of backbone extraction
for complex weighted networks, where dominant weighted edges in a graph are
identified by comparing the normalized weight of each edge against the probability
that the edge of that weight could have occurred just by chance [58]. Although the
networks considered in the study were large-scale networks such as the US airport
network and the Florida Bay food web, the methods developed are relevant for
networks of any size. The null hypothesis is that the edge weights associated with
a node in the network, the transition probabilities into or out of a state in our case,
are assigned by random assignment from a uniform distribution. We only retain
transitions that reject this null hypothesis. The details of the calculations can be
found in Appendix C.
2.4.7 Checking for equivalent POMMs - state-merging
Another useful procedure to ensure that we obtain the most compact POMM
structure is to allow for the merging of states since there could be equivalent
POMMs of different sizes. We consider the possibility of merging states associated
with the same syllable based on a concatenated observation sequence O as follows.
We use the forward backward algorithm to find the probability of a state at any
step of the concatenated sequence. We find piA = p (Si = 1A|O, T,E) and piB =
p (Si = 1B|O, T,E) where S = (S1, S2 . . . ) is the state sequence. If piA and piB are
non zero at every step where Oi = 1 (i is the step count), then we can say that
they are good candidates for merging. If piA = 0 when piB 6= 0 and vice versa
37
most of the time, then they are less mergeable. Using this idea we can calculate a
‘merge-score’ which we define as
M =
∑′
i piA + piB∑i piA + piB
(2.4)
where the sum in the numerator is over those time steps where piA and piB
are non-zero. Note that M = 0 if the states emit different syllables. Once M is
calculated for every pair of states, we can decide on the ideal choice for merging
by starting with the pair with the largest M and moving on to other pairs in
decreasing order of their merge score. For every merger, the log-likelihood of the
merged model is calculated and the model with the maximum log-likelihood is
chosen. We then move onto considering merges possible with the new model. For
the algorithm this means that we approach state-merging in two stages
1. Merge using maximum merge scores and log-likelihoods
2. In the model generated by stage 1, look for multiple states associated with
the same syllable and check for all-common parents. If this is the case, merge
the states.
2.4.8 Demonstration of grid search with a toy model
We first consider the inference of a POMM given an artificially generated dataset
that only contains sequences AB and CBA in the ratio 3 : 1. Fig. 2.13 details
the inference procedure. The iterative inference procedure starts with a Markov
model on the grid (1,1,1) with one state assigned to each of the syllables A, B and
C. The change in log-likelihood and sequence completeness with the addition of
a state to each of the syllables in turn can be seen in the last row of the figure.
Only the addition that maximizes the log-likelihood and sequence completeness
is retained. For this small artificially generated set of sequences, with the total
addition of two states to the Markov model for syllables A and B, the true model
is inferred exactly (Pc = 1). In the case of more complex datasets, grid search is
continued until the stopping criteria outlined in Sec. 2.4.4 are met.
38
Start with
(NA=1,NB=1,NC=1)(NA=2,NB=1,NC=1) (NA=2,NB=2,NC=1)
(NA=1,NB=1,NC=1)
4 6 8 10 12−800
−700
−600
−500
−400
−300
−200
log−
likel
ihood
(NA=1,NB=2,NC=1)
(NA=1,NB=1,NC=2)
(NA=2,NB=1,NC=1)
4 6 8 10 12−800
−700
−600
−500
−400
−300
−200
number of states
(including start state S)
(NA=3,NB=1,NC=1)
(NA=2,NB=1,NC=2)
(NA=2,NB=2,NC=1)
4 6 8 10 12−800
−700
−600
−500
−400
−300
−200
number of states
(including start state S)
number of states
(including start state S)
sequen
ce c
om
ple
tenes
s
4 6 8 10 120
0.2
0.4
0.6
0.8
1
4 6 8 10 120
0.2
0.4
0.6
0.8
1
4 6 8 10 120
0.2
0.4
0.6
0.8
1
Figure 2.13. Inference of a POMM given the artificially generated set of sequences[AB(75%), CBA(25%)]. The increases in log-likelihood and sequence completeness aswell as the model obtained at every node on the grid are shown.
39
Chapter 3Comparison of Syntactic Structures
Probabilistic models of song syntax are compact representations of rules for pos-
sible arrangements of vocal elements. An accurate model of the syntax is useful
on several fronts. Many neural and behavioral studies use tutoring paradigms in
which young birds are exposed to artificial songs in order to understand how se-
quences are learned. Having a good model of the syntax of different birds, both
conspecific and heterospecific, provides greater control over the selection and al-
teration of elements to be tutored. This is true for studies of mate preferences in
songbirds as well. Changes in syntax as a result of manipulations such as neu-
ral lesions or altered auditory feedback can also be tracked to identify the role
of different brain regions or peripheral mechanisms involved in the learning and
generation of sequences. Finally, if there exists a mapping between statistical and
neural models, then the topology of the syntax model could be used to infer the
topology of the underlying neural network.
We test if the method of inferring the POMM described in Chapter 2, leads to
a good stochastic model for the syntax of a songbird, the Bengalese finch, and a
cetacean, the humpback whale. In the case of the Bengalese finch we have access
to song sequences from birds both before and after the disruption of auditory
feedback. This allows us to study changes in syntax caused by the disruption. It
seems plausible that the removal of an important feedback signal wold very likely
lead to a loss of structure in the songs of the bird. We ask if this is seen as a
change in the syntactic category - POMM to Markov.
40
3.1 The Bengalese finch
The Bengalese or society finch (Lonchura striata Domestica) has long been the
subject of investigations into learning and song generation in songbirds. These
birds are age-limited or close-ended learners - i.e., there exists a developmental time
window during which the acquisition of song by learning occurs. Experimental
studies have shown that the adult Bengalese finch relies on auditory feedback
for the maintenance of its song [59, 60]. Supporting this idea, a recent study in
these birds revealed stimulus specific adaptation to repeated auditory responses
in the HVC - a key sensory-motor nucleus implicated in song control, as well as
an immediate decrease in syllable repetitions after deafening [51]. The analysis
of Bengalese finch songs in this study focused on syllable repetitions alone. It
has also been shown in a previous study that the song of the Bengalese finch is a
POMM [6]. We construct the syntax for the songs of six Bengalese finches both
before and after deafening. We first confirm earlier results that the syntax of
Bengalese finch song is consistent with a POMM. We then compare the syntax
before and after deafening. We find that the many-to-one mapping between states
and syllables in the POMM is dependent on the presence of auditory feedback for
some of the Bengalese finches studied, while for others this is not the case. The
results suggest a more complex relationship between many-to-one mapping and
auditory feedback than an on/off dependence.
3.1.1 Description of data
The data used in this analysis is part of a larger dataset collected for another
study [51]. The authors have made the data freely available in the public domain1. The data that we consider from this set are transcriptions of song sequences
from six male adult Bengalese finches - labeled bfa14, bfa16, bfa19,bfa7, o10bk90,
and o46bk78. We will refer to these birds by labels Bird 1, Bird 2, ..., Bird 6
henceforth. Song transcriptions are available for the six birds before, and soon
after deafening (2-3 days post-deafening) by bilateral cochlear removal. Each of
the birds sings songs with about 7-9 unique syllables.
1As of May 2016, the data is freely available and can be downloaded fromhttp://users.phys.psu.edu/ djin/SharedData/ KrisBouchard/
41
Table 3.1. Bengalese finch song statistics
Bird IDNo: of songs No: of syllables
pre-deafening post-deafening pre-deafening post-deafening
Bird 1 160 271 687 744Bird 2 57 330 578 1053Bird 3 20 30 391 358Bird 4 108 272 653 892Bird 5 59 143 354 423Bird 6 69 83 660 841
3.1.2 Identification of songs from recording transcriptions
In the dataset, syllables are labelled a through l, and x through z, as well as
with symbols 0 and - for unknown syllable identities. A Bengalese Finch song
bout typically begins with short vocalizations called introductory notes. The role
of introductory notes has long been open to speculation. In zebra finches it has
been observed that the number of introductory notes before a bout correlates with
the time since the previous bout suggesting that introductory notes reflect neural
preparation before initiation of a learned motor sequence [61]. Introductory notes
are therefore not considered part of a song sequence. In the dataset, these notes are
i, j, k, and l. We define songs in the transcribed sequences as segments that appear
between introductory notes. In keeping with convention, introductory notes are
not retained as part of the song. However, it must be pointed out that other studies
consider isolated introductory notes that appear between song bouts to be part
of the song [59]. Transcriptions that contain the unknown symbols 0 and - are
not considered for the analysis. With the removal of these syllables, the number
of sequences obtained for each individual before and after deafening are shown in
Table. 3.1. As seen in Table. 3.1 the number of sequences obtained pre-deafening
in all cases is much smaller than the number available post-deafening. Since we
choose to segment sequences on the basis of the appearance of introductory notes,
it is possible that there were simply more introductory notes in the post-deafening
songs leading to a larger number of segments. Table. 3.1 also lists the total number
of syllables available from each recording. If the numbers pre- and post-deafening
are comparable, then the larger number of sequences obtained post-deafening could
42
be the result of more introductory notes. However, given the numbers, it is hard to
make this inference. The difference in number of samples before and after deafening
will influence any comparison of sequences if not explicitly accounted for. We must
therefore be careful about considering this difference in calculations.
S
1 (A)
0.77
9 (H)
0.13 3 (C)
0.06
4 (C)
0.04
6 (E)
0.01
2 (B)
0.59
5 (D)
0.72
7 (F)
0.01
0.92
8 (G)
0.11
0.43
0.07
0.05
0.95
0.050.04
0.96
0.970.01
0.01
0.90
0.10
(a) Bird 1, pre-deafening
S
1 (A)
0.35
2 (B)
0.01
3 (C)
0.03
4 (D)
0.14
5 (E)
0.11
6 (F)
0.03
7 (G)
0.20
8 (H)
0.13
0.34
0.03
0.33
0.07
0.72
0.08
0.01
0.08
0.01
0.03
0.78
0.07
0.01
0.39
0.06
0.01
0.81
0.01
0.18
0.13
0.03
0.01
0.03
0.01
0.410.01
0.01
0.05
0.02
0.70
(b) Bird 1, post-deafening
Figure 3.1. Song syntax of Bengalese finch, Bird 1, before and after deafening
3.2 Syntax of Bengalese finch song
Songs of the Bengalese finch consist of variable sequences of discrete syllables.
The variability could be in the number of syllable types sung in each sequence,
the length of the sequence, the number of times a syllable is repeated and so on.
The syllable sequences follow probabilistic rules, and have been shown to be well-
described by the Partially Observable Markov Model with Adaptation (POMMA)
[6]. However, the inference of the model in the study involved a considerable
amount of manual fine-tuning. We use the methods developed in Chapter 2 to
derive the song syntax of the non-repeat structure of six Bengalese finches before
43
S
2 (A)
0.98
10 (G)
0.02
1 (A)
3 (B)
0.01
5 (D)
0.52
0.96
4 (C)
6 (D)
0.9111 (H)
0.07
7 (E)
0.82
8 (E)
0.84
0.14
9 (F)
0.94
0.10
0.19
0.83
1.00
1.00
0.76 0.24
(a) Bird 2, pre-deafening
S
1 (A)
0.98
5 (E)
0.02
4 (D)
0.01
0.05
2 (B)
0.36
0.89 0.02
7 (F)
6 (E)
0.67
8 (G)
0.09
0.03
9 (H)
0.670.21
0.78
0.01 3 (C)
1.00
0.98
0.02
0.95
0.05
S
(b) Bird 2, post-deafening
Figure 3.2. Song syntax of Bengalese finch, Bird 2, before and after deafening
and after deafening. The repeats in the songs of these six birds were the sole focus
of the main study [51] from which this data is taken, and are consistent with a
model of repeat adaptation which will be visited in Chapter 4 when we exclusively
discuss the repeat structure of syllables in various songbird songs. The reference
to syntax in the rest of this chapter is therefore for the non-repeat structure of
song sequences.
We confirm that the POMM is a good model of the non-repeat structure of
Bengalese finch songs based on the syntax inferred for the six birds before deafening
using grid search in combination with the Baum-Welch algorithm - Figs. 3.1(a) -
3.6(a). In the state transition diagrams for the syntax, the pink bubble is the
44
S
1 (A)
1.00
4 (C)
5 (D)
0.40
7 (F)
0.38
6 (E)
0.97
2 (B)
0.66
3 (B)
0.34
1.00
1.00
9 (G)
1.00 0.59
0.41
8 (F)
1.00
1.00
(a) Bird 3, pre-deafening
S
2 (B)
0.03
5 (E)
0.23
1 (A)
0.304 (D)
0.37
6 (F)
0.03
7 (G)
0.033 (C)
0.96
0.02
0.01
0.02
0.61
0.05
0.93
0.04
0.470.42
0.05
0.05
0.10
0.74
0.10
0.01
0.06
0.07 0.53
0.07
0.33
0.36
0.64
(b) Bird 3, post-deafening
Figure 3.3. Song syntax of Bengalese finch, Bird 3, before and after deafening
start/end state, while all blue bubbles represent states that have transitions to the
end state. Each state is indexed by a number shown in the bubble and is associated
with the syllable indicated in brackets after the number. As far as the many-to-one
mapping is concerned, five of the six birds (except Bird 4) have atleast one syllable
that is assigned to multiple states indicating deviation from a Markov model, i.e.,
Bengalese finch song is more complex than what we would expect from a Markov
model. The largest deviation from a Markov model is through the addition of
states to three of the syllables (Bird 6). We also find diverse state transition
structures. Some are simple, mainly composed of deterministic transitions with
occasional stochastic branches (Bird 3, for example), while others are composed of
branched transitions from most states (Bird 1, for example). Given this assortment
of structures, we can conclude that there is no prototypical syntax for the songs
45
S
1 (A)
0.99
2 (B)
0.11 5 (E)
0.25
4 (D)
0.12
7 (G)
0.15
0.12 3 (C)
0.54
8 (H)
0.02
0.85
1.00
6 (F)
1.00
1.0
1.00
1.0
(a) Bird 4, pre-deafening
S
1 (A)
0.91
7 (G)
0.04
2 (B)
0.04
5 (E)
0.17
0.08
0.13
3 (C)
0.19
4 (D)
0.49
0.03
0.04
0.42
0.01
0.14
0.21
6 (F)
0.98
0.89
0.04
0.04
0.14
0.02
0.74
1.00
(b) Bird 4, post-deafening
Figure 3.4. Song syntax of Bengalese finch, Bird 4, before and after deafening
of all Bengalese finches.
3.2.1 Changes in syntax caused by deafening
Experimental studies have shown that the Bengalese finch relies on auditory feed-
back for the maintenance of its song [59,60]. Disruption in auditory feedback does
not have the same effect on the songs of different species of songbirds. In zebra
finches which are close-ended learners with fixed song, there is no immediate de-
terioration in the song [62, 63]. With bilateral cochlear removal, the crystallized
song of the adult zebra finch deteriorated to have only about 36% of the song
syllables produced before surgery only after 16 weeks post-surgery. However, in
Bengalese finches that are also close-ended or age-limited learners, but with vari-
46
S
7 (F)
0.03
3 (C)
0.97
2 (B)
4 (D)
0.57
0.02
5 (E)
0.97
6 (E)
1 (A)
0.47 8 (F)
0.51
0.98
0.02
1.00
1.00
1.00
(a) Bird 5, pre-deafening
S
1 (A)
0.39
3 (C)
0.39
7 (F)
0.13
4 (D)
0.08
2 (B)
0.91
0.01
0.01
0.03
0.01
0.11
0.07
0.02
5 (E)
0.84
6 (E)
0.92
0.13
0.02
0.02
0.04
0.96
0.98
0.02
(b) Bird 5, post-deafening
Figure 3.5. Song syntax of Bengalese finch, Bird 5, before and after deafening
able or stochastic song sequences, the change is immediate [59,60]. As far as neural
responses are concerned, when a singing Bengalese finch is presented with tran-
sient perturbations to auditory feedback, reliable decreases in HVC activity were
observed for short latencies [64]. This means that auditory signals have real-time
access to the song-generating circuitry. The abstract states in a POMM must have
neural correlates. One possibility is that each state is a chain network of neurons
as in the branching synfire chain model discussed in Sec. 1.2.2, and a many-to-one
mapping means that same syllable is redundantly encoded by multiple chains. The
other possibility is that multiple states associated with the same syllable represent
47
S
4 (B)
0.27
2 (A)
0.095 (B)
0.64
1 (A)
0.34
3 (B)
0.03
0.06
6 (C)
0.91
9 (E)
0.91
0.02
1.00
0.92
0.07 0.02
0.04
7 (C)
0.94
8 (D)
0.0510 (F)
0.95
1.00
1.00
1.00
(a) Bird 6, pre-deafening
S
1 (A)
0.19
2 (B)
0.81
0.67
5 (E)
0.84
6 (F)
0.06
0.014 (D)
0.91
0.33
3 (C)
0.67
0.47
0.53
0.37
0.63
(b) Bird 6, post-deafening
Figure 3.6. Song syntax of Bengalese finch, Bird 6, before and after deafening
different activity patterns in the same chain network. The differences in activity
could be driven by differing feedback, including auditory feedback. It is therefore
reasonable to speculate that auditory feedback may be necessary for the many-to-
one mapping in the syntax of the Bengalese finch and that the removal of auditory
feedback will drive changes in the structure of the POMM.
On inferring the syntax of song sequences after deafening for the same six
birds, we notice several significant changes (syntax seen in Figs. 3.1(b) - 3.6(b)).
48
2 4 6 8 101
2
3
4
5
6
7
8
9
10
state transition entropy pre−deafening
state
tra
nsi
tion e
ntr
opy p
ost−
dea
fenin
g
(a)
0 5 10 15 200
5
10
15
20
mean sequence length pre−deafening
mea
n s
equen
ce len
gth
post−
dea
fenin
g
(b)
Figure 3.7. On average the state transition entropy increases and sequence lengthdecreases after deafening.
Firstly, there is an increase in the total number of possible transitions, and the
transitions appear to be more random. We can test this by comparing the state
transition entropy as in Sec. 2 before and after deafening. Larger entropy values
after deafening indicate more random transitions. Secondly, the average length
of sequences is smaller after deafening. Both these can be seen in Fig. 3.7. One
possible cause for this could be an increase in how frequently an introductory note
is produced, since song sequences are defined to be the segments between two of
these notes.
Finally, some of the many-to-one mappings between the states and the sylla-
bles vanish after deafening. This is an indication of the syntax becoming more
Markovian. We would like to compare how Markovian the syntax is before and
after deafening. This can be done by computing sequence completeness under a
Markov model pre- and post-deafening - i.e., we infer Markovian transition prob-
abilities from the data and calculate the probabilities of all unique sequences in
the dataset based on these probabilities. The values obtained are displayed in Fig.
3.8. However, this is where we must use caution. A relatively higher sequence
completeness value does not necessarily indicate a more Markovian model. This
is based on our discussion in Chapter 2 where we determined that the sequence
completeness values obtained using models inferred from different numbers of se-
quences are not directly comparable. We must instead compare the p-values of the
sequence completeness obtained under these Markov models. To obtain p-values,
49
Bird 1 Bird 2 Bird 3 Bird 4 Bird 5 Bird 60
0.2
0.4
0.6
0.8
1
**
*
**
*
pre-deafeningpost-deafening
sequ
ence
com
plet
enes
s un
der
a M
ark
ov m
odel
Figure 3.8. Sequence completeness of Bengalese finch song sequences under Markovmodels before and after deafening. Since the number of sequences is different pre- andpost- deafening, direct comparisons of sequence completeness values should not be made.
we first construct sequence completeness distributions using multiple splits of the
sequence set into fit and test sets and finding the completeness of the test set using
the probability distribution over sequences defined by the fit set as described in
Sec. 2.3.1.1. The sequence completeness under the Markov model should be atleast
within 5% of the sequence completeness distribution, or, equivalently, the p-value
should be greater than or equal to 0.05, if the syntax has to be considered to be
Markovian. Based on this criterion, four out of the six birds studied have a syntax
that is more Markovian after deafening (see Fig. 3.9 and Table. 3.2).
3.2.2 Persistence of dominant transitions after deafening
As described in Sec. 2.4.6, the dominant transitions in any probabilistic finite state
model can be found by testing if each transition probability is greater than or equal
to the value we would expect if the values were drawn from a uniform distribution.
Since our goal is to have a concise representation of the syntax that also results in
a sequence completeness with p-value greater than 0.05, once we find the dominant
transitions in the syntax at difference significance levels α, we select the syntax
50
0
0.2
0.4
0.6
0.8
1
p-va
lue
for
sequ
ence
com
plet
ene
ss
pre-deafening post-deafening
Figure 3.9. p-values of sequence completeness under Markov models before and afterdeafening based on the distributions generated from the data. A low p-value (p < 0.05)indicates that a Markov model is not a good representation of the sequence syntax. Eachcircle represents a single bird and the lines connect p-values for the same bird pre- andpost-deafening.
Table 3.2. Sequence completeness and p-values under Markov models
Bird IDBefore deafening After deafening
Sequence completeness p-value Sequence completeness p-value
Bird 1 0.83 0.03 0.81 0.95Bird 2 0.36 0.001 0.90 0.003Bird 3 0.33 0.002 0.24 0.24Bird 4 0.67 0.04 0.79 0.84Bird 5 0.40 0 0.82 0.003Bird 6 0.23 0 0.35 0.64
structure at the α for which the p-value condition is satisfied. As an illustration
of this, Fig. 3.11. shows the dominant transitions for Bird 1 after deafening at
significance levels α ranging from 0.1 to 1. As seen from the p-values of sequence
completeness displayed in Fig. 3.10b, the smallest α for which p-value ≥ 0.05,
is α = 0.7. Looking for this structure gives us a way of visualizing if the most
dominant transitions remain unchanged or not as the result of deafening as seen
51
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
α
sequ
ence
com
plet
enes
s
pre−deafeningpost−deafening
(a)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
α
p va
lue
pre−deafeningpost−deafening
(b)
Figure 3.10. Sequence completeness and corresponding p-values under syntax modelswith only the dominant transitions retained at significance levels α
Table 3.3. Sequence completeness and p-values for pre-deafening sequences based onsyntax post-deafening
Bird ID Sequence completeness p-value
Bird 1 0.19 0Bird 2 0.88 0.99Bird 3 0.06 0Bird 4 0.38 0Bird 6 0.11 0
in Fig. 3.11. for the syntax of Bird 1 before and after deafening.
Based on the diversity of inferred syntax structures, we conclude that normal
Bengalese finch song syntax with its many-to-one mappings between states and
syllables is more complex than a Markov model. Deafening reduced this complexity
in most of the birds studied, while the complexity persisted in others. These
observations suggest that auditory feedback can induce complexity in the Bengalese
finch song syntax, but is not sufficient to explain complexity entirely. There most
certainly have to be contributions from other intrinsic factors. We suggest that the
Bengalese finch song syntax is encoded in the interplay between auditory feedback
and the intrinsic song-generating circuitry.
52
0.04
S
1(A)
0.77
9(H)
0.13 3(C)
0.06
4(C)
6(E)
0.01
2(B)
0.59
5(D)
0.72
7(F)
0.01
0.92
8(G)
0.43
0.07
0.05
0.95
0.05
0.04
0.96
0.97
0.11
0.01
0.01
0.90
0.10
S
1(A)
0.35
8(H)
0.13
3(C)
5(E)
0.11
2(B)
4(D)
6(F)
7(G)
0.20
0.14
0.34
0.07
0.33
0.72
0.78
0.08
0.39
0.81
0.18
0.06
0.41
0.07
0.05
0.70
0.13
Pre-deafening Post-deafening
Figure 3.11. Dominant transitions in the syntax after deafening compared with thesyntax before deafening for Bird 1. Most frequent sequences are retained with lowerprobabilities after deafening, but many new transitions are introduced. The many-to-one mapping for syllable C is also lost after deafening.
3.3 Humpback whale
Among cetaceans, the humpback whale (Megaptera novaeangliae), found in all
oceans of the world, is capable of complex vocalizations called whale song [65,66].
Whale song is composed of an arrangement of mostly low-frequency vocalizations
(10 Hz - 2 KHz) continuously sung for as long as 30 minutes [65] with a typical
length of 10-20 minutes. The longest songs recorded in the animal kingdom have
been that of the humpback whale. As far as it is known, just as in songbirds,
53
Figure 3.12. Transcription of part of a humpback whale song. The units in the songare ‘BA’, ‘BW’, ‘GR’, ‘HMM’, ‘HSH’, ‘HSQ’, ‘LBA’, ‘LGO’, ‘MM’, ‘RA’
the singers are always males. The exact purpose of song in these cetaceans is not
agreed upon, but most songs have been recorded during the breeding season [67].
Humpback whales are migratory - migrating annually between high-latitude
waters in summer and low-latitude waters in winter. Whales within a population
sing roughly the same song (similar arrangement of units) [65, 66]. The song that
the population sings can evolve from year to year, but again all members of the
population but transmission of a song type between different populations in the
western and central South Pacific over an 11-year period has been demonstrated
[68]. This is evidence of vocal learning among humpback whales. There is definite
structure to the arrangement of acoustic ‘units’ (syllables in the songbird literature)
in whale song. It has been demonstrated that independently identically distributed
(iid) and first order Markov models fail to capture the full structure of humpback
whale songs [69]. This analysis was from an information theoretic perspective with
the rejection of simple models based on the discrepancy in expected and observed
entropy values. If the syntax is not Markovian, we would like to test if it is
POMM. However, the methods we have developed are not suitable for the kind
of song sequence data available for humpback whales. The following sections give
a description of the available datasets and explain the need for developing better
methods of syntax inference than what we have for birdsong. Some preliminary
54
analyses of alternatives are also described.
3.3.1 Description of data
The dataset consists of transcriptions from recordings of a population of 11
humpback whales recorded off the coast of Eastern Australia in 2003 (record-
ing by Michael Noad, Cetacean Ecology and Acoustics Laboratory, The Univer-
sity of Queensland, transcription by Luca Lamoni, Luke Rendell, University of
St.Andrews). There is one continuous recording available for each whale lasting
between 15 and 30 minutes. In the transcriptions, the basic element of the se-
quence is the unit or syllable. In terms of the number of units, the longest song
has 2167 units. There are 9 kinds of units in the song - ‘BA’, ‘BW’, ‘GR’, ‘HMM’,
‘HSH’, ‘HSQ’, ‘LBA’, ‘LGO’, ‘MM’, ‘RA’. An example of the transcription for part
of one whale’s song is shown in Fig. 3.12.
3.3.2 Challenges in inferring the syntax of humpback whale song
With animal vocal sequences that are hard to record or very long - such as hump-
back whale songs, the data set often contains one or a few sequences. If we only
have one long sequence to train the POMM on, the grid search method we dis-
cussed in Chapter 2 cannot be used since it depends on the calculation of bounds
on log-likelihood and sequence completeness using multiple observed sequences for
the same individual. Could we use sub-sequences generated from the long sequence
to construct bounds on the maximum possible log-likelihood of the long sequence
given a syntax model? Is there a choice of segmentation length for which we could
consider the corresponding sub-sequences to be independent? Such a possibility
would be a starting point in the syntax analysis of long sequences.
Entropy calculations for long sequences Our goal is to relate the entropy of
sub-sequences obtained from the long sequence to the log-likelihood of the long
sequence. To do so we first need to understand entropy calculations using data
with correlations, and how the choice of length of the sub-sequence affects these
calculations. We first chose a humpback whale song (without syllable repeats), and
calculated the entropy of sets of sub-sequences. Each set contained sub-sequences
of a particular length. The entropy shows an initial increase with segmentation
length, peaks, and then begins to decrease. We can understand this in terms of the
55
0 10 20 30 40 500
20
40031022, system size = 332
segmentation length
num
ber o
f uni
que
sub−
sequ
ence
s
0 10 20 30 40 500
1
2
entro
py o
f sub
−seq
uenc
es
Figure 3.13. Dependence of sequence entropy on the number of unique sub-sequencesobtained using different segmentation lengths for a 300-unit long humpback whale song
number of unique sub-sequences that the segmentation results in. As can be seen
in Fig. 3.13, the entropy values follow the increase and decrease in the number of
unique sub-sequences resulting from different segmentation lengths.
However, the curves seen in the figure could be dependent on the length of
the long sequence, i.e., the system size S. If S is large, does this shift the location
of the peak? We would expect that the number of unique sequences and thereby
the entropy would begin to decrease for larger segmentation lengths, i.e., the peak
would occur at larger segmentation lengths. Before we use observed data, we would
like to first understand this dependence using randomly generated sequences of
different lengths. The variation in entropy with segmentation length k is studied.
Considering a randomly generated sequence of size S, for small k, nk � S.
Since the sub-sequences are independent, the probability of occurrence of each is
simply 1nk
. Hence the entropy of the sub-sequences is E = −∑
1nklog( 1
nk). There
are nk terms in the sum. Therefore E = log(nk). Since the number of possible
combinations of length k is much smaller than the system size, we can get a good
estimate of the distribution of sub-sequences of length k and thereby the entropy for
small k. This is seen as the agreement with the lower asymptote in Fig. 3.14. For
56
segmentation length1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
entr
op
y
S = 1000upper asymptoteS = 5000upper asymptoteS = 10000upper asymptotelower asymptote
10 20 30 40 50
2000
4000
0.7
0.9
1.1
1.3
1.5
1.7
1.9
2.1
2.3
2.5
segmentation length
syst
em
siz
e
entropy
6000
8000
10000
Figure 3.14. Dependence of sequency entropy on segmentation length and system sizeusing randomly generated sequences. The asymptotes for large and small segmentationlengths are displayed.
large k, nk � S, i.e., number of possible sub-sequences of length k is much greater
than the system size. Hence we expect to see the combinations that do occur, only
once. The number of sequences that can occur would be Sk. The entropy in this
limit is therefore E = log(Sk), the upper asymptote drawn for different system sizes
in Fig. 3.14. The crossover between the two asymptotes occur at a segmentation
length k such that nk ≈ Sk. It is now reasonable to ask how the entropy of
the actual humpback whale song sub-sequences compare to these asymptotes. We
considered the analysis above for real song sequences using both non-overlapping
and overlapping segments of length k. Results are shown in Fig. 3.15. As we
expect, the sequences are far from random as seen by the deviation from the lower
asymptote. Since for large k, the bottle-neck is the system size, we see a good
agreement with the upper asymptote - i.e., the expectation for a random sequence.
This is because for large segmentation lengths, the number of unique sequences is
small so as to be effectively random.
3.3.3 Markov model of humpback whale themes
Humpback whale sequences are often referred to as possessing a hierarchical struc-
ture [65]. Units are arranged into phrases, phrases into themes, and themes into
songs. The analysis of syntax in humpback whale songs in studies [68, 69] assume
Markovian transitions between themes with a theme defined to be a segment of
57
0 10 20 30 400
0.5
1
1.5
2
2.5
3
3.5
4
segmentation length
entr
opy
031022, system size = 332
non−overlapping segmentsoverlapping segmentslower asymptotehigher asymptote
Figure 3.15. Entropy of a segmented humpback whale sequence depends on the seg-mentation length k as well as the total system size S. For small k entropy calculationsdo not agree with those expected from random sequences demonstrating that the seg-ments have non-random structure. However for large k the agreement with the upperasymptote implies that they cannot be used to infer the structure of humpback whalesongs. This is due to to large segmentation lengths leading to very few unique sequences.
the song in which specific units repeat in combination until a new combination of
units is sung.
We suggest redefining the basic element of the humpback whale song to be a
’repeat unit’ instead of a unit. A repeat unit is a note or combination of notes
that repeat as a whole in the song. For example, in the sequence MM BA MM BA
MM LBA MM LBA MM RA RA RA RA, the repeat units are MMBA, MMLBA,
MM and RA. The number of consecutive repeats of a repeat unit is variable at
different points in the song. This number is not the same even when the repeat unit
appears in the same context at different points in the song. Hence it is reasonable
to assume that an exact number of consecutive repeats of the repeat unit is not
the elementary unit of the song that is learned, and the form of which is conserved.
The transcription in Fig. 3.12 is reorganized in terms of repeat units in Fig. 3.16
with what would be the themes in the song highlighted with different colors.
We can now consider each theme to be an independent sequence and consider
the construction of finite state models for the themes. Although this does not
58
Figure 3.16. Transcription of part of a humpback whale song with repeat units andthemes highlighted
serve the purpose of obtaining the full syntax for humpback whale songs, it is a
good segue into more complicated models. The structure of the Markov model for
themes and the match of n-gram distributions for the observed sequences and those
generated from the model are shown in Fig. 3.17. Preliminarily, we can assume
based on the distribution matches that the syntax at the level of themes has to be
more complex than a Markov model.
59
S
4 (HMM)
0.07
9 (MM)
0.93
1 (BA)
3 (GR)
0.09
0.82
2 (BW)
0.71 0.735 (HSH)
0.22
0.38
6 (HSQ)
0.37
7 (LBA)
0.90
8 (LGO)
0.81
10 (RA)
0.441.00 0.49 0.35 0.11 0.04
0.99
(a)
0 20 400
0.1
0.2
2−gram index
Pro
babi
lity
0 20 400
0.1
0.2
3−gram index0 20 40
0
0.1
0.2
4−gram index
0 20 400
0.1
0.2
5−gram index
Pro
babi
lity
0 20 400
0.1
0.2
6−gram index0 20 40
0
0.1
0.2
7−gram index
(b)
Figure 3.17. (a)Markov model of themes in the songs of a population of humpbackwhales, (b)n-gram distribution matches between observed sequences and sequences gen-erated using the Markov model
60
Chapter 4Repeat Structure in Vocal
Sequences
Actions or gestures in behavioral sequences are often repeated several times. Move-
ments such as walking, clapping, breathing and blinking are all examples of such
behavior where you can identify a basic motor gesture that is stereotypically and
repetitively generated to form a sequence. Repetitive behavior can also be asso-
ciated with certain pathological conditions. In speech disorders like stuttering,
palilalia, echolalia and verbigeration, syllables, words, or sometimes whole phrases
are repeated [70]. Tourette’s syndrome is characterized by vocal tics including
repetitive throat-clearing, sniffing, and grunting, as well as motor tics such as
repetitive head-bobbing, shoulder-jerking and blinking [71]. Repetitions provide
a good system to understand features of a behavior since it becomes possible to
compare an action or gesture with what are essentially copies of itself.
Many syllables in vocal sequences are repeated multiple times before a new
kind of syllable is vocalized. We refer to a sequence of repetitions of the same
kind of syllable as a phrase or a trill. The number of times a syllable is repeated
varies from rendition to rendition of the phrase. In this chapter we discuss the
statistics of syllable repetitions in the songs of canaries, swamp sparrows and Ben-
galese finches. In canaries and sparrows, various studies have shown that there is
an innate preference for a trilled syntax with each song sequence composed only
of repeats of a single syllable type [72–75]. In Bengalese finches however, song
sequences contain a mix of repeats and single occurrences of different kinds of
61
syllables. We study the dependence of repetition statistics on the different time
scales involved in the production of vocalizations. Every instance of a syllable, for
example, shows remarkably little variation in duration. Timing has to therefore be
critical in the production of stereotyped motor commands. Constraints in duration
may be imposed by the specifics of neural control, or by peripheral mechanisms,
or by both. We study the relationship between syllable duration and the most
probable number of repetitions of the syllable and show that there is a precise
inverse relationship between the two for canaries and swamp sparrows, while this
is not the case for Bengalese finches.
4.1 Syllable repetitions in multiple species
Syllable repetitions or trills have been studied previously, although the term ‘trills’
may specifically refer to note/syllable repetition at rapid rates. A phrase of sylla-
ble repetitions has also been called a ‘tour’ in reference to green finch and canary
songs [76]. Trilled songs have been previously studied in the context of perfor-
mance tradeoffs between maximal frequency bandwidths and syllable repetition
rates in 34 species of the Emberizid family [77] including birds such as the swamp
sparrow, song sparrow, dark-eyed junco, canary and northern cardinal. High rep-
etition rates are associated with syllables that have narrow fequency bandwidths.
More recently it has been shown that the performance quality of trills can reflect
the age of the singer in the nightingale [78], with older males rendering trills closer
to the performance limit. The songs of multiple songbirds have been observed to
consist of variable syllable repetitions - the Black-capped Chickadee [79], Mountain
Chickadee [80] and Mexican Chickadee [81] vary the number of syllable repeats in
their song. This has been observed in a suboscine passerine as well - the Flam-
mulated Attila (Attila flammulatus), a Neo-tropical tyrant flycatcher [82]. It is
believed that suboscine songs are not learned and do not require auditory feed-
back for normal vocal behavior [33]. The presence of variable repeats in suboscine
song is therefore particularly interesting.
In zebra finches, it has been reported that syllable repetitions can be present in
the songs of non-tutored or isolate birds [83]. This seems to suggest that repetitions
may be a feature of the song that is not learned in the same manner as other
features of the song. Further, in a separate experiment that was part of the same
62
study, when the offspring of birds with songs that contained no syllable repetitions
were tutored by birds with repeats in their song, many developed repeats in their
song as well. However, the repeated syllables were not necessarily the ones repeated
by the tutors. The researchers concluded that the learned aspect was the tendency
to repeat syllables and not the exact syllable repetitions.
4.2 Distribution of the number of syllable repetitions
The number of repetitions of a syllable varies from rendition to rendition of a
phrase. The distribution of the number of repeats for a syllable could be monoton-
ically decreasing or peaked at repeat numbers greater than 1. If the probability of
this transition is a constant, and the generation of repeats is assumed to be Marko-
vian, then the distribution of repeats should decrease monotonically [6]. However,
the repeat distribution of some syllables of Bengalese finches, zebra finches, and
canaries are non-monotonic, mostly unimodal, and sometimes bimodal [6, 51]. As
discussed in the text accompanying Eq. 1.1 in Sec. 1.3, a peaked repeat distribution
is indicative of a non-Markovian song syntax. Both monotonic and peaked repeat
distributions can be explained by considering the repeat probability of a syllable -
the probability of a syllable transitioning into itself - to be adaptive, i.e., changing
as a function of the number of repetitions. POMM for the non-repeat structure
of sequences combined with adaptation for the repeat structure is referred to as
POMMA [6]. Typically, for any sequence with repetitions of syllables, a POMM is
constructed using sequences with the repeats removed. So for a sequence, AAAB-
BCDDD we consider the non-repeat sequence ABCD to construct the POMM.
The repeat distributions for each of the syllables are then considered separately
and fitted with a model of adaptation.
4.2.1 Sigmoidal model of adaptation
A recent computational model suggests that long syllable repetitions are sustained
by an excitatory input to the neurons encoding the repeating syllables [51]. The
input decays over time, with a time constant that is independent of the syllable
duration. In this model, the initial probability of the syllable repeating is high due
to the strong input. However, with continued repetition, this probability decreases
63
number of repeats n
P(n
|T)
Syllable A Syllable B Syllable C Syllable D Syllable E
1 1 1 1 19 11 6 13 20
Figure 4.1. Sample repeat distributions of an individual canary’s syllables and fitsbased on the sigmoidal model of adaptation
as the input decays. This process produces peaked repeat number distributions
with long tails that are typically observed for long repeating syllables. The central
assumption in this model is that given that a syllable has repeated n successive
times (n ≥ 1), the probability that the syllable occurs an (n + 1)th time is given
by
p(n) = 1− c
1 + abn(4.1)
The likelihood of such an additional repeat decreases with n in a sigmoidal
manner. The parameters a and b quantify the location of the drop of the sigmoid
the shape. The model builds in the effect of syllable duration T on the location of
the drop in the sigmoid by setting
b = exp
(−νTτ
)(4.2)
where ν and τ are model parameters. The drop in the sigmoidal curve occurs at
nthreshold ∼ τνT
. The plateaus of the sigmoid at n ≶ nthreshold are determined by
the parameters a, c. The probability of a syllable of duration T occurring exactly
n times is given by
P (n|T ) = (1− p(n))n−1∏s=1
p(s) =c
1 + a exp(−νnTτ
)
n−1∏s=1
[1− c
1 + a exp(−νsTτ
)
]
In the study, the adapting excitatory input was considered to be auditory feedback.
However, the mathematical model is relevant in the case of any other kind of
adapting feedback as well. Fig. 4.1 shows the fits of this model to the repeat
distributions for few of an individual canary’s syllables. One of the predictions of
64
the model is that the most probable number of repeats, or the peak repeat number
np of a syllable should be inversely proportional to the duration T of the syllable
np = c
(1
T
)(4.3)
4.3 Evidence of inverse relationship between syllable
duration and most probable repeat number
The prediction of an inverse relationship between the syllable duration and most
probable number of repeats of a syllable is strong. If true, it would imply the
existence of fundamental biological constraints in the production of syllable repe-
titions. Further, if it holds across species, then we have identified a behavior that
is likely to be conserved across multiple species and must be studied further ex-
perimentally since it is possible that the neural mechanism behind it is conserved
across these species too. We test if the inverse relationship is observed for the
songs of the canary, the swamp sparrow and the Bengalese finch. We do so by
considering the fit
np = c
(1
Tα
)(4.4)
α ≈ −1 signifies an inverse relationship.
4.3.1 Swamp sparrow
Swamp sparrow recordings of six birds were obtained from Dana Moseley and
Jeffrey Podos, University of Massachussets, Amherst. All songs were manually
transcribed by us. Since the analysis is based on constructing distributions of
the number of repetitions of a syllable for different renditions, we require a large
number of instances of trills of the same syllable. Hence, only syllables appearing
in atleast 50 trills of length equal to the peak repeat number were considered (Male
02 - syllable 1 & syllable 2, Male 06 - syllable 2 & syllable 9, Male 21 - syllable 7
& syllable 8). The data from the six birds was combined for the analysis. Syllable
duration is calculated by dividing the total duration of the trill by the number of
syllables in the trill. The most probable number of repetitions for each syllable is
calculated from the repeat distributions and the average duration of syllables are
65
101
102
103
100
101
102
103 α=0.97+/-0.06, c=1.83s
0 100 200 300 400 5000
10
20
30
40
50
60
70
80
syllable duration (ms) syllable duration (ms)
mos
t pro
babl
e nu
mbe
r of
re
peat
s
Figure 4.2. The most probable or modal number of repeats as a function of syllableduration for six syllables from a population of six swamp sparrows. Song recordings fromDana Moseley and Jeffrey Podos, University of Massachussets, Amherst.
calculated using 20 instances of each syllable. A linear fit on the logarithmic scale
is considered. An inverse relationship (α = −0.97) does exist between syllable
duration and most probable repeat number for the swamp sparrow (population of
six birds) as seen in Fig. 4.2.
4.3.2 Canary
Canary songs are composed of repetitions of syllables separated by brief silent
intervals. A set of repetitions of a syllable is usually referred to as a phrase or
a tour. Every phrase in a canarys song contains only repetitions of the same
syllable. Canary recordings and transcriptions for six birds were obtained from
Jeffrey Markowitz and Timothy Gardner, Boston University. Each canary sings
between 20-25 syllables. This means that we can test the inverse relationship for
each individual bird. The relationship is amazingly exact for each of the six birds
studied as seen in Fig. 4.3.
For the sake of completeness, we would like to check if the inverse relationship
is exact at the population level, i.e., if we consider the combined data of all six
birds. However, since the value of c is different for different birds, we can imagine
that the data from different birds might possibly lie on parallel lines, leading to a
spread in the data that would not be ideal for a good linear fit. One way around
the problem is to collapse the data.
66
−100
102030405060
Bird 1c=1.535s
α=-1.1+/-0.14
0 100 200 300 400 500Syllable duration (ms)
Mos
t pro
babl
e re
peat
num
ber
Bird 2c=1.354s
α=-1.1+/-0.13
−100
102030405060
0 100 200 300 400 500
Mos
t pro
babl
e re
peat
num
ber
Syllable duration (ms)
−100
102030405060
0 100 200 300 400 500Syllable duration (ms)
Mos
t pro
babl
e re
peat
num
ber
−100
102030405060
0 100 200 300 400 500Syllable duration (ms)
Mos
t pro
babl
e re
peat
num
ber
−100
102030405060
0 100 200 300 400 500Syllable duration (ms)
Mos
t pro
babl
e re
peat
num
ber
−100
102030405060
0 100 200 300 400 500Syllable duration (ms)
Mos
t pro
babl
e re
peat
num
ber
Bird 4c=1.357s
α=-1.01+/-0.15
Bird 3c=1.410s
α=-1.01+/-0.12
Bird 5c=1.563s
α=-1.0+/-0.15
Bird 6c=1.743s
α=-1.03+/-0.19
Figure 4.3. Inverse relationship between syllable duration and most probable repeatnumber for six individual canaries. Error bars show standard error. Canary recordingswere obtained from Timothy Gardner and Jeffrey Markowitz, Boston University
67
−4 −2 0 2 4−4
−3
−2
−1
0
1
2
3
4
x−<x>
y−
<y>
Mode, α = −1.0739
Mean, α = −1.1304
Figure 4.4. Mean and modal number of repetitions show the inverse relationship withsyllable duration for the canary population (all six birds). x = lnT and y = lnnmode ory = lnnmean.
We are trying to find the exponent and intercept of the linear fit to lnT vs
lnnmode or lnnmean, where T is the average syllable duration, nmode is the most
probable repeat number and nmean is the mean repeat number. Consider x = lnT
and y = lnnmode or y = lnnmean. We could subtract the means from these variables
to obtain data collapse.
Plotting the data for both nmode and nmean using the transformed variables
y− < y > and x− < x >, visually, in Fig. 4.4, we can see that the exponents of
the linear fit are almost the same in both cases. Both the most probable number
and mean repeat number show an inverse relationship with syllable duration. We
also show that the difference between the two is not significant when analyzed
statistically. The null hypothesis for the analysis is that the slopes of the two
regression lines are equal. We obtain p=0.5465 (ANCOVA) and cannot reject the
null hypothesis. The slopes are therefore not significantly different. The differ-
ence between the slopes is 0.0565 and this lies in the 95 % confidence interval
[−0.1269, 0.2400]. If for the sigmoidal model, the most probable and mean repeat
numbers are proportional, then we should not in fact expect the slopes to be dif-
ferent. Even though an analytical expression for the mean is hard to obtain, we
can numerically simulate repeat distributions following the sigmoidal model for
68
Figure 4.5. Inverse relationship is not exact between syllable duration and most prob-able number of repetitions for Bengalese finches] Data collected by Kristofer Bouchardand Michael Brainard, University of California, San Fransisco. Data available online athttp://users.phys.psu.edu/ ˜djin/SharedData/KrisBouchard/
different parameter values and see that the two are indeed proportional, and more
strongly almost equal. We found that this is indeed the case.
4.3.3 Bengalese finch
Although the prediction of an inverse relationship between syllable duration and
the most probable repeat number follows from a model used to study syllable
repetitions in Bengalese finches [51], the relationship was never tested in the study
for these birds. Statistics of syllable repeats from the study for 32 birds is freely
available in the public domain. Using this data, we test if the inverse relationship
holds for Bengalese finches. It turns out that this is not the case. As seen in Fig.
4.5, the decrease in the most probable number of repeats with syllable duration is
much faster than cT
for Bengalese finches (α = −1.7).
69
4.4 Other calculations
4.4.1 Distribution of phrase duration and repeat number
The data presented in the above sections suggest a simple relation < n >< T >= d
between the means of the repeat number, syllable duration and phrase duration. In
this section we try to gain an understanding of the distributions of these variables
based on a few simple assumptions and the results of previous studies on these
quantities.
As mentioned earlier, the distribution of the repeat number for the syllables can
be modeled using the sigmoidal model. The probability of a syllable of duration
T occurring exactly n times is given by
P (n|T ) = (1− p(n))n−1∏s=1
p(s) =c
1 + a exp(−νnTτ
)
n−1∏s=1
[1− c
1 + a exp(−νsTτ
)
]
For a distribution of syllable durations (for a specific syllable) given by some
P(T ), the distribution of the phrase duration d defined as nT is give by
P (d) =∑n
∫dT P(T )P (n|T )δ(d− nT ) (4.5)
Note that the distribution P(T ) models the variability in the syllable durations
between different phrases and not within a single phrase. For the case of P(T )
being the normal distribution if mean < T > and variance σ2, the distribution
P (d) is given by
P (d) =1
σ√
2π
c
1 + a exp(−νdτ
)
∑n
exp
(−(d/n− µ)2
σ2
) n−1∏i=1
(1− c
1 + a exp(−νidτn
)
)(4.6)
This can be computed numerically to get the distribution P (d) of phrase du-
rations, and can be compared to observed phrase duration distributions if there is
enough data to construct these distributions.
70
0 1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
n
pr(n
)
a=0.42
b=0.3
b=1
b=2
b=5
4.4.2 Exponential distributions lead to inverse relationship
We have so far consider the following sigmoidal form for the repeat probability pr
of a syllable
pr(n) = 1− c
1 + a exp(−νnT
τ
)However, this form was based on the model used in the study on adaptation of
syllable repeats in the Bengalese finch [51]. Let us consider the following form
pr(tn) = ae−tbn
pr(n) = ae−(nT )b
When n → 0, pr(n) → a, and when n → ∞, pr(n) → 0. We require 0 < a < 1..
For different ranges of b, the function pr(n) is as seen in the figure below The
function for 0 < b < 1 is called a stretched exponential, b = 1 corresponds to a
regular exponential function, and b > 1 is called a compressed exponential function
(b = 2, being the normal function). b > 1 gives us sigmoids that we can use. More
parameters are required in the function since we want the lower repeat probability
to be non-zero, as well as the location of the fastest change in the sigmoid to be
tunable. This should be possible with the modified function below
71
0 1 2 3 4 5 6 7 8 9 100.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
n
pr(n
)
a=0.42, b=10, c=0.3, τ=5
a=0.42, b=5, c=0.3, τ=5
pr(n) = ae−(nTτ
)b + c
0 < a + c < 1, would represent the lower repeat probability (0 < a < 1 − c),and τ > 0 would change the location of the drop in the sigmoid
Using this form, the probability of a syllable repeating exactly N times would
be
P (N) = (1− pr(N))N−1∏i=1
pr(i)
=(
1− ae−(NTτ
)b − c)N−1∏
i=1
(ae−(
iTτ)b + c
)Do we still see the inverse relationship between np and T?
p′
r(n) = −ab(T
τ
)bn(b−1)e−(
nTτ
)b
p”r(n) = −ab(T
τ
)b [−bn2(b−1)e−(
nTτ
)b(T
τ
)b+ e−(
nTτ
)b(b− 1)nb−2
]
72
The fastest change corresponds to p”r(n) = 0, leading to the condition(T
τ
)bbn2(b−1)
p = (b− 1)nb−2p
nbp =
(1− 1
b
)( τT
)bThe inverse relation follows
np =
(1− 1
b
) 1b τ
T
In general, the probability of syllable repeating could follow any exponential dis-
tribution, and the inverse relationship would still be required to hold true. This
means that there could be potentially a range of feedback mechanisms influencing
syllable repetitions.
4.5 Mechanisms of repeat generation
According to the inverse relationship, npT = c where np is the most probable
number of repeats, T is the syllable duration and c is a constant. However, in Sec.
4.3.2 we showed that the most probable number of repeats and the mean number of
repeats are approximately equal for the sigmoidal model of adaptation, i.e., np =<
n >. Now, < n > T is the average phrase duration d. The inverse relationship
therefore implies that there is a typical phrase length (a phrase is a repeated set
of syllables). In such a relationship, any two of these quantities - < n >, T , d have
to be encoded by the generative processes. A pertinent question that can be asked
is - Which of these is actually being encoded neurally? The inverse relationship
between syllable duration and most probable number of repeats is independent
of any considerations about underlying physical mechanisms. However, we could
speculate about its physical origins.
4.5.1 Auditory feedback could regulate repetition
In the Bengalese finch, auditory feedback is known to influence song sequencing
and may provide the external drive that sustains repetition. Supporting this idea,
73
a recent study in these birds revealed stimulus specific adaptation to repeated
auditory responses in a key sensory motor forebrain region implicated in song
control as well as an immediate decrease in repetitions after deafening [51]. On the
other hand, earlier reports have suggested that canary phrase structure is largely
intact even for birds that never heard themselves sing. The mechanisms underlying
the long time constant of phrase persistence remain to be elucidated. In swamp
and song sparrows, deafened juveniles retained species-specific characteristics such
as phrase duration [84].
The Waterschlager canary is a songbird specifically bred for its song [11]. Hence
the song is very different from that of wild canaries. The Waterschlager strain also
has a genetic auditory defect. The bird is partially deaf, with sensitivity to sounds
above 2 KHz decreased by as much as 40 dB [11]. These are sounds produced
mainly by the right side of the syrinx. 90 percent of the syllables generated by
the canary in song are produced by the left side of the syrinx. These songs have a
distinct structure of strong repetitions of the same syllable before a new syllable
is sung. Perhaps the lateralization of singing in the canary brain can be used to
understand whether repetitions are a right-brained feature, and if so why? Also,
if these strains are deaf to syllables of frequencies higher than 2 KHz, yet repeat
distributions for these syllables are distinctly non-Markovian, i.e., not a monotoni-
cally decreasing distribution with the most probable number of repeats being one,
then it may be an indication that the mechanism of repeat generation is different
from that of Bengalese finches. We do not have information about whether the
birds whose songs we have analyzed were deaf exactly to the frequencies considered
standard in the literature. However, a clear indication of the possibility could help
motivate new experiments.
In early deafening experiments, canaries that were deafened as juveniles devel-
oped species-specific patterning of syllables into songs - the presence of tours and a
preference for durations of tours and silent intervals known to be statistically more
likely in normal canaries [85]. Further, in experiments aimed at identifying features
of the canary song that are possibly genetically programmed, it was observed that
even canaries reared in isolation develop normal phrase structure on sexual mat-
uration [86]. In the same study, canaries were able to learn and reproduce songs
with atypical features such as absence of syllable repetitions when tutored with
abnormal songs. However, with sexual maturation, these canaries sang songs with
74
the species-specific phrase structure even if abnormal learned syllables constituted
these phrases. Phrases have been stated as having a fundamental time-scale that
is independent of the syllable duration [87].
4.5.2 Constrained phrase duration
The observed relationship could be a trivial outcome of the fact that the total
duration of the repeated segment is constrained to be a constant independent
of syllable duration. This would mean that even with fluctuations about this
constant value, the average number of repeats has to be inversely proportional to
the syllable duration simply as a consequence of the constraint. This constraint
could be a physical limit - say the amount of air available in the airsacs of the
bird. In the song system, RA receives projections from LMAN which is a part
of the Anterior Forebrain Pathway (AFP). RA in turn projects to downstream
motor and respiratory neurons. LMAN is thought to be responsible for introducing
variability in juvenile song [88]. If syllable repetitions can be considered a temporal
variant of normal song, it is possible that repetitions are controlled by LMAN or
more generally the basal ganglia circuit (Area X → DLM → LMAN). There also
seems to be a strong link between repetitions and respiration. This can happen
through the LMAN → RA → respiratory nuclei pathway. The phrase duration
could also be genetically encoded or pre-programmed, perhaps a limit on some
neural process [85,86]. These are both mechanisms we speculate to be alternatives
to the regulation of repeats by auditory feedback alone.
75
Chapter 5Semi-automated Classification of
Song Syllables
The datasets used in the analyses in previous chapters were sets of vocal sequences
transcribed into symbol sequences. All sequences in these datasets were tran-
scribed manually to ensure accuracy. This involves converting an audio recording
into a spectrogram - a visual representation of frequencies in the recording, iden-
tifying distinct syllable types from the spectrogram, and labelling every syllable
by its type. However, this is a time-consuming endeavour and subject to human
error. The automation of song transcription procedures would be a highly wel-
come development in the field since it would enforce consistency and reliability in
transcriptions. However this is in fact an extremely hard problem. Most human
speech recognition systems rely on large databases containing multiple exemplars
of word utterances [89]. Also, the performance of these recognition systems can
be quantified relatively easily since we are dealing with speech forms that we as
humans can readily identify by ear. But with the vocal sequences of other animals,
we do not know what the syllables in the song should be.
Many studies have tackled the problem of syllable recognition and classifica-
tion in songbird vocalizations with varying degrees of success [6, 90–92]. They
have all been semi-automated procedures. There is no procedure available till date
that eliminates the need for manual involvement completely. Currently, to our
knowledge, the most advanced and widely used free software that integrates both
the recording of songs, as well as measures syllable features and performs syllable
76
classification is the Sound Analysis Pro [93]. The feature-based classification algo-
rithm that the software implements makes use of acoustic features such as pitch,
amplitude, frequency modulation etc for the categorization of syllables. However,
since we humans are good at visually identifying syllables by their spectrograms
(this is how manual transcription of songs is done), it is possible that incorporating
image-based features into classification methods might improve the performance
of syllable classifiers. This chapter details a syllable recognition and classification
system we developed by coupling selective acoustic and image-based features with
a standard supervised learning method used for classification - the Support Vector
Machine (SVM).
5.1 Morphology of a song
The song of a songbird is composed of acoustic units called syllables that are song
following some syntax. Syllables are separated from each other by silent intervals.
The silent intervals between syllables within a song are much smaller than the
interval between songs. In fact, we identify a new song based on the large silent
intervals. In Fig. 5.1. we see the example of the pressure wave from the recording
of a Bengalese finch song in the top panel. The corresponding spectrogram giving
the spectral or frequency content of the song is displayed below the waveform.
Distinct syllables are identified by differences in temporal and spectral content.
We utilize these difference in the classification method that we develop. However,
before syllable classification, we first need to distinguish syllables from silences in
the song.
5.2 Identification of song syllables
In order to isolate the vocal elements in a recording, it is necessary to devise
a method that filters out the noise and picks out reliable vocal data from the
pressure wave d(t). We do this by identifying the silent intervals in the song since
every vocalization is preceded and followed by a period of silence. Thresholding the
amplitudes of the pressure waves is a common approach of isolating vocal elements
in birdsong [94, 95]. The idea would be to set a threshold θ below which all data
is considered to be either silence or noise. However, this threshold must not be a
77
23 24 25 26 27 28 29 30−0.2
0
0.2pr
essu
re(a
.u)
time (s)
frequ
ency
(Hz)
2000
4000
6000
8000
10000
Figure 5.1. Waveform of a Bengalese finch song with the corresponding spectrogram
constant for the entire song, but a function of time θ(t). This is because the mean
equilibrium value of the data could vary with time. This variation is ‘subtracted’
when the threshold is adjusted accordingly with time. The threshold function θ(t)
is determined based on the pressure amplitudes in consecutive time windows of
100 ms each. For example, if the data is sampled at a frequency of 40000 Hz for a
period of 30 s, there are 1.2 million data points.
However, for purposes of identifying the threshold function, it is enough to have
a representative set of pressure amplitudes. The representative set is obtained by
first finding the maximum amplitude | A(t) | in a single oscillation cycle of d(t).
In order to do this, a copy of d(t) , d′(t) is made. Every element in d′(t) is then
shifted by one. A product of d(t) and d′(t) now gives a set of positive and negative
values. Every time a negative value is encountered, it indicates that the pressure
wave has crossed the zero axis - i.e, we have located a node. Three consecutive
negative values would then tell us that a complete oscillation cycle is contained
between the first and third nodes.
The maximum value A(t) of the pressure amplitude in one oscillation cycle
78
is then determined to obtain the envelope of the waveform. We consider the
logarithm of A(t) to emphasize the difference between different amplitudes. Since
doing so magnifies extremely low noise levels, any noise level below 0.02(a.u) is
set to be 0.02(a.u). This is then smoothed by using the second order Savitzky-
Golay filter [96]. We then identify vocal elements by detecting continuous regions
in A(t) that are above a threshold function θ(t) defined for a moving window of
constant size (100 ms, 4000 time points). We also define a step size for the moving
window. One issue that was encountered is the loss of data towards the end of the
waveform if the number of data points for the last window happened to be less
than that needed to complete a full window. This is avoided to prevent loss of
data by choosing a step size of 1 data point. For example, if we choose a window
of size equal to 4000 time points - the first window is indexed by data point 1
followed by 3999 points, the second by data point 2 flanked by data point 1 on
one side and 3998 points on the other and so on. This ensures that all the data
points are covered. The maximum Amax(t) and minimum Amin(t) of A(t) within
each window are then determined. The threshold is defined to be a fraction α of
the difference between Amax(t) and Amin(t) in a time window of size 100 ms plus
Amin(t). This fraction was chosen to be α = 0.4 by trial and error.
The waveform of an isolated syllable was transformed into a spectrogram s(f, t),
which is the energy density at frequency f and time t using the multitaper method.
Distinct syllables obtained using our method for the song of a canary are shown
in Fig. 5.2. Once the syllables in the song have been isolated, they have to be
classified into identified types by pattern recognition.
5.3 Semi-automated classification of song syllables
Pattern recognition is a standard problem in machine learning [97]. In general
any algorithm for pattern recognition is based on identifying regularities in the
dataset. Such algorithms can fall under one of two broad categories - unsupervised
or supervised learning methods. In an unsupervised learning method, inferences
are drawn from the data without the aid of pre-assigned labels to a representative
subset of the data. Such methods include clustering, k-means, mixture models,
hierarchical clustering, among others. Unsupervised methods have been previ-
ously used to classify syllables in birdsong [6]. By contrast, in supervised learning
79
Figure 5.2. Spectrograms of different syllable types in the vocal repertoire of a canary(1 bird). Duration and frequency information are not shown to scale. Syllables can bedistinguished by visual inspection of the spectrograms.
methods, inferences are drawn from the data based on labelled training data. We
focus on a supervised learning method, the Support Vector Machine, for syllable
classification. The training data for our SVM are 20 exemplars each of all distinct
syllable types identified from the recordings of a songbird.
5.3.1 Support Vector Machines
Support Vector Machine (SVM) is a machine learning algorithm that can be used
to classify data points in Rn into a set of predetermined discrete set of classes
Y . The algorithm uses a training set containing pairs (xi, yi) - assigning the data
points xis to the class yis - to learn the classifier, and is thus a supervised learning
technique [97, 98]. The learned classifier can be then used to classify new data
80
x1
x2
Separating hyperplane
wTx+b=0
margin
Figure 5.3. Linear boundary surface in an SVM that separates data labelled by{+1,−1}
points x. This is similar to regression in that SVM uses the training data (xi, yi)
to find a fit function f (x, α) (α being the fitting parameters) such that f minimizes
some empirical loss function evaluated on the training data. SVM however differs
from regression in that the fit function produces discrete outputs.
The simplest of problems that can be tackled with the SVM is the task of
finding a linear boundary surface separating data labelled by {+1,−1} similar to
what is shown in Fig. 5.3. The function f in this situation has the form
f (x|α = (w, b)) = sign[wTx+ b
]and the optimization algorithm can be used to pick the parameters such that some
notion of loss is minimized. Simple and intuitive choices of loss functions could be
the number of misclassified data points, or (negative of) the margin between the
boundary and the data points, or a weighted combination of the two. SVM can
be used also in cases where the separating surface has a non-linear form. Rather
than fitting a non-linear boundary surface, non-linear SVM, takes the approach of
appending additional dimensions to the data ie placing the original n dimensional
81
Figure 5.4. Mapping data that is not separable by a linear boundary into a higherdimensional feature space leads to a linear boundary in the higher dimensional featurespace. When this is mapped back into the input space as in the last panel, we see thatthe boundary is a curve a circle in this case - that separates the data.
data points into a larger, n+ d dimensional space through
x = (x1, x2, x3 . . . xn)→ x′ = (x1, x2, x3 . . . xn, π1 (x) , π2 (x) , . . . πd (x)) .
After a suitable choice of the kernel functions πi, the new points x′ can be separated
using a hyperplane using the techniques of linear SVMs. Fig. 5.4 illustrates the non-
linear SVM used to separate the data points using a kernel function π3 (x) = x21+x22.
5.3.2 Syllable features for classification
When syllables are considered as training data points xi for an SVM, they are
represented as points in a multidimensional feature space. Each dimension of the
feature space is a distinguishing characteristic of the syllable type. The choice
of features is crucial to the performance of the classifier. We used a minimal
feature set - three acoustic-based and two image-based features, for a reasonable
classification performance on Bengalese finch and canary syllables.
5.3.2.1 Duration
Different syllable types have fairly distinct durations. Hence the first feature of
the syllable we use for the SVM is syllable duration. Although multiple syllables
may have roughly the same duration, the combination of this feature with others
ensures that this is not an issue. The durations (Mean ± Standard Deviation) of
seven types of syllables for a Bengalese finch along with the coefficient of variation
82
Type Duration (ms) Coeffecient of Variation
1 40.3± 5.1 0.12682 87.4± 5.9 0.06743 50.7± 6.7 0.13144 53.7± 6.0 0.11145 63.6± 10.7 0.16776 50.2± 8.8 0.17577 50.0± 7.7 0.1546
Table 5.1. Duration of syllables in a Bengalese Finch song.
(Standard Deviation/Mean) are shown in Table. 5.1.
5.3.2.2 Wiener entropy
Wiener entropy of an audio is a measure that quantifies the deviation of the audio
from white noise. White noise is characterized by uniform power in all frequency
bands. In contrast, the sound produced by resonant structures or animal vocaliza-
tions contains multiple harmonics of a frequency. Wiener entropy is defined as the
ratio of the geometric mean to arithmetic mean of a power-spectrum, and is often
expressed in logarithmic scale:
W = log10
N
√∏Ni=1 s(ωi)
1N
∑Ni s(ωi)
= log10
exp [〈ln s〉]〈s〉
(5.1)
where ωi labels the ith frequency bin and s(ωi) is the total power spectrum inside
this bin. Within this definition, white noise has the highest Wiener entropy (0)
and a pure note, such as a sinusoidal wave has the lowest Wiener entropy (−∞).
All other signals have intermediate values. A spectrogram gives us both temporal
and spectral information for a syllable. Hence we can calculate the Wiener entropy
along both dimensions, since the variation in the power spectrum is unique along
each dimension. This gives us the two other acoustic features used with the SVM.
5.3.2.3 Hough transform
We now look for image-based features. When we visually inspect spectrograms
of syllables, we identify matches of syllables roughly based on the orientation and
extent of ‘lines’ on the spectrogram. Hough transform is a feature extraction tool
83
Figure 5.5. Summary of the Hough transform. Any edge in an image, which is simplya set of collinear points is represented as a set of intersecting sinusoidal curves as theresult of a Hough transform.
that can be used to identify structures in an image that can be approximated by line
segments (and geometric curves in general). The simplest Hough transform allows
extraction of linear segments in an image, through combining radon transform and
binning. The transform takes in a two dimensional array representing an image A
and maps it to another two dimensional array representing an new image B. The
map satisfies the following qualitative features:
1. Isolated points in A maps to sinusoidal waves in the output figure.
2. Every linear segment in A maps to points in the image B with the intensity
of the point.
3. Almost collinear sets of pixels in image A map to points of larger size in the
image.
84
Figure 5.6. Separation of syllables in feature space based on duration, and the twoHough transform coordinates ρ and θ
The transform maps a point in A represented by the Cartesian coordinates (x, y)
to a sinusoidal curve in B:
Radon [(x, y)]→ {(θ, x cos θ + y sin θ) such that θ ∈ (0, π)} (5.2)
The transform of an image A with multiple points is an image B with all the
sinusoidal curves piled together. Intensity of a pixel in the final image B is propor-
tional to the number of sinusoidal curves passing through it. This is summarized
in Fig. 5.5. In particular, it can be shown from simple geometrical arguments that
all the sinusoidal curves in B arising from the transform of a set of collinear points
in A intersect at a single point, reinforcing the intensity of the pixel represented
by that point. This point is represented by two coordinates - ρ and θ.
As can be seen in Fig. 5.2, the images of the spectrogram of individual syllables
contain prominent linear structure. This, in the space of the Hough transformed
variables appears as a high intensity point (ρ, θ), the location of which can be
identified by simple peak detection. This process effectively assigns to each syllable
a pair of numbers (ρ, θ) which are the last two features used with the SVM.
Fig. 5.6 shows an example of the separation of six Bengalese finch syllable types
based on three of the five features.
85
5.3.3 SVM ensembles
Since we are dealing with a multi-class classification problem, we construct a set
of binary classifiers which in combination act as a multi-class classifier. However,
often the support vectors obtained from the learning are not sufficient to classify
all test samples accurately. To improve performance, we use the strategy of using
an ensemble of classifiers [99] in place of each binary classifier. An ensemble of clas-
sifiers is a collection of several classifiers whose individual decisions are combined
(using for example a simple voting strategy) to classify the test samples. Several
different methods of creating these ensembles seem to exist in the literature, all
aiming to make one SVM as different from the other SVMs as possible by creating
different training sets for each of them.
Bootstrapping is one such method where k replicate training sets are con-
structed from the training set, by randomly resampling with replacement. A
training syllable may appear more than once or never in a given training set.
Every replicate training set trains a different SVM all aimed to perform the same
binary classification task.
In summary, for classification into one of n classes, there should be n binary
classifiers. Each of the n classifiers in turn consist of an ensemble of binary clas-
sifiers of size k. A majority voting scheme is then employed where a syllable is
classified as class i if the majority of the classifiers in the ensemble classify the
syllable as class i.
5.3.4 Transcription of a song
In a given recording, song sequences were identified by examining the inter-syllable
duration between syllables sn and sn+1. If this duration ∆τ = tn+1−tn was greater
than 200 ms, a given song sequence was assumed to have ended with syllable sn and
a new one begun with syllable sn+1. Syllables were identified as voiced elements
between silent intervals and isolated (see top panel of Fig. 5.7). From the complete
set of syllables, 20 training exemplars were identified for each unique syllable type
by visual inspection of their spectrograms. The five classification features were
extracted for these syllables and the hyperplanes that separated distinct syllable
types in feature space were determined using the SVM. All other syllables were
then classified based on their position with respect to these hyperplanes. The
86
Figure 5.7. Transcription of a Bengalese Finch song. In the top panel, a song sequencebegins at the magenta line and ends at the cyan line
classification performance was higher than 85% for all syllables for both Bengalese
finch and canary songs (an example of syllable labelling after SVM classification
is shown in the lower panel of Fig. 5.7), with performance as high as 98% for some
syllables (mostly clean whistles). We constructed a GUI to manually inspect the
classification performance and easily reassign misclassified or unclassified syllables.
Thus although we were unable to eliminate the need for manual involvement, we
were able to construct a good classifier based on a minimal number of syllable
features.
87
Chapter 6Conclusion
Vocal sequences are observable expressions of complex neural processes occurring
in the brain. The study of vocal sequences and the patterns contained in them can
shed light on neural computations. The relative simplicity of sequence structure
and the possibility of laboratory collection of data have made songbird songs an
ideal candidate for such studies. Even so, there are several challenges, especially in
developing methods for efficient and reliable syntax inference using a finite amount
of data.
6.1 Partially Observable Markov Model - inference
and evaluation
Earlier studies have shown that the syntax of Bengalese finch songs can be effi-
ciently represented by a class of probabilistic finite state systems called Partially
Observable Markov Models (POMM). In the first part of the dissertation we iden-
tified and developed various methods to infer the POMM from a set of symbol
sequences. Although they were devoted to modeling the syntax of vocal sequences
in this dissertation, the methods are very general and can easily be applied to other
classes of problems where sequential structures can be mapped into POMMs. We
discussed several different metrics to evaluate the fit of the model and showed that
finiteness of available data places upper bounds on the values of the metrics. Reli-
able inference of the model syntax from finite data involves optimization of model
parameters without over-fitting. We develop a scheme to perform such controlled
88
inferences, specifically utilizing a metric called sequence completeness.
Discussion
The combinatorial challenge of determining the optimal number of states in a
POMM by the grid search algorithm we developed for model inference increases
in difficulty with the number of syllables in an animal’s repertoire. The current
method of finding the number of states associated with each syllable is computa-
tionally efficient for a bird with 6-10 syllables in its repertoire such as the Bengalese
finch, but becomes highly inefficient for birds like the nightingale and the warbler
that sing more than 100 types of syllables. There is much scope for the design
of more efficient computational paradigms for inference methods described in this
dissertation.
Also, it was assumed that we knew all syllables in the animal’s vocal repertoire
for our analyses before we constructed syntax models. However, if we want to
consider problems such as real-time computation of syntax, where the syntax is
dynamically inferred during vocalizations, we need to allow for the possibility of
incorporating unseen observations in our models. This goes back historically to a
question in inductive inference - what to do when the utterly unexpected occurs,
an outcome for which no slot has been provided in the support of your distribution.
This is not the problem of observing an impossible event that is an event whose
existence has been considered and the probability of which is considered to be zero.
Rather, this problem arises when we observe an event whose existence we did not
previously suspect. In earlier literature, this is called the sampling of species
problem [100]. In one of the many side-projects that came out of this dissertation,
we considered the use of non-parametric Bayesian inference techniques such as the
Hierarchical Dirichlet Process [101] and the infinite Hidden Markov Model [102,103]
to come up with ways of incorporating the encounter of a new syllable or symbol in
a discrete sequence. However, such extensions were not necessary for the questions
we tried to answer in this dissertation and were therefore not pursued actively.
But we would like to draw attention to these possibilities.
89
6.2 Comparison of syntactic structures
We then used the inference methods that were developed in the first part of the
dissertation to construct the syntax of the non-repeat structure of Bengalese finch
songs before and after deafening, to understand the role of auditory feedback in the
regulation of syntax. Before deafening, the syntax is a POMM for five out of the
six birds, with the sixth bird having a Markovian syntax. Significantly, the many-
to-one mapping that existed between states in the POMM and syllables of the song
were absent in the song syntax for all the birds except one after deafening - i.e.,
the syntax is more Markovian. Many new syllable transitions that did not exist
before the removal of auditory feedback also appear post-deafening. This implies
that auditory feedback has a role in maintaining the song syntax for an individual
Bengalese finch and more specifically in influencing the many-to-one mappings in
the syntax of Bengalese finches. However, the fact that the changes in syntax were
not seen in all the birds studied suggests that auditory feedback most likely acts
in combination with other factors such as the topographical connectivity patterns
in the song-generating neural circuitry. It is necessary to confirm these results in
studies involving a larger number of birds and different species of birds. It would
be particularly interesting to consider a bird such as the canary that is considered
to not rely heavily on auditory feedback.
Discussion
The procedure used for the comparison of syntax structures in Bengalese finches
pre- and post-deafening is not limited to birdsong. Our analysis suggests that
metrcs such as sequence completeness are tools that can be used in the study
of more general questions - Are some sets of sequences more Markovian or more
POMM-like than others? How different are the syntax models of different species
from each other? Considering grammars in the Chomsky hierarchy, finite-state
grammars can generate strings generated by higher grammars as part of their
language e.g. a Markov model based on transitions between two syllables A and
B can generate the palindromic string ABBA, the context-free string AAABBB,
the copy string ABAB. However, a finite-state grammar cannot generate only these
strings [104]. Some interesting questions to ask are - Can we distinguish a Markov
90
model from a POMM from a higher grammar based on the observed sequences?
If so, how many sequences should we observe, or what is the minimum required
sample size, before we can state with some confidence that the sequences were
generated by one grammar or another?
Also, syntax models such as the POMM are construted based on the analysis
of local transition rules between the states or syllables. Long vocal sequences such
as humpback whale songs are referred to in the literature as possessing a hierarchi-
cal structure [65,69] with notes arranged into units (syllables), units into phrases,
phrases into themes, and themes into songs. The reference to hierarchy also seems
to imply a parallel organization of structures that generate them including associ-
ated time scales. However, the appearance of hierarchy may be a result of human
perception. It is possible that local rules of transition between syllables would
suffice to describe the syntax and we would see hierarchical structures emerge.
For example, we may tend to split the simple sequence ABCABCDBCABC... into
chunks ABC, DBC, ABC, ABC although it could be generated by a simple Markov
model in which the transitions A → B, B → C,D → B occur with probability 1,
C → A occurs with probability 23
and C → A occurs with probability 13. Ascer-
taining whether the syntax of humpback whale song can be represented by a finite
state model or not would be an important endeavor. However, as we discussed in
the section on humpback whale songs in Chapter 2, the methods we developed are
not suitable for dealing with long and continuous vocalizations since all bounds
are defined based on the availability of multiple independent sequences from the
same individual. Extensions of our methods for a long continuous sequence could
be interesting future research.
6.3 Statistics of syllable repetitions
Finally, we considered the repeat structure of syllables in song sequences. We
demonstrated that a precise inverse relationship exists between syllable duration
and the most probable number of repetitions of the syllable for canaries and swamp
sparrows - two species with songs composed mostly of syllable repetitions or trills.
However, this relationship does not hold for the songs of a Bengalese finch. This
contrast in behavior can be a window into differences in mechanisms of song gener-
ation in different songbirds. The inverse relationship also implies that the duration
91
of a set of repets, or a phrase, is a constant on average. We also raise the question
of which variables involved in the inverse relationship are encoded neurally for ca-
naries and swamp sparrows - syllable duration, phrase duration or the probability
of the number of syllable repeats. This is an question since in some species it
has been assumed that phrase duration and syllable duration are independently
encoded. For example, in canaries it is suggested that the phrase duration is ge-
netically encoded [76]. However given the inverse relationship, it is possible that
it is simply a result of a constraint on the the syllable duration and the number of
syllable repetitions.
Discussion
It is plausible that the exact inverse relationship specifies an upper bound on the
performance of syllable repetitions. Repetition-generation mechanisms of some
songbirds like the Bengalese finch may operate in regimes well below the perfor-
mance limit, while others operate exactly at the limit. There is evidence that
swamp sparrows tutored with trills that are artificially accelerated are unable to
keep up with the pace and end up singing trills with a ‘broken syntax’ [105] in
which discrete chunks are missing from what should be a continuous trill. There
exist neurons in the HVC that project to Area X, and are known to exhibit activity
that is precisely time-locked to every repetition of a syllable [106]. In awake singing
swamp sparrows it has been shown that at accelerated trill rates, these neurons fail
to respond to consecutive syllables [107]. Similar experiments in different species
could help test the hypothesis that different species operate at different distances
from the performance limit, with the distance given by how fast the most probable
number of repeats scales with syllable duration.
92
Appendix ABaum-Welch Algorithm for
estimation of POMM parameters
The algorithmic problem of inferring a Hidden Markov Model from data is de-
scribed in Rabiner & Juang [55]. For observed syllable sequences {y} = {y1, y2, . . .}with each observation taking one of m values, a Partially Obervable Markov Model
(POMM) has to be constructed in which each syllable yi is generated by a hidden
state Si such that corresponding to the syllable sequence y there is a state sequence
S. Assume that there are N kinds of states {h1, h2, . . . , hN}. The number of states
t = 1
s1
s2
s3
s4
y1
t = 2
s1
s2
s3
s4
y2
t = 3
s1
s2
s3
s4
y2
Figure A.1. Calculation of forward probabilities in the trellis of the observation se-quence y1,y2,y2 for a 4-state POMM. The thick arrows indicate the most probable tran-sitions. As an example, the transition between state s1 at time t=2 and state s4 at timet=3 has probability α2(1)T14E4(y2), where αt(i) is the probability to be in state si attime t.
93
N and the state-syllable type assignments are initially fixed. Each state is assigned
a single syllable type, but the same syllable type may be assigned to several states.
We consider a single ‘silent’ state (the first state in the system) to represent both
start and end states each emitting a single symbol that signifies either the start
or end of a syllable sequence. All the available observation sequences are concate-
nated into one long sequence with the symbol indicating a silent state marking
the beginning and end of each individual sequence. The elements of the model are
the distribution Γ for the start state, the state transition probabilities T , and the
state-syllable emissions E. For the computational implementation, Γ is a vector
of length N, T is an N ×N matrix and E is an N ×m matrix. The first state in
the state sequence that leads to the observed concatenated syllable sequence will
always be the start state. Hence all the elements of the vector Γ are 0 except for
the first element (corresponding to the silent states) which is 1. Γ is a constant
in our implementation of the POMM. Initially, a random state transition matrix
T is chosen. For the POMM, the emission matrix E is constructed based on the
state-syllable assignments and is fixed in our implementation of the POMM.
Eij =
1 if state igenerates syllable j
0 otherwise
Using the emission matrix, an ‘observation matrix’ O of size N × N is con-
structed for each syllable type - e.g., In a four-state system, if syllable y1 is of type
1 generated by states 2 and 3, then the observation matrix for type 1 is
O1 =
0 0 0 0
0 1 0 0
0 0 1 0
0 0 0 0
In other words, Oi is a diagonal matrix with the diagonal given by ith column of
Φ. The construction of the observation matrix simplifies the implementation of
the forward-backward algorithm which is the first of two parts of the Baum-Welch
algorithm.
If the concatenated sequence is of length τ , the objective is to obtain the state
transition probabilities that are most likely to give rise to y. In order to do so, we
94
are interested in calculating the probability γt (h) defined as follows.
γt (h) gives the probability that the state St occurring at step 0 ≤ t ≤ τ is h,
given the entire observed sequence y1,...,τ .
γt (h) = P (St = h | y1,...τ , T ) =P (St = h, y1,...τ | T )
P (y1,...τ | T )=P (A ∩B)
P (B)
where A is the event of state h occuring at step t, B is the event of observation of
the entire syllable sequence. It turns out that this can be written as
γt (h) =αt (h)× βt (h)∏
t ct
where α and β are
forward probabilities αt (h) = P (St = h and y1,...,t)
backward probabilities βt (h) = P (yt+1,...l | St = h)
αt (h) gives the probability that a state h occurs at step t followed by syllables
y1→t.
βt (h) gives the probability that the syllables yt+1→τ occur given that the state
at step t was h. ct gives the probability that the given syllable occurs in the model.
A.1 Estimating forward probabilities
The forward variable α is calculated as follows: The probability of starting a new
sequence and making a transition that emits the first observed syllable is given by
the vector
α1:2 ∼ ΓΠO2
1 corresponds to the start state and the start symbol, so α1 = Γ. Also, since α1:2
is the probability of occurrence of possible states, the sum of all the elements of
the vector should give one. Hence the vector has to be normalized. Define
c2 =∑
h=all states
α1:2 (h)
95
to be the normalization factor such that
α2 = α1:2 = c−12 ΓTO2
Similarly,Wikipedia example
α1:3 = c−13 α1:2TO3
...
α1:τ = c−1τ α1:τ−1TOτ
T τi=1ci can then be interpreted as the probability or likelihood of obtaining the
observed sequence for all possible final states at each point in the sequence, i.e.,
P (y1, y2, . . . yτ | T ) = T τi=1ci
A.2 Estimating backward probabilities
The backward variable β can be calculated similarly starting with the last observed
syllable giving βτ :τ = 1
βτ−1:τ = c−1τ ΠOτβτ :τ...
β2:τ = c−13 TO3β3:τ
The β’s are scaled using the same factors c as the α’s. The two can be combined
to obtain the probability distribution γt of being in each state at every point t in
the sequence
γt (h) = α1:t (h) βt+1:τ (h) =αt (h)× βt (h)∏
t ct
A.3 New transition matrix T
Let us take K possible sequences of states that can generate the given syllable
sequence. At a particular step t, γt (h) gives the probability of finding state h. So
we expect that γK of the K state sequences have h in position t. So total number
96
of times state h occurs, at any location, is expected to be
K ×∑t=1→τ
γt (h)
Now, consider the number of times a state hp at step t is followed by hq in step
t + 1. Probability that syllables y1→t occur and the state at t is hp is given by
αt (hp). Probability that syllables y1→t occur and the state at t is hp and the state
at t + 1 is hq is given by αt (hp) × Tpq. Probability that syllables y1→t occur and
the state at t is hp and the state at t+1 is hq and that the syllable yt+1 is observed
is given by αt (hp)× Tpq × Eq(yt+1). Probability that syllables y1→t occur and the
state at t is hp and the state at t+1 is hq and that the syllable yt+1 is observed and
the remaining syllables yt+2→T occur is given by αt (hp)×Tpq×Eq(yt+1)×βt+1 (hq).
The probability of the above given the observed sequence can be obtained by
dividing this by the probability of the observed sequence,
ζt (p→ q) =αt (hp)× Tpq × Eq(yt+1)× βt+1 (hq)∏τ
t=1 ct
So the number of times a state hp at step t is followed by hq in step t + 1 is
ζt (p→ q)K. Hence the number of times a state hp at any step is followed by hq
in the next step is
K ×τ−1∑t=1
ζt (p→ q)
Combining the above two results, transition probabilities can be estimated to be
Tpq =
∑τ−1t=1 ζt (p→ q)∑τ−1
t=1 γt (h)
The equation above is used to update the transition probabilities in a given
iteration. In order to move onto successive iterations, the log of the likelihood T τi=1ci
is calculated for each iteration, and the iterations are continued if the difference
between the log-likelihoods in iteration j and iteration j + 1 is smaller than a set
tolerance limit. A tolerance limit of 1× 10−3 is typically used in all cases of model
inference in this dissertation.
97
Appendix BConfidence Intervals for Entropy of
Sequences
The entropy of a distribution ideally should not depend on the number of sample
sequences available. Hence to find the error bounds on the log-likelihood, we should
first find the error bounds on the entropy, and then multiply by the total number
of observed sequences.
B.1 Subsampling
The basic assumption is that if we know the subsampled means and standard
deviation obtained from a data sample (set of song sequences in our case), we
can infer the true means and standard deviation. Why is there a true standard
deviation at all? If we had an infinite set of sequences, the standard deviation
would be zero. But we should account for the fact that we only have finite data.
The procedure to obtain the true mean and standard deviation and the subsampled
mean and standard deviation is as follows. We first pick any POMM and treat
this as our true source or generator. In the analysis below we consider a 10-state
POMM built on N = 850 sequences. We then create 1000 sets of N sequences
generated from each model, calculate the entropy for each set, and find the mean
and standard deviation of the distribution so obtained. This is the true mean
and standard deviation. We then consider a single set of N sequences as our
sample data. We consider a fraction α of these sequences, pick α × N sequences
98
0 0.5 10.88
0.9
0.92
0.94
0.96
0.98
1
fraction
sam
ple
mea
n / t
rue
mea
n
0 0.5 10
0.5
1
1.5
2
2.5
3
fraction
sam
ple
sd /
true
sd
Figure B.1. Scaling of sample and true means and standard deviations
by subsampling without replacement (1000 times), and calculate the sample mean
and sample standard deviation of the entropy distribution obtained. We can now
compare these to the true values.
Fig. B.1 shows the ratio of the sample mean and standard deviation to the true
mean and standard deviation for different fractions α of sequences picked. We can
use these curves to obtain error bounds on the maximum log-likelihood attainable
by a model inferred from N sequences.
99
Appendix CFinding Dominant Transitions in a
POMM
Based on the work by Serrano et al (2009) [58] on complex weighted networks, we
find the backbone of the POMM, or the set of significantly dominant transitions
in the POMM, by testing each transition against a null hypothesis. The null
hypothesis is that the transition probabilities into or out of a state are produced
by a random assignment from a uniform distribution. We can define a p-value for
each transition - the probability that if the null hypothesis is true, then the variable
under consideration would have a value greater than or equal to the observed value.
Dominant transitions in the syntax are defined to be the transitions for which the
null hypothesis is rejected at an assigned significance level.
C.1 Random assignment of transition probabilities
from a uniform distribution
If a state has k transitions, we would like to find the ones that reject the null
hypothesis. A state has transitions both into it and out of it - kin transitions into,
and kout transitions out of the state. The following analysis is applicable to either
set, and we will therefore consider k transitions out of the state for purposes of
explanation.
Since the sum of all k transition probabilities out of a state must be unity, the
problem is to divide the interval (0, 1] into k pieces. This is possible by choosing
100
k − 1 points within the interval. Let the distribution associated with k random
intervals be defined as pk. We will prove by induction that pk(x)dx = (k − 1)(1−x)k−2dx. All intervals are equivalent, and therefore to find the probability density
function for the random assignment of interval sizes from a uniform distribution,
we could solve the problem of finding the probability of assigning a length to any
one of the k intervals.
Consider the equivalent problem of finding the distance x between two ran-
domly chosen adjacent points on a line of length 1. We arbitrarily decide to keep
track of the distance between the end of the line corresponding to x = 0 and the 1st
point in line beyond it; i.e, the 1st interval. Assume we have k − 2 points already
chosen on the line such that there are k−1 intervals. The probability that the first
interval is of length q is pk−1(q). Now if we want to choose a new point adjacent
to the point that is currently first in line, this point could be chosen at a location
closer or further away from x = 0 than the current first point. One possibility is
that the new point is at location x with probability 1 and the point that was pre-
viously first in line was at a position beyond x. The probability of this occurrence
is given by 1 ×∫ 1
xpk−1(y)dy. The other possibility is that the point already first
in line was at x and the new point is chosen beyond x - the probability of this is
pk−1(x)∫ 1
xp(y)dy where p(y) = 1. Therefore the probability of the first interval
being of length x if there is 1 point or 2 intervals already is
p3(x)dx = 1×∫ 1
x
p3−1(y)dy + p3−1(x)
∫ 1
x
p(y)dy
= (1− x) + (1− x)
= 2(1− x)
If the state has only one transition, i.e, k = 1, it is trivially true that p1(x) = 0.
We have already made the argument that p2(x) = 1− x and p3(x) = 2(1− x). Let
us assume that it is true that for k intervals
pk(x)dx = (k − 1)(1− x)k−2dx (C.1)
Then
pk+1(x)dx = k(1− x)k−1dx (C.2)
101
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2
2.5
3
3.5
4
x
p k(x)
Figure C.1. The probability density function for a variable taking a value based onrandom assignment from a uniform distribution decreases monotonically.
By the principle of induction, Eq. (C.1). is true for all k.
C.2 Assignment of significance levels
For any POMM, we want to calculate the probability that each transition pij from
state i to state j is compatible with the null hypothesis. This is done by calculating
a p-value αij - the probability under a true null hypothesis of obtaining a value for
the transition probability that is equal to or greater than the observed probability
pij. This is because Eq. (C.1) describes a monotonically decreasing function as
shown for k = 5 in the sketch in Fig. C.1 - which means that all low-probability
events occur at values greater than pij as shown by the shaded region. The p-value
αij is therefore given by
αij =
∫ 1
pij
pk(x)dx (C.3)
= 1−∫ pij
0
pk(x)dx (C.4)
102
= 1− (k − 1)
∫ pij
0
(1− x)k−2dx (C.5)
Each transition pij in the POMM can be associated with a p-value αij. We
can assign a global significance level α to the p-values to determine the dominant
transitions. Small values of α correspond to the most dominant transitions, i.e.,
αij < α.
103
Bibliography
[1] Kuiper, K. and J. Nokes (2013) Theories of Syntax: Concepts and CaseStudies, Palgrave Macmillan.
[2] Bickerton, D. and E. Szathmary (2009) Biological foundations and ori-gin of syntax, Mit Press.
[3] Chomsky, N. (1957) Syntactic Structures, Mouton, The Hague.
[4] Berwick, R. C., K. Okanoya, G. J. L. Beckers, and J. J. Bolhuis(2011) “Songs to syntax: the linguistics of birdsong.” Trends in cognitivesciences, 15(3), pp. 113–21.
[5] Katahira, K., K. Suzuki, K. Okanoya, and M. Okada (2011) “Com-plex sequencing rules of birdsong can be explained by simple hidden Markovprocesses.” PloS one, 6(9), p. e24516.
[6] Jin, D. Z. and A. a. Kozhevnikov (2011) “A compact statistical modelof the song syntax in Bengalese finch.” PLoS computational biology, 7(3), p.e1001108.
[7] Hopcroft, J. E. (1979) Introduction to automata theory, languages, andcomputation, Pearson Education India.
[8] Callut, J. and P. Dupont (2004) “A Markovian approach to the inductionof regular string distributions,” Icgi, 3264(3264), pp. 77–90.
[9] Wilbrecht, L. and F. Nottebohm (2003) “Vocal learning in birds andhumans.” Mental retardation and developmental disabilities research reviews,9(3), pp. 135–48.
[10] Deacon, T. W. (1998) The symbolic species: The co-evolution of languageand the brain, WW Norton & Company.
104
[11] Marler, P. R. and H. Slabbekoorn (2004) Nature’s music: the scienceof birdsong, Academic Press.
[12] Reiss, D. and B. McCowan (1993) “Spontaneous vocal mimicry and pro-duction by bottlenose dolphins (Tursiops truncatus): evidence for vocal learn-ing.” Journal of Comparative Psychology, 107(3), p. 301.
[13] Payne, K., P. Tyack, and R. Payne (1983) “Progressive changes in thesongs of humpback whales (Megaptera novaeangliae): a detailed analysis oftwo seasons in Hawaii,” Communication and behavior of whales, pp. 9–57.
[14] Foote, A. D., R. M. Griffin, D. Howitt, L. Larsson, P. J. O.Miller, and A. R. Hoelzel (2006) “Killer whales are capable of vocallearning.” Biology letters, 2(4), pp. 509–12.
[15] Sanvito, S., F. Galimberti, and E. H. Miller (2007) “ObservationalEvidences of Vocal Learning in Southern Elephant Seals: a LongitudinalStudy,” Ethology, 113(2), pp. 137–146.
[16] Pistorio, A. L., B. Vintch, and X. Wang (2006) “Acoustic analy-sis of vocal development in a New World primate, the common marmoset(Callithrix jacchus) a),” The Journal of the Acoustical Society of America,120(3), pp. 1655–1670.
[17] Boughman, J. W. (1998) “Vocal learning by greater spear-nosed bats.”Proceedings. Biological sciences / The Royal Society, 265(1392), pp. 227–33.
[18] Poole, J. H., P. L. Tyack, A. S. Stoeger-Horwath, and S. Wat-wood (2005) “Animal behaviour: elephants are capable of vocal learning,”Nature, 434(7032), pp. 455–456.
[19] Arriaga, G., E. P. Zhou, and E. D. Jarvis (2012) “Of mice, birds, andmen: the mouse ultrasonic song system has some features similar to humansand song-learning birds.” PloS one, 7(10), p. e46610.
[20] Williams, H. (2004) “Birdsong and singing behavior.” Annals of the NewYork Academy of Sciences, 1016, pp. 1–30.
[21] Doupe, A. J. and P. K. Kuhl (1999) “Bird Song and Human Speech:Common Themes and Mechanisms,” Annu. Rev. Neurosci., 22, pp. 567–631.
[22] Bolhuis, J. J., K. Okanoya, and C. Scharff (2010) “Twitter evolu-tion: converging mechanisms in birdsong and human speech.” Nature reviews.Neuroscience, 11(11), pp. 747–59.
105
[23] Thorpe, W. H. (1954) “The Process of Song-Learning in the Chaffinchas Studied by Means of the Sound Spectrograph,” Nature, 173(4402), pp.465–469.
[24] Marler, P. and D. Isaac (1960) “Song variation in a population of BrownTowhees,” The Condor, 62(4), pp. 272–283.
[25] ——— (1960) “Physical analysis of a simple bird song as exemplified by theChipping Sparrow,” The Condor, 62(2), pp. 124–135.
[26] Nottebohm, F., T. M. Stokes, and C. M. Leonard (1976) “Centralcontrol of song in the canary, Serinus canarius.” The Journal of comparativeneurology, 165(4), pp. 457–86.
[27] Marler, P. (1970) “A comparative approach to vocal learning: song devel-opment in White-crowned Sparrows.” Journal of comparative and physiolog-ical psychology, 71(2p2), p. 1.
[28] Calder, W. A. (1970) “Respiration during song in the canary (Serinuscanaria),” Comparative biochemistry and physiology, 32(2), pp. 251–258.
[29] Hinde, R. (1958) “Alternative motor patterns in chaffinch song,” AnimalBehaviour, 6(3), pp. 211–218.
[30] Doupe, A. J. and M. Konishi (1991) “Song-selective auditory circuits inthe vocal control system of the zebra finch,” Proceedings of the NationalAcademy of Sciences, 88(24), pp. 11339–11343.
[31] Konishi, M. and E. Akutagawa (1985) “Neuronal growth, atrophy anddeath in a sexually dimorphic song nucleus in the zebra finch brain,” Nature,315, pp. 145–147.
[32] Jurgens, U. (2002) “Neural pathways underlying vocal control.” Neuro-science and biobehavioral reviews, 26(2), pp. 235–58.
[33] Kroodsma, D. E. and M. Konishi (1991) A suboscine bird (easternphoebe, Sayornis phoebe) develops normal song without auditory feedback,Elsevier Masson.
[34] Suthers, R., W. Fitch, R. Fay, and A. Popper (2016) Vertebrate SoundProduction and Acoustic Communication, Springer Handbook of AuditoryResearch, Springer International Publishing.
[35] Jurgens, U. (2009) “The neural control of vocalization in mammals: areview.” Journal of voice : official journal of the Voice Foundation, 23(1),pp. 1–10.
106
[36] Jarvis, E. D., O. Gunturkun, L. Bruce, A. Csillag, H. Karten,W. Kuenzel, L. Medina, G. Paxinos, D. J. Perkel, T. Shimizu,G. Striedter, J. M. Wild, G. F. Ball, J. Dugas-Ford, S. E. Du-rand, G. E. Hough, S. Husband, L. Kubikova, D. W. Lee, C. V.Mello, A. Powers, C. Siang, T. V. Smulders, K. Wada, S. A.White, K. Yamamoto, J. Yu, A. Reiner, and A. B. Butler (2005)“Avian brains and a new understanding of vertebrate brain evolution,” Na-ture Reviews Neuroscience, 6(2), pp. 151–159.
[37] Elemans, C. P. H., J. H. Rasmussen, C. T. Herbst, D. N. During,S. A. Zollinger, H. Brumm, K. Srivastava, N. Svane, M. Ding,O. N. Larsen, S. J. Sober, and J. G. Svec (2015) “Universal mech-anisms of sound production and control in birds and mammals.” Naturecommunications, 6, p. 8978.
[38] Albert, C. Y. and D. Margoliash (1996) “Temporal hierarchical controlof singing in birds,” Science, 273(5283), pp. 1871–1875.
[39] Hahnloser, R. H. R., A. A. Kozhevnikov, and M. S. Fee (2002) “Anultra-sparse code underlies the generation of neural sequences in a songbird,”Nature, 419, pp. 65–70.
[40] Fee, M. S., A. A. Kozhevnikov, and R. H. R. Hahnloser (2004)“Neural mechanisms of vocal sequence generation in the songbird.” Annalsof the New York Academy of Sciences, 1016, pp. 153–70.
[41] Troyer, T. W. (2013) “Neuroscience: The units of a song.” Nature,495(7439), pp. 56–7.
[42] Amador, A., Y. S. Perl, G. B. Mindlin, and D. Margoliash (2013)“Elemental gesture dynamics are encoded by song premotor cortical neu-rons.” Nature, 495(7439), pp. 59–64.
[43] Fiete, I. R., R. H. Hahnloser, M. S. Fee, and H. S. Seung (2004)“Temporal sparseness of the premotor drive is important for rapid learningin a neural network model of birdsong,” Journal of neurophysiology, 92(4),pp. 2274–2282.
[44] Fiete, I. R., W. Senn, C. Z. Wang, and R. H. Hahnloser (2010)“Spike-time-dependent plasticity and heterosynaptic competition organizenetworks to produce long scale-free sequences of neural activity,” Neuron,65(4), pp. 563–576.
[45] Jin, D. Z. (2009) “Generating variable birdsong syllable sequences withbranching chain networks in avian premotor nucleus HVC,” Physical ReviewE, 80(5), pp. 1–13.
107
[46] Okubo, T. S., E. L. Mackevicius, H. L. Payne, G. F. Lynch, andM. S. Fee (2015) “Growth and splitting of neural sequences in songbirdvocal development,” Nature, 528(7582), pp. 352–357.
[47] Jin, D. Z., F. M. Ramazanolu, and H. S. Seung (2007) “Intrinsic burst-ing enhances the robustness of a neural network model of sequence generationby avian brain area HVC.” Journal of computational neuroscience, 23(3), pp.283–99.
[48] Basharin, G. P., A. N. Langville, and V. A. Naumov (2004) “Thelife and work of A.A. Markov,” Linear Algebra and its Applications, 386, pp.3–26.
[49] Dobson, C. W. and R. E. Lemon (1979) “Markov sequences in songs ofAmerican thrushes,” Behaviour, 68(1), pp. 86–105.
[50] Okanoya, K. (2004) “The Bengalese finch: a window on the behavioral neu-robiology of birdsong syntax,” Annals of the New York Academy of Sciences,1016(1), pp. 724–735.
[51] Wittenbach, J. D., K. E. Bouchard, M. S. Brainard, and D. Z.Jin (2015) “An Adapting Auditory-motor Feedback Loop Can Contributeto Generating Vocal Repetition.” PLoS computational biology, 11(10), p.e1004471.
[52] Lyons, J. (1981) Language and linguistics, Cambridge University Press.
[53] Hennie, F. C. (1968) Finite-state models for logical machines, John Wiley& Sons.
[54] Herve Bourlard, S. B. (2002) Hidden Markov Models and other FiniteState Automata for Sequence Processing, 2nd ed., MIT Press, Cambridge,MA, USA.
[55] L.R.Rabiner, B. (2007) “An introduction to hidden Markov models.” Cur-rent protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [etal.], (January).
[56] Dupont, P. (2005) “Inducing Hidden Markov Models to Model Long-TermDependencies,” Machine Learning: ECML, pp. 513–521.
[57] Moon, T. K. (1996) “The expectation-maximization algorithm,” Signalprocessing magazine, IEEE, 13(6), pp. 47–60.
[58] Serrano, M. ., M. Bogu, and A. Vespignani (2009) “Extracting themultiscale backbone of complex weighted networks,” Proceedings of the Na-tional Academy of Sciences, 106(16), pp. 6483–6488.
108
[59] Woolley, S. M. N. and E. W. Rubel (1997) “Bengalese FinchesLonchura Striata Domestica Depend upon Auditory Feedback for the Main-tenance of Adult Song,” J. Neurosci., 17(16), pp. 6380–6390.
[60] Okanoya, K. and A. Yamaguchi (1997) “Adult Bengalese finches(Lonchura striata var. domestica) require real-time auditory feedback to pro-duce normal song syntax.” Journal of neurobiology, 33(4), pp. 343–56.
[61] Rajan, R. and A. J. Doupe (2013) “Behavioral and neural signatures ofreadiness to initiate a learned motor sequence.” Current biology : CB, 23(1),pp. 87–93.
[62] Nordeen, K. W. and E. J. Nordeen (1992) “Auditory feedback is nec-essary for the maintenance of stereotyped song in adult zebra finches,” Be-havioral and neural biology, 57(1), pp. 58–66.
[63] Lombardino, A. J. and F. Nottebohm (2000) “Age at deafening affectsthe stability of learned song in adult male zebra finches,” The Journal ofNeuroscience, 20(13), pp. 5054–5064.
[64] Sakata, J. T. and M. S. Brainard (2008) “Online Contributions of Audi-tory Feedback to Neural Activity in Avian Song Control Circuitry,” Journalof Neuroscience, 28(44), pp. 11378–11390.
[65] Payne, R. S. and S. McVay (1971) “Songs of humpback whales,” Science,173(3997), pp. 585–597.
[66] Payne, K. and R. Payne (1985) “Large scale changes over 19 years in songsof humpback whales in Bermuda,” Zeitschrift fur Tierpsychologie, 68(2), pp.89–114.
[67] Helweg, D. A., A. S. Frankel, J. R. Mobley Jr, and L. M. Her-man (1992) “Humpback whale song: our current understanding,” in Marinemammal sensory systems, Springer, pp. 459–483.
[68] Garland, E. C., A. W. Goldizen, M. L. Rekdahl, R. Constantine,C. Garrigue, N. D. Hauser, M. M. Poole, J. Robbins, and M. J.Noad (2011) “Dynamic horizontal cultural transmission of humpback whalesong at the ocean basin scale,” Current Biology, 21(8), pp. 687–691.
[69] Suzuki, R., J. R. Buck, and P. L. Tyack (2006) “Information entropy ofhumpback whale songs.” The Journal of the Acoustical Society of America,119(3), pp. 1849–1866.
[70] Cummings, J. and M. Mega (2003) Neuropsychiatry and Behavioral Neu-roscience, Oxford University Press.
109
[71] Robertson, M. M. (2000) “Tourette syndrome, associated conditions andthe complexities of treatment,” Brain, 123(3), pp. 425–462.
[72] Marler, P. and S. Peters (1982) “Structural changes in song ontogenyin the swamp sparrow Melospiza georgiana,” The Auk, pp. 446–458.
[73] Dooling, R. and M. Searcy (1980) “Early perceptual selectivity in theswamp sparrow,” Developmental Psychobiology, 13(5), pp. 499–506.
[74] Mota, P. G. and G. C. Cardoso (2001) “Song organisation and patternsof variation in the serin (Serinus serinus),” Acta ethologica, 3(2), pp. 141–150.
[75] Guttinger, H. R. (1985) “Consequences of domestication on the songstructures in the canary,” Behaviour, 94(3), pp. 254–278.
[76] ——— (1979) “The Integration of Learnt and Genetically Programmed Be-haviour,” Zeitschrift fr Tierpsychologie, 49(3), pp. 285–303.
[77] Podos, J. (1997) “A performance constraint on the evolution of trilled vo-calizations ina songbird family (Passeriformes : Emberizidae),” Evolution,51(2), pp. 537–551.
[78] Sprau, P., T. Roth, V. Amrhein, and M. Naguib (2013) “The predic-tive value of trill performance in a large repertoire songbird, the nightingaleLuscinia megarhynchos,” Journal of Avian Biology, 44(6), pp. 567–574.
[79] Hailman, J. P. and M. S. Ficken (1986) “Combinatorial animal commu-nication with computable syntax: chick-a-dee calling qualifies as languagebystructural linguistics,” Animal Behaviour, 34(6), pp. 1899–1901.
[80] Bloomfield, L. L., I. Charrier, and C. B. Sturdy (2004) “Note typesand coding in parid vocalizations. II: The chick-a-dee call of the mountainchickadee (Poecile gambeli),” Canadian Journal of Zoology, 82(5), pp. 780–793.
[81] Ficken, M. S., E. D. Hailman, and J. P. Hailman (1994) “The chick-a-dee call system of the Mexican chickadee,” Condor, pp. 70–82.
[82] Leger, D. W. (2005) “First documentation of combinatorial song syntaxin a suboscine passerine species,” The Condor, 107(4), pp. 765–774.
[83] Helekar, S., G. Espino, A. Botas, and D. Rosenfield (2003) “Devel-opment and adult phase plasticity of syllable repetitions in the birdsongof captive zebra finches (Taeniopygia guttata).” Behavioral neuroscience,117(5), p. 939.
110
[84] Marler, P. and V. Sherman (1983) “Song structure without auditoryfeedback: emendations of the auditory template hypothesis.” The Journal ofneuroscience : the official journal of the Society for Neuroscience, 3(3), pp.517–531.
[85] Guttinger, H. R. (1981) “Self-differentiation of Song Organization Rulesby Deaf Canaries,” Zeitschrift fur Tierpsychologie, 56(4), pp. 323–340.
[86] Gardner, T. J., F. Naef, and F. Nottebohm (2005) “Freedom andrules: the acquisition and reprogramming of a bird’s learned song.” Science(New York, N.Y.), 308(5724), pp. 1046–1049.
[87] Markowitz, J. E., E. Ivie, L. Kligler, and T. J. Gardner (2013)“Long-range Order in Canary Song.” PLoS computational biology, 9(5), p.e1003052.
[88] Olveczky, B. P., A. S. Andalman, and M. S. Fee (2005) “Vocal exper-imentation in the juvenile songbird requires a basal ganglia circuit,” PLoSBiol, 3(5), p. e153.
[89] Lippmann, R. P. (1997) “Speech recognition by machines and humans,”Speech communication, 22(1), pp. 1–15.
[90] Anderson, S. E., A. S. Dave, and D. Margoliash (1996) “Template-based automatic recognition of birdsong syllables from continuous record-ings,” The Journal of the Acoustical Society of America, 100(2), pp. 1209–1219.
[91] Kogan, J. A. and D. Margoliash (1998) “Automated recognition of birdsong elements from continuous recordings using dynamic time warping andhidden Markov models: A comparative study,” The Journal of the AcousticalSociety of America, 103(4), pp. 2185–2196.
[92] Tachibana, R. O., N. Oosugi, and K. Okanoya (2014) “Semi-automaticclassification of birdsong elements using a linear support vector machine,”PloS one, 9(3), p. e92584.
[93] Tchernichovski, O., F. Nottebohm, C. E. Ho, B. Pesaran, andP. P. Mitra (2000) “A procedure for an automated measurement of songsimilarity,” Animal Behaviour, 59(6), pp. 1167–1176.
[94] Janata, P. (2001) “Quantitative assessment of vocal development in thezebra finch using self-organizing neural networks,” The Journal of the Acous-tical Society of America, 110(5), pp. 2593–2603.
111
[95] Du, P. and T. W. Troyer (2006) “A segmentation algorithm for zebrafinch song at the note level,” Neurocomputing, 69(10), pp. 1375–1379.
[96] Savitzky, A. and M. J. Golay (1964) “Smoothing and differentiation ofdata by simplified least squares procedures.” Analytical chemistry, 36(8), pp.1627–1639.
[97] Bishop, C. M. (2006) Pattern Recognition and Machine Learning (Infor-mation Science and Statistics), Springer-Verlag New York, Inc.
[98] Boser, B. E., I. M. Guyon, and V. N. Vapnik (1992) “A TrainingAlgorithm for Optimal Margin Classifiers,” in Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory, ACM Press, pp. 144–152.
[99] Kim, H.-C., S. Pang, H.-M. Je, D. Kim, and S. Y. Bang (2002) “Patternclassification using support vector machine ensemble,” in Pattern Recogni-tion, 2002. Proceedings. 16th International Conference on, vol. 2, IEEE, pp.160–163.
[100] Zabell, S. L. (1992) “Predicting the unpredictable,” Synthese, 90(2), pp.205–232.
[101] Teh, Y. W., M. I. Jordan, M. J. Beal, and D. M. Blei (2012) “Hier-archical dirichlet processes,” Journal of the american statistical association.
[102] Beal, M. J., Z. Ghahramani, and C. E. Rasmussen (2001) “The in-finite hidden Markov model,” in Advances in neural information processingsystems, pp. 577–584.
[103] Van Gael, J., Y. Saatci, Y. W. Teh, and Z. Ghahramani (2008)“Beam sampling for the infinite hidden Markov model,” in Proceedings of the25th international conference on Machine learning, ACM, pp. 1088–1095.
[104] Durbin, R., S. R. Eddy, A. Krogh, and G. Mitchison (1998) Bio-logical sequence analysis: probabilistic models of proteins and nucleic acids,Cambridge university press.
[105] Podos, J., S. Nowicki, and S. Peters (1999) “Permissiveness in thelearning and development of song syntax in swamp sparrows,” Animal Be-haviour, 58(1), pp. 93–103.
[106] Fujimoto, H., T. Hasegawa, and D. Watanabe (2011) “Neural codingof syntactic structure in learned vocalizations in the songbird,” The Journalof Neuroscience, 31(27), pp. 10023–10033.
112
[107] Prather, J., S. Peters, R. Mooney, and S. Nowicki (2012) “Sensoryconstraints on birdsong syntax: neural responses to swamp sparrow songswith accelerated trill rates,” Animal behaviour, 83(6), pp. 1411–1420.
113
Vita
Sumithra Surendralal
Education
Doctor of Philosophy, Physics August 2016The Pennsylvania State University, University Park, PA
Master of Science, Physics May 2009University of Madras, Chennai, India
Bachelor of Science, Physics May 2007Women’s Christian College, Chennai, India
Work Experience
Research Assistant Apr-Jul 2010Institute of Mathematical Sciences, Chennai, India
Awards
Outstanding Physics Teaching Assistant Award 2016American Association of Physics Teachers (AAPT)
The Professor Stanley Shepherd Graduate Teaching Assistant Award 2016The Pennsylvania State University, University Park,PA
Graduate Teaching Assistant Award 2013The Pennsylvania State University, University Park,PA