Upload
hadung
View
261
Download
7
Embed Size (px)
Citation preview
Analysis of Motifs in Carnatic Music: AComputational Perspective
A THESIS
submitted by
SHREY DUTTA
for the award of the degree
of
MASTER OF SCIENCE(by Research)
DEPARTMENT OF COMPUTER SCIENCE ANDENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY, MADRAS.October 2015
THESIS CERTIFICATE
This is to certify that the thesis entitled Analysis of Motifs in Carnatic Music:
A Computational Perspective, submitted by Shrey Dutta, to the Indian Institute
of Technology, Madras, for the award of the degree of Master of Science (by
Research), is a bona fide record of the research work carried out by him under my
supervision. The contents of this thesis, in full or in parts, have not been submitted
to any other Institute or University for the award of any degree or diploma.
Dr. Hema A. MurthyResearch GuideProfessorDept. of Computer Science and EngineeringIIT-Madras, 600 036
Place: Chennai
Date:
ACKNOWLEDGEMENTS
I joined IIT Madras with the intention of mastering the techniques used in machine
learning. There is so much data available in digital form and I used to think that
machine learning techniques help in making sense of this data just as human brain
makes sense of the raw data received from different senses. As I started gaining
deep understanding in machine learning techniques, I realized that these tech-
niques are not mature enough to mimic the human brain and thus, should not be
used blindly. I understood that the data needs to be represented in a sensible form
which depends on the task under consideration. These techniques are designed to
use this representation in achieving the desired task. After understanding this, I
was able to use the existing techniques efficiently as well as design new techniques
when required. This level of understanding was not possible without the immense
knowledge and experience shared by my adviser, Prof. Hema A. Murthy, through
endless captivating discussions.
I would like to express my sincere gratitude to her for the excellent guidance,
patience and providing me with an excellent atmosphere for doing research. She
helped me to develop my background in signal processing and machine learning
and to experience the practical issues beyond the textbooks. She has not only
helped in improving my perspective towards research but also towards life.
I would like to thank my collaborators Vignesh Ishwar, Krishnaraj Sekhar
and Ashwin Bellur. The completion of this thesis would not have been possible
without their contribution. They helped me in building datasets, carrying out the
i
experiments, analyzing results and in writing research papers.
I am grateful to the members of my General Test Committee, Prof. C. Chandra
Sekhar and Prof. C. S. Ramalingam, for their suggestions and criticisms with
respect to the presentation of my work. I am also grateful for being a part of the
CompMusic project. It was a great learning experience working with the members
of this consortium.
I would like to thank my music teachers Prof. M.V.N. Murthy and Niveditha
Bharath. Prof. M.V.N. Murthy patiently taught me to play the instrument,
Saraswati Veena, in his unique and excellent style. He always encouraged me
to explore the music beyond what he used to teach in classes which certainly
manifested my creativity. Madam Nivedita Bharath taught me to sing Carnatic
music. She is an excellent and a very friendly teacher. Her classes were full of fun
and excitement. Learning music from these wonderful teachers also helped me to
better understand the work with respect to this thesis.
I would like to thank Aashish, Anusha, Asha, Jom, Karthik, Manish, Padma,
Praveen, Raghav, Rajeev, Sarala, Saranya, Sridharan, Srikanth and other members
of Donlab for their help and unconditional support over the years. It would have
been a lonely lab without them. I am also grateful to Alastair, Ajay and Sankalp
from MTG Barcelona for always clearing my doubts and helping in my research. I
would also like to acknowledge the help of Kaustuv from IIT Bombay. He always
found time to answer my questions regarding Hindustani music.
I am also obliged to the European Research Council for funding the research un-
der the European Unions Seventh Framework Program, as part of the CompMusic
project (ERC grant agreement 267583).
I would like to thank all my friends at IIT Madras without whom the life at IIT
ii
campus would have been dry and boring. If not for them, I would have finished
my thesis much earlier. They have always been a source of refreshment during
stressful times.
I would like to thank my parents who have made many sacrifices so that I can
get a good education and a good life. They have always tolerated my stubborn
and rebellious nature which I am constantly trying to change. I wish to make them
proud one day.
Lastly, I would like to thank my loving brother Anubhav for always being
an anchor of my life. It was he who has taken the responsibility of financially
supporting our family at an early age and motivated me to pursue any path I wish
to choose. I will always be grateful to him and I wish him all the happiness in life.
iii
ABSTRACT
KEYWORDS: Carnatic Music, Pattern Discovery, Motif Spotting, Motif Dis-
covery, Raga Verification, Stationary Points, Rough Longest
Common Subsquence, Longest Common Segment Set
In Carnatic music, a collective expression of melodies that consists of svaras
(ornamented notes) in a well defined order and phrases (aesthetic threads of or-
namented notes) that have been formed through the ages defines a raga. Melodic
motifs are those unique phrases of a raga that collectively give a raga its identity.
These motifs are rendered repeatedly in every rendition of the raga, either compo-
sitional or improvisational, so that the identity of a raga is established. Different
renditions of a motif makes it challenging for a time-series matching algorithm to
match them as they differ slightly from each other. In this thesis, we design al-
gorithmic techniques to automatically find these motifs, their different renditions
and, then use the regions rich in these motifs to perform raga verification.
The initial focus of the thesis is on finding different renditions of melodic
motifs in an improvisational form of the raga called the alapana. Then we make
an attempt to automatically discover these motifs from the composition lines. The
results suggest that composition lines are indeed replete with melodic motifs.
Using these composition lines, raga verification is performed. In raga verification,
a melody (a single phrase or an aesthetic concatenation of many such phrases)
along with a raga claim is supplied to the system. The system confirms or rejects
the claim.
iv
Two algorithms for time-series matching are proposed in this work. One is
a modification of the existing algorithm, Rough Longest Common Subsequence
(RLCS). Another proposed algorithm, Longest Common Segment Set (LCSS), is
completely novel and uses in between matched segments to give a holistic score.
Using the proposed algorithm LCSS, an error rate of ∼ 12% is obtained for raga
verification on a database consisting of 17 ragas.
v
TABLE OF CONTENTS
ACKNOWLEDGEMENTS i
ABSTRACT iv
LIST OF TABLES x
LIST OF FIGURES xi
ABBREVIATIONS xii
NOTATION xiii
1 Introduction 1
1.1 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contribution of the thesis . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Survey 5
3 Motif Spotting 20
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Stationary Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Method of obtaining Stationary Points . . . . . . . . . . . . 23
3.3 Rough Longest Common Subsequence Algorithm . . . . . . . . . 25
3.3.1 Rough match . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 WAR and WAQ for local similarity . . . . . . . . . . . . . . 26
3.3.3 Score matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Modified-Rough Longest Common Subsequence . . . . . . . . . . 27
3.4.1 Rough and actual length of RLCS . . . . . . . . . . . . . . 28
vi
3.4.2 RWAR and RWAQ . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Matched rate on the query sequence . . . . . . . . . . . . . 30
3.5 A Two-Pass Dynamic Programming Search . . . . . . . . . . . . . 30
3.5.1 First Pass: Determining Candidate Motif Regions using RLCS 31
3.5.2 Second Pass: Determining Motifs from the Groups . . . . 32
3.6 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7.1 Querying motifs in the alapanas . . . . . . . . . . . . . . . . 33
3.7.2 Comparison between RLCS and Modified-RLCS using longermotifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8.1 Importance of VAD in motif spotting . . . . . . . . . . . . 39
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Motif Discovery 41
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Lines from the compositions . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Optimization criteria to find Rough Longest Common Subsequence 44
4.3.1 Density of the match . . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 Normalized weighted length . . . . . . . . . . . . . . . . . 46
4.3.3 Linear trend in stationary points . . . . . . . . . . . . . . . 46
4.4 Discovering typical motifs of ragas . . . . . . . . . . . . . . . . . . 49
4.4.1 Filtering to get typical motifs of a raga . . . . . . . . . . . . 49
4.5 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Raga Verification 56
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Dataset used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.1 Extraction of pallavi lines . . . . . . . . . . . . . . . . . . . . 58
vii
5.2.2 Selection of cohorts . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Longest Common Segment Set Algorithm . . . . . . . . . . . . . . 59
5.3.1 Common segments . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.2 Common segment set . . . . . . . . . . . . . . . . . . . . . 62
5.3.3 Longest Common Segment Set . . . . . . . . . . . . . . . . 62
5.4 Raga Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.1 Score Normalization . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5.1 Experimental configuration . . . . . . . . . . . . . . . . . . 66
5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6.1 Combining hard-LCSS and soft-LCSS . . . . . . . . . . . . 69
5.6.2 Reduction of overlap in score distribution by T-norm . . . 69
5.6.3 Scalability of raga verification . . . . . . . . . . . . . . . . . 70
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6 Conclusion 71
6.1 Salient Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Criticism of the work . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
LIST OF TABLES
2.1 Svaras and their respective ratios to the base pitch ‘S’. . . . . . . . 6
3.1 Dataset of alapanas . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Short Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Long Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Short Motifs: Retrieved regions after the first pass . . . . . . . . . 34
3.5 Long Motifs: Retrieved regions after the first pass . . . . . . . . . 35
3.6 Short Motifs: Top 10 retrieved motifs after the second pass . . . . 35
3.7 Long Motifs: Top 10 retrieved motifs after the second pass . . . . 35
3.8 Long Motifs: Retrieved regions after the first pass . . . . . . . . . 37
3.9 Long Motifs: Retrieved regions after the second pass . . . . . . . 38
3.10 Retrieved Groups after both the passes for modified-RLCS withoutVAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 D1: Dataset of composition lines . . . . . . . . . . . . . . . . . . . 50
4.2 D1: Dataset for filtering . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 D2: Dataset of composition lines . . . . . . . . . . . . . . . . . . . 51
4.4 D2: Dataset for filtering . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 D1:Similar motifs retrieved from composition lines . . . . . . . . . 52
4.6 D1:Percentage of motifs preserved after filtering . . . . . . . . . . 53
4.7 D2:Similar motifs retrieved from composition lines . . . . . . . . . 53
4.8 D2:Percentage of motifs preserved after filtering . . . . . . . . . . 54
5.1 Details of the database used. Durations are given in approximatehours (h), minutes (m) or seconds (s). . . . . . . . . . . . . . . . . 58
5.2 EER(%) for different algorithms using different normalizations ondifferent datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
ix
5.3 Number of claims correctly verified by hard-LCSS only, by soft-LCSS only, by both and by neither of them for D1 and D2 usingT-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
x
LIST OF FIGURES
2.1 Comparing Pitch Histogram of Raga ‘Sankarabharanam’ with itsHindustani and Western classical counterparts. . . . . . . . . . . . 7
2.2 Comparing a phrase in raga Sankarabaranam with gamakas and with-out gamakas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 The gamakas in their true form are marked in a pitch contour of amelody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Tonic normalization of two similar phrases in raga ‘sankarabha-ranam’ rendered at different tonics. . . . . . . . . . . . . . . . . . . 10
2.5 Different renditions of a melodic motif in raga ‘Kalyani’ and raga‘Kamboji’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Different instances of a melodic motif in an alapana marked in red. 12
2.7 Extraction of stationary points and their interpolation to get a smoothpitch contour. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 A Phrase with Stationary Points . . . . . . . . . . . . . . . . . . . . 22
3.2 The Pitch and Stationary Point Histograms of the raga Kamboji . 23
3.3 Original and Cubic Interpolated pitch contours . . . . . . . . . . . 24
3.4 a) True positive groups’ and false alarm groups’ score distributionfor RLCS. b) True positive groups’ and false alarm groups’ scoredistribution for modified-RLCS. . . . . . . . . . . . . . . . . . . . 37
4.1 RLCS matching two sequences partially . . . . . . . . . . . . . . . 42
4.2 Slopes of the linear trend of stationary points helps in reducing thefalse alarms. The last three phrases are false alarms. . . . . . . . . 47
5.1 An example of a common segment set between two sequences rep-resenting the real data . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 DET curves comparing LCSS algorithm with different algorithmsusing different score normalizations . . . . . . . . . . . . . . . . . 68
5.3 Showing the effect of T-norm on the score distribution . . . . . . . 70
xi
ABBREVIATIONS
DTW Dynamic Time Warping
UE-DTW Unconstrained Endpoint - Dynamic Time Warping
LCS Longest Common Subsequence
RLCS Rough Longest Common Subsequence
RCS Rough Common Subsequence
WAR Width Across Reference
WAQ Width Across Query
RWAR Rough Width Across Reference
RWAQ Rough Width Across Query
HMM Hidden Markov Model
LSF Least Squares Fit
LCSS Longest Common Segment Set
Z-Norm Zero Normalization
T-Norm Test Normalization
EER Equal Error Rate
VAD Voice Activity Detection
xii
NOTATIONS
f Frequency value in hertzdri,q j Distance between reference’s i-th value and query’s j-th valueTd Threshold on distance, dri,q j
ci, j Cost of RLCS till reference’s i-th value and query’s j-th valuewr
i, j WAR till reference’s i-th value and query’s j-th valuewq
i, j WAQ till reference’s i-th value and query’s j-th valueβ A weight on densityρ Matching rateca
i, j Actual length of RLCS till reference’s i-th value and query’s j-th valuewr
i, j RWAR till reference’s i-th value and query’s j-th valuewq
i, j RWAQ till reference’s i-th value and query’s j-th valuest A semitone in centsδSXY Density of RCS SXY
lwSXY
Actual length of RCS SXY
gX Gaps in sequence XgY Gaps in sequence Yτsim Threshold on the similarity scoresµX
SXYSlope of the linear trend of stationary points in sequence X
σXSXY
Standard Deviation of the linear trend’s slope in sequence XλSXY Similarity in the linear trend of stationary points in sequences X and Yγ The number of gaps between two hard segmentsη Penalty issued for each gapµclaim
I Imposter mean for the claimσclaim
I Imposter standard deviation for the claim
xiii
CHAPTER 1
Introduction
1.1 Overview of the thesis
In Carnatic music, a raga is a collective expression of melodies which consists of:
1. A set of svaras (ornamented notes) ordered in a well defined manner.
2. Phrases (aesthetic threads of ornamented notes) as established by perfor-mances through ages as rendered in well known compositions.
While there are some ragas, in particular, for which the first condition suffices,
in general both these conditions are necessary and are used in practice. The
phrases that collectively give a raga its identity are called melodic motifs. The
melodic motifs are unique to a raga. Therefore, in any rendition of the raga, either
compositional or improvisational, these motifs are rendered in order to establish
the raga’s identity. Different renditions of a motif may differ slightly from each
other, but they are enough to confuse a time-series matching algorithm. The goal
of the thesis is to design algorithmic techniques to automatically find these motifs,
their different renditions and, then use the regions replete with these motifs to
perform raga verification.
The initial part of the thesis is dedicated towards finding different renditions of
melodic motifs in an improvisational form of raga called the alapana. This problem
is known as motif spotting. A melodic motif, preselected by a musician, is used
as a query and its different renditions are spotted using a matching algorithm.
Following this work, inspired by how trained listeners identify ragas, automatic
discovery of motifs is attempted using certain segments of compositions which are
supposed to be rich in motifs. Similar phrases are extracted from a number of such
segments of the compositions in a particular raga. All similar phrases need not be
melodic motifs. Some of them could also appear in other ragas thus, violating the
uniqueness property of the motifs. Therefore, these non-motif phrases are filtered
out if they are found in composition lines of other ragas. Using this approach,
various motifs are discovered for 14 ragas thus, confirming that these segments
are replete with motifs. Therefore, using these segments of compositions, raga
verification is performed. In raga verification, a melody (a single phrase or an
aesthetic concatenation of many such phrases) along with a raga claim is supplied
to the system. The system confirms or rejects the claim. Raga verification is
performed by comparing the snippet of audio supplied with various composition
lines of the claimed raga. The obtained score is matched against the scores obtained
with composition lines of confusing ragas using score-normalization techniques.
Two algorithms for time-series matching are proposed in this work. One is
a modification of the existing algorithm, Rough Longest Common Subsequence
(RLCS). Another proposed algorithm, Longest Common Segment Set (LCSS), is
completely novel and uses in between matched segments to give a holistic score.
This algorithm comes in two forms: hard and soft. Hard-LCSS treats individ-
ual matched segments separately irrespective of their lengths and distribution
whereas, soft-LCSS can join two or more segments based on their lengths and
distribution in order to compute a holistic score. Using the proposed algorithms,
an error rate of ∼ 12% is obtained for raga verification on a database consisting of
17 ragas.
2
1.2 Contribution of the thesis
The following are the main contributions of the thesis.
1. A measure based on the stationary points of the pitch contour is introducedthat reduces the number of false alarms.
2. Modifications to an existing time-series matching algorithm, known as RoughLongest Common Subsequence, are proposed that reduces the number of falsealarms and results in better localization.
3. A new time-series matching algorithm, known as Longest Common SegmentSet, is proposed which performs better for the task of raga verification.
4. Approaches are proposed to discover melodic motifs automatically from thecomposition lines and to find their different renditions.
5. A system is designed to perform raga verification which is scalable to anynumber of ragas.
1.3 Organization of the thesis
The organization of the thesis is as follows: In Chapter 2, a brief background on
carnatic music is given which is required for a better understanding of the work.
Some of the related works on spotting motifs, motif discovery and raga verfication
is also discussed this chapter.
Chapter 3 is dedicated towards describing the approach proposed in this the-
sis to find different renditions of motifs. This chapter describes the quantization
of a pitch contour into stationary point that preserved most of the raga informa-
tion. This chapter also describes the modifications made to an existing time-series
matching algorithm.
Chapter 4 describes the proposed approach for automatically discovering the
melodic motifs from the composition lines of the ragas. A measure is defined based
on the stationary points which reduces the false alarms.
3
Chapter 5 is dedicated towards explaining the raga verification system. Auto-
matic extraction of composition lines from a given composition is discussed. This
chapter also describes the concept of cohorts for a raga. A new time-series match-
ing algorithm, named as Longest Common Segment Set, is also proposed in this
chapter.
Finally, Chapter 6 summarizes the work and discusses the possible future work.
4
CHAPTER 2
Literature Survey
Carnatic music is an art music (often also referred to as classical music) tradition
commonly associated with four states of Southern India: Andra Pradesh, Karnataka,
Kerela and Tamil Nadu, and also some parts of Maharashtra. It is one of the two
main sub-genres of Indian classical music. The other sub-genre is Hindustani Music
which is mainly practiced in North India and also some parts of South India.
A Carnatic music concert is an ensemble of the main performer (usually a
vocalist), an accompanist (usually a violinist, occasionally a vainika or flautist) and
percussionists (a single mridangam vidwan (main percussionist), or an ensemble
of percussionists). If the main percussionist is right handed, s/he sits to the right
of the main artist and the violinist sits to left. The positions are exchanged when
the mridangam vidwan is left handed. All the performers sit on the stage cross
legged without any support.
The first musical sound of a concert is always of a tambura, a drone instrument
which provides the tonic for the entire concert. The tambura (tanpura) is a string
instrument that has four strings tuned to three pitches: P-(S)-(S)-S. ‘S’ is the first
pitch of an octave whereas ‘P’ is 1.5 times the pitch of ‘S’ which makes ‘P’ the
seventh pitch of the octave. The two ‘(S)’s, being twice the pitch of ‘S’, represent the
first pitch of the second octave. When these four strings are played continuously
in a conventional manner, the perceived sound, rich in harmonics, provides the
harmonic base for the performance.
Table 2.1: Svaras and their respective ratios to the base pitch ‘S’.
Svaras S R1 R2/G1 R3/G2 G3 M1 M2 P D1 D2/N1 D3/N2 N3 (S)Ratio 1 16/15 9/8 6/5 5/4 4/3 17/12 3/2 8/5 5/3 9/5 15/8 2
with ‘S’ 1.000 1.067 1.125 1.200 1.250 1.333 1.417 1.500 1.600 1.667 1.800 1.875 2.000
The sound of ‘S’, also referred as ‘sruti’, is a base pitch (tonic) with respect
to which all other pitches are defined. These musical pitches in the context of
Carnatic music are referred to as ‘svaras’. ‘S’ (Sadja) and ‘P’ (Panchama) are the
two of the seven svaras in an octave; other 5 being ‘R’ (Rishabha), ‘G’ (Gandhara),
‘M’ (Madhayama), ‘D’ (Dhaivata) and ‘N’ (Nisada). These five svaras have defined
variability. They take two to three pitch positions in contrast to ‘S’ and ‘P’ as
shown in Table 2.1. These manifestations of a svara, into multiple pitch positions,
are collectively defined as svarasthanas (svara positions) [27]. There are 12 pitch
positions within an octave and the total number of svarasthanas are 16. Therefore,
as shown in Table 2.1, there are overlaps between svarasthanas sharing the same
pitch position. For example, Chatusruti Rishabha (R2) and Suddha Gandhara (G1)
share the same pitch position. Therefore, this pitch position can be interpreted as
one of these two svarasthanas depending on the context.
A svara is not perceived as a single point of frequency although it is referred
as a definitive pitch. It is perceived as movements within a small range of pitch
values around a dominant mean. Figure 2.1 shows the histogram of pitch values
in a melody of raga Sankarabharanam and compares it with its Hindustani (raga
Bilawal) and Western classical (C-Major) counterparts that share the same scale.
The pitch histogram is continuous for Carnatic music and Hindustani music but
more or less discrete for Western music. It is clearly seen that the svaras are a range
of pitch values and this range is maximum for Carnatic music. This is because, the
intonation of a svara within the permissible range cognitively refers to only one
svarasthana. For example, when the svarasthana Antara Gandhra (G3) is constantly
6
0 200 400 600 800 1000 12000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frequency (Cents)
No
rma
lize
d D
en
sity
Western Classical − C Major Scale
Hindustani Classical − Raga Bilawal
Carnatic Classical − Raga Shankarabharanam
S
R2
G3
M1
D2
N3
P
Figure 2.1: Comparing Pitch Histogram of Raga ‘Sankarabharanam’ with its Hin-dustani and Western classical counterparts.
moving within a range, it is cognitively recognized as G3 even if it touches upon
other svarasthanas. This concept where a svara is used to create a variability of
movement in relation to the phraseology and melodic identity, creating a cognitive
understanding of the svarasthana, is defined as a gamaka [27]. Therefore, a svara is
a complete embodiment of svarasthana and the associated gamakas.
There have been various documentations about the types and number of
gamakas. In [13] these gamakas are classified into 13 types. A comparison of a
phrase rendered in raga Sankarabharanam with gamakas and without gamakas is
shown in Figure 2.2. The phrases are represented as time-frequency trajectories of
pitch values. This trajectory is also referred as ‘pitch contour’. From the Figure
2.2 it is obvious that the deviations of the pitches from the norm is much higher
for gamaka laden svaras. The notes of Western classical music are transformed to a
7
0 1 2 3 4 5 6 7 8140
160
180
200
220
240
260
Time (seconds)
Fre
qu
en
cy (
He
rtz)
Tonic E
0 1 2 3 4 5 6140
160
180
200
220
240
260
Time (seconds)
Fre
qu
en
cy (
He
rtz)
Tonic E
a) With gamakas b) Without gamakas
Figure 2.2: Comparing a phrase in raga Sankarabaranam with gamakas and withoutgamakas
symbolic notation due to their shorter pitch range. Sometimes, even the improvi-
sations are also written in symbolic form but the gamaka laden svaras of Carnatic
music are difficult to express in a symbolic form.
It is precisely the presence of extensive gamakas that renders developing a sym-
bolic representation of Carnatic music extremely difficult. It also poses significant
challenges in the analysis of Carnatic music. These gamakas, however difficult to
represent, form the essential content of a melodic phrase. Since, a svara is mostly
rendered using gamakas, it was earlier thought that a melodic phrase could be
quantized in terms of 13 gamakas described in [13]. In practice, even though the
svaras are sung using these gamakas they are mostly present in a modified form
than a true form which is described in [13]. Figure 2.3 shows the pitch contour
of a melodic segment. It is clear that the gamakas present in their true form are
very rare thus making it difficult to quantize a melody in terms of these gamakas.
If a melody cannot be quantized in a sequence of gamakas, how can a melody be
represented? Before addressing this question it is important to understand the
concept of a ‘raga’ and its various forms of renditions.
8
Figure 2.3: The gamakas in their true form are marked in a pitch contour of a melody
The concept of a raga is very central to Carnatic music. A raga is a collective
expression of melodies that consists of gamaka laden svaras and phrases (smaller
melodic units) as rendered in well known compositions through the ages [26].
Scale or svara sequence of a raga is defined by its arohana and avarohana. Aro-
hana corresponds to the ascending order of svaras in the raga whereas avarohana
corresponds to the descending order of svaras in terms of pitch. Tonic is crucial in
the identification of a raga. A melody when heard without a referred tonic can be
perceived as two different ragas depending on the svara that is considered as tonic
[27].
Figure 2.4 shows the pitch contours of two similar phrases in raga ‘Sankarab-
haranam’. These phrases are rendered at different tonics. Any time series match-
ing algorithm will give large error during matching even though these are same
phrases in the same raga but in different tonics. Therefore, before performing
any kind of matching, the normalization of these phrases with respect to tonic is
important. Tonic normalization of these phrases is shown in Figure 2.4.
Raga identification can be done at different levels [27]. In some cases, a svara
itself, even when rendered without a gamaka, may be sufficient to identify a raga.
Identification of a raga can also be aided by different expression of a gamaka on a
svara. Phrases of a raga may also be used to identify it.
9
0 1 2 3 4 5 6 7 8 9140
160
180
200
220
240
260
280
300
Time (seconds)
Fre
quency (
Hert
z)
Tonic E
Tonic F#
—(1200. log2 f
tonic
)—I
0 1 2 3 4 5 6 7 8 9−400
−200
0
200
400
600
800
1000
Time (seconds)
Fre
quency (
Cents
)
Tonic E
Tonic F#
Figure 2.4: Tonic normalization of two similar phrases in raga ‘sankarabharanam’rendered at different tonics.
A phrase is an aesthetic thread of the articulated and the unarticulated svaras
in a raga. Phrases that collectively give a raga its identity are called melodic motifs.
Each time a musician renders a phrase, its form varies even though the core identity
is recognized. Figure 2.5 shows different renditions of a melodic motif in raga
Kalyani and raga Kamboji. The renditions differ slightly from each other but these
are enough to confuse a time-series matching algorithm. These improvisations
should not make the phrase sound like a different raga. Sometimes, even with a
little improvisation a phrase is perceived in a different raga. An example in [27]
states, the phrase P D1 N2 D1 P M1 with an elongated N2 is common to raga Thodi
and Bhairavi. A gamaka on M1 makes it sound like Bhairavi and without the gamaka
it sounds like raga Thodi. Although phrases can be sung at different speeds, for
some phrases an increase in speed constricts the rendition of gamakas in a svara
which can result in a different raga [27]. Other than variations within a phrase, the
way each phrase connects to another also changes but the raga remains the same.
A raga is rendered in various compositional and improvisational forms. Almost
all compositional forms start with a section called ‘pallavi’. The pallavi is usually
made up of one or two lines but is rich with melodic motifs of the raga [26]. The
anupallavi is the second section of the composition. In anupallavi, the melodic
movements in the higher octaves, with reference to the tonic, are present [26]. If
10
Figure 2.5: Different renditions of a melodic motif in raga ‘Kalyani’ and raga ‘Kam-boji’.
the anupallavi is present in a composition, it is always rendered after the pallavi
before any other section. Charanam is another section found in most compositions
which has a variable length depending on the type of composition [26].
There are many improvisational forms in Carnatic music like alapana, tanam, ni-
raval, kalpana svara, etc. We will discuss only alapana in detail as it is relevant to the
work. Alapana is generated by the musician’s distinctive imagination and creativ-
ity. Alapana is the opening of a raga and brings all the aspects of the raga without
using other elements like tala. Every alapana begins with a phrase (melodic motif)
that clearly establishes the identity of the raga. Once the identity is established, mu-
sicians tend to further explore the raga. This exploration leads to small variations
that start appearing in the renditions of svaras and phrases in the form of gamakas
or slight deviations from known phrases. These variations lead to newer phrases
in the raga that, over a period of time, can be used to identify the raga and can be
called as melodic motifs of that raga [26]. The possible ways to move from one
known phrase to another are numerous. The musician exploits this gap between
two known phrases and aesthetically connect them with a new phrase[26].
11
0 1 2 3 4−500
0
500
1000
1500
2000
Time in Minutes
Fre
qu
en
cy in
Ce
nts
Pitch Contour of an Alapana
0 0.5 1600
800
1000
1200
1400
Time in Seconds
Fre
qu
en
cy in
Ce
nts
Motif1
0 0.5 1600
800
1000
1200
1400
Time in Seconds
Fre
qu
en
cy in
Ce
nts
Motif2
0 0.5 1600
700
800
900
1000
1100
1200
1300
Time in Seconds
Fre
qu
en
cy in
Ce
nts
Motif3
Figure 2.6: Different instances of a melodic motif in an alapana marked in red.
Therefore, the raga while having an aesthetic core is also an evolving entity
through endless improvisation. In spite of this evolution, the identity of the raga
remains intact in most cases. A raga is much like an evolving personality while
the person remains the same. An example of an alapana showing the instances of
known phrases (melodic motifs) is shown in Figure 2.6. Earlier discussed melody
in Figure 2.3 is also of an alapana which made it clear that a melody in a raga
cannot be quantized in terms of 13 gamakas described in [13]. Now we will address
the question asked earlier, “If a melody cannot be quantized in a sequence of
gamakas, how can a melody be represented?” We know that every raga consists
of the well known phrases (melodic motifs) that are unique to that raga and can
be used to identify it. These phrases are also referred to as characteristic motifs,
distinctive motifs and typical motifs. In any rendition of a raga, it is required that
the characteristic motifs are rendered in order to establish the identity of the raga.
If the motifs in a recording can be located, then these motifs can be used to index
the recording. The focus of the initial part of the thesis is on locating motifs (as
defined by a musician) in a continuous alapana.
12
In [20], the uniqueness of these characteristic motifs was established using a
closed set motif recognition experiment using Hidden Markov Model (HMM).
Following this work, we attempt to spot motifs given a long alapana interspersed
with motifs. From Figure 2.5, it is clear that motifs that seemingly identical from
a perception perspective appear quite different (visually) when viewed as time
series. Time series motif recognition has been attempted for Hindustani music.
In [39], the onset point of the rhythmic cycle, emphasized by the beat of the tabla
(an Indian percussion instrument), is used as a cue for potential motif regions. In
another work [40], motif spotting is attempted in a Bandish (a type of composition
in Hindustani music) using elongated notes (nyaas svara).
Spotting motifs in a raga alapana is equivalent to finding a subsequence in a
time-frequency trajectory of the alapana. Interestingly, the duration of these motifs
may vary, but the relative duration of the svaras is preserved across the motif.
The attempt in this thesis is to use pitch contours as a time series and employ
time series pattern capturing techniques to identify the motif. The techniques
are customized to use the properties of Carnatic music. There has been work
done on time series motif recognition in fields other than music. In [36], a time
series motif is defined and motif discovery is attempted using the Enumeration of
Motifs through Matrix Approximation (EMMA) algorithm. In [4] and [28], time
series motifs are discovered by adapting the random projection algorithm to time
series data. In [3], a new warping distance called Spatial Assembling distance is
defined and used for pattern matching in streaming data. In [30], music matching
is attempted using a variant of the Longest Common Subsequence (LCS) algorithm
called Rough Longest Common Subsequence (RLCS).
Chapter 3 attempts similar time series motif matching for Carnatic Music.
13
Figure 2.7: Extraction of stationary points and their interpolation to get a smoothpitch contour.
Searching for a 2-3 second motif (in terms of a pitch contour) in a 10 min alapana
(also represented as a pitch contour) can be erroneous, owing to pitch estimation
errors. To address this issue, the pitch contour of the alapana is first quantized
to a sequence of stationary points (points in the pitch contour where the first-
derivative is 0), as shown in Figure 2.7, which are meaningful in the context of a
raga. The meaningfulness of these stationary points is validated by 13 listeners. In
order to validate, stationary points were interpolated using cubic B-splines. The
pitch trajectory corresponding to that of the interpolated curve was then used to
generate the melody. A similarity test was then performed to determine when
the original melodic segments and melodic segments generated after interpolation
were indeed similar. A very high similarity score of 7, out of 10, was obtained.
The examples presented to the listeners for validation are available online1.
To determine the location of the motif, a two-pass search is performed. In
the first pass, Rough Longest Common Subsequence approach with modifications
is used to find the region corresponding to the location of the motif using the
1http://www.iitm.ac.in/donlab/motif_analysis.html
14
stationary points. Once the region is located, another pass is made on this region
using the raw pitch contour instead of stationary points. Although the results
using this approach were very promising, it required that musicians first identify
typical motifs manually. It was also observed that the number of false alarms were
significantly high. Also, the correlation amongst musicians with respect to correct
phrases was as high as ∼ 0.8 while, for false alarms the correlation was as low as
∼ 0.4. High ranking false alarms were primarily due to partial matches with the
given query. Many of these were considered as an instance of the queried motif
by some musicians. Initially, motifs with shorter duration were used and for these
shorter motifs the inconsistency was high. Due to these problems, the scalability
of this approach to extend to more ragas was less. This also illustrates that the
typical motif itself is questionable. Nevertheless, there is a core using which the
audience very quickly identifies ragas easily. The rest of the thesis focuses on this
ability of listeners.
As alapana is an improvisational segment, the rendition of the same motif could
be different across alapanas especially among different schools. On the other hand,
compositions in Carnatic music are rendered more or less in a similar manner. Al-
though the music evolved through the oral tradition and fairly significant changes
have crept into the music, renditions of compositions do not vary very significantly
across different performers and schools. The number of variants for each line of the
song can vary quite a lot though. Nevertheless, the typical motifs and the metre
of motifs will be generally preserved. An attempt is therefore made, to determine
the typical motifs automatically.
It is discussed in [32] that not all repeating patterns are interesting and relevant.
In fact, the vast majority of exact repetitions within a music piece are not musically
15
interesting. The algorithm proposed in [32] mostly generates interesting repeating
patterns along with some non-interesting ones which are later filtered during post
processing. This work is an attempt from a similar perspective. The only difference
is that typical motifs of ragas need not be interesting to a listener. The primary
objective for discovering typical motifs, is that these motifs can be used to index
the audio of a rendition. For example, as discussed earlier, while performing an
alapana of a raga, musicians bridge two well known motifs of that raga with new
phrases using their creativity. These new phrases are musically more interesting
as they are the result of an ever evolving raga. The known typical motifs can be
used to index the alapana and the new phrases connecting them could be extracted.
Typical motifs could also be used for raga classification.
In Carnatic music, the composition still holds a very important position. Many
artists change the phrases in the alapana based on the composition that is likely
to follow. The proposed approach in this work generates similar patterns across
composition lines of a raga. From these similar patterns, the typical motifs are
filtered by using composition lines of other ragas. Motifs are considered typical of
a raga if they are present in the composition lines of a given raga and absent from
composition lines of other ragas. This filtering approach is similar to anti-corpus
approach of Conklin [8, 9] for the discovery of distinctive patterns.
Most of the earlier work, regarding discovery of repeated patterns of interest in
music, is on western music. In [22], B. Jansen et al discuss the current approaches
on repeated pattern discovery. It discusses string based methods and geometric
methods for pattern discovery. In [31], Lie Lu et al used constant Q transforms and
proposed a similarity measure between musical features for doing repeated pattern
discovery. In [32], Meredith et. al. presented Structure Induction Algorithms (SIA)
16
using a geomatric approach for discovering repeated patterns that are musically
interesting to the listener. In [6, 7], Collins et. al. introduced improvements in
Meredith’s Structure Induction Algorithms. There has also been some significant
work on detecting melodic motifs in Hindustani music by Joe Cheri Ross et. al.
[39]. In this approach, the melody is converted to a sequence of symbols and a
variant of dynamic programming is used to discover the motif.
As mentioned before the typical motifs can be used for raga classification but
when the number of ragas increases, the scalability of this approach becomes an
issue. In Chapter 5, inspired by how the listener tries to identify raga during a
concert, an attempt is made to mimic the same. During a concert, the performer
usually begins by establishing the identity of the raga. When the musician is
establishing the identity by rendering the raga, the listener narrows down the
search space from hundreds of ragas to a small likely subset of ragas. By further
listening to the musician, the listener identifies the peculiarities, match them with
the shortlisted ragas and finally, identifies the raga. First, to mimic the reduction of
search space, a raga recording is presented with a claim. The claim is a raga which
a listener has associated it with. For every raga, a set of cohorts are identified
by a musician. Cohorts are ragas that have similar phrases and can be confused
with the given raga. The cohort raga list is used to reduce the search space. The
task that remains is to determine whether claimed raga is correct. This is done by
using a novel matching algorithm known as Longest Common Segment Set (LCSS)
along-with score normalization.
There is no parallel in Western classical music to raga verification. The closest
that one can associate with, is cover song detection [14, 33, 43], where the objective is
to determine the same song rendered by different musicians. Whereas, as discussed
17
in Chapter 2, two different renditions of the same motif may not be identical.
Several attempts have been made earlier to identify ragas [5, 11, 12, 18, 20, 25, 29, 47].
Most of these efforts have used small repertoires or have focused on ragas for which
ordering is not important. In [47], the audio is transcribed to a sequence of notes
and string matching techniques are used to perform raga identification. In [5],
pitch-class and pitch-dyads distributions are used for identifying ragas. Bigrams
on pitch are obtained using a twelve semitone scale. In [35], the authors assume that
an automatic note transcription system for the audio is available. The transcribed
notes are then subjected to HMM based raga analysis. In [25, 46], a template based
on the arohana and avarohana is used to determine the identity of the raga. The
frequency of the svaras in Carnatic music is seldom fixed. Further, as indicated
in [48] and [49], the improvisations in extempore enunciation of ragas can vary
across musicians and schools. This behaviour is accounted for in [23, 24, 29] by
decreasing the binwidth for computing melodic histograms. In [29], steady note
transcription along with n-gram models is used to perform raga identification. In
[11] chroma features are used in an HMM framework to perform scale independent
raga identification, while in [12] hierarchical random forest classifier is used to
match svara histograms. The svaras are obtained using the Western transcription
system. These experiments are performed on 4 to 8 different ragas of Hindustani
music. In [18], an attempt is made to perform raga identification using semi-
continuous Gaussian mixtures models. This will work only for ragas with linear
ordering of svaras.
Recent research indicates that a raga is characterised best by a time-frequency
trajectory rather than a sequence of quantised pitches [20, 38, 39, 45]. In [38, 39],
the sama of the tala (emphasised by the bol of tabla) is used to segment a piece. The
repeating pattern in a bandish in Hindustani Khayal music is located using the
18
sama information. In [20, 38], motif identification is performed for Carnatic music.
Motifs for a set of five ragas are defined and marked carefully by a musician.
Motif identification is performed using hidden Markov model (HMM) trained
for each motif. Similar to [39], motif spotting in an alapana in Carnatic music is
performed in Chapter 3. In [45], a number of different similarity measures for
matching melodic motifs of Indian music was attempted. It was shown that the
intra pattern type variance of the melodic motifs is higher for Carnatic music in
comparison with that of Hindustani music. It was also shown that the similarity
obtained is very sensitive to the measure used. All these efforts are ultimately
aimed at obtaining typical signatures of ragas. It is shown in Chapter 3 that there
can be many signatures for a given raga. To alleviate this problem in Chapter 4,
an attempt was made to obtain as many signatures for a raga by comparing lines
of compositions. Here again, it was observed that the typical motif detection was
very sensitive to the distance measure chosen. Using typical motifs/signatures for
raga identification is not scalable, when the number of ragas under consideration
increases. In raga-verification, as the task of identifying ragas narrows to a few
number of ragas, it is scalable to any number of new ragas.
19
CHAPTER 3
Motif Spotting
3.1 Introduction
A raga in Carnatic music can be characterised by a set of distinctive motifs. Dis-
tinctive motifs can be characterised by the trajectory of inflected svaras over time.
These motifs are of utmost aesthetic importance to the raga. Carnatic music is
a genre abundant with compositions. These compositions are replete with many
distinctive motifs. These motifs are used as building blocks for extempore improvi-
sational pieces in Carnatic music. These motifs can also be used for distinguishing
between two ragas, and also for archival and learning purposes. The objective of
the work presented in this chapter is to spot the location of the distinctive motifs in
an extempore enunciation of a raga called the alapana. In Carnatic music, the motifs
are laden with gamakas [27]. In addition, the motifs are similar across musicians
but not necessarily identical. The duration of the motifs can also vary quite signif-
icantly although the rhythm may be preserved. The query motif in general is very
short in duration compared to that of the test music segment. Several factors need
to be considered when dealing with this problem namely: selection of features,
time complexity, tolerance to noise, tolerance to speed variation, allowing partial
matches or rough matches rather than exact matches, timbre, etc [17, 30, 50].
In this chapter, pitch is used as the main feature for the task of motif spot-
ting. Substantial research exists on analysing different aspects of Carnatic music
computationally, using pitch as a feature. In [28], gamakas are characterized and
analysed using pitch contours. In [21], tuning of Indian classical music is studied
using pitch histograms. In [48], the motifs are extensively studied in the raga Thodi
using pitch histograms and pitch contours. All of the above prove the relevance
and importance of pitch as a feature for computational analysis of Carnatic music.
There are a number of dynamic programming techniques namely,the Dynamic
Time Warping (DTW), the Longest Common Subsequence (LCS) and their variants,
which are used for similar music matching tasks. DTW takes care of the speed
variations due to warping but forces the match from end-to-end of both the query
and the test sequences. Even unconstrained endpoint DTW will align an entire
query with a part of the test sequence [16]. In motif-spotting, there can be instances
where one can expect that most of the query is roughly matched with a part of
the test sequence. Although LCS does not force the match between query and test
to be end-to-end, it does not give importance to local similarity. Rough Longest
Common Subsequence (RLCS) addresses the issue of local similarity where some
leeway is given for partial query matches [30]. Other than partial query matches,
when the characteristic motif, for example, “Sa Ni Da Pa Da” is rendered as “Sa
Ri Ni Da Pa Da”, RLCS gives a good match since it gives the longest matched
subsequence.
3.2 Stationary Points
The task therefore is to attempt automatic spotting of a motif that is queried. The
motif is queried against a set of alapanas of a particular raga to obtain locations of
the occurrences of the motif. The task is non-trivial since no particular rhythm
is maintained in an alapana nor it is accompanied by a percussion instrument.
21
Figure 3.1: A Phrase with Stationary Points
Figure 2.6 shows repetitive occurrences of motifs in a piece of music. An enlarged
view of the motif is also shown. Since the alapana is much longer than the motif,
searching for a motif in an alapana is like searching for a needle in a haystack. After
an analysis of the pitch contours and discussions with professional musicians, it
was conjectured that the pitch contour can be quantized at stationary points. The
conjecture was confirmed as explained in Chapter 2. Figure 3.1 shows an example
phrase of the raga Kamboji with the stationary points highlighted.
Musically, the stationary points are a measure of the extent to which a particular
svara is intoned. In Carnatic music since svaras are rendered with gamakas, there is
a difference between the notation and the actual rendition of the phrase. However,
there is a one to one correspondence with the stationary point frequencies and
what is actually rendered by the musician (Figure 3.1). Figure 3.2 shows the pitch
histogram and the stationary point histogram of an alapana of the raga Kamboji.
The similarity between the two pitch histograms vindicates our conjecture that
stationary points are important.
22
Figure 3.2: The Pitch and Stationary Point Histograms of the raga Kamboji
3.2.1 Method of obtaining Stationary Points
Carnatic music is a heterophonic musical form. In a Carnatic music vocal concert,
a minimum of two accompanying instruments play simultaneously along with
the lead artist. These are the violin and the mridangam (a percussion instrument
in Carnatic music). Carnatic music is performed at a fixed tonic[2] to which all
instruments are tuned. The tonic is chosen by the lead artist and is maintained
throughout the performance by an instrument called the Tambura as discussed in
Chapter 2. The simultaneous performance of many instruments in addition to the
voice renders pitch extraction of the predominant voice a tough task. This leads
to octave errors and other erroneous pitch values. For this task it is necessary that
pitch be continuous. After experimenting with various pitch algorithms, it was
observed that the Melodia-Pitch Extraction algorithm [41] produced the fewest
errors. This was verified after re-synthesis using the pitch contours. In case of
an octave error or any other such pitch related anomaly, the algorithm replaces
the erroneous pitch values with zeros. The stationary points are obtained by
processing the pitch contour extracted from the waveform. The pitch extracted
23
Figure 3.3: Original and Cubic Interpolated pitch contours
is converted to the cent scale using (3.1) to normalise with respect to the tonic of
different musicians.
centFrequency = 1200 · log2
(f
tonic
)(3.1)
Least Squares Fit (LSF)[37] was used to compute the slope of the pitch extracted.
The zero crossings of the slope correspond to the stationary points (Figure 3.1). A
Cubic Hermite interpolation[15] was then performed with the initial estimation of
stationary points to get a continuous curve (Figure 3.3). The stationary points are
then again estimated from this continuous curve.
24
3.3 Rough Longest Common Subsequence Algorithm
Rough Longest Common Subsequence (RLCS), a variant of Longest Common
Subsequence (LCS), performs an approximate match between a reference sequence
and a query sequence while retaining the local similarity [30]. It introduces three
major changes in LCS namely, rough match, width-across-reference (WAR) and
width-across-query (WAQ) for local similarity and score matrix.
3.3.1 Rough match
In the recurrence function of LCS, the cost function is incremented by 1 when there
is an exact match. In RLCS, when the distance between a reference point, ri, and
a query point, qi, is less than a threshold, Td, they are said to be roughly matched,
ri ≈ qi i.e. d(ri, q j) < Td → ri ≈ qi, where d(ri, q j) is the distance between r j and q j.
The cost is incremented by a number, δ, between 0 and 1 instead of 1, based on
how good the match is as shown in (3.2).
δi, j = 1 −dri,q j
Td(3.2)
The cost is estimated using the following recurrence:
ci, j =
0 ; i. j = 0
ci−1, j−1 + δi, j ; ri ≈ q j
max(ci−1, j, c j−1,i) ; ri! ≈ qi
(3.3)
In LCS the cost gives the length of the Longest Common Subsequence. The cost
of RLCS is not incremented by 1 but it represents the length of the Rough Longest
25
Common Subsequence. Later, it is argued that this length is actually a rough length
of RLCS rather than its actual length.
3.3.2 WAR and WAQ for local similarity
To retain the local similarity, width-across-reference, WAR, and width-across-
query, WAQ, are used. WAR and WAQ represent the length of the shortest
substring of the reference and the query respectively, containing the LCS. These
measures represent the density of LCS in the reference and the query. Small values
of WAR and WAQ indicate a dense distribution of LCS. WAR is incremented by 1
if there is a rough match or jump along the reference. Likewise for the WAQ. WAR
and WAQ are computed using the following recurrences:
wri, j =
0 ; i. j = 0
wri−1, j−1 + 1 ; ri ≈ q j
wri−1, j + 1 ; ri! ≈ qi, ci−1, j ≥ ci, j−1
wri, j−1 ; ri! ≈ qi, ci−1, j < ci, j−1
(3.4)
wqi, j =
0 ; i. j = 0
wqi−1, j−1 + 1 ; ri ≈ q j
wqi−1, j ; ri! ≈ qi, ci−1, j ≥ ci, j−1
wqi, j−1 + 1 ; ri! ≈ qi, ci−1, j < ci, j−1
(3.5)
In (3.4) and (3.5), some of the cases and conditions are dropped from [30] for
the sake of clarity.
26
3.3.3 Score matrix
WAR, WAQ and cost are used to compute the score of a common subsequence in
the following way:
Scorei, j =
(β
ci, j
wri, j
+ (1 − β)ci, j
wqi, j
)·
ci, j
n i f ci, j ≥ ρn
0 otherwise
(3.6)
In (3.6), large value ofci, j
wri, j
suggests that the density of the RLCS is high in the
reference. Similarly, a large value ofci, j
wqi, j
is indicative of higher density of RLCS in
the query. β weighs between these two ratios. Large value ofci, j
n indicates that a
large part of the query has been matched, where n is the length of the query. ρ is
the matching rate that represents how long the length of the RLCS should be, with
respect to the query.
The algorithm to compute these values using Dynamic Programming is pre-
sented in [30].
3.4 Modified-Rough Longest Common Subsequence
In this section, the modifications made to the existing RLCS algorithm and the
rationale behind them are discussed.
27
3.4.1 Rough and actual length of RLCS
In [30], ci, j is defined as the length of the RLCS. But, it actually represents a rough
length of RLCS because it is incremented by δi, j when there is a rough match. The
resulting value of ci, j need not be an integer. Therefore, it cannot be the actual
length of any sequence. The actual length of RLCS is defined by the following
recurrence:
cai, j =
0 ; i. j = 0
cai−1, j−1 + 1 ; ri ≈ q j
max(cai−1, j, c
aj−1,i) ; ri! ≈ qi
(3.7)
In (3.7), cost is incremented by 1 on a rough match. In (3.6), while computing score,
half of the importance is given to the ratio of rough length of RLCS and the query
length. Instead of just considering how good the rough length of the RLCS with
respect to the query length is, it is conjectured that it is also important to consider
how good the rough length of the RLCS is with respect to the actual length of the
RLCS.ci, j + ci, j
cai, j + n
gives equal importance to both the ratios. This term is similar to the F1 score where
precision and recall are given equal importance.
3.4.2 RWAR and RWAQ
WAR and WAQ represent the width of the shortest substring that contains the
RLCS. As discussed in the previous subsection, ci, j represents the rough length
28
which is shorter than the actual length of the RLCS. Therefore, it is not clear
whetherci, j
wri, j
really represents the density of the RLCS in the reference. This term
also penalizes based on the degree of match, while a penalty has already been
accounted for in the term,ci, j
n . Therefore, a rough width across reference and query
is required that represents the rough width of the shortest substring containing
the RLCS. On a rough match, cost is incremented by a δi, j. At the same time,
when a rough match is obtained, the WAR and WAQ are also incremented by δi, j
resulting in Rough WAR (RWAR) and Rough WAQ (RWAQ), respectively. When
there is no match, RWAR and RWAQ are incremented by 1 whereas the cost is not
incremented. Therefore, RWAR and RWAQ account for the density of the RLCS
in the reference and query better. RWAR and RWAQ can be computed by the
following recurrences:
rwri, j =
0 ; i. j = 0
rwri−1, j−1 + δi, j ; ri ≈ q j
rwri−1, j + 1 ; ri! ≈ qi, ci−1, j ≥ ci, j−1
rwri, j−1 ; ri! ≈ qi, ci−1, j < ci, j−1
(3.8)
rwqi, j =
0 ; i. j = 0
rwqi−1, j−1 + δi, j ; ri ≈ q j
rwqi−1, j ; ri! ≈ qi, ci−1, j ≥ ci, j−1
rwqi, j−1 + 1 ; ri! ≈ qi, ci−1, j < ci, j−1
(3.9)
29
3.4.3 Matched rate on the query sequence
In (3.6), ρ is an empirical parameter that is set based on the required match rate on
the entire query sequence. The score is updated by a non-zero value, only if the
rough length of the RLCS is greater than ρ × n. It is not clear how to set the value
of ρ or what it means for the rough length to be greater than a fraction of the query
length. Instead, it would be better to update the score by a non-zero value if the
actual length is greater than ρ × n. This makes the interpretation clear and makes
it easy to set the value of ρ.
The score update of the modified-RLCS is given by the following equation:
Scorei, j =
(β
ci, j
rwri, j
+ (1 − β)ci, j
rwqi, j
)·
ci, j+ci, j
cai, j+n i f ca
i, j ≥ ρn
0 otherwise
(3.10)
3.5 A Two-Pass Dynamic Programming Search
In Section 3.2 it is illustrated that the sequence of stationary points are crucial for
a motif. Therefore, RLCS is used to query for the stationary points of the given
motif in the alapana.
Music matching using LCS methods for western music is performed on sym-
bolic music data[19]. The musical notes in this context are the symbols. However,
in the context of Carnatic music, there is no consistent one to one correspondence
between the notation and the sung melody. Although, in this work, stationary
points are used instead of a symbolic notation, one must keep in mind that sta-
tionary points are not symbols but are continuous pitch values. In order to match
such pitch values, a rough match instead of an exact match is required. A variant
30
of the LCS known as the Rough Longest Common Subsequence [30] allows such a
rough match.
In this work, a two pass RLCS matching is performed. In the first pass, the
stationary points of the reference sequence and the query sequence are matched to
obtain the candidate motif regions. Nevertheless, given two consecutive stationary
points, the pitch contour between these two stationary points can be significantly
different for different phrases. This leads to many false alarms. A second pass of
RLCS is then performed on the regions obtained from the first pass to filter out the
false alarms from the true motifs.
3.5.1 First Pass: Determining Candidate Motif Regions using
RLCS
The RLCS algorithm used in this work is illustrated in this section. The alapana is
first windowed and then processed with the RLCS algorithm. The window size
chosen for this task is 1.5 times the length of the motif queried for. The matrices
obtained from the RLCS are then processed as follows:
• From the cells of the score matrix with values greater than a threshold,seqFilterTd, sequences are obtained by tracing the direction matrix backwards.
• The duplicate sequences which may be acquired are neglected, preservingunique sequences of length greater than ρ times the length of reference. Theseare then added to a sequence buffer.
• This process is repeated for every window. The window is shifted by a hopof one stationary point.
• The sequences obtained thus are grouped.
• Each group, taken from the first element of the first member to the lastelement of the last member, represents a potential motif region.
31
3.5.2 Second Pass: Determining Motifs from the Groups
In the first pass a matching of only the stationary points is performed. As men-
tioned above, even though the stationary points are matched it is not necessary
that the trajectory between them matches. This leads to a large number of false
alarms. Now that the search space is reduced, the RLCS is performed between the
entire pitch contour of the potential motif region obtained in the first pass and the
motif queried. The entire pitch contour is used in order to account for the trajectory
information contained in the phrases. The threshold Td used for the first pass is
tightened in this iteration for better precision while matching the entire feature
vector. In this iteration, the cell of the score matrix having the maximum value is
chosen and the sequence is traced back using the direction matrix from this cell.
This sequence is hypothesized to be the motif. The database and experimentation
are detailed in the following sections.
3.6 Dataset
Table 3.1 gives the details of the dataset of alapana used in this work. As mentioned
above, this task will be performed on alapanas. The motifs are categorized into two
types based on their durations: short motifs and long motifs. The details of these
motifs are given in Table 3.2 and Table 3.3. The average duration is obtained from
the labeled ground truth. The long motifs are inspired by the “raga test” conducted
by Rama Verma1. Most people across the globe were able to unambiguously
determine the identity of ragas using these motifs. An attempt was made to use
the motifs from Rama Verma’s raga test directly. As the recordings are rather noisy,
1http://www.youtube.com/watch?v=3nRtz9EBfeY
32
the same motifs were generated by a professional musician. In particular, we have
chosen only the raga “Bhairavi” for illustration.
Table 3.1: Dataset of alapanas
Raga Number of Number of Average TotalName Alapanas Artists Duration (mins) Duration (mins)Kamboji 27 12 9.73 262.91Bhairavi 16 13 10.65 170.48
Table 3.2: Short Motifs
Raga Labeled AverageName Ground Truth Duration (secs)Kamboji 70 1.8837Bhairavi 103 1.3213
Table 3.3: Long Motifs
Raga Labeled AverageName Ground Truth Duration (secs)Bhairavi 59 3.18
3.7 Experiments and Results
3.7.1 Querying motifs in the alapanas
RLCS was performed on the dataset of alapanas. The distance function used for
RLCS is cubic in nature with the equation given below.
di, j =
|xi−y j|
3
(3st)3 i f | xi − y j |< 3st
0 otherwise(3.11)
33
where xi and y j represent pitch values and st represents a semitone in cents. Due
to different styles of various musicians, an exact match between two pitch values
contributing to the same svara cannot be expected. Hence, in this work a leeway
of 3 semitones is allowed between pitch values. Musically two pitch values, 3
semitones apart, cannot be called similar but this issue is addressed by the cubic
nature of the similarity function. The function reaches its half value when the
difference in two symbols is approximately half a semitone. Therefore, lower
distance values are obtained when the corresponding pitch values are at most half
a semitone apart. In this work, the phrases sung across octaves are ignored. For
this experiment the parameters set were as follows: Td = 0.45; β = 0.5 ; ρ = 0.8.
The parameter 0 < ρ < 1 is a user defined parameter that ensures that ρ× length of
the query motif is matched with that of the alapana.
The details of the number of ground truth motifs retrieved and the total number
of trues retrieved are given in Table 3.4 and Table 3.5 for short motifs and long motifs
respectively. The number of false positives, retrieved are however substantial. This
is affordable since the objective in the first pass is to obtain the maximum number
of the regions similar to the motif. The second iteration of RLCS is performed to
filter out the false positives.
Table 3.4: Short Motifs: Retrieved regions after the first pass
Raga Total True Precision RecallName Retrieved Retrieved (%) (%)Kamboji 719 58 8.07 82.86Bhairavi 474 91 19.20 88.35
Now that the candidate motif regions are known, the second pass of RLCS is
conducted wherein the same motifs are queried in the regions retrieved by the
first pass. The entire pitch contour of the query and reference are used for this
34
Table 3.5: Long Motifs: Retrieved regions after the first pass
Raga Total True Precision RecallName Retrieved Retrieved (%) (%)Bhairavi 194 51 26.29 86.44
task in order to account for the information of trajectory of pitches between the
stationary points. The requirement of a query motif for such a search is due to
the scarce rendition of certain characteristic phrases in an alapana. The spotting of
such phrases proves to be useful to musicians and students for analysis purposes.
The hits obtained in the second pass are sorted according to the RLCS scores.
Top 10 hits per alapana are considered to compute the precision and recall. The
motifs are not exact since they correspond to the extempore enunciation by an
artist. The relevant motifs are all motif that were marked as true in that alapana.
The results are illustrated in Table 3.6 and Table 3.7.
Table 3.6: Short Motifs: Top 10 retrieved motifs after the second pass
Raga Precision RecallName (%) (%)Kamboji 40.45 76.00Bhairavi 41.25 91.04
Table 3.7: Long Motifs: Top 10 retrieved motifs after the second pass
Raga Precision RecallName (%) (%)Bhairavi 31.65 74.58
35
3.7.2 Comparison between RLCS and Modified-RLCS using longer
motifs
Motif spotting is performed using RLCS and modified-RLCS on the dataset of
alapanas using longer motifs as queries. Td is set to 0.45 in both the methods.
This is done so that pitch values which are approximately one semitone apart are
considered as rough match. ρ is set to zero because the best value of ρ could be
different for both the methods which makes the comparison difficult.
First Voice Activity Detection (VAD) is performed on the alapanas to get the
voiced parts. This approximately segments the alapana into phrases. Instead of
the entire alapana, these voiced regions are used. In the first pass, stationary points
of the query motif and test alapana are used and the motif regions or groups are
retrieved along with their scores. Each group either corresponds to a motif or a
false alarm. A true group consists of one or more true positives. Score distribution
of the true positive groups and false alarm groups for both the algorithms are
shown in Figure 3.4. Each score value is subtracted from the mean of scores of the
false alarms such that mean of the false alarms’ distribution becomes zero for both
the algorithms. This enables a better comparison between RLCS and modified-
RLCS algorithms. The shared region in the score distribution of true positives
groups and false alarms groups is less in the modified-RLCS algorithm than in the
RLCS algorithm.
Motifs are sparsely present in an alapana. Our purpose is to retrieve as many
motifs as possible. Spotting all or most of the motifs is more crucial than removal
of all false alarms. Therefore higher penalty is given for missing a motif than
for a false alarm group. Score threshold is selected from the minimum detection
cost function for both the algorithms. The sequences whose scores are above the
36
−1 −0.5 0 0.5 1
Density
Scores−1 −0.5 0 0.5 1
Density
Scores
a) RLCS b) Modified-RLCS
Figure 3.4: a) True positive groups’ and false alarm groups’ score distribution forRLCS. b) True positive groups’ and false alarm groups’ score distribu-tion for modified-RLCS.
score threshold are preserved. The details of the comparison after the first pass are
shown in Table 3.8. Modified-RLCS has shown a clear improvement over RLCS in
terms of false alarms and average duration of true positives and false alarms.
Table 3.8: Long Motifs: Retrieved regions after the first pass
Algorithm Total True True Positive False Alarm Precision RecallRetrieved Retrieved Duration (avg.) Duration (avg.) (%) (%)
RLCS 194 51 9.61 secs 12.73 secs 26.29 86.44Modified-RLCS 151 52 9.00 secs 11.95 secs 34.44 88.14
These regions are used as the tests in the second pass. In the second pass, tonic
normalized smoothed pitch contour is used as a feature. The primary objective
of the second pass is to locate the motifs within a group and remove as many
false alarm groups as possible. The details of the comparision after the second
pass are given in Table 3.9. There is a reduction in the number of false alarms in
both the algorithms but the precision and localization of motifs are still better for
modified-RLCS.
37
Table 3.9: Long Motifs: Retrieved regions after the second pass
Algorithm Total True True Positive False Alarm Precision RecallRetrieved Retrieved Duration (avg.) Duration (avg.) (%) (%)
RLCS 180 50 8.25 secs 11.12 secs 27.78 84.75Modified-RLCS 144 50 7.68 secs 11.00 secs 34.72 84.75
3.8 Discussion
From the results obtained in Table 3.6 and Table 3.7, it is clear that even though
the precision is low, the recall is high in most of the cases. Certain partial matches
are also obtained where either the first part of the query is matched or the end of
the query is matched. These are movements similar to those of the phrases and
are interesting for a listener, learner, or researcher. High scores were obtained for
certain false alarms. This is primarily due to melodic similarities between the false
alarm and the original phrase.
Modified-RLCS results in less false alarms compared to RLCS in both the passes.
The duration of the hits is also less when compared to that of RLCS which means
that the localization is better in modified-RLCS. This is primarily due to the fact
that the actual length of the match is used in the modified-RLCS. In (3.6), the term
ci, j
n
focuses on getting an RLCS whose rough length is as large as length of query motif
but in (3.10), the termci, j + ci, j
cai, j + n
gives equal importance to both: getting an RLCS whose rough length is as large
as the length of the query motif and also on getting an RLCS whose rough length
38
is as large as the actual length of the RLCS. Due to this, shorter sequences also get
a good score if they represent the motif adequately.
3.8.1 Importance of VAD in motif spotting
Voice Activity Detection (VAD) on the alapanas is a very crucial step. This step
removes the noise and reduces the search space. Table 3.10 shows the results after
the two passes using the modified-RLCS algorithm. The method for selecting the
score thresholds after both the passes remains the same. The groups have become
much longer though the number of true positive groups and false alarm groups
has reduced. Some of the true positive groups have more than one instance of the
motif. Therefore, the number of true positives are much more than the number
of true positive groups. But they are not at all localized properly even after the
second pass. The number of false alarm groups is also very less but the duration is
very high, approximately 1 minute. The total duration of the false alarms without
VAD is much more than that of those with VAD after each of the passes. This
vindicates the use of VAD.
Table 3.10: Retrieved Groups after both the passes for modified-RLCS withoutVAD
Pass No. True Group True Positive True Group False Group False Group
Retrieved Retrieved Duration (avg.) Retrieved Duration (avg.)
Pass 1 29 57 80.31 secs 20 68.23 secs
Pass 2 38 57 44.33 secs 31 45.45 secs
39
3.9 Summary
In this work, RLCS is used for motif spotting in alapanas in Carnatic music. It is
illustrated that the stationary points of the pitch contour of a musical piece hold
significant music information. It is then shown that quantizing the pitch contour of
the alapana at the stationary points leads to no loss of information while it results
in a significant reduction in the search space. The RLCS method is shown to
give a high recall for the motif queried. Given that the objective is to explore the
musical traits of a raga by spotting interesting melodic motifs rendered by various
artists, the recall of the motif queried is of higher importance than the precision. A
modified version of RLCS algorithm is also presented which gives better scores for
sub-sequences which are shorter than the query but have matched reasonably well
with most of the query. Modified-RLCS was tested on longer motifs and compared
favorably with the original RLCS. The importance of performing Voice Activity
Detection is also discussed.
40
CHAPTER 4
Motif Discovery
4.1 Introduction
A raga in Carnatic music is characterised by typical phrases or motifs. They are
primarily pitch trajectories in the time-frequency plane. Although, for annotation
purposes, ragas in Carnatic music are based on 12 srutis (or semitones), the gamakas
associated with the same semitone can vary significantly across ragas as discussed
in Chapter 2. Nevertheless, although the phrases do not occupy fixed positions in
the time-frequency (t-f) plane, an experienced listener can determine the identity
of a raga within few seconds of an alapana. The objective of the presented work here
is to determine typical motifs of a raga automatically. This is obtained by analyzing
various compositions that are composed in a particular raga. Unlike Hindustani
music, there is a huge repository of compositions by a number of composers in
different ragas. It is often stated by musicians that the famous composers have
composed such that a single line of a composition is replete with the motifs of the
raga. In this work, we therefore take single lines of different compositions and
determine the typical motifs of the raga.
In a Carnatic music concert, many listeners from the audience are able to
identify the raga at the very beginning of the composition, usually during the
singing of the first line itself — a line corresponds to one or more tala cycles.
Thus, first lines of the compositions could contain typical motifs of a raga. A
pattern which is repeated within a first line could still be not specific to a raga.
Figure 4.1: RLCS matching two sequences partially
Whereas, a pattern that is present across different composition lines could be a
typical motif of that raga. Instead of just using first lines, we have also used other
lines of compositions, namely, the lines from pallavi, anupallavi and charanam. In
this chapter, an attempt is made to find repeating patterns across these lines and
not within a line. Typical motifs are filtered from the generated repeating patterns
during post processing. These typical motifs are available online 1.
The length of the typical motif to be discovered is not known a priori. Therefore
there is a need for a technique which can itself determine the length of the motif
at the time of discovering it. Dynamic Time Warping (DTW) based algorithms can
only find a pattern of a specific length since it performs end-to-end matching of the
query and test sequence. There is another version of DTW known as Unconstrained
End Point-DTW (UE-DTW) that can match the whole query with a partial test
1http://www.iitm.ac.in/donlab/typicalmotifs.html
42
but still the query is not partially matched. Longest Common Subsequence (LCS)
algorithm on the other hand can match the partial query with partial test sequences
since it looks for a Longest Common Subsequence which need not be end-to-end.
LCS by itself is not appropriate as it requires discrete symbols and does not account
for local similarity. A modified version of LCS known as Rough Longest Common
Subsequence takes continuous symbols and takes into account the local similarity
of the Longest Common Subsequence. The algorithm proposed in [30] to find
Rough Longest Common Subsequence between two sequences fits the bill for the
task of motif discovery. An example of RLCS algorithm matching two partial
phrases is shown in Figure 4.1. The two music segments are represented by their
tonic normalized pitch contours. The stationary points, where the first derivative
is zero, of the tonic normalized pitch contour are first determined. The points are
then interpolated using cubic Hermite interpolation to smooth the contour.
In Chapter 3, plenty of false alarms were observed. One of the most prevalent
false alarms was found to be, due to a sustained note appearing in the phrase. The
slope of the linear trend in stationary points along with its standard deviation is
used to address this issue.
The rest of the chapter is organized as follows. In Section 4.2 the use of com-
position lines to find motifs is discussed. Section 4.3 discusses the optimization
criteria to find the Rough Longest Common Subsequence. Section 4.4 describes
the proposed approach for discovering typical motifs of ragas. Section 4.5 describe
the dataset used in this work. Experiments and results are presented in Section
4.6.
43
4.2 Lines from the compositions
As previously mentioned, first line of the composition contains the characteristic
traits of a raga. The importance of the first lines and the raga information it holds is
illustrated in great detail in the T. M. Krishna’s book on Carnatic music [26]. T. M.
Krishna states that opening section called “pallavi” (discussed in Chapter 2) directs
the melodic flow of the raga and through its rendition, the texture of the raga can
be felt. Motivated by this observation, an attempt is made to verify the conjecture
that typical motifs of a raga can be obtained from the first lines of compositions.
Along with the lines from pallavi, we have also selected few lines from other
sections, namely, ‘anupallavi’ and ‘charanam’. Anupallavi comes after pallavi and the
melodic movements in this section tend to explore the raga in the higher reaches
of the octave as discussed in Chapter 2.
4.3 Optimization criteria to find Rough Longest Com-
mon Subsequence
The Rough Longest Common Subsequence (RLCS) between two sequences, X =
〈x1, x2, · · · , xn〉 and Y = 〈y1, y2, · · · , ym〉, of length n and m is defined as the Longest
Common Subsequence (LCS) ZXY = 〈(xi1 , y j1), (xi2 , y j2), · · · , (xip , y jp)〉, 1 ≤ i1 < i2 <
· · · < ip ≤ n, 1 ≤ j1 < j2 < · · · < jp ≤ m; such that the similarity between xik
and y jk is greater than a threshold, τsim, for k = 1, · · · , p. This definition does not
include any constraints on the length and on the local similarity of the RLCS. Some
applications demand the RLCS to be locally similar or its length to be in a specific
range. As discussed in Section 3.3, [30] uses constraints on local similarity and on
44
the length to obtain the RLCS. For the task of motif discovery along with these
constraints, one more constraint is used to reduce false alarms. Before discussing
any of these optimization measures in detail, a few quantities, from Section 3.3,
need to be redefined which gives them a different perspective.
lwSXY
=
s∑k=1
sim(xik , y jk) (4.1)
gX = is − i1 + 1 − s (4.2)
gY = js − j1 + 1 − s (4.3)
Let SXY = 〈(xi1 , y j1), (xi2 , y j2), · · · , (xis , y js)〉, 1 ≤ i1 < i2 < · · · < is ≤ n, 1 ≤ j1 < j2 <
· · · < js ≤ m; be a rough common subsequence (RCS) of length s and sim(xik , yik) ∈
[0, 1] be the similarity between xik and yik for k = 1, · · · , s. (4.1) defines the rough
length of SXY as sum of similarities. Thus, rough length is less than or equal to s.
The number of points in the shortest substring of sequence X, containing the RCS
SXY, that are not the part of the RCS SXY are termed as gaps in SXY with respect to
sequence X as defined by (4.2). Similarly, (4.3) defines the gaps in SXY with respect
to sequence Y. Small gaps indicate that the distribution of RCS is dense in that
sequence.
The optimization measures to find the RLCS are described as follows.
4.3.1 Density of the match
(4.4) represents the distribution of the RCS SXY in the sequences X and Y. This is
called density of match, δSXY . This quantity needs to be maximized to make sure
45
the subsequence, SXY, is locally similar. β ∈ [0, 1] weighs the individual densities
in sequences X and Y. The terms, lwSXY
+ gX and lwSXY
+ gY, in (4.4) are same as
rough-width across query and reference from Chapter 3.
δSXY = βlwSXY
lwSXY
+ gX+ (1 − β)
lwSXY
lwSXY
+ gY(4.4)
4.3.2 Normalized weighted length
The weighted length of RCS is normalized as shown in (4.5) to restrict its range to
[0, 1]. n and m are the lengths of sequences X and Y, respectively.
lwSXY
=lwSXY
min(m,n)(4.5)
4.3.3 Linear trend in stationary points
As discussed in Section 3.3, density of the match, δSXY , and normalized weighted
length, lwSXY
, are used as optimization measures to find the Rough Longest Common
Subsequence. A rough common subsequence, SXY, between two sequences, X
and Y, which maximizes these optimization measures is found using a dynamic
programming approach. In this work an additional measure, based on the linear
trend in stationary points, is used for optimization which helps in reducing a type
of false alarm.
As observed in Chapter 3, the RLCS obtained using only two optimization
measures, discussed in Section 3.3, suffered from a large number of false alarms
for the motif spotting task. The false alarms generally constituted of long and
sustained notes. This resulted in good normalised weighted lengths and density.
46
Figure 4.2: Slopes of the linear trend of stationary points helps in reducing the falsealarms. The last three phrases are false alarms.
To address this issue, the slope and standard deviation of the slope of the linear
trend in stationary points of a phrase are estimated. Figure 4.2 shows a set of
phrases. This set has five phrases which are termed as similar phrases based on the
density of match and normalized weighted length. The first two phrases, shown
in green, are true positives while the remaining, shown in red, are false alarms.
Figure 4.2 also shows the linear trend in stationary points for the corresponding
phrases. It is observed that the trends are similar for true positives when compared
to that of the false alarms. The slope of the linear trend for the fifth phrase (false
alarm) is similar to the true positives but its standard deviation is less. Therefore,
a combination of the slope and the standard deviation of the linear trend is used
47
to reduce the false alarms.
Let the stationary points in the shortest substrings of sequences X and Y con-
taining the RCS SXY be 〈xq1 , xq2 , · · · , xqtx〉 and 〈yr1 , yr2 , · · · , yrty
〉 respectively, where
tx and ty are the number of stationary points in the respective substrings. (4.6)
estimates the slope of the linear trend, of stationary points in the substring of se-
quence X, as the mean of the first difference of stationary points, which is same asxqtx−xq1
tx−1 [10]. Its standard deviation is estimated using (4.7). Similarly, µYSXY
and σYSXY
are also estimated for the substring of sequence Y.
µXSXY
=1
tx − 1
tx−1∑k=1
(xqk+1 − xqk) (4.6)
σXSXY
2=
1tx − 1
tx−1∑k=1
((xqk+1 − xqk) − µXSXY
)2 (4.7)
Let z1 = µXSXYσY
SXYand z2 = µY
SXYσX
SXY. For a true positive, the similarity in the
linear trend should be high. (4.8) calculates this similarity which needs to be
maximized. This similarity has negative value when the two slopes are of different
signs and thus, the penalization is more.
λSXY =
max(z1,z2)min(z1,z2) i f z1 < 0; z2 < 0
min(z1,z2)max(z1,z2) otherwise
(4.8)
Finally, (4.9) combines this optimization measure with the other two measures
to get a score value which is maximized. Then the RLCS, RXY, between the se-
quences X and Y, is defined as an RCS with a maximum score in (4.10). The RLCS
48
RXY can be obtained by optimizing (4.9) using a dynamic programming algorithm.
ScoreSXY = αδSXY lwSXY
+ (1 − α)λSXY (4.9)
RXY = argmaxSXY
(ScoreSXY) (4.10)
4.4 Discovering typical motifs of ragas
Typical motifs of a raga are discovered using composition lines in that raga and
composition lines from other ragas. For each voiced part in a composition line of
a raga, RLCS is found in the voiced parts of other composition lines of the same
raga. Only those RLCS are selected whose score values and lengths (in seconds) are
greater than thresholds τscr and τlen respectively. The voiced parts which generated
no RLCS are interpreted to have no motifs. The RLCS generated for a voiced part
are grouped and this group is interpreted as a motif found in that voiced part. This
results in a number of groups (motifs) for a raga. Further, filtering is performed to
isolate typical motifs of that raga.
4.4.1 Filtering to get typical motifs of a raga
The generated motifs are filtered to get typical motifs of a raga using composition
lines of other ragas. The most representative candidate of a motif, a candidate with
highest score value, is selected to represent that motif or group. The instances of
a motif are spotted in the composition lines of cohort ragas using motif spotting
discussed in Chapter 3. Each motif is considered as a query to be searched for in
a composition line. The RLCS is found between the query and a composition line.
49
From the several RLCS found from several composition lines of a raga, top τn RLCS
with highest score values are selected. The average of these score values defines
the presence of this motif in that raga. A motif of a raga is isolated as its typical
motif if the presence of this motif is more in the given raga than in other ragas. The
value of τn is selected empirically.
4.5 Dataset
Table 4.1: D1: Dataset of composition lines
Raga Number of Average TotalName recordings Duration (secs) Duration (mins)Bhairavi 17 16.87 4.78Kamboji 12 13.01 2.60Kalyani 9 12.76 1.91Shankarabharanam 12 12.55 2.51Varali 9 9.40 1.41Overall 59 13.44 13.22
Table 4.2: D1: Dataset for filtering
Raga Number of Average TotalName recordings Duration (mins) Duration (hrs)Bhairavi 20 18.88 6.29Kamboji 9 24.25 3.64Kalyani 16 20.07 5.35Shankarabharanam 10 21.68 3.61Varali 18 17.03 5.11Overall 73 19.73 24.01
The approach is tested on two datasets, namely: D1 and D2. D1 consists of five
ragas. The details of composition lines used in D1 for finding similar pattern and
for filtering are given in Table 4.1 and Table 4.2 respectively. The lines are sung
50
by a musician in isolation. This is done to ensure that the pitch estimation does
not get affected due to the accompanying instruments. D2 consists of 12 ragas and
composition lines are extracted from live concert recordings using the algorithm
described in [42]. The details of the dataset are given in Table 4.3 and Table 4.4.
Table 4.3: D2: Dataset of composition lines
Raga Number of Average TotalName recordings Duration (secs) Duration (mins)Huseni 6 9.20 0.92Bhairavi 20 12.03 4.01Abhogi 13 8.66 1.88Kalyani 16 12.36 3.30Ananda-Bhairavi 14 8.16 1.90Shri 10 13.67 136.67Bageshri 4 15.09 1.01Hameer-Kalyani 5 11.45 0.95Surati 13 9.33 2.02Thodi 29 10.99 5.31Mukhari 9 10.01 1.50Kapi 14 9.31 2.17Overall 153 10.69 27.25
Table 4.4: D2: Dataset for filtering
Raga Number of Average TotalName recordings Duration (secs) Duration (mins)Huseni 24 9.21 3.68Bhairavi 74 11.44 14.11Abhogi 48 8.30 6.64Kalyani 57 11.49 10.91Ananda-Bhairavi 55 8.08 7.41Shri 39 13.74 8.93Bageshri 16 15.09 4.02Hameer-Kalyani 20 11.45 3.81Surati 52 9.33 8.08Thodi 113 10.98 20.68Mukhari 36 10.01 6.00Kapi 56 9.31 8.69Overall 590 10.47 102.99
51
Table 4.5: D1:Similar motifs retrieved from composition lines
Raga True Partial Wrong AverageName Motifs Motifs Motifs Duration (secs)Bhairavi 3 2 5 3.52Kamboji 1 2 2 3.40Kalyani 2 1 3 4.48Shankarabharanam 2 2 1 3.41Varali 1 0 1 4.11Overall 9 7 12 3.78
4.6 Experiments and results
The pitch of the music segment, with tonic normalization, is used as a basic feature
in this work. The extraction of pitch is discussed in Chapter 3.
The similarity, sim(xik , y jk), between two symbols xik and y jk is defined as
sim(xi, y j) =
1 −
|xi−y j|3
(3st)3 i f | xi − y j |< 3st
0 otherwise(4.11)
where st represents a semitone in cents. The intuition behind using this function
for similarity is same as of (3.11). The similarity threshold, τsim, is empirically set
to 0.45 which accepts similarities when two symbols are less than 2.5 semitones
(approx.) apart, although penalty is high after a semitone. The threshold on the
score of RLCS, τscr, is empirically set to 0.6 to accept RLCS with higher score values.
The threshold on the length of the RLCS, τlen, is set to 2 seconds to get longer motifs.
The value of β is set to 0.5 to give equal importance to the individual densities in
both the sequences and α value is set to 0.6 which gives more importance to density
of match and normalized weighted length as compared to linear trend in stationary
points. τn is empirically set to 3.
52
Table 4.6: D1:Percentage of motifs preserved after filtering
Raga True Partial Wrong AverageName Motifs(%) Motifs(%) Motifs(%) Duration (secs)Bhairavi 100.00 100.00 0.00 4.52Kamboji 0.00 0.00 0.00 NAKalyani 0.00 0.00 0.00 NAShankarabharanam 100.00 100.00 0.00 3.68Varali 100.00 NA 0.00 6.09Overall 66.67 57.14 0.00 4.76
Table 4.7: D2:Similar motifs retrieved from composition lines
Raga True Partial Wrong AverageName Motifs Motifs Motifs Duration (secs)Huseni 1 4 0 4.84Bhairavi 8 6 9 4.60Abhogi 5 3 3 3.68Kalyani 15 13 3 3.87Ananda-Bhairavi 12 2 6 3.85Shri 1 16 1 3.87Bageshri 1 1 0 3.48Hameer-Kalyani 2 0 3 3.96Surati 21 15 13 3.96Thodi 32 18 30 4.17Mukhari 3 1 12 3.89Kapi 3 4 1 3.48Overall 104 83 81 4.04
53
Table 4.8: D2:Percentage of motifs preserved after filtering
Raga True Partial Wrong AverageName Motifs(%) Motifs(%) Motifs(%) Duration (secs)Huseni 100.00 25.00 NA 5.66Bhairavi 87.50 66.67 33.33 5.00Abhogi 100.00 66.67 33.33 3.62Kalyani 80.00 61.54 66.67 4.00Ananda-Bhairavi 58.33 100.00 83.33 3.90Shri 100.00 43.75 0.00 3.98Bageshri 100.00 100.00 NA 3.48Hameer-Kalyani 50.00 NA 0.00 3.23Surati 85.71 66.67 84.62 4.12Thodi 71.88 55.56 40.00 4.50Mukhari 33.33 0.00 41.67 4.15Kapi 100.00 25.00 0.00 3.23Overall 76.92 55.42 48.15 4.22
The similar patterns found across composition lines of a raga are summarized
in Table 4.5 and Table 4.7 for D1 and D2, respectively. These similar phrases are
categorized as true motifs, partial motifs and wrong motifs by as musician. Partial
motifs with some more context can be considered as true motifs. The task of
filtering is thus to preserve as many true motifs as possible and reject as many
wrong motifs as possible. The details of the typical motifs preserved after filtering
are given in Table 4.6 and Table 4.8. For D1, the appropriate chosen threshold
filtered out all the motifs of Kamboji and Kalyani. This happened because other
higher thresholds resulted in preserving many wrong motifs. A better analysis of
filtering is done for D2 as shown in Table 4.8. On an average 77% of true motifs are
preserved whereas only 48% of wrong motifs are preserved. The average length of
all the typical motifs is similar to the longer motifs used in Chapter 3. The shorter
motifs used in Chapter 3 also resulted in great deal of false alarms. The longer
motifs, of approximately 3 seconds duration, used in Chapter 3 were inspired from
54
the raga test conducted by Rama Verma2.
4.7 Summary
In this chapter, we have presented an approach to discover typical motifs of a
raga from the composition lines in that raga. The importance of compositions lines
is discussed in detail. A new measure is introduced, to reduce the false alarms,
in the optimization criteria for finding Rough Longest Common Subsequence
between two given sequences. Using the RLCS algorithm, similar patterns across
composition lines of a raga are found. Further, the typical motifs are isolated by
a filtering technique, introduced in this work, which uses composition lines of
various ragas. The filtering results in preserving most of the true motifs.
2http://www.youtube.com/watch?v=3nRtz9EBfeY
55
CHAPTER 5
Raga Verification
5.1 Introduction
Raga identification by machine is a difficult task in Carnatic music. This is primarily
because a raga is not defined just by the solfege but by svaras (ornamented notes)
[27]. The melodic histograms obtained for the Carnatic music are more or less
continuous owing to the gamaka laden svaras of the raga [44]. As discussed in
Chapter 2, the svaras in Carnatic music are not quantifiable, but for notational
purposes an octave is divided into 12 semitones: S, R1, R2(G1), R3(G2), G3, M1,
M2, P, D1, D2(N1), D3(N2) and N3. Each raga is characterised by atleast four or five
svaras. Arohana and avarohana correspond to an ordering of svaras in the ascent and
descent of the raga, respectively. Ragas with linear ordering of svaras are referred
to as linear ragas such as Mohonam raga (S R2 G3 P D2 S). Similarly, non linear ragas
have non linear ordering such as Ananda Bhairavi raga (S G2 R2 G2 M1 P D2 P S).
A further complication arises owing to the fact that although the svaras in different
ragas may be identical, the ordering can be different. Even if the ordering is the
same, in one raga the approach to the svara can be different, for example, Thodi and
Dhanyasi [27].
In this chapter, this problem is addressed in a different way. The objective
is to mimic a listener in a Carnatic music concert. There are at least 100 ragas
that are actively performed today. Most listeners identify ragas by referring to the
compositions with similar motivic patterns that they might have heard before. In
raga verification, a raga’s name (claim) and an audio clip is supplied. The machine
has to primarily verify whether the clip belongs to a given raga or not.
This task therefore requires the definition of cohorts for a raga. As discussed
in Chapter 4, cohorts of a given raga are the ragas which have similar movements
while at the same time have subtle differences, for example, darbar and nayaki. In
darbar raga, G2 is repeated twice in avarohana. The first is more or less flat and short,
while the second repetition is inflected. The G2 in nayaki is characterised by a very
typical gamaka. In order to verify whether a given audio clip belongs to a claimed
raga, the similarity is measured with respect to the claimed raga and compared with
its cohorts using a novel algorithm called Longest Common Segment Set (LCSS).
LCSS scores are then normalized using Z and T norms [1, 34].
The rest of the chapter is organised as follows. Section 5.2 describes the dataset
used in the study. Section 5.3 describes the LCSS algorithm and its relevance for
raga verification. As the task is raga verification, score normalisation is crucial.
Different score normalisation techniques are discussed in Section 5.4. The experi-
mental results are presented in Section 5.5 and discussed in Section 5.6. The main
conclusions drawn from the key results in this work are discussed in Section 5.7
5.2 Dataset used
Table 5.1 gives the details of the dataset used in this work. This dataset is obtained
from the Charsur arts foundation1. The dataset consists of 254 vocal and instru-
mental live recordings spread across 30 ragas, including both target ragas and their
cohorts. For every new raga that needs to be verified, templates for the raga and
1http://www.charsurartsfoundation.org
57
Table 5.1: Details of the database used. Durations are given in approximate hours(h), minutes (m) or seconds (s).
Vocal Instruments TotalMale Female Violin Veena Saxophone FluteNumber of Ragas 25 27 8 3 2 2 30 (distinct)Number of Artists 53 37 8 3 1 3 105Number of Recordings 134 97 14 4 2 3 254Total Duration of Recordings 30 h 22 h 3 h 31 m 10 m 58 m 57 hNumber of Pallavi Lines 655 475 69 20 10 15 1244Average Duration of Pallavi Lines 11 s 8 s 10 s 6 s 6 s 8 s 8 s (avg.)Total Duration of Pallavi Lines 2 h 1 h 11 m 2 m 55 s 2 m 3 h
its cohorts are required.
5.2.1 Extraction of pallavi lines
A composition in Carnatic music is composed of three parts, namely, pallavi, anu-
pallavi and charanam. It is believed that the first phrase of the pallavi line of a
composition contains the important movements in a raga. A basic sketch is initi-
ated in the pallavi line, developed further in the anupallavi and charanam[42] and
therefore contains the gist of the raga. The algorithm described in [42] is used for
extracting pallavi lines from compositions. Details of the extracted pallavi lines are
given in Table 5.1. Experiments are performed on template and test recordings,
selected from these pallavi lines, as discussed in greater detail in Section 5.5.
5.2.2 Selection of cohorts
Wherever possible 4-5 ragas are chosen as cohorts of every raga. The cohorts
of every raga were defined by a professional musician. Professionals are very
careful about this as they need to ensure that during improvisation, they do not
accidentally sketch the cohort. Interestingly, as indicated by the musicians, cohorts
58
need not be symmetric. A raga A can be similar in movement to a raga B, but raga
B need not share the same commonality with raga A. The identity of raga B
may depend on phrases similar to raga A with some additional movement. For
example, to identify the raga Hindolam, the phrase G2 M1 D1 N2 S is adequate,
while Jayantashree raga requires the phrase G2 M1 D1 N2 S N2 D1 P M1 G2 S.
5.3 Longest Common Segment Set Algorithm
In raga verification, matching needs to be performed between two audio clips. The
number of similar portions could be more than one and spread across the entire
clip. Therefore, there is a need for a matching approach that can find these similar
portions without issuing large penalties for gaps in between them. In this section, a
novel algorithm called Longest Common Segment Set is described which attempts
to do the same.
Let X = 〈x1, · · · , xm; xi ∈ R; i = 1 · · ·m〉 be a sequence of m symbols and Y =
〈y1, · · · , xn; y j ∈ R; j = 1 · · · n〉 be a sequence of n symbols where xi and y j are the
tonic normalized pitch values in cents. The similarity between two pitch values,
xi and y j, is calculated using (4.11) defined in Chapter 4.
A common subsequence SXY in sequences X and Y is defined as
SXY =
⟨(xi1 , y j1), · · · , (xip , y jp)
⟩1 ≤ i1 < · · · < ip ≤ m
1 ≤ j1 < · · · < jp ≤ n
simk=1,··· ,p
(xik , y jk) ≥ τsim
(5.1)
59
Soft segment running score
10
20
30
40
50
60
70
80
90
100
110
Common subsequence
Hard segments
Soft segments
450 500 550 600 650 700 750 800
Sequence 2
0
200
400
600
800
1,000
1,200
Sequence 1
DDDD
200300400200
300
400
Pitch
Pitch
Figure 5.1: An example of a common segment set between two sequences repre-senting the real data
where τsim is a threshold which decides the membership of the symbol pair (xik , y jk)
in a subsequence SXY. The value of τsim is decided empirically based on the domain
of the problem as discussed in Section 5.5. An example common subsequence is
shown with red color in Figure 5.1.
5.3.1 Common segments
Continuous symbol pairs in a common subsequence are referred to as a segment.
Two different types of segments are defined, namely hard and soft segments.
60
Hard segment is a group of common subsequence symbols such that there are
no gaps in between as shown in green color in Figure 5.1. Then a hard segment,
starting with a symbol pair (xi, y j), must be of the form
HlXiY j
=
⟨(xi, y j), (xi+1, y j+1), · · · , (xi+l, y j+l)
⟩1 ≤ i < i + 1 < · · · < i + l ≤ m
1 ≤ j < j + 1 < · · · < j + l ≤ n
(5.2)
where l + 1 represents the length of the hard segment. The score of the kth hard
segment HlXik Y jk
is defined as
hc(Hl
Xik Y jk
)=
l∑d=0
sim(xik+d, y jk+d
)(5.3)
Soft segment is a group of common subsequence symbols where gaps are
permitted with a penalty. Therefore, a soft segment consists of one or more hard
segments (shown with blue color in Figure 5.1). The gaps between the hard
segments decides the penalty assigned. Thus, the score of the kth soft segment
SXik Y jk, consisting of r hard segments, is defined as
sc(SXik Y jk
)=
r∑s=1
hc(Hl
Xik Y jk
)− γη (5.4)
where γ is the total number of gaps between r hard segments and η is the penalty
for each gap. The number of hard segments to be included in a soft segment is
decided by the running score of the soft segment. The running score of the soft
segment increases during the hard segment and decreases during the gap due to
penalties as shown in gray-scale in Figure 5.1. During a gap, if the running score
61
decreases below a threshold τrc (or becomes almost white in Figure 5.1) then that
gap is ignored and all the hard segments, encountered before it, are included into
a soft segment.
5.3.2 Common segment set
All segments together correspond to a segment set. The score of a segment set (ss)
is defined as
score (ssXY) =
∑pk=1 c
(ZXik Y jk
)2
min(m,n)2 (5.5)
where p is the number of segments, c refers to the score computed in either (5.3)
or (5.4) and Z refers to a segment (hard or soft). This equation gives preference
to longer segments. For example, in case 1, there are 10 segments each of length
2 and in case 2, there are 4 segments each of length 5. In both the cases the total
length of the segments is 20 but in (5.5), case 1 is scored as 0.1 and case 2 is scored
as 0.25 when the denominator is taken to be 202. Longer matched segments could
be considered as a phrase or an essential part of it. Whereas, shorter matched
segments could generally mean noise. Therefore, there is a heavier penalty for
shorter segments.
5.3.3 Longest Common Segment Set
Longest Common Segment Set (LCSS) is a segment set with maximum score value
as defined in (5.6).
lcssXY = argmaxssXY
(score (ssXY)) (5.6)
62
Therefore, LCSS can be obtained by maximizing score in (5.5) using dynamic
programming.
Dynamic Programming algorithm to find Longest Common Segment Set
The algorithm to find the optimum soft segment set is given in Algorithm 1.
Optimum hard segment sets are found similarly. In the algorithm, tables c and
s are used for storing the running score and the score of the common segment
sets, respectively. Table a is used for storing the partial scores from s. Table d
is maintained for backtracking the path of the LCSS. The arrows represent the
subpath to take while backtracking (up, left or cross). Input sequences to function
LCSS are appended with symbols φx and φy such that their similarity with any
symbol is 0. This is mainly required to compute the last row and column of score
table. Line 8 in Algorithm 1 updates the running score with a value based on
the similarity, whereas line 9 updates the score using the previous diagonal entry.
When symbols are dissimilar a gap is found. Lines 12 and 19 are used to penalize
the running score. If it is an end of the segment then line 14 and 21 updates
score as per (5.5). Line 26 updates table a with the score value of the current
segment set when the beginning of a new segment is encountered. When a gap is
encountered line 28 updates table a to −1. To find the Longest Common Segment
Set, backtracking is performed to obtain the path in table d that has the maximum
score as given by table s. The boundaries of soft segments can be found using the
cost values while tracing the path.
63
Algorithm 1 Algorithm for Soft-Longest Common Segment SetData:c - table of size (m + 2) × (n + 2) for storing running scores - table of size (m + 2) × (n + 2) for storing scored - table of size (m + 2) × (n + 2) for path trackinga - table of size (m + 2) × (n + 2) for storing partial scores.
1: function LCSS(〈x1, · · · , xm, φx〉, 〈y1, · · · , yn, φy〉
)2: Initialize 1st row and column of c, s, d and a to 03: p← min(m,n)4: for i← 1 to m + 1 do5: for j← 1 to n + 1 do6: if sim(xi, y j) > τsim then7: di, j ← “↖ ”8: ci, j ← ci−1, j−1 +
( sim(xi, y j)−τsim
1−τsim
)9: si, j ← si−1, j−1
10: else if ci−1, j > ci, j−1 then11: di, j ← “ ↑ ”12: ci, j ← max(ci−1, j − ρ, 0)13: if di−1, j = “↖ ” then
14: si, j ←ai−1, j∗p2+c2
i−1, j
p2
15: else16: si, j ← si−1, j
17: else18: di, j ← “← ”19: ci, j ← max(ci, j−1 − ρ, 0)20: if di, j−1 = “↖ ” then
21: si, j ←ai, j−1∗p2+c2
i, j−1
p2
22: else23: si, j ← si, j−1
24: q← max(ai−1, j−1, ai−1, j, ai, j−1)25: if q = −1 and di, j = “↖′′ then26: ai, j ← si−1, j−1
27: else if ci, j < τrc then28: ai, j ← −129: else30: ai, j ← q
5.4 Raga Verification
Let Traga ={t1, t2, · · · , tNraga
}represent a set of template recordings, where ‘raga’
refers to the name of the raga and Nraga is the total number of templates for that
64
raga. During testing, an input test recording, X, with a claim is tested against all
the template recordings of the claimed raga. The final score is computed as given
in (5.7).
score (X, claim) = maxY∈Tclaim
(score (lcssXY)) (5.7)
The final decision, of accepting or rejecting the claim, directly based on this score
could be erroneous. Score normalisation with cohorts is essential to make a deci-
sion, especially when the difference between two ragas is subtle.
5.4.1 Score Normalization
LCSS scores corresponding to correct and incorrect claims are referred as true and
imposter scores, respectively. If the imposter is a cohort raga, then the imposter
score is also referred as cohort score. Various score normalization techniques are
discussed in the literature for speech recognition, speaker/language verification
and spoken term detection [1, 34].
Zero normalization (Z-norm) uses the mean and variance estimate of cohort
scores for scaling. The advantage of Z-norm is that the normalization parameters
can be estimated off-line. Template recordings of a raga are tested against template
recordings of its cohorts and the resulting scores are used to estimate a raga specific
mean and variance for the imposter distribution. The normalized scores using Z-
norm can be calculated as
scorenorm
(X, claim) =score (X, claim) − µclaim
I
σclaimI
(5.8)
65
where µclaimI and σclaim
I are the estimated imposter parameters for the claimed raga.
Test normalization (T-norm) is also based on a mean and variance estimation
of cohort scores for scaling. The normalization parameters in T-norm are estimated
online as compared to their offline estimation in Z-norm. During testing, a test
recording is tested against template recordings of cohort ragas and the resulting
scores are used to estimate mean and variance parameters. These parameters are
then used to perform the normalization given by (5.8).
The test recordings of a raga may be scored differently against templates corre-
sponding to the same raga or imposter raga. This can cause overlap between the
true and imposter score distributions. T-norm attempts to reduce this overlap.
The templates that are stored and the audio clip that is used during test can be
from different environments.
5.5 Performance evaluation
In this section, we describe the results of raga verification using LCSS algorithm
in comparison with Rough Longest Common Subsequence (RLCS) algorithm [30]
and Dynamic Time Warping (DTW) algorithm using different normalizations.
5.5.1 Experimental configuration
Only 17 ragas out of 30 were used for raga verification as only for 17 ragas sufficient
number of relevant cohorts could be obtained from the 30 ragas. This is due to non-
symmetric nature of the cohorts as discussed in Section 5.2. For raga verification,
40% of the pallavi lines are used as templates and remaining 60% are used for
66
Table 5.2: EER(%) for different algorithms using different normalizations on dif-ferent datasets.
Algorithm Dataset No Norm Z-norm T-NormDTW D1 27.78 29.88 17.45
D2 40.81 40.03 35.96RLCS D1 24.43 27.22 14.87
D2 41.72 42.58 41.20RLCS-MOD D1 20.88 22.72 13.25
D2 36.72 37.68 34.58LCSS (hard) D1 29.00 31.75 15.65
D2 40.28 40.99 34.11LCSS (soft) D1 21.89 24.11 12.01
D2 37.24 38.96 34.57
testing. This partitioning of dataset is done into two ways, referred as D1 and D2.
In D1, the variations of a pallavi line might fall into both templates and test though
it is not necessary. Variations of a pallavi line are different from the pallavi line due
to improvisations. In D2, these variations can either belong to template or they all
belong to test but strictly not present in both. The values of thresholds τsim and τrc
are empirically chosen as 0.45 and 0.5, respectively. Penalty, η, issued for gaps in
segments is empirically chosen as 0.5.
5.5.2 Results
Table 5.2 and Figure 5.2 show the comparison of LCSS with DTW and RLCS using
different normalizations. Equal Error rate (EER) refers to a point where false alarm
rate and miss rate is equal. For T-norm, the best 20 cohort scores were used for
normalization. LCSS (soft) with T-norm performs best for D1 around the EER
point, and for high miss rates and low false alarms, whereas it performs poorer
than LCSS (hard) for low miss rates and high false alarms. This behavior appears
to be reversed for D2. The magnitude around EER is much greater for D2. This
67
0.001 0.01 0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
80
False Alarm probability (in %)
Mis
s p
rob
ab
ility
(in
%)
no−norm lcss (soft)
z−norm lcss (soft)
t−norm lcss (soft)
no−norm lcss (hard)
z−norm lcss (hard)
t−norm lcss (hard)
no−norm rlcs
z−norm rlcs
t−norm rlcs
no−norm dtw
z−norm dtw
t−norm dtw
0.001 0.01 0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
80
False Alarm probability (in %)
Mis
s p
rob
ab
ility
(in
%)
no−norm lcss (soft)
z−norm lcss (soft)
t−norm lcss (soft)
no−norm lcss (hard)
z−norm lcss (hard)
t−norm lcss (hard)
no−norm rlcs
z−norm rlcs
t−norm rlcs
no−norm dtw
z−norm dtw
t−norm dtw
a) DET curves for dataset D1 b) DET curves for dataset D2
Figure 5.2: DET curves comparing LCSS algorithm with different algorithms usingdifferent score normalizations
is because, none of the variations of the pallavi lines in test are present in the
templates. It is also shown that RLCS performs poorer than any other algorithms
for D2. The curves also show no improvements for Z-norm compared to baseline
with no normalization. This can happen due to the way normalization parameters
are estimated for Z-norm. For example, some of the templates, which may not be
similar to the test, can be similar to some of the cohorts’ templates, resulting in
higher mean. This would not have happened in T-norm where the test itself is
tested against the cohorts’ templates.
5.6 Discussion
In this section, we discuss how LCSS (hard) and LCSS (soft) can be combined
to achieve better performance. We also verify that T-norm reduces the overlap
between true and imposter scores.
68
Table 5.3: Number of claims correctly verified by hard-LCSS only, by soft-LCSSonly, by both and by neither of them for D1 and D2 using T-norm
Dataset Claim- Hard- Soft- Both Neithertype only only
D1 True 23 55 289 77False 46 78 1745 54
D2 True 47 23 155 220False 99 75 1585 168
5.6.1 Combining hard-LCSS and soft-LCSS
Instead of selecting a threshold, we will assume that a true claim is correctly
verified when its score is greater than all the cohort scores. Similarly, a false claim
is correctly verified when its score is lesser than atleast one of the cohort scores.
Table 5.3 shows the number of claims correctly verified only by hard-LCSS, only
by soft-LCSS, by both and by neither of them. It is clear that there is an overlap
between the correctly verified claims of hard-LCSS and soft-LCSS. Nonetheless,
the number of claims distinctly verified by both is also significant. Therefore, the
combination of these two algorithms could result in a better performance.
5.6.2 Reduction of overlap in score distribution by T-norm
Figure 5.3 shows the effect of T-norm on the distribution of hard-LCSS scores. It is
clearly seen that the overlap, between the true and imposter score distributions, is
reduced significantly. For visualization purposes, the true score distributions are
scaled to zero mean and unit variance and corresponding imposter score distribu-
tions are scaled appropriately.
69
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
LCSS (hard) scores without normalization
Den
sity
True scores
Imposter scores
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
LCSS (hard) scores with t−norm
Den
sity
True scores
Imposter scores
Figure 5.3: Showing the effect of T-norm on the score distribution
5.6.3 Scalability of raga verification
The verification of a raga depends on the number of its cohort ragas which are
usually 4 or 5. Since it does not depend on all the ragas in the dataset, as in raga
identification, any number of ragas can be added to the dataset.
5.7 Summary
In this Chapter, we have presented a different approach to raga analysis in Carnatic
music. Instead of raga identification, raga verification is performed. A set of cohorts
for every raga is defined. The identity of an audio clip is presented with a claim.
The claimed raga is verified by comparing with the templates of the claimed raga
and its cohorts by using a novel approach, Longest Common Segment Set (LCSS).
A set of 17 ragas and its cohorts constituting 30 ragas is tested using appropriate
score normalization techniques. An equal error rate of about 12% is achieved. This
approach is scalable to any number of ragas as the given raga and only its cohorts
need to be added to the system.
70
CHAPTER 6
Conclusion
Typical motifs of a raga are used to establish its identity in all improvisational and
compositional forms. Along with raga identity, typical motifs can also be used to
index a recording for archival purposes. Further, indexed motifs can also be used
to explore and to analyze the melodic phrases connecting them, which could be
useful for both listeners and learners of Carnatic music.
The objective of this thesis was to develop algorithmic techniques for automatic
extraction of typical motifs and for performing raga verification using the regions
replete with typical motifs. Some of the salient points presented in this thesis are
as follows:
6.1 Salient Points
• It was shown using pitch histograms that the notes in Carnatic music havegreater pitch range as compared to Hindustani music and Western classicalmusic. This renders the symbolic representation of Carnatic music a non-trivial task and poses significant challenges in the analysis of Carnatic music.
• The stationary points of the pitch contour were shown to preserve the es-sential raga information however, the exact melodic information was lost.For the task of finding different renditions of typical motifs, these stationarypoints were used to reduce the search space. A measure based on the slopeof the linear trend in stationary points along with its standard deviation isused to reduce the false alarms.
• An algorithm was proposed for time-series matching which is a modificationof an existing algorithm known as Rough Longest Common Subsequence.This algorithm can match shorter sequences that are common between twolonger sequences. However, the score was penalized with respect to the
length of the longer sequences. Therefore, matched shorter sequences canget low scores suggesting that the match is poor even when the match isgood.
• The second algorithm, known as Longest Common Segment Set, proposedfor time-series matching was novel. It can also match shorter sequences thatare common between two longer sequences but the score was not penalizedwith respect to the longer sequences. Therefore, it was more effective in theextraction of the common shorter sequences.
• Typical motifs of duration of approximately four seconds were found to bemore relevant for raga identity. Shorter motifs had less context and resultedin great deal of false alarms.
• Typical motifs were found to be prevalent in the pallavi lines of the composi-tions. Therefore, these pallavi lines were used in the task of raga verification.
• In raga verification, cohort ragas (usually four or five ragas) were used fornormalizing the score instead of all the ragas in the dataset. Therefore, theproposed raga verification system was found to be scalable to any number ofragas. For a new raga to be added into the system, only the templates of thenew raga and its cohorts were required without altering the existing system.
6.2 Criticism of the work
In this section, we discuss the shortcomings of the approaches proposed in this
thesis.
• The proposed algorithms for time-series matching requires that the orderingof the common shorter sequences is same in both the longer sequences. Ifthe ordering is different, all the common shorter sequences are not matched.
• The algorithms also fail to match sequences if they are in different octaves.
• The performance of the algorithms is also sensitive to pitch errors. Thisproblem is dealt to some extent by smoothing the pitch contours if the pitcherrors are not significantly large.
• Typical motifs are retrieved only if they repeat across the composition lines.Therefore, this approach relies on large number of composition lines.
• Raga verification also needs large number of composition lines (templates)such that most of the typical motifs are represented.
72
6.3 Future work
Given the drawbacks of the proposed approaches in the previous section, the
following improvements can be made:
• For time-series matching, when the ordering of common shorter sequencesis different, no single alignment can align all the common shorter sequences.In such situations different alignments can be inspected to extract all thecommon shorter sequences irrespective of their order.
• For matching sequences that belong to different octaves, one of the two se-quences can be shifted to different octaves and the matching can be performedwith all the shifted sequences.
• Instead of using pitch to represent the melody, a transformation of the fre-quency spectrum can be used that reduces other noises and preserves themelody. This will help in improving the performance of the algorithm whichis sensitive to pitch errors.
73
LIST OF PAPERS BASED ON THESIS
1. Shrey Dutta, Krishnaraj Sekhar PV and Hema A. Murthy. Raga Verificationin Carnatic Music using Longest Common Segment Set. In Proceedings of16th International Society for Music Information Retrieval Conference, 2015.
2. Shrey Dutta and Hema A. Murthy. Discovering Typical Motifs of a Raga fromOne-Liners of Songs in Carnatic Music. In Proceedings of 15th InternationalSociety for Music Information Retrieval Conference, pages 397–402, 2014.
3. Shrey Dutta and Hema A. Murthy. A modified rough longest commonsubsequence algorithm for motif spotting in an Alapana of Carnatic Music.In 20th National Conference on Communications (NCC), pages 1–6, 2014.
4. Vignesh Ishwar, Shrey Dutta, Ashwin Bellur and Hema A. Murthy. MotifSpotting in an Alapana in Carnatic Music. In Proceedings of 14th InternationalSociety for Music Information Retrieval Conference, pages 499–504, 2013
74
REFERENCES
[1] Roland Auckenthaler, Michael Carey, and Harvey Lloyd-Thomas. Score normal-ization for text-independent speaker verification systems. Digital Signal Processing,10:42–54, 2000.
[2] Ashwin Bellur and Hema A Murthy. A cepstrum based approach for identifying tonicpitch in indian classical music. In National Conference on Communications, pages 1–5,2013.
[3] Yueguo Chen, Mario A. Nascimento, Beng Chin, Ooi Anthony, and K. H. Tung. Spade:On shape-based pattern detection in streaming time series. In International Conferenceon Data Engineering, pages 786–795, 2007.
[4] Bill Chiu, Eamonn Keogh, and Stefano Lonardi. Probabilistic discovery of time seriesmotifs. In Proceedings of the Ninth ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 493–498, 2003.
[5] P Chordia and A Rae. Raag recognition using pitch- class and pitch-class dyaddistributions. In Proceedings of International Society for Music Information RetrievalConference, ISMIR, pages 431–436, 2007.
[6] Tom Collins, Andreas Arzt, Sebastian Flossmann, and Gerhard Widmer. Siarct-cfp:Improving precision and the discovery of inexact musical patterns in point-set repre-sentations. In Internation Society for Music Information Retrieval, pages 549–554, 2013.
[7] Tom Collins, Jeremy Thurlow, Robin Laney, Alistair Willis, and Paul H. Garthwaite.A comparative evaluation of algorithms for discovering translational patterns inbaroque keyboard works. In International Society for Music Information Retrieval, pages3–8, 2010.
[8] Darrell Conklin. Discovery of distinctive patterns in music. Intelligent Data Analysis,pages 547–554, 2010.
[9] Darrell Conklin. Distinctive patterns in the first movement of brahms string quartetin c minor. Journal of Mathematics and Music, 4(2):85–92, 2010.
[10] Jonathan D. Cryer and Kung-Sik Chan. Time Series Analysis: with Applications in R.Springer, 2008.
[11] Pranay Dighe, Parul Agarwal, Harish Karnick, Siddartha Thota, and Bhiksha Raj.Scale independent raga identification using chromagram patterns and swara basedfeatures. In 2013 IEEE International Conference on Multimedia and Expo Workshops, SanJose, CA, USA, July 15-19, 2013, pages 1–4, 2013.
[12] Pranay Dighe, Harish Karnick, and Bhiksha Raj. Swara histogram based structuralanalysis and identification of indian classical ragas. In Proceedings of the 14th Inter-national Society for Music Information Retrieval Conference, ISMIR 2013, Curitiba, Brazil,November 4-8, 2013, pages 35–40, 2013.
75
[13] Subbarama Dikshitulu. Sangita sampradaya pradarsini. The Music Academy Madras,Vol. 2, 2011.
[14] D.P.W. Ellis and G.E. Poliner. Identifying ‘cover songs’ with chroma features anddynamic programming beat tracking. In Proceedings of IEEE International Conferenceon Acoustics, Speech and Signal Processing, volume 4, pages 1429–1432, 2007.
[15] F. N. Fritsch and R. E. Carlson. Monotone Piecewise Cubic Interpolation. SIAM Journalon Numerical Analysis, Vol. 17, No. 2., 1980.
[16] Toni Giorgino. Computing and visualizing dynamic time warping alignments in R:The dtw package. Journal of Statistical Software, 31(7):1–24, 2009.
[17] AnYuan Guo and Hava Siegelmann. Time-warped longest common subsequencealgorithm for music retrieval. In Proceedings of 5th International Conference on MusicInformation Retrieval (ISMIR), 2004. http://works.bepress.com/hava_siegelmann/13.
[18] S Arthi H G Ranjani and T V Sreenivas. Shadja, swara identification and raga veri-fication in alapana using stochastic models. In 2011 IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics (WASPAA), pages 29–32, 2011.
[19] F Scholer I S H Suyoto, A L Uitdenbogerd. Searching musical audio using symbolicqueries audio, speech, and language processing. IEEE Transactions on In Audio, Speech,and Language Processing, IEEE Transactions on, Vol. 16, No. 2., pages 372–381, 2008.
[20] Vignesh Ishwar, Ashwin Bellur, and Hema A Murthy. Motivic analysis and its rel-evance to raga identification in carnatic music. In Workshop on Computer Music,Instanbul, Turkey, July 2012. http://compmusic.upf.edu/publications.
[21] M Miron J Serra, G K Koduri and X Serra. Tuning of sung indian classical music. InProceedings of the 12th International Society for Music Information Retrieval Conference,ISMIR, pages 157–162, 2011.
[22] Berit Janssen, W. Bas de Haas, Anja Volk, and Peter van Kranenburg. Discoveringrepeated patterns in music: state of knowledge, challenges, perspectives. InternationalSymposium on Computer Music Modeling and Retrieval (CMMR), pages 225–240, 2013.
[23] Gopala Krishna Koduri, Sankalp Gulati, and Preeti Rao. A survey of raaga recognitiontechniques and improvements to the state-of-the-art. Sound and Music Computing,2011.
[24] Gopala Krishna Koduri, Sankalp Gulati, Preeti Rao, and Xavier Serra. Raga recogni-tion based on pitch distribution methods. Journal of New Music Research, 41(4):337–350,2012.
[25] A.S. Krishna, P.V. Rajkumar, K.P. Saishankar, and M. John. Identification of carnaticraagas using hidden markov models. In Applied Machine Intelligence and Informatics(SAMI), 2011 IEEE 9th International Symposium on, pages 107 –110, jan. 2011.
[26] T. M. Krishna. A Southern Music: The Karnatic Story, chapter 5. HarperCollins, India,2013.
[27] T M Krishna and Vignesh Ishwar. Carnatic music : Svara, gamaka, motif andraga identity. In Workshop on Computer Music, Instanbul, Turkey, July 2012. http://compmusic.upf.edu/publications.
76
[28] A Krishnaswamy. Application of pitch tracking to south in- dian classical music. InProc. of the IEEE Int. Conf. on Acous- tics, Speech and Signal Processing (ICASSP), pages557–560, 2003.
[29] V. Kumar, H. Pandya, and C.V. Jawahar. Identifying ragas in indian music. In 22ndInternational Conference on Pattern Recognition (ICPR), pages 767–772, 2014.
[30] Hwei-Jen Lin, Hung-Hsuan Wu, and Chun-Wei Wang. Music matching based onrough longest common subsequence. Journal of Information Science and Engineering,pages 95–110, 2011.
[31] Lie Lu, Muyuan Wang, and Hong-Jiang Zhang. Repeating pattern discovery andstructure analysis from acoustic music data. In Proceedings of the 6th ACM SIGMMInternational Workshop on Multimedia Information Retrieval, pages 275–282, 2004.
[32] David Meredith, Kjell Lemstrom, and Geraint A. Wiggins. Algorithms for discoveringrepeated patterns in multidimensional representations of polyphonic music. Journalof New Music Research, pages 321–345, 2002.
[33] Meinard Muller, Frank Kurth, and Michael Clausen. Audio matching via chroma-based statistical features. In Proceedings of International Society for Music InformationRetrieval (ISMIR), pages 288–295, 2005.
[34] Jiri Navratil and David Klusacek. On linear dets. In Proceedings of the IEEE InternationalConference on Acoustics, Speech, and Signal Processing, pages 229–232, 2007.
[35] Gaurav Pandey, Chaitanya Mishra, and Paul Ipe. Tansen: A system for automaticraga identification. In Indian International Conference on Artificial Intelligence, pages1350–1363, 2003.
[36] Pranav Patel, Eamonn Keogh, Jessica Lin, and Stefano Lonardi. Mining motifs inmassive time series databases. In Proceedings of IEEE International Conference on DataMining (ICDM02), pages 370–377, 2002.
[37] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery.Numerical Recipes in C: The Art of Scientific Computing. Second Edition. Oxford UniversityPress, 1992.
[38] P. Rao, J. Ch. Ross, K. K. Ganguli, V. Pandit, V. Ishwar, A. Bellur, , and H. A. Murthy.Melodic motivic analysis of indian music. Journal of New Music Research, 43(1):115–131,2014.
[39] Joe Cheri Ross, Vinutha T. P., and Preeti Rao. Detecting melodic motifs from audiofor hindustani classical music. In Proceedings of 13th International Society for MusicInformation Retrieval (ISMIR), pages 193–198, 2012.
[40] Joe Cheri Ross and Preeti Rao. Detection of raga-characteristic phrases from hindus-tani classical music audio. Workshop on Computer Music, 2012. http://compmusic.upf.edu/publications.
[41] Justin Salamon and Emilia Gomez. Melody extraction from polyphonic music signalsusing pitch contours characteristics. In IEEE Transactions on Audio Speech and LanguageProcessing, 20(6):1759–1770, August 2012.
77
[42] Sridharan Sankaran, Krishnaraj P V, and Hema A Murthy. Automatic segmentationof composition in carnatic music using time-frequency cfcc templates. In Proceedingsof 11th International Symposium on Computer Music Multidisciplinary Research (CMMR),2015.
[43] J. Serra, E. Gomez, P. Herrera, and X. Serra. Chroma binary similarity and localalignment applied to cover song identification. Audio, Speech, and Language Processing,IEEE Transactions on, 16(6):1138–1151, Aug 2008.
[44] Joan Serra, Gopala K. Koduri, Marius Miron, and Xavier Serra. Assessing the tuningof sung indian classical music. In Proceedings of the 12th International Society for MusicInformation Retrieval Conference, ISMIR 2011, Miami, Florida, USA, October 24-28, 2011,pages 157–162, 2011.
[45] Sankalp Gulati Joan Serra and Xavier Serra. An evaluation of methodologies formelodic similarity in audio recordings of indian art music. In Proceedings of the 40thIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015,pages 678–682, April 2015.
[46] Surendra Shetty. Raga mining of indian music by extracting arohana-avarohanapattern. International Journal of Recent trends in Engineering, 1(1), 2009.
[47] Rajeswari Sridhar and Tv Geetha. Raga identification of carnatic music for musicinformation retrieval. International Journal of Recent trends in Engineering, 1(1):1–4,2009.
[48] M Subramanian. Carnatic ragam thodi pitch analysis of notes and gamakams. Journalof the Sangeet Natak Akademi, XLI(1):3–28, 2007.
[49] D Swathi. Analysis of carnatic music: A signal processing perspective. M.Tech. Thesis,IIT Madras, 2009.
[50] Alexandra L. Uitdenbogerd and Justin Zobel. Manipulation of music for melodymatching. MULTIMEDIA ’98 Proceedings of the sixth ACM international conference onMultimedia, pages 235–240, 1998.
78