Identifying Words that are Musically Meaningful
David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet
Computer Audition Lab
UC San Diego
ISMIR
September 25, 2007
2
Introduction
Our Goal: Create a content-based music search engine for natural language queries.
– CAL Music Search Engine [SIGIR07]
Problem: picking a vocabulary of musically meaningful words?– Word is present pattern in audio
content
Solution: find words that are correlated with a set of acoustic signals
3
Two-View Representation
Consider a set of annotated songs. Each song is represented by:
1. Annotation vector in a Semantic Space
2. Audio feature vector(s) in an Acoustic Space
Acoustic Space (2D)
x
y
Semantic Space (2D)
‘funky’
‘Ireland’
Mustang Sally The Commitments
Mustang Sally The Commitments
Riverdance Bill Whelan
Riverdance Bill Whelan
Hot Pants James Brown
Hot Pants James Brown
4
Semantic Representation
Vocabulary of words:1. CAL500: 174 phrases from a human survey
• Instrumentation, genre, emotion, usages, vocal characteristics
2. LastFM: ~15,000 tags from social music site
3. Web Mining: 100,000+ words mined from text documents
Annotation Vector, denoted s1. Each element represents the ‘semantic association’ between
a word and the song.2. Dimension (DS) = size of vocabulary
3. Example: Frank Sinatra’s ‘Fly Me to the Moon”1. Vocabulary = {funk, jazz, guitar, female vocals, sad,
passionate }
2. Annotation (si) = [0/4 , 3/4, 4/4 , 0/4 , 2/4, 1/4]
Data is represented by a N x DS Matrix S =
- s1 -.
- si -.
- sN -
5
Acoustic Representation
Each song is represented by an audio feature vector a that is automatically extracted from the audio-content.
Data is represented by NxDA matrix A =
- a1 -.
- ai -.
- aN -
Acoustic Space (2D)Semantic Space (2D)
‘funky’
‘Ireland’
Mustang Sally The Commitments
x
y
Mustang Sally The Commitments
6
Canonical Correlation Analysis (CCA)CCA is a technique for exploring dependencies between two related spaces.– Generalization of PCA to multiple spaces– Constrained optimization problem
• Find vectors weight vectors ws and wa:
– 1-D projection of data in the semantic space - Sws
– 1-D projection of data in the acoustic space - Awa
• Maximize correlation of the projections
– max (Sws)T(Awa)
• Constrain ws and wa to prevent infinite correlation
max (Sws)T
(Awa) wa, ws
subject to: (Sws)T (Sws) = 1
(Awa)T(Awa) = 1
7
CCA VisualizationAudio feature spaceSemantic space
‘funky’
‘Ireland’
a
a
a
b
b
b b
cc
cc
dd
d
1 1 0 -1 0 -1-1 -1
S 1 -1 1 1 -1 -1-1 1
A100-1
200-2
= =10
ws
1-1
wa(Sws)T (Awa)
200-2
1 0 0 -1 = 4
x
y
SparseSolution
8
What Sparsity means…
In the previous example,
• ws,’funky’ 0
‘funky’ is correlated w/ audio signals a musically meaningful word
• ws,’Ireland’ = 0
‘Ireland’ is not correlated No linear relationship with the acoustic representation
In practice, ws is dense even if most words are uncorrelated
– ‘dense’ means many non-zero values – due to random variability in the data
Key Idea: reformulate CCA to produce a sparse solution.
9
Introducing Sparse CCA [ICML07]
Plan: penalize the objective function for each non-zero semantic dimensions• Pick a penalty function f(ws)
• Penalizes each non-zero dimension
• Take 1: Cardinality of ws: f(ws) = |ws|0
• Combinatorial problem - np-hard
• Take 2: L1 relaxation: f(ws) = |ws|1
• Non-convex, not very tight approximation• Take 3: SDP relaxation
• Prohibitive expensive for large problem
• Solution: f(ws) = i log |ws,i|
• Non-convex problem, but• Can be solved efficiently with DC program• Tight approximation
10
Introducing Sparse CCA [ICML07]
Plan: penalize the objective function for each non-zero semantic dimensions
• Pick a penalty function f(ws)
• Penalizes each non-zero dimension
• f(ws) = i log |ws,i|
• Use tuning parameter to control importance of sparsity
• Increasing smaller set of ‘musically relevant’ words
max (Sws)T (Awa) wa, ws
subject to: (Sws)T (Sws) = 1
(Awa)T(Awa) = 1
- f(ws)
11
Experimental Setup
CAL500 Data Set [SIGIR07]
– 500 songs by 500 Artists
– Semantic Representation• 173 words
– genre, instrumentation, usages, emotions, vocals, etc…
• Annotation vector is average from 3+ listeners
• Word Agreement Score
– measures how consistently listeners apply a word to songs
– Acoustic Representation• Bag of Dynamic MFCC Vectors [McKinney03]
– 52-D vector spectral modulation intensities
– 160 vectors per minute of audio content
• Duplicate annotation vector for each Dynamic MFCC
12
Experiment 1: Qualitative Results
Words with high acoustic correlation
hip-hop, arousing, sad, drum machine, heavy beat, at a party, rapping
Words with no acoustic correlation
classic rock, normal, constant energy, going to sleep, falsetto
13
Experiment 2: Vocabulary Pruning
AMG2131 Text Corpus [ISMIR06]
– AMG Allmusic song reviews for most of CAL500 songs– 315 word vocabulary – Annotation vector based on the presence or absence
of a word in the review– More noisy word-song relationships then CAL500
Experimental Design:1. Merge vocabularies: 173+315 = 488 words
2. Prune noisy words as we increase amount of sparsity in CCA
Hypothesis: – AMG words will be pruned before CAL500 words
14
Experiment 2: Vocabulary Pruning
Experimental Design:
1. Merge vocabularies: 488 words
2. Prune noisy words as we increase amount of sparsity in CCA
Result:As Sparse CCA is more aggressive, more AMG words are pruned.
Vocabulary Size
# CAL500 Words
# AMG2131 Words
% Web2131 Words
488
173
315
0.64
249
118
131
0.52
149
85
64
0.42
50
39
11
0.22
15
Experiment 3: Vocabulary SelectionExperimental Design:1. Rank words by
• how aggressive Sparse CCA is before word gets pruned.• how consistently humans use a word across CAL500 corpus.
2. As we decrease vocabulary size, calculate Average AROC
Result: Sparse CCA does predict words that have better AROC
.68
.76
AROC
173 120 20Vocab Size
70
16
Recap
Constructing a ‘meaningful vocabulary’ is the first step in building a content-based, natural-language search engine for music.
Given a semantic representation and acoustic representation Sparse CCA can be used to find ‘musically meaningful’ words.– i.e., semantic dimensions linearly correlated with audio
features
Automatically pruning words is important when using noisy sources of semantic information – e.g., LastFM Tags or Web Documents
17
Future Work
Theory: moving beyond linear correlation with kernel methods
Application: Sparse CCA can be used to find ‘musically meaningful’ audio features by imposing sparsity in the acoustic space
Practice: handling large, noisy semantically annotated music corpora
Identifying Words that are Musically Meaningful
David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet
Computer Audition Lab
UC San Diego
ISMIR
September 25, 2007
19
Experiment 3: Vocabulary Selection
Our content-based music search engine rank orders songs given a text-based query [SIGIR 07]
– Area under the ROC curve (AROC) measures quality of each ranking
• 0.5 is random, 1.0 is perfect
• 0.68 is average AROC for all 1-word queries
Can Sparse CCA pick words that will have higher AROC?– Idea: words with high correlation have more signal in the audio representation and will be easier to model.
– How does it compare picking words that humans consistently use to label songs.