Advanced Multimedia Music Information Retrieval Tamara Berg

Advanced Multimedia

Music Information RetrievalTamara Berg

Announcements

• Still missing a few assignment 1’s

• Assignment 2 is online – due March 10

Audio Indexing and Retrieval

• Motivation• Features for representing audio:

– Metadata– low level features – high level audio features

• Example usage cases:Audio classificationMusic retrieval

Howard Leung

Content Based Music Retrieval

Extract music descriptions from a database of music documents.

Extract music description from a query music document.

Compute match between query and database descriptions.

Retrieve similar music documents to query.

Casey et al IEEE 2008

MIR tasks

H: high level specificity – match specific instances of audio content.

M: mid-level specificity – match high level audio features like melody, but do not match audio content.

L: low specificity – match global (statistical) properties of the query

Different usage cases require different descriptions and matching schema.

Metadata

• Most common method of accessing music• Can be rich and expressive• When catalogues become very large, difficult

to maintain consistent metadata

Useful for low specificity queries

Metadata• Pandora.com – Uses metadata to estimate artist

similarity and track similarity and creates personalized radio stations. Human entered metadata of musical-cultural properties (20-30 minutes per track of an expert’s time – 50 person-years for 1 million tracks).

• User contributed metadata repositories (gracenote, musicbrainz). Factual metadata (artist, album, year, title, duration). Cultural metadata (mood, emotion, genre, style).

• Automatic metadata methods – generate descriptions from community metadata automatically. Language analysis to associate noun and verb phrases with musical features (Whitman & Rifkin).

Content features• Low level or high level• Want features to be robust to certain changes in

the audio signal (why?)– Noise– Volume– Sampling

• High level features will be more robust to changes, low level features will be less robust.

• Low level features will be easy to compute, high level difficult

Low level audio features

• Low level measurements of audio signal that contain information about a musical work.

• Can be computed periodically (10-1000 ms intervals) or beat synchronous.

In text analysis we had words, here we have to come up with our own set of features to compute from audio signal!

Example Low-Level Audio Features

Howard Leung

Average number of times signal crosses zero amplitude value.

1 if trueO o.w.

Howard Leung

Frequency Domain Reminder

How much of each describes the frequency spectrum of a signal.Li & Drew

Signals can be decomposed into a weighted sum of sinusoids

Frequency domain features

• How do we get to frequency domain?

Time Frequency

DFTDiscrete Fourier Transform (DFT) of the audio

Converts to a frequency representation

DFT analysis occurs in terms of number of equallyspaced ‘bins’

Each bin represents a particular frequency rangeDFT analysis gives the amount of energy in the audio signalthat is present within the frequency range for each bin

Inverse Discrete Fourier Transform (IDFT)Converts from frequency representation back to audio signal.

Howard Leung

FilteringRemoves frequency components from some

part of the spectrum Low pass filter – removes high frequency

components from input and leaves only low in the output signal.

High pass filter – removes low frequency components from input and leaves only high in the output signal.

Band pass filter – removes some part of the frequency spectrum.

How could you do this using the FT and IFT?

Compute FT spectrum of input.

Zero out the part of the frequency spectrum that you want to filter out.

Compute the IFT of this modified spectrum -> output will be input with some frequency components removed.

f = input

f = input FT(f)

Zero out some freq components

f = input FT(f)

x xxxxxxxxxxxxx

f = input FT(f)

Zero out some freq components IFT

o = Frequency limited output

x xxxxxxxxxxxxx

f = input FT(f)

x xxxxxxxxxxxxx

What kind of filter is this?

f = input

f = input FT(f)

xxxxxxxxxxxxx

f = input FT(f)

xxxxxxxxxxxxx

f = input FT(f)

xxxxxxxxxxxxx

What kind of filter is this?

Filtering

Alternatively you can convolve the input signal with a filter to get frequency limited output signal.

Convolution (we’ll see this again for images):

(convolution demo)

1 3 2 5 3 2 4 5

1/3 1/3 1/3

signal

filter

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/3 1/3 1/3* * *

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/3 1/3 1/3* * *

= - 2 10/3

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/3 1/3 1/3* * *

= - 2 10/3 10/3

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/3 1/3 1/3* * *

= - 2 10/3 10/3 10/3

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/3 1/3 1/3* * *

= - 2 10/3 10/3 10/3 3

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/3 1/3 1/3* * *

= - 2 10/3 10/3 10/3 3 11/3 -

What does this filter do?

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/4 1/2 1/4

signal

filter

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/4 1/2 1/4* * *

= - 2.25

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/4 1/2 1/4* * *

= - 2.25 3

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/4 1/2 1/4* * *

= - 2.25 3 3.75

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/4 1/2 1/4* * *

= - 2.25 3 3.75 3.25

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/4 1/2 1/4* * *

= - 2.25 3 3.75 3.25 2.75

Filtering

Convolution:

(convolution demo)

1 3 2 5 3 2 4 5

1/4 1/2 1/4* * *

= - 2.25 3 3.75 3.25 2.75 3.75 -

In general filters will have a more complex effect on the output.

What is convolution doing?

Relationship

f = input F = ft(f)g = filterG = ft(g)

f f ★ g

Relationship

f f ★ g

F F .*G

Relationship

f f ★ g

F F .*G

Theorem: Convolution in signal space is equivalent to point-wise multiplication in frequency space.

Relationship

f f ★ g

F F .*G

f ★ g = IFT(F.*G)F.*G = FT(f ★ g)

Theorem: Convolution in signal space is equivalent to point-wise multiplication in frequency space.

Matlab demo

soundFilt/demo.m

Howard Leung

Pitch-Class Profile (PCP)

• Represent the energy due to each pitch class • Integrates the energy in all octaves into a single band• There are 12 equally spaced pitch classes in western tonal

music. So, typically 12 bands in the PCP.

Pitch-Class Profile (PCP)

• Represent the energy due to each pitch class • Integrates the energy in all octaves into a single band• There are 12 equally spaced pitch classes in western tonal

music. So, typically 12 bands in the PCP.

How might we calculate this using the DFT?

Howard Leung

High level music features

High level intuitive information about a piece of music (melody, harmony etc).

“It is melody that enables us to distinguish one work from another. It is melody that human beings are innately able to reproduce by singing, humming, andwhistling. It is melody that makes music memorable: we are likely to recall a tune long after we have forgotten its text.”

-Selfridge-Field

Intuitive features, but hard to extract and ongoing areas of research.

Melody & Bass Estimation

• Melody and bass lines represented as continuous temporal trajectory of fundamental frequency, F0 (a series of musical notes).

• PreFEst (Predominant-F0 Estimation method – Goto 1999) – Estimate the F0 trajectory in mid-high freq range of

input -> melody. – Estimate the F0 trajectory in low freq range-> bass.

Chord Recognition

Musical performance is assumed to travel through a sequence of states.

Hidden Markov Model (HMM – probabilistic model good for modeling sequences of data, here sequences of chords over time) is used to model these transitions and predict the best chord sequence given a set of observations (PCP).

Transition model – Probability of transitioning from one chord to another

Output model – Probability of a PCP given a chord.

Chord Recognition

Music Structure

• Segment into temporal regions with some internal consistency – Beat segmentation– Verse, chorus, bridge– Speech vs music

• Uses:– facilitate audio editing– Improve similarity measurements by removing

irrelevant parts or selecting most representative parts (for recommender systems).

Music Structure

Detect repeated structures and label them as being the same.

Music as vector of features

• Once again we represent (music) documents as a vector of numbers – Each entry (or set of entries) in this vector is a different

feature

Music as vector of features

• Once again we represent (music) documents as a vector of numbers – Each entry (or set of entries) in this vector is a different

feature

• To retrieve music documents given a query we can:– Find exact matches– Find nearest match– Find nearby matches– Train a classifier to recognize a given category (genre, style

Audio Similarity

We have a description of a music document based on some set of features, now how do we compare two descriptions?

Usage examples

Howard Leung

Query by humming• Requires high level features because matches

will not be exact• Extract melody from dataset of songs• Extract melody from hum• Match by comparing similarities of melodies

(nearby matches)

Copyright monitoring

• Compute fingerprints from database examples• Compute fingerprint from query example• Find exact matches

Best performing systems on MIREX 2007

Music BrowsingMusicream – UI for discovering and managing musical pieces.

User can select a disc and listen to it. By dragging a disc in the flow, the user can easily pick out other similar pieces (attach similardiscs). This interaction allows a user to unexpectedly comeacross various pieces similar to other pieces the user likes.

Link to demo

Music Browsing

Musicrainbow – UI for discovering unknown artists.

Artists are mapped on a circular rainbow where colors represent different styles of music. Similar artists are mapped near each other.

User rotates rainbow by turning a knob.

Link to demo

Howard Leung

Advanced Multimedia Music Information Retrieval Tamara Berg

Documents

Natural Language Processing Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Computational Photography Tamara Berg Features. Representing Images Keep all the pixels! Pros? Cons?

Names and Faces - University College LondonNames and Faces Tamara L. Berg∗ Alexander C. Berg∗ Jaety Edwards∗ Michael Maire∗ Ryan White∗ Yee-Whye Teh† Erik Learned-Miller‡

Advanced Multimedia Text Classification Tamara Berg

Tamara Berg Object Recognition – BoF models

Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Advanced Multimedia Meta Data & Tags Tamara Berg Some slides adapted from: JISC Digital Media

Image Stitching Tamara Berg CSE 590 Computational Photography Many slides from Alyosha Efros & Derek Hoiem

Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

Tamara Berg Machine Learning 790-133 Recognizing People, Objects, & Actions 1

PageRank for Product Image Search - Tamara Berg

Names and Faces in the News - Computational Vision: [Home]people.vision.caltech.edu/~mmaire/papers/pdf/names_faces...Names and Faces in the News Tamara L. Berg, Alexander C. Berg,

Shape Matching and Object Recognition using Low Distortion Correspondence Alexander C. Berg, Tamara L. Berg, Jitendra Malik U.C. Berkeley

Fundamentals of Multimedia, Chapter 6 Sound Analysis Tamara Berg Advanced Multimedia 1

Computational Photography CSE 590 Tamara Berg Filtering & Pyramids

Names and Faces in the News - University of Chicagottic.uchicago.edu/~mmaire/papers/pdf/names_faces_cvpr...Names and Faces in the News Tamara L. Berg, Alexander C. Berg, Jaety Edwards,

Machine Learning Overview Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,

Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart