2015/10/101 Query-by-Singing/Humming: An Overview 「哼唱選歌」綜述 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan

112/04/21 1

Query-by-Singing/Humming: An Overview「哼唱選歌」綜述

J.-S. Roger Jang ( 張智星 )

Multimedia Information Retrieval Lab

CS Dept., Tsing Hua Univ., Taiwan

http://mirlab.org/jang

-2-

Outline

IntroductionMethods for QBSH

Pitch Tracking Database Comparison

Demos and Commercial ApplicationsConclusions

-3-

音樂資訊檢索（ MIR ）分類

Metadata-based Example: 歌名、歌手、標記、作詞者、作曲者 Query input: text or speech

Content-based Example: Melody, chord, note onsets, moods… Query input:

Symbolic: 音符、和弦、文字Acoustic: 哼唱、口哨、敲擊

-4-

Acoustic Inputs for MIR

哼唱 Query by humming

(usually “ta” or “da”) Query by singing

口哨 Query by whistling

敲擊 Query by tapping (at the

onsets of notes)

語音 Query by the user’s

speech input (for meta-data)

原音音樂範例 Query by recordings of

mobile phones

Beatboxing

-5-

Introduction to QBSH

QBSH: Query by Singing/Humming Input: Singing or humming from microphone Output: A ranking list retrieved from the song

database

Progression First paper: Around 1994 Extensive studies since 2001 State of the art: QBSH tasks at ISMIR/MIREX

-6-

「哼唱選歌」的流程

前處理：收集單軌標準答案（通常是 MIDI 檔）轉換成適合比對的中介格式

即時處理：將使用者的音訊輸入轉成音高向量由音高向量轉成音符（選擇性）和標準答案進行比對列出排名

-7-

Flowchart of QBSH

Pitch vectorsmoothing

Pitch tracking

Microphone input

Filtering

Query results(Ranked song list)

Similarity comparison

Off-line processing

Melody trackextraction

MIDI files

Frame-based representation

On-line processing

-8-

Pitch Tracking for QBSH

Two categories for pitch tracking algorithms Time domain ( 時域 )

ACF (Autocorrelation function)AMDF (Average magnitude difference function)SIFT (Simple inverse filtering tracking)

Frequency domain ( 頻域 )Harmonic product spectrum methodCepstrum method

-9-

Frame Blocking for Pitch Tracking

Frame size=256 pointsOverlap=84 pointsFrame rate=11025/(256-84)=64 pitch/sec

0 50 100 150 200 250 300-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Zoom in

Overlap

Frame

0 500 1000 1500 2000 2500-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

-10-

ACF: Auto-correlation Function

Frame s(i):

Shifted frame s(i+):

=30

30

acf(30) = inner product of overlap part

Pitch period

1

0

n

i

acf s i s i

-11-

Pitch Tracking via ACF

Specs Sampe rate = 11025 Hz Frame size = 32 ms Overlap = 0 Frame rate = 31.25

Playback soo.wav sooPitch.wav

-12-

AMDF: Average Magnitude Difference Function

Frame s(i):

Shifted frame s(i+):

=30

30

amdf(30) = sum of abs. difference

Pitch period

1

0

n

i

amdf s i s i

-13-13/44

UPDUDP (1/4)

UPDUDP: Unbroken Pitch Determination Using DP Goal: To take pitch smoothness into consideration

: a given path in the AMDF matrix : Number of frames : Transition penalty : Exponent of the transition difference

n

i

n

i

m

iiii pppamdfm1

1

11,,cost p

mn

ni ppp ,,1p

-14-

UPDUDP (2/4)

Optimum-value function D(i, j): the minimum cost starting from frame 1 to position (i, j)

Recurrent formula:

Initial conditions : Optimum cost :

160,8),(),1( 1 jjamdfjD

),(min

160,8jnD

j

2

160,8),1(min)(),( jkkiDjamdfjiD

ki

160,8,,1 jni

-15-

UPDUDP (3/4)

A typical example of UPDUDP using AMDF

-16-

UPDUDP (4/4)

Insensitivity in

0 0.5 1 1.5 2

-3

-2

-1

0

1

2

3

x 104

Wav

efor

m

xi

x i

lu

l u

chan

ch a nn

sheng

sh ng

chang

ch a ng

0 0.5 1 1.5 2

20

30

40

50

60

70

80

Time (seconds)

Pitc

h (S

emito

nes)

xi

x i

lu

l u

chan

ch a nn

sheng

sh ng

chang

ch a ng

=0

=2000 =4000 =6000 =8000 =10000 =12000 =14000 =16000 =18000 =20000

-17-

Frequency to Semitone Conversion

Semitone : A music scale based on A440

Reasonable pitch range: E2 - C6 82 Hz - 1047 Hz ( - )

69440

log12 2

freqsemitone

-18-

Vectors after Pitch Tracking

With rests Without rests

-19-

Typical Result of Pitch Tracking

Pitch tracking via autocorrelation for 茉莉花 (jasmine)聲音

-20-

Comparison of Pitch VectorsYellow line : Target pitch vector

-21-

Demo of Pitch Tracking

Real-time display of ACF for pitch tracking toolbox/sap/goPtByAcf.mdl

Real-time pitch tracking for real-time mic input toolbox/sap/goPtByAcf2.mdl

Pitch scaling pitchShiftDemo/project1.exe pitchShift-multirate/multirate.m

-22-

Comparison Methods of QBSH

Categories of approaches to QBSH Histogram/statistics-based Note vs. note

Edit distance

Frame vs. noteHMM

Frame vs. frameLinear scaling, DTW, recursive alignment

-23-

Range Comparison

Concept Reject a song if the range does not match:

Characteristics Extremely fast Not effective Good for initial filtering

)()( crangeqrange

-24-

Linear Scaling (LS)

Concept Scale the query linearly to match the candidates

Example:

-25-

Linear Scaling (II)

Strength One-shot for dealing

with key transposition Efficient and effective Indexing methods

available

Weakness Cannot deal with non-

uniform tempo variations

Typical mapping path

-26-

Linear Scaling (III)

Distance function for LS Normalized L1-norm Normalized L2-norm

Rest handling Extend previous non-zero

note

Alignment example

-27-

Dynamic Time Warping (DTW)

Goal: Allows comparison of high tolerance to tempo variation

Characteristics: Robust for irregular tempo variations Trial-and-error for dealing with key transposition Expensive in computation Does not conform to triangle inequality Some indexing algorithms do exist

#1 method for task 2 in QBSH/MIREX 2006

-28-

Dynamic Time Warping: Type 1

i

j

t(i-1)

r(j)

)1,2(

)1,1(

)2,1(

min

|)()(|),(

jiD

jiD

jiD

jritjiD

),( jiD

t: input pitch vector (8 sec, 128 points)r: reference pitch vectorLocal paths: 27-45-63 degrees

DTW recurrence:r(j-1)

t(i)

-29-

Dynamic Time Warping: Type 2

i

j

t(i-1)

r(j)

),1(

)1,1(

)1,(

min

|)()(|),(

jiD

jiD

jiD

jritjiD

),( jiD

r(j-1)

t(i)

t: input pitch vector (8 sec, 128 points)r: reference pitch vectorLocal paths: 0-45-90 degrees

DTW recurrence:

-30-

Local Path Constraints

Type 1: 27-45-63 local paths

Type 2: 0-45-90 local paths

jiD ,

jiD ,

),1(

)1,1(

)1,(

min

)()(),(

jiD

jiD

jiD

jritjiD

)1,2(

)1,1(

)2,1(

min

)()(),(

jiD

jiD

jiD

jritjiD

2,1 jiD

1, jiD

1,1 jiD

jiD ,1

1,1 jiD 1,2 jiD

-31-

DTW Paths of “Match Beginning”

We assume the speed of a user’s acoustic input falls within 1/2 and 2 times of that of the intended song.

Right-end is free to move. Typical DTW table size =

128 x 180

i

j

-32-

DTW Paths of “Match Anywhere”

Both ends are free to move.

Typical DTW table size = 128 x 2880

i

j

-33-

DTW Path of “Match Beginning”

-34-

DTW Path of “Match Anywhere”

-35-

DTW Path of “Match Anywhere”

-37-

Key Transposition

Goal: Allow users’ input of different keys

Method 1: Mean shift and heuristic modification

5 DTW computation when compared to each song

Mean

-4 40-2 21 3

t-2t+2(t’)t’-1 t’+1t

-38-

Type-3 DTW:Frame to Note Alignment

DP-based method for filling the table:

67

64

65

Frame-levelPitch vector

Notes

)1,1(

),1(min|)()(|),(

jiD

jiDjritjiD

jiD ,

1,1 jiD

jiD ,1

Recurrent formula: Local constraint:

62

65

-39-

Type-3 DTW

Characteristics Frame-based query input

vs. note-based music database

Note duration unused More efficient, less

effective Heuristics for key-

transposition

Mapping path

-40-

RA (Recursive Alignment)

Characteristics Combine characteristics

of LS & DTW #1 method for task 1 in

QBSH/MIREX 2006

A typical mapping path

-41-

Modified Edit Distance

Note segmentation

Modified edit distance

,

)(}2),,....,,({

)(}2),,,....,({

)(),(

)(),(

)(),(

min

1,1

11,

1,1

1,

,1

,

ionfragmentatjkbbawd

ionconsolidatikbaawd

treplacemenbawd

insertionbwd

deletionawd

d

jkjikji

jikijki

jiji

jji

ji

ji

-42-

Challenges in QBSH Systems

Song database preparation MIDIs, singing clips, or audio music

Reliable pitch tracking for acoustic input Input from mobile devices or noisy karaoke bar

Efficient/effective retrieval Karaoke machine: ~10,000 songs Internet music search engine: ~500,000,000 songs

-43-

-44-

Goal and Approach

Goal: To retrieve songs effectively within a given response time, say 5 seconds or so

Our strategy Multi-stage progressive filtering Indexing for different comparison methods Repeating pattern identification

-45-

Demo: MIRACLE

MIRACLE: Music Information Retrieval Acoustically via CLuster Engines

Demo page of MIR Lab: http://mirlab.org/new/mir_products.asp

MIRACLE demo: http://cuda.mirlab.org

-46-

Internet Music Search EngineClient-server distributed computingCloud computing via clustered PCs & GPU

Master server

Clients Clustered servers

PC

PDA

Cellular

Slave

Slave

Slave

Master server

Slave servers

Request: pitch vector

Response: search result

-47-

Challenge 1：音樂資料庫之收集

由網路收集之音樂檔案： MIDI檔案

若要精準，需由人工找出主旋律所在的軌數。若以自動化之方法來進行，辨識率約為 85%

MIDI 檔案格式複雜且不一致MIDI 主旋律不乾淨（有前奏、疊音、變奏等）

MP3檔案流行音樂：極不容易抽取人聲之音高。根據 ISMIR2011之比賽結果，最佳音高辨識率為 84%

交響樂：可能根本沒有主旋律人工標記：

若要支援文字搜尋，則需加入歌手、歌詞、類別等資訊。

-48-

Challenge 2：比對之加速

影響比對速度之因素（及其代表值）哼唱輸入長度： 8 秒（ 128音高點）資料庫大小：約 13000首歌比對方法： LS+DTW CPU： Pentium 2G（比較不受到記憶體大小影響）比對位置

從頭比對：約 2 秒從中間比對

• 副歌開始處• 每個音符開始處：約 45秒• 任意處：約 60秒

-49-

Response Time of Miracle

8 sec recording of “ 小毛驢” , comparison from beginning: LS: 0.4 sec DTW: 3.5 sec LS+DTW: 0.6 sec

8 sec recordings of the refrain of “ 夢醒時分” , comparison from anywhere: LS: 40 sec DTW: IIS time out LS+DTW: 45 sec NBDTW: IIS time out

-50-

Could It Be More Efficient?

Algorithms Indexing of LS/DTW Progressive filtering

New Platforms GPU (66 times faster for QBSH!) Grid/clustered computing Multi-core platforms

-51-

Commercial Applications

www.midomo.comwww.soundhound.comwww.shazam.com

-52-

Conclusions

QBSH Fun and interesting way to retrieve music Can be extend to singing scoring Commercial applications getting mature

Challenges How to deal with massive music databases? How to extract melody from audio music?

Documents

2015/10/101 Query-by-Singing/Humming: An Overview 「哼唱選歌」綜述 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan