Audio chord recognition using deep neural networks

Audio Chord RecognitionUsing Deep Neural Networks

Bohumír Zámečník @bzamecnik(A Farewell) Data Science Seminar – 2016-05-25

https://twitter.com/bzamecnik

Agenda

● what are chords & why recognize them?● task formulation● data set● pre-processing● model● evaluation● future work

The dream – Beatles: Penny Lane

https://vid.me/is5

"multiple tonesbeing playedat the same time"

~ pitch class sets

group Z12

212 = 4096 possibilities

What are chords?

Motivation – why recognize chords?

● provide rich high-level musical structure○ → visualization

● difficulty to pick by ear○ lyrics & melody – easy○ chords – harder

Representation ● symbolic names● pitch class sets (unique tones)

[1, 3, 5] [1, 4, 6] [2, 5, 7]

Task formulation – end-to-end task

● segmentation & classification○ input data: sampled audio recording○ output: time segments with symbolic chord labels

start end chord

0.440395 1.689818 B

1.689818 2.209188 B/7

2.209188 2.746326 B/6

2.746326 3.280385 B/5

3.280385 3.849274 E:maj6

3.849274 4.406553 C#:min7

4.406553 4.940612 F#:sus4

Task formulation – intermediate task

● multi-label classification of frames○ input: chromagram○ output: pitch class labels for each frame

0 0 0 1 0 0 1 0 0 0 1 1

0 0 0 1 0 0 1 0 1 0 0 1

0 0 0 1 0 0 1 0 0 0 0 1

0 1 0 0 1 0 0 0 1 0 0 1

0 1 0 0 1 0 0 0 1 0 0 1

0 1 0 0 0 0 1 0 0 0 0 1

(Isophonics)

● 180 songs● ~ 8 hours● human-annotated chord labels● raw audio possible but hard

to obtain – due to copyrights :(○ torrent to help

Data set – The Beatles: Reference Annotations

http://isophonics.net/content/reference-annotations-beatles

Pre-processing

● hard part – cleaning the input data :)● need to synchronize audio & features

● chromagram features○ like log-spectrogram○ bins aligned to musical tones○ linear translation○ time-frequency reassignment

■ using phase to "focus" the content position

Pre-processing – audio

Pre-processing – audio

● stereo to mono (mean)● cut to (overlapping) frames● apply window (Hann)● FFT – time-domain to frequency-domain → spectrogram● reassignment – derivative of phase wrt. time & frequency

○ better position● log scaling of frequency● requantization● dynamic range compression of values (log)

linearspectrogram

logspectrogram

reassignedlogspectrogram

Preprocessing – labels

● symbolic labels to binary pitch class vectors○ chord-labels parser

● sample to frames (to match the audio features)

B 0 0 0 1 0 0 1 0 0 0 1 1

B/7 0 0 0 1 0 0 1 0 1 0 0 1

B/6 0 0 0 1 0 0 1 0 1 0 0 1

B/5 0 0 0 1 0 0 1 0 0 0 0 1

E:maj6 0 1 0 0 1 0 0 0 1 0 0 1

C#:min7 0 1 0 0 1 0 0 0 1 0 0 1

F#:sus4 0 1 0 0 0 0 1 0 0 0 0 1

https://github.com/bzamecnik/chord-labels


Preprocessing – tensor reshaping for the model

● (data points, features)● cut the sequences to fixed length

○ eg. 100 frames○ → (sequence count, sequence length, features)

● reshape for convolution○ → (sequence count, sequence length, features, channels)

● final shape: (3756, 100, 115, 1)

Dataset size

● ~630k frames● 115 features● ~ 4 GB raw audio● ~ 300 MB features compressed numpy array● splits

○ training 60%, validation 20 %, test 20 %○ over whole songs to prevent leakage!

Model – using deep neural networks

● the current architecture is inspired by what's used in the wild● convolutions (+ pooling) at the beginning to extract local features● recurrent layers to propagate context in time● sigmoids at the end for multi-label classification● dropout & batch normalization for regularization● ADAM optimizer

model = Sequential()

model.add(TimeDistributed(Convolution1D(32, 3, activation='relu'), input_shape=(max_seq_size, feature_count, 1)))

model.add(TimeDistributed(Convolution1D(32, 3, activation='relu')))

model.add(TimeDistributed(MaxPooling1D(2, 2)))

model.add(Dropout(0.25))









model.add(TimeDistributed(Flatten()))

model.add(BatchNormalization())

model.add(LSTM(64, return_sequences=True))

model.add(LSTM(64, return_sequences=True))


model.add(TimeDistributed(Dense(12, activation='sigmoid')))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, Y_train, validation_data=(X_valid, Y_valid), nb_epoch=10, batch_size=32)

implemented in Pythonusing Keras on top ofTheano/TensorFlow

6x convolutions

2x recurrent

1x classifier

http://keras.io/

http://deeplearning.net/software/theano/

https://www.tensorflow.org/

http://deeplearning.net/software/theano/

Training

● trained on GPU NVIDIA GTX 980Ti● model ~260k parameters● batch size: 32● 6 GB GPU RAM● ~ 60 s per epoch● a few epochs to overfit● 46 °C :)

Evaluation

● classification metrics○ accuracy○ hamming distance – for binary vectors○ AUC

● segmentation metrics○ WAOR (weighted average overlap ratio)

Evaluation (validation set)

accuracy hamming score AUC

CNN + dense 0.402 0.873 0.910

CNN + LSTM 0.512 0.899 0.935

Pred. probability

Pred. labels

True labels

Probability error

Label error

"And I Love Her"

predicted

ground-truth

Future work

● prepare for MIREX 2016● clean up the project● write down all the stuff to blog● make interactive demos / production app● examine new approaches

○ better frame -> segment post-processing○ 2D/nD convolutions – using locality in time/octaves○ bi-directional RNN○ beat-aligned features○ language models○ unsupervised pre-training○ segmental RNN for direct segmentation

Open-source @ GitHub

● bzamecnik/audio-ml – latest ML models & experiments● bzamecnik/music-processing-experiments – chromagram features● bzamecnik/chord-labels – labels -> pitch class vectors● bzamecnik/harmoneye

○ real-time chromagram features visualization○ chord timeline visualization (from Penny Lane video)

● bzamecnik/harmoneye-android● visualmusictheory.com - blog● bzamecnik/ideas – more ideas :)

https://github.com/bzamecnik/audio-ml

https://github.com/bzamecnik/audio-ml

https://github.com/bzamecnik/music-processing-experiments

https://github.com/bzamecnik/music-processing-experiments



https://github.com/bzamecnik/harmoneye

https://github.com/bzamecnik/harmoneye

https://github.com/bzamecnik/harmoneye-android

https://github.com/bzamecnik/harmoneye-android

http://www.visualmusictheory.com/

http://www.visualmusictheory.com/

https://github.com/bzamecnik/ideas

https://github.com/bzamecnik/ideas

Thank you!

Audio Chord Recognition