Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

Reference

Deep Learning for Music ClassificationGCT634 Spring 2016, KAIST

[email protected]

Centre for Digital Music, Queen Mary University of London, UK

24 May 2016

1/15

mailto:[email protected]



Classification

[email protected]

Musicclassification


Reference

1 Music classification

2 Data-driven approachesConventional MLDeep Learning

3 Reference

2/15



Classification

[email protected]

Musicclassification


Reference

Music Classification

Definition

Classify music items into certain categories(using audio content)

Genre classification [3]

Rock/Jazz/Hiphop/Classical/...

Instrument identification

Music/Speech segmentation

Emotion recognition

Automatic tagging

3/15



Classification

[email protected]

Musicclassification


Reference

Music Classification

Feasibility

Are these info really extractable from audio signal?

Genre: sound, play style, chord, instrument, melody, ..

Instrument: spectral and/or temporal patterns

Music/Speech: spectral and/or temporal patterns

Emotion: sound/melody/lyrics/..

Tags (instrument/era/emotion/activity/): ...

4/15



Classification

[email protected]

Musicclassification


ConventionalML

Deep Learning

Reference

Data-driven approachesConventional ML

Data-driven + domain knowledgeacoustic/musical features

[3]

”We provide candidates and let machine choose”

5/15



Classification

[email protected]

Musicclassification


ConventionalML

Deep Learning

Reference


1/2 Feature selection

Any features that might be relevant to the classification

Spectral features

Spectral rolloff, centroid, MFCC, ZCR,

Rhythmic features

Tempo, beat histogram

Tonal features

Key, pitch-class distribution, tonality

6/15



Classification

[email protected]

Musicclassification


ConventionalML

Deep Learning

Reference


2/2 Classifiers

Classifiers select relevant features

To map (aggregated N-dim feature) to (decision)

Classifiers are trained with data

After training, it is usually possible to score how relevanteach feature is

7/15



Classification

[email protected]

Musicclassification


ConventionalML

Deep Learning

Reference

Data-driven approachesConventional ML - Genre classification example

audio signal → → , → →

length=N → 256-by-100 → 30-by-100, 1-by-100 → 31-by-100 → 62-by-1

for x , y in training data, # x: audio signal

1 X = stft(x)2 x mfccs = mfcc(X )2 x centroids = spectral centroid(X )3 x feats = concatenate(x mfccs, x centroids)

# size(x feats) = (31,100), feature vectors for every frame in the track

4 x feat = concatenate(mean(x feats), var(x feats))# size(x feat) = (62,1), feature vector of the whole track x

Training the classifier with (x feat, y)

* Now, we have a system that maps audio signal → genre

8/15



Classification

[email protected]

Musicclassification


ConventionalML

Deep Learning

Reference

Data-driven approachesConventional ML - Genre classification example

audio signal → [feature extraction] → → [trained classifier] → yprediction

for new audio signal s,

1 get s feat2 predict genre of the signal!

9/15



Classification

[email protected]

Musicclassification


ConventionalML

Deep Learning

Reference

Even more data-driven approachesDeep Learning

”Why don’t we optimise/automate more?”

Because designing features is NOT optimised (and boring)

”We provide candidates and let machine choose”

”Let machine design and choose features”

10/15



Classification

[email protected]

Musicclassification


ConventionalML

Deep Learning

Reference


Deep Learning

Deep == More layers (of Neural Networks) == Somelayers serve as feature extractors, the others as classifiers

11/15



Classification

[email protected]

Musicclassification


ConventionalML

Deep Learning

Reference


Deep LearningMachines might do better than humans

they don’t get bored, compute faster, are not biased,..

Machines are more flexible than before

learned classifier AND feature extractor

Machines need more examples to learn from than before

because the number of parameters to learn increases

Human: decides the structure and input types

12/15



Classification

[email protected]

Musicclassification


ConventionalML

Deep Learning

Reference


End-to-end learning for music audio, Sander Dieleman etal., ICASSP, 2014 [2]

Auto-tagging using deep convolutional neural networks,Keunwoo Choi et al., ISMIR, 2016 [1]

13/15


Choi, K., Fazekas, G., Sandler, M.: Automatic taggingusing deep convolutional neural networks. In: Proceedingsof the 17th International Society for Music InformationRetrieval Conference (ISMIR 2016), New York, USA (2016)

Dieleman, S., Schrauwen, B.: End-to-end learning formusic audio. In: Acoustics, Speech and Signal Processing(ICASSP), 2014 IEEE International Conference on. pp.6964–6968. IEEE (2014)

Tzanetakis, G., Cook, P.: Musical genre classification ofaudio signals. Speech and Audio Processing, IEEEtransactions on 10(5), 293–302 (2002)


Classification

[email protected]

Musicclassification


Reference

Bonus 1. 6 selected pages of this slide on deep CNNs

ConvolutionalNeural

Networks

[email protected]

Overview

CNN use-cases

References

Convolutional Neural NetworksA brief explanation

[email protected]


1/43

14/15


http://www.slideshare.net/secret/ad0Utn0vcPw57B

ConvolutionalNeural

Networks

[email protected]

Overview

CNNs vs DNNs

CNN structures

Inside CNNs

CNN use-cases

References

CNNs: Convolutional Neural Networks

(Deep) Convolutional Neural Networks

deep = cascadedconvolutional = filtersneural networks = things are learned

1

2

1cns.org2AlexNet

3/43

ConvolutionalNeural

Networks

[email protected]

Overview

CNNs vs DNNs

CNN structures

Inside CNNs

CNN use-cases

References

Hierarchical features

Hierarchical feature learning

Each layer learns features in different levels of hierarchy

High-level features are built on low-level features

E.g.

Layer 1: Edges (low-level, concrete)Layer 2: Simple shapesLayer 3: Complex shapesLayer 4: More complex shapesLayer 5: Shapes of target objects (high-level, abstract)

26/43

ConvolutionalNeural

Networks

[email protected]

Overview

CNNs vs DNNs

CNN structures

Inside CNNs

CNN use-cases

References

What is learned in CNNs?in image recognition task

[11]

27/43

ConvolutionalNeural

Networks

[email protected]

Overview

CNNs vs DNNs

CNN structures

Inside CNNs

CNN use-cases

References


[11]

28/43

ConvolutionalNeural

Networks

[email protected]

Overview

CNNs vs DNNs

CNN structures

Inside CNNs

CNN use-cases

References


[11]

29/43

ConvolutionalNeural

Networks

[email protected]

Overview

CNN use-cases

Image

Music

References

CNN use-casesMusic information retrieval

Anything people can do by seeing spectrograms

E.g. Auto tagging [1], chord recognition [5], instrumentrecognition [7], music-noise segmentation [8], onsetdetection [9], boundary detection [10]

+ style change? source separation? effects/de-effects?

39/43

Bonus 2. 11 Selected pages of this slide on Auto-Taggingwith CNNs

AutomaticTagging using

DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition

The proposedarchitecture

Experimentsanddiscussions

Conclusion

Reference

Automatic Tagging usingDeep Convolutional Neural Networks [1]

[email protected]


1/22

http://www.slideshare.net/secret/6ekwzPpiVKWAJb

http://www.slideshare.net/secret/6ekwzPpiVKWAJb


DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition



Conclusion

Reference

IntroductionTagging

Tags

Descriptive keywords that people put on music

Multi-label natureE.g. {rock, guitar, drive, 90’s}

Music tags include Genres (rock, pop, alternative, indie),Instruments (vocalists, guitar, violin), Emotions (mellow,chill), Activities (party, drive), Eras (00’s, 90’s, 80’s).

Collaboratively created (Last.fm ) → noisyfalse negativesynonyms (vocal/vocals/vocalist/vocalists/voice/voices.guitar/guitars)popularity biastypo (harpsicord)irrelevant tags (abcd, ilikeit, fav)

3/22


DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

TF-representations

ConvolutionKernels andAxes

Pooling

Problemdefinition



Conclusion

Reference

CNNs and MusicTF-representations

Options

STFT / Mel-spectrogram / CQT / raw-audio

STFT: Okay, but why not melgram?

Melgram: Efficient

CQT: only if you’re interested in fundamentals/pitchs

Raw-audio: end-to-end setup (learn the transformation),

have not outperformed melgram (yet) in speech/musicperhaps the way to go in the future?we lose frequency axis though

7/22


DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition



Conclusion

Reference

Problem definition

Automatic tagging

Automatic tagging is a multi-label classification task

K -dim vector: up to 2K cases

Majority of tags is False (no matter it’s correct or not)

Measured by AUC-ROC

Area Under Curve of Receiver Operating Characteristics

1

1Image from Kaggle

10/22


DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition



Conclusion

Reference

The proposed architecture

4-layer fully convolutional network, FCN-4

11/22


DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition



Conclusion

Reference

The proposed architecture

FCN-5 FCN-6 FCN-7

Mel-spectrogram (input: 96×1366×1)

Conv 3×3×128

MP (2, 4) (output: 48×341×128)

Conv 3×3×256

MP (2, 4) (output: 24×85×256)

Conv 3×3×512

MP (2, 4) (output: 12×21×512)

Conv 3×3×1024

MP (3, 5) (output: 4×4×1024)

Conv 3×3×2048

MP (4, 4) (output: 1×1×2048)

· Conv 1×1×1024 Conv 1×1×1024· Conv 1×1×1024

Output 50×1 (sigmoid)

Table: The configurations of 5, 6, and 7-layer architectures. The onlydifferences are the number of additional 1×1 convolution layers.

12/22


DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition



Overview

MagnaTagATune

Million SongDataset

Conclusion

Reference

Experiments and discussionsOverview

MTT MSD

# tracks 25k 1M

# songs 5-6k 1M

Length 29.1s 30-60s

Benchmarks 10+ 0

Labels Tags, genresTags, genres,EchoNest features,bag-of-word lyrics,...

13/22


DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition



Overview

MagnaTagATune

Million SongDataset

Conclusion

Reference

Experiments and discussionsMagnaTagATune

Same depth (l=4), melgram>MFCC>STFTmelgram: 96 mel-frequency binsSTFT: 128 frequency binsMFCC: 90 (30 MFCC, 30 MFCCd, 30 MFCCdd)

Methods AUC

FCN-3, mel-spectrogram .852

FCN-4, mel-spectrogram .894FCN-5, mel-spectrogram .890

FCN-4, STFT .846

FCN-4, MFCC .862

Still, ConvNet may outperform frequency aggregation thanmel-frequency with more data. But not here.

ConvNet outperformed MFCC

15/22


DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition



Overview

MagnaTagATune

Million SongDataset

Conclusion

Reference

Experiments and discussionsMagnaTagATune

Methods AUC


FCN-4, mel-spectrogram .894FCN-5, mel-spectrogram .890

FCN-4, STFT .846

FCN-4, MFCC .862

FCN-4>FCN-3: Depth worked!

FCN-4>FCN-5 by .004

Deeper model might make it equal after ages of trainingDeeper models requires more dataDeeper models take more time (deep residual network[6])

4 layers are enough vs. matter of size(data)?

16/22


DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition



Overview

MagnaTagATune

Million SongDataset

Conclusion

Reference

Experiments and discussionsMillion Song Dataset

Methods AUC


FCN-4, — .808

FCN-5, — .848

FCN-6, — .851FCN-7, — .845

FCN-3<4<5<6 !

Deeper layers pay off

utill 6-layers in this case.

19/22


DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition



Conclusion

Reference

Conclusion

2D fully convolutional networks work well

Mel-spectrogram can be preferred to STFT until

until we have a HUGE dataset so that mel-frequencyaggregation can be replaced

Bye bye, MFCC? In the near future, I guess

MIR can go deeper than now

if we have bigger, better, stronger datasets

Q. How do ConvNets actually deal with spectrograms?

A. Stay tuned to this year’s MLSP paper!

21/22

Engineering

Deep learning for music classification, 2016-05-24