32
Deep Learning for Music Classification Keunwoo.Choi @qmul.ac.uk Music classification Data-driven approaches Reference Deep Learning for Music Classification GCT634 Spring 2016, KAIST Keunwoo.Choi @qmul.ac.uk Centre for Digital Music, Queen Mary University of London, UK 24 May 2016 1/15

Deep learning for music classification, 2016-05-24

Embed Size (px)

Citation preview

Page 1: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

Reference

Deep Learning for Music ClassificationGCT634 Spring 2016, KAIST

[email protected]

Centre for Digital Music, Queen Mary University of London, UK

24 May 2016

1/15

Page 2: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

Reference

1 Music classification

2 Data-driven approachesConventional MLDeep Learning

3 Reference

2/15

Page 3: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

Reference

Music Classification

Definition

Classify music items into certain categories(using audio content)

Genre classification [3]

Rock/Jazz/Hiphop/Classical/...

Instrument identification

Music/Speech segmentation

Emotion recognition

Automatic tagging

3/15

Page 4: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

Reference

Music Classification

Feasibility

Are these info really extractable from audio signal?

Genre: sound, play style, chord, instrument, melody, ..

Instrument: spectral and/or temporal patterns

Music/Speech: spectral and/or temporal patterns

Emotion: sound/melody/lyrics/..

Tags (instrument/era/emotion/activity/): ...

4/15

Page 5: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

ConventionalML

Deep Learning

Reference

Data-driven approachesConventional ML

Data-driven + domain knowledgeacoustic/musical features

[3]

”We provide candidates and let machine choose”

5/15

Page 6: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

ConventionalML

Deep Learning

Reference

Data-driven approachesConventional ML

1/2 Feature selection

Any features that might be relevant to the classification

Spectral features

Spectral rolloff, centroid, MFCC, ZCR,

Rhythmic features

Tempo, beat histogram

Tonal features

Key, pitch-class distribution, tonality

6/15

Page 7: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

ConventionalML

Deep Learning

Reference

Data-driven approachesConventional ML

2/2 Classifiers

Classifiers select relevant features

To map (aggregated N-dim feature) to (decision)

Classifiers are trained with data

After training, it is usually possible to score how relevanteach feature is

7/15

Page 8: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

ConventionalML

Deep Learning

Reference

Data-driven approachesConventional ML - Genre classification example

audio signal → → , → →

length=N → 256-by-100 → 30-by-100, 1-by-100 → 31-by-100 → 62-by-1

for x , y in training data, # x: audio signal

1 X = stft(x)2 x mfccs = mfcc(X )2 x centroids = spectral centroid(X )3 x feats = concatenate(x mfccs, x centroids)

# size(x feats) = (31,100), feature vectors for every frame in the track

4 x feat = concatenate(mean(x feats), var(x feats))# size(x feat) = (62,1), feature vector of the whole track x

Training the classifier with (x feat, y)

* Now, we have a system that maps audio signal → genre

8/15

Page 9: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

ConventionalML

Deep Learning

Reference

Data-driven approachesConventional ML - Genre classification example

audio signal → [feature extraction] → → [trained classifier] → yprediction

for new audio signal s,

1 get s feat2 predict genre of the signal!

9/15

Page 10: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

ConventionalML

Deep Learning

Reference

Even more data-driven approachesDeep Learning

”Why don’t we optimise/automate more?”

Because designing features is NOT optimised (and boring)

”We provide candidates and let machine choose”

”Let machine design and choose features”

10/15

Page 11: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

ConventionalML

Deep Learning

Reference

Even more data-driven approachesDeep Learning

Deep Learning

Deep == More layers (of Neural Networks) == Somelayers serve as feature extractors, the others as classifiers

11/15

Page 12: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

ConventionalML

Deep Learning

Reference

Even more data-driven approachesDeep Learning

Deep LearningMachines might do better than humans

they don’t get bored, compute faster, are not biased,..

Machines are more flexible than before

learned classifier AND feature extractor

Machines need more examples to learn from than before

because the number of parameters to learn increases

Human: decides the structure and input types

12/15

Page 13: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

ConventionalML

Deep Learning

Reference

Even more data-driven approachesDeep Learning

End-to-end learning for music audio, Sander Dieleman etal., ICASSP, 2014 [2]

Auto-tagging using deep convolutional neural networks,Keunwoo Choi et al., ISMIR, 2016 [1]

13/15

Page 14: Deep learning for music classification, 2016-05-24

Choi, K., Fazekas, G., Sandler, M.: Automatic taggingusing deep convolutional neural networks. In: Proceedingsof the 17th International Society for Music InformationRetrieval Conference (ISMIR 2016), New York, USA (2016)

Dieleman, S., Schrauwen, B.: End-to-end learning formusic audio. In: Acoustics, Speech and Signal Processing(ICASSP), 2014 IEEE International Conference on. pp.6964–6968. IEEE (2014)

Tzanetakis, G., Cook, P.: Musical genre classification ofaudio signals. Speech and Audio Processing, IEEEtransactions on 10(5), 293–302 (2002)

Page 15: Deep learning for music classification, 2016-05-24

Deep Learningfor Music

Classification

[email protected]

Musicclassification

Data-drivenapproaches

Reference

Bonus 1. 6 selected pages of this slide on deep CNNs

ConvolutionalNeural

Networks

[email protected]

Overview

CNN use-cases

References

Convolutional Neural NetworksA brief explanation

[email protected]

Centre for Digital Music, Queen Mary University of London, UK

1/43

14/15

Page 16: Deep learning for music classification, 2016-05-24

ConvolutionalNeural

Networks

[email protected]

Overview

CNNs vs DNNs

CNN structures

Inside CNNs

CNN use-cases

References

CNNs: Convolutional Neural Networks

(Deep) Convolutional Neural Networks

deep = cascadedconvolutional = filtersneural networks = things are learned

1

2

1cns.org2AlexNet

3/43

Page 17: Deep learning for music classification, 2016-05-24

ConvolutionalNeural

Networks

[email protected]

Overview

CNNs vs DNNs

CNN structures

Inside CNNs

CNN use-cases

References

Hierarchical features

Hierarchical feature learning

Each layer learns features in different levels of hierarchy

High-level features are built on low-level features

E.g.

Layer 1: Edges (low-level, concrete)Layer 2: Simple shapesLayer 3: Complex shapesLayer 4: More complex shapesLayer 5: Shapes of target objects (high-level, abstract)

26/43

Page 18: Deep learning for music classification, 2016-05-24

ConvolutionalNeural

Networks

[email protected]

Overview

CNNs vs DNNs

CNN structures

Inside CNNs

CNN use-cases

References

What is learned in CNNs?in image recognition task

[11]

27/43

Page 19: Deep learning for music classification, 2016-05-24

ConvolutionalNeural

Networks

[email protected]

Overview

CNNs vs DNNs

CNN structures

Inside CNNs

CNN use-cases

References

What is learned in CNNs?in image recognition task

[11]

28/43

Page 20: Deep learning for music classification, 2016-05-24

ConvolutionalNeural

Networks

[email protected]

Overview

CNNs vs DNNs

CNN structures

Inside CNNs

CNN use-cases

References

What is learned in CNNs?in image recognition task

[11]

29/43

Page 21: Deep learning for music classification, 2016-05-24

ConvolutionalNeural

Networks

[email protected]

Overview

CNN use-cases

Image

Music

References

CNN use-casesMusic information retrieval

Anything people can do by seeing spectrograms

E.g. Auto tagging [1], chord recognition [5], instrumentrecognition [7], music-noise segmentation [8], onsetdetection [9], boundary detection [10]

+ style change? source separation? effects/de-effects?

39/43

Page 22: Deep learning for music classification, 2016-05-24

Bonus 2. 11 Selected pages of this slide on Auto-Taggingwith CNNs

AutomaticTagging using

DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition

The proposedarchitecture

Experimentsanddiscussions

Conclusion

Reference

Automatic Tagging usingDeep Convolutional Neural Networks [1]

[email protected]

Centre for Digital Music, Queen Mary University of London, UK

1/22

Page 23: Deep learning for music classification, 2016-05-24

AutomaticTagging using

DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition

The proposedarchitecture

Experimentsanddiscussions

Conclusion

Reference

IntroductionTagging

Tags

Descriptive keywords that people put on music

Multi-label natureE.g. {rock, guitar, drive, 90’s}

Music tags include Genres (rock, pop, alternative, indie),Instruments (vocalists, guitar, violin), Emotions (mellow,chill), Activities (party, drive), Eras (00’s, 90’s, 80’s).

Collaboratively created (Last.fm ) → noisyfalse negativesynonyms (vocal/vocals/vocalist/vocalists/voice/voices.guitar/guitars)popularity biastypo (harpsicord)irrelevant tags (abcd, ilikeit, fav)

3/22

Page 24: Deep learning for music classification, 2016-05-24

AutomaticTagging using

DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

TF-representations

ConvolutionKernels andAxes

Pooling

Problemdefinition

The proposedarchitecture

Experimentsanddiscussions

Conclusion

Reference

CNNs and MusicTF-representations

Options

STFT / Mel-spectrogram / CQT / raw-audio

STFT: Okay, but why not melgram?

Melgram: Efficient

CQT: only if you’re interested in fundamentals/pitchs

Raw-audio: end-to-end setup (learn the transformation),

have not outperformed melgram (yet) in speech/musicperhaps the way to go in the future?we lose frequency axis though

7/22

Page 25: Deep learning for music classification, 2016-05-24

AutomaticTagging using

DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition

The proposedarchitecture

Experimentsanddiscussions

Conclusion

Reference

Problem definition

Automatic tagging

Automatic tagging is a multi-label classification task

K -dim vector: up to 2K cases

Majority of tags is False (no matter it’s correct or not)

Measured by AUC-ROC

Area Under Curve of Receiver Operating Characteristics

1

1Image from Kaggle

10/22

Page 26: Deep learning for music classification, 2016-05-24

AutomaticTagging using

DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition

The proposedarchitecture

Experimentsanddiscussions

Conclusion

Reference

The proposed architecture

4-layer fully convolutional network, FCN-4

11/22

Page 27: Deep learning for music classification, 2016-05-24

AutomaticTagging using

DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition

The proposedarchitecture

Experimentsanddiscussions

Conclusion

Reference

The proposed architecture

FCN-5 FCN-6 FCN-7

Mel-spectrogram (input: 96×1366×1)

Conv 3×3×128

MP (2, 4) (output: 48×341×128)

Conv 3×3×256

MP (2, 4) (output: 24×85×256)

Conv 3×3×512

MP (2, 4) (output: 12×21×512)

Conv 3×3×1024

MP (3, 5) (output: 4×4×1024)

Conv 3×3×2048

MP (4, 4) (output: 1×1×2048)

· Conv 1×1×1024 Conv 1×1×1024· Conv 1×1×1024

Output 50×1 (sigmoid)

Table: The configurations of 5, 6, and 7-layer architectures. The onlydifferences are the number of additional 1×1 convolution layers.

12/22

Page 28: Deep learning for music classification, 2016-05-24

AutomaticTagging using

DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition

The proposedarchitecture

Experimentsanddiscussions

Overview

MagnaTagATune

Million SongDataset

Conclusion

Reference

Experiments and discussionsOverview

MTT MSD

# tracks 25k 1M

# songs 5-6k 1M

Length 29.1s 30-60s

Benchmarks 10+ 0

Labels Tags, genresTags, genres,EchoNest features,bag-of-word lyrics,...

13/22

Page 29: Deep learning for music classification, 2016-05-24

AutomaticTagging using

DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition

The proposedarchitecture

Experimentsanddiscussions

Overview

MagnaTagATune

Million SongDataset

Conclusion

Reference

Experiments and discussionsMagnaTagATune

Same depth (l=4), melgram>MFCC>STFTmelgram: 96 mel-frequency binsSTFT: 128 frequency binsMFCC: 90 (30 MFCC, 30 MFCCd, 30 MFCCdd)

Methods AUC

FCN-3, mel-spectrogram .852

FCN-4, mel-spectrogram .894FCN-5, mel-spectrogram .890

FCN-4, STFT .846

FCN-4, MFCC .862

Still, ConvNet may outperform frequency aggregation thanmel-frequency with more data. But not here.

ConvNet outperformed MFCC

15/22

Page 30: Deep learning for music classification, 2016-05-24

AutomaticTagging using

DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition

The proposedarchitecture

Experimentsanddiscussions

Overview

MagnaTagATune

Million SongDataset

Conclusion

Reference

Experiments and discussionsMagnaTagATune

Methods AUC

FCN-3, mel-spectrogram .852

FCN-4, mel-spectrogram .894FCN-5, mel-spectrogram .890

FCN-4, STFT .846

FCN-4, MFCC .862

FCN-4>FCN-3: Depth worked!

FCN-4>FCN-5 by .004

Deeper model might make it equal after ages of trainingDeeper models requires more dataDeeper models take more time (deep residual network[6])

4 layers are enough vs. matter of size(data)?

16/22

Page 31: Deep learning for music classification, 2016-05-24

AutomaticTagging using

DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition

The proposedarchitecture

Experimentsanddiscussions

Overview

MagnaTagATune

Million SongDataset

Conclusion

Reference

Experiments and discussionsMillion Song Dataset

Methods AUC

FCN-3, mel-spectrogram .786

FCN-4, — .808

FCN-5, — .848

FCN-6, — .851FCN-7, — .845

FCN-3<4<5<6 !

Deeper layers pay off

utill 6-layers in this case.

19/22

Page 32: Deep learning for music classification, 2016-05-24

AutomaticTagging using

DeepConvolutional

NeuralNetworks [1]

[email protected]

Introduction

CNNs andMusic

Problemdefinition

The proposedarchitecture

Experimentsanddiscussions

Conclusion

Reference

Conclusion

2D fully convolutional networks work well

Mel-spectrogram can be preferred to STFT until

until we have a HUGE dataset so that mel-frequencyaggregation can be replaced

Bye bye, MFCC? In the near future, I guess

MIR can go deeper than now

if we have bigger, better, stronger datasets

Q. How do ConvNets actually deal with spectrograms?

A. Stay tuned to this year’s MLSP paper!

21/22