38
IN DEGREE PROJECT COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS , STOCKHOLM SWEDEN 2018 The Effect of Audio Snippet Locations and Durations on Genre Classification Accuracy Using SVM AXEL KENNEDAL NICKLAS HERSÉN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

The Effect of Audio Snippet Locations and Durations on

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Effect of Audio Snippet Locations and Durations on

IN DEGREE PROJECT COMPUTER ENGINEERING,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2018

The Effect of Audio Snippet Locations and Durations on Genre Classification Accuracy Using SVM

AXEL KENNEDAL

NICKLAS HERSÉN

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Page 2: The Effect of Audio Snippet Locations and Durations on

Genreklassificering med SVM - Hurpositionen och längden av ljudklipppåverkar klassificeringsprecisionen

Axel Kennedal, Nicklas Hersén

Degree Project in Computer Science DD142XSupervisor: Jens LagergrenExaminer: Örjan Ekeberg

CSC KTH May 29, 2018

Page 3: The Effect of Audio Snippet Locations and Durations on

Sammanfattning

Genre-klassificering baserad på maskininlärning har enmängd användningsområden, exempelvis rekommen-dationer i streaming- och distributionsplattformar ochautomatisk taggning av musikbibliotek. På grund avatt det inte existerar några exakta objektiva defini-tioner av specifika genrer är denna typ av automatiskklassificering en subjektiv uppgift. Klassificering medhjälp av maskininlärning försöker att klassificera lå-tar genom att jämföra så kallade feature vectors. Defeatures som används har en stor påverkan på preci-sionen av klassificeringen. Denna rapport undersökerom det finns något samband mellan startpositionenoch längden av utvalda ljudklipp på precisionen. Fleraexperiment genomfördes på sex olika musikgenrer, fy-ra olika startpositioner och åtta längder för ljudklipp.Resultaten visar att startpositionen och längden haren signifikant påverkan på klassificeringsprecisionen.

Page 4: The Effect of Audio Snippet Locations and Durations on

AbstractReal world scenarios where machine learning basedmusic genre classification could be applied includes;streaming services, music distribution platforms andautomatic tagging of music libraries. Music genre clas-sification is inherently a subjective task; there are noexact boundaries that separate different genres. Ma-chine learning based audio classification attempts toclassify audio by comparing feature vectors. Whichfeatures to extract from which parts of the audiogreatly impact the classification accuracy. This paperinvestigates whether different audio snippet locationsand durations impact the classification accuracy. Anumber of experiments were run across six genres, fourkinds of snippet locations and eight durations. Theresults show that these parameters do in fact have asignificant impact on the accuracy.

Page 5: The Effect of Audio Snippet Locations and Durations on

Contents

1 Introduction 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Genre . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Feature . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Mel-frequency cepstral coefficients (MFCC) . . . . . . 42.1.4 Support Vector Machine (SVM) . . . . . . . . . . . . . 52.1.5 Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.6 OffsetMode . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Method 73.1 Locations and Durations . . . . . . . . . . . . . . . . . . . . . 73.2 Parameters and Implementation . . . . . . . . . . . . . . . . . 83.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.4 Statistical significance . . . . . . . . . . . . . . . . . . . . . . 9

4 Results 114.1 Statistical significance . . . . . . . . . . . . . . . . . . . . . . 14

5 Discussion 155.1 Possible improvements and future work . . . . . . . . . . . . . 16

6 Conclusion 17

Appendices 20

A Confusion Matrices 21

i

Page 6: The Effect of Audio Snippet Locations and Durations on

Chapter 1

Introduction

Genre classification may be applied to a number of real-world scenarios,such as music recommendation in streaming services and music distributionplatforms or automatic tagging of music libraries.

When analyzing audio to perform genre classification it is standard to notanalyze entire songs, but to choose small segments ("snippets") of them.This reduces the amount of data that needs to be processed, but may po-tentially lead to less accurate results. How to choose which part of a songthat is to represent the entire song therefore becomes an important question.Interestingly enough this is a commonly overlooked aspect of genre classifi-cation; in our research of previous work we found that researchers would notprovide any argument for their choice of snippet location, but simply choosethe first X seconds of the songs, use a library of snippets ready to go [10] ornot mention it at all [12] [6].

1.1 Problem Statement

This paper aims to investigate whether the audio snippet location and du-ration significantly impact the precision of the classification.

1.2 Scope

To reduce the complexity of the implementation we will only be using MFCCsfor the feature extraction (see the section Definitions for definitions of these).The implementation will make use of existing libraries for feature extractionand machine learning, allowing us to focus on the research questions morethan on the implementation.

1

Page 7: The Effect of Audio Snippet Locations and Durations on

CHAPTER 1. INTRODUCTION 2

1.3 Approach

We will perform a literature analysis and find relevant material about previ-ous studies in the field of machine learning based music genre classification.Based on the results of the literature study we shall identify and apply thecurrent state of the art methods for music genre classification to a highquality dataset. Once we have this foundation we will be ready to performexperiments on the parameters this paper is focused on.

For a more detailed technical description of how this study was conducted,please see the Method section.

1.4 Thesis Outline

The following chapter presents relevant background information of previ-ously conducted studies on machine learning based music genre classifica-tion. The subsequent third chapter describes the procedure of the study.The fourth chapter presents the results obtained from the various conductedexperiments, while the fifth chapter discusses the results, eventual problemsand how the study could be further improved. The sixth and final chap-ter summarises the results and the discussion and concludes the researchquestions.

Page 8: The Effect of Audio Snippet Locations and Durations on

Chapter 2

Background

2.1 Definitions

2.1.1 Genre

The Oxford Dictionary of English defines genre as "a style or category ofart, music, or literature" [2]. Music genres don’t even necessarily correlateto the sound of the music - they may be more related to the culture of themusicians creating it, and the time and place where it happened [15]. Krissnotes that there are some problems with determining genres: Since thereare no exact boundaries between separate genres the perception of them israther subjective. In recent times the rapid development of subgenres andhybrid genres continues to blur these lines even further. Tzanetankis, Essland Cook state that even though the classification of music into genres issubjective there are perceptual criteria related to the texture (the number ofparts and their relationships), instruments and rhythmic structure that canbe used to characterize music [5].

Even as an experienced listener it may sometimes be difficult to determinethe genre of a song one is listening to. Most people would probably agreethat music taste is subjective, and so is the way that we perceive music.As a result of this there is no standard for classifying music into genres - itis done on a basis of what the listener knows about genres. Songs can beconstructed of the same instruments and with the same emotion attached toit yet still be considered different genres.

3

Page 9: The Effect of Audio Snippet Locations and Durations on

CHAPTER 2. BACKGROUND 4

2.1.2 Feature

A feature is an individual measurable property or characteristic of a phe-nomenon being observed [1]. Selecting informative and independent featuresis crucial for developing effective algorithms in pattern recognition and ma-chine learning. A set of numeric features can be described by a feature vectorand is used as input data for classification algorithms.

2.1.3 Mel-frequency cepstral coefficients (MFCC)

Mel-Frequency Cepstral Coefficients are short-term descriptors of the powerspectrum of an audio signal. (The power spectrum of an audio signal de-scribes the power of all the frequencies that make up the sound at one givenpoint in time of the audio time series.) MFCC are one of the most popularfeatures used in audio pattern recognition. Their success has partly beendue to their ability to represent the audio information in a compact form [9].The different frequency bands are equally spaced on the logarithmic mel-scale, which approximates the human auditory system more closely than thelinearly-spaced frequency bands.

Figure 2.1: Visualization of 20 MFCC over 30 seconds

The visualization in figure 2.1 shows 20 Mel-frequency cepstral coefficientsover a period of 30 seconds. One step in the x direction represents one frame.The 20 coefficients are seen as frequency bands along the y-axis. The colorrepresents the power of a certain frequency band in a given frame.

Page 10: The Effect of Audio Snippet Locations and Durations on

CHAPTER 2. BACKGROUND 5

2.1.4 Support Vector Machine (SVM)

Support Vector Machines is a group of supervised machine learning modelsused for binary classification of data and regression analysis. Given a set oftraining data marked belonging to one of the two classification categories. AnSVM model is a representation of points in an n-dimensional space where thetwo categories are divided by a hyperplane. The model strives to maximizethe margin separating the two categories. Efficient classification of non-lineardata is possible through the use of kernel functions, the dot product in a high-dimensional space. It is also possible to extend an SVM for classification ofmore than two categories.

2.1.5 Timbre

Timbre is the characteristic quality of a sound, independent of frequency andloudness. Depends on the relative powers of the frequency components (thefundamental frequency plus overtones) that make up the sound.

2.1.6 OffsetMode

OffsettMode is a term introduced to describe four different types of snippetoffsets. Please refer to chapter 3 Method.

2.2 Previous Work

Previous studies conducted in the field of music genre classification thusfar have been concerned with comparing, choosing and optimizing relatedalgorithms, features and parameters.

Algorithms for music genre classifications must take the characteristics of asong into account when deciding which genre the song belongs to. Machinelearning algorithms attempt to emulate this by comparing the distance be-tween spectral features of the songs. In order to be able to compare songsto each other there is a need for some sort of numerical description of soundcontent, which characterizes the sound. This is what is referred to as featureextraction [5]. Feature extraction is a central part during the construction ofmachine learning algorithms. Deciding on which features to extract dependson their computability and relation to the specified problem. By catego-rizing audio features into different subcategories based on Weihs et al. [14]and Scaringella [13] work, the audio characteristics are assigned into cer-tain categories based on different perspectives (short-, long term) and levels

Page 11: The Effect of Audio Snippet Locations and Durations on

CHAPTER 2. BACKGROUND 6

(low-, mid level). Low-level features can efficiently be extracted from theaudio using different spectral analysis techniques such as fast Fourier trans-form (FFT) or discrete wavelet transform (DWT). These kind of features(low-level short-term) describe the characteristics of an audio signal and areusually called timbre. Mid-level features (low-level long-term) attempt todescribe patterns such as rhythm, pitch and harmony rather than the char-acteristics of the signal. Even though mid-level features are easily identifiedby music listeners, they are not straightforward to define. Low-level fea-tures such as timbre are currently the most widely used during music genreclassification, however there are also some algorithms that rely on mid-levelfeatures (mostly rhythm) [3].

What features to extract is one of the largest challenges about comparing thesimilarity between different songs. How well you are able to classify songsinto genres depends very much upon this step. Songs that are consideredsimilar must be described by features that have a relatively small distancebetween them in feature space, while songs that are considered to not besimilar need to be described by features far apart. Another important aspectis that the chosen features should preserve relevant information from theoriginal data (the music snippet). Often these features are chosen by trialand error [5].

After features have been extracted they are used as input to some sort ofclassification algorithm, for an example a machine learning algorithm suchas SVM or a neural network.

The single study we found which investigates the snippet location parameteronly compared starting snippets at the beginning of the song compared toexactly one minute into the song. It should also be noted that the papermainly focused on traditional Malay music, and that no other articles foundcited this article for the choice of snippet location [11]. One article statesthat they chose snippet locations "that contained the most representative(i.e. the most instruments playing at the same time) audio information" [5],though it is not proven whether snippets where the most instruments playingat the same time actually are the most representative.

No rigorous studies examining the effect of snippet duration on accuracy werefound, therefore motivating our examination of this parameter as well.

Page 12: The Effect of Audio Snippet Locations and Durations on

Chapter 3

Method

3.1 Locations and Durations

In order to make a conclusion about how snippet locations affect the accuracyof genre classification we compare a few different types of locations. Thetypes of locations this paper compares are:

• static

– the start of the song (OffsetMode START)

– the middle of the song (OffsetMode MIDDLE)

– the end of the song (OffsetMode END)

• dynamic

– random (OffsetMode RANDOM)

OffsetMode START means that the starting location of the snippet in thesong is 0 seconds into it. For OffsetMode MIDDLE the starting location ischosen as the duration of the song divided by two, minus the duration ofthe snippet divided by two - effectively centering the snippet in the song.OffsetMode END sets the starting location to the duration of the song minusthe duration of the snippet. OffsetMode RANDOM picks for each songa different random location between OffsetMode START and OffsetModeEND.

We run the feature extraction algorithm for each OffsetMode using followingdurations: 5, 10, 15, 20, 25, 30, 45 and 60 seconds. Classification is performedwith SVM, then the accuracy of these classifications is compared.

Important to notice is that this paper is focused on the relative differencesin these classification results.

7

Page 13: The Effect of Audio Snippet Locations and Durations on

CHAPTER 3. METHOD 8

3.2 Parameters and Implementation

MFCC is a widely used feature for genre classification, justifying our choiceof this feature. We use MFCC with 20 coefficients, as recommended by [12].The frame size used is approximately 46ms with 75% overlap between frames(defaults for libROSA). All music in the dataset has a sample rate of 44,1kHz.The dataset was split in 70/30 for training and testing respectively.

SVM is a common choice of classification algorithm in previous work, whichis why it is used in this report. This study applies sklearns SVC pythonimplementation of an SVM algorithm using the gaussian radial basis function(rbf) as a kernel. Multi-class classification is supported through the use ofthe one-versus-rest approach [7].

The parameters for the SVM classifier were found using an exhaustive searchalgorithm (GridSearchCV [8] from scikit-learn) testing combinations of thefollowing parameter values:

• γ: 1/10000 1/1000 1/100 1/10 1 2 3 4 5 6 7 8 9 10 100 1000 10000

• C: all integers in the range: [1, 1000]

Where γ is the tightness of the fit of the model and C is a penalty parame-ter.

Out of all values tested, the ones resulting in highest classification accuracywas γ: 3, C: 1. The SVM model was trained using the radial basis functionkernel provided by scikit-learn, with the obtained parameters.

Our implementation is written entirely in Python and utilizes the libraryLibROSA for feature extraction and scikit-learn for classification. Full sourcecode for our implementation can be found at GitHub [4].

3.3 Dataset

Kennedal assembled a new dataset for machine learning applied to genreclassification, since we were not able to find a dataset suitable for this paper.The ones we found either only contained 30 second snippets or had extremelyinaccurate labeling of genre.

We call this new dataset the No Nonsense Music Dataset, and it contains arelatively large amount of full-length songs distributed across multiple gen-res. The primary intended use for this dataset is genre classification andall songs have manually been reviewed to actually belong to the genre it islabeled with. Each song is approximately between 1 to 6 minutes long. Thedistribution of songs over genres is the following:

Page 14: The Effect of Audio Snippet Locations and Durations on

CHAPTER 3. METHOD 9

• House: 124 tracks

• Rock: 92 tracks

• Dubstep: 67 tracks

• Hip-Hop: 82 tracks

• Classical: 50 tracks

• Reggae: 50 tracks

Which in total is 465 tracks, all of which are copyright-free.

300 of these tracks ended up being used in our experiments; 50 tracks wereselected for each genre in order to have a balance between the genres.

3.4 Statistical significance

The classification success rate varied depending on the starting location used,as well as the test data permutation used to train the classifier. In order toprove that the different starting locations influenced the success rate of theclassification we calculated the mean success rate of 121 classifications usingdifferent permutations of the data. The number of songs from each genrewere constant during each test, that is 70% of all songs in a specific genre wereused to train the classifier. The other 30% were used as test samples.

Xk : Observed success rate using the beginning of the song as the starting location

X = x =1

|X|

|X|∑k=1

Xk

s2x =

1

|X| − 1

|X|∑k=1

(Xk − x)2

Yk : Observed success rate using the middle of the song as the starting location

Y = y =1

|Y |

|Y |∑k=1

Yk

s2y =

1

|Y | − 1

|Y |∑k=1

(Yk − y)2

Page 15: The Effect of Audio Snippet Locations and Durations on

CHAPTER 3. METHOD 10

We assume that the starting location does not have an impact on the successrate of the classification algorithm. If this is the case then the differencebetween the two observed probabilities should be 0. A significance test isperformed at level 0.1% to verify the hypothesis.

H0 : y − x = 0

H1 : y − x 6= 0

Since the variance of the two observations might be different the numberof degrees of freedom is approximated using the Welch-Satterthwaite equa-tion:

df ≈

(s2x|X| +

s2y

|Y |

)2

s4x

|X|2(|X|−1)+

s4y

|Y |2(|Y |−1)

−t0,0005(df) ≤ Y − X − (y − x)√s2y

|Y | +s2x|X|

≤ t0,0005(df)

Iy−x = (y − x)± t0,0005(df)

√s2y

|Y |+

s2x

|X|

If H0 : y − x = 0 6∈ Iy−x we reject the null hypothesis at significance level0.1%. It is likely that the feature extraction starting location has an impacton the classification success rate. If 0 ∈ Iy−x we cannot reject the nullhypothesis implying that the starting location have a significant impact onthe classification results.

Page 16: The Effect of Audio Snippet Locations and Durations on

Chapter 4

Results

Figure 4.1: A summary of classification accuracies for different OffsetModesand durations

In 4.1 a summary of the results is shown. Highlighted in bold is the highestaccuracy achieved; 61,1% and in italics the lowest accuracy; 22,2%. Offset-Mode END performs the worst out of the static locations in all but one du-ration (20s). This means that the difference between the best and worst per-forming combination of OffsetMode and snippet duration was 38,9%.

11

Page 17: The Effect of Audio Snippet Locations and Durations on

CHAPTER 4. RESULTS 12

Figure 4.2: Classification accuracies for OffsetMode START using differentsnippet durations

Figure 4.3: Classification accuracies for OffsetMode MIDDLE using differentsnippet durations

Page 18: The Effect of Audio Snippet Locations and Durations on

CHAPTER 4. RESULTS 13

Figure 4.4: Classification accuracies for OffsetMode END using differentsnippet durations

Figure 4.5: Classification accuracies for OffsetMode RANDOM using differ-ent snippet durations

Page 19: The Effect of Audio Snippet Locations and Durations on

CHAPTER 4. RESULTS 14

Looking at 4.2, 4.3, 4.4 and 4.5 one can notice a tendency of decreasingaccuracy as the duration decreases. This is most noticeable in the graphsfor the static locations, while 4.5 shows that the randomness in the locationcan lead to less predictable accuracies. On average for all snippet lengthsOffsetMode START performed the best at 52,1%, compared to OffsetModeEND which performed worst at 44,2%. OffsetMode START had the highestaccuracy of all modes for 6/8 snippet durations.

4.1 Statistical significance

Since zero does not belong in any of the confidence intervals for each respec-tive snippet length (see 4.1). We conclude that the starting locations havea significant impact on the classification success rate on significance level0,1%.

Bound Duration (seconds)

5 10 15 20 25 30 45 60

L 21,58 1,84 -30,15 -51,34 -82,42 -40,25 -44,37 -36,22U 23,60 4,04 -27,88 -49,30 -80,11 -37,98 -42,13 -33,75

L: CI Lower bound (10−3)U: CI Upper bound (10−3)

Table 4.1: Confidence intervals for each duration

Page 20: The Effect of Audio Snippet Locations and Durations on

Chapter 5

Discussion

Although the highest classification accuracy achieved in this paper was avery modest 61,1% we are pleased with the results since the importance inthese experiments is the relative difference between the accuracies.

Before running the experiments we had some expectations about what theywould be. In a previous report [5] the snippets for the dataset were manuallychosen to contain the most instruments played at the same time, whichaccording to the author would mean they were the most representative fortheir genre. With this in mind we expected to see that the OffsetModeMIDDLE would perform the best out of the static OffsetModes, since thechoruses of a song typically occur somewhere in the middle of it (and notright at the beginning or end). To our surprise this assumption about mostinstruments played at the same time does not seem to be true.

One thing worth to consider for real world applications of genre classifica-tion is the required running time of the algorithms. Using longer snippetduration increases the runtime of the feature extraction phase of the classi-fication process. Thus it might be reasonable to choose a snippet length ofapproximately 30 seconds if both accuracy and running time are of impor-tance.

When we ran the experiments for this paper each run (feature extraction +classification) took between 20 seconds up to five minutes (when running ina single thread). In general OffsetMode END had the longest running time,presumably due to how the feature extraction library we used had to skip tothe end of the song and therefore performing more IO operations than whenusing OffsetMode START (which was the fastest). Shorter snippet durationsalso meant shorter running times. The accuracy for OffsetMode START onlydecreased by 3,4% (from 57,8% to 54,4%) when decreasing the duration from60 seconds to 30 seconds. The feature extraction in our implementation is

15

Page 21: The Effect of Audio Snippet Locations and Durations on

CHAPTER 5. DISCUSSION 16

now multithreaded, which lead to a very noticeable decrease in running timecompared to our early version where it was not. For an example running ourprogram with 60 second snippets using OffsetMode START took approxi-mately three minutes using one thread, and only approximately 50 secondsusing eight threads (on an Intel Core i7 with hyperthreading).

5.1 Possible improvements and future work

Due to our scope limitation we did not include any mid-level features inour implementation. Previous work has shown that using a combination ofmid-level features and low-level features can lead to very high classificationaccuracy [5], which makes this an interesting addition to consider.

When we conducted the experiments in this paper we used a different framelength and overlap than the ones recommended by Aucouturier and Pa-chet [12], which may have lead to a lower accuracy. Converting stereo files tomono may lead to loss of precision [5], which is also something we did.

In all of our experiments the algorithm analyzes one snippet per song. Aninteresting alternative approach to examine would be splitting each song intoa number of snippets and letting each snippet make a "vote" for which genreit thinks that part most likely is. The final classification would then be thegenre that the majority of the snippets "voted" for.

Page 22: The Effect of Audio Snippet Locations and Durations on

Chapter 6

Conclusion

The results show that there is a significant difference between different start-ing locations (on significance level 99.9%). The snippet locations generallyhad a greater impact on the classification accuracy when using short snippetdurations: For the shortest snippet duration tested in this paper (5 seconds)there was a difference in 21,1% between the best performing location for thissnippet duration; OffsetMode START and the worst performing OffsetModeEND. In contrast the difference between the best and worst performingOffsetModes for the longest duration (60 seconds) was 8,9%.

In general OffsetMode START outperformed the other OffsetModes. Asexpected OffsetMode RANDOM performed neither the best nor the worst,possibly due to overlaps with other offset modes.

Generally longer snippet durations resulted in higher accuracies.

17

Page 23: The Effect of Audio Snippet Locations and Durations on

Bibliography

[1] Christopher M Bishop. Pattern recognition and machine learning. eng.Information science and statistics. New York, NY: Springer, 2006. isbn:0-387-31073-8.

[2] Oxford Living Dictionaries - English. ”Definition of genre”. In: OxfordDictionary (2018). url: https : / / en . oxforddictionaries . com /definition/genre.

[3] Z. Fu et al. ”A Survey of Audio-Based Music Classification and Annota-tion”. In: IEEE Transactions on Multimedia 13.2 (Apr. 2011), pp. 303–319. issn: 1520-9210.

[4] Axel Kennedal and Nicklas Hersén. Bachelor’s Thesis GitHub. May2018. url: https://github.com/axelkennedal/kexjobbet.

[5] Priit Kriss. ”Audio Based Genre Classification of Electronic Music”.MA thesis. University of Jyväskylä, June 2007.

[6] Matan Lachmish. ”Music Genre Classification Using Deep Learning”.MA thesis. Tel Aviv University, May 2016.

[7] Scikit learn. 1.4. Support Vector Machines. Mar. 2018. url: http://scikit-learn.org/stable/modules/svm.html.

[8] Scikit learn. Grid Search CV. Apr. 2018. url: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.

[9] Beth Logan et al. ”Mel Frequency Cepstral Coefficients for Music Mod-eling.” In: ISMIR. Vol. 270. 2000, pp. 1–11.

[10] Haggblade Michael, Hong Yang, and Kao Kenny. ”Music Genre Clas-sification”. In: Department of Computer Science, Stanford University(2011).

[11] Noris Mohd Norowi, Shyamala Doraisamy, and Rahmita Wirza. ”Fac-tors affecting automatic genre classification: an investigation incorpo-rating non-western musical forms”. In: Queen Mary, University of Lon-don (2005).

[12] Francois Pachet and Jean-Julien Aucouturier. ”Improving timbre sim-ilarity: How high is the sky”. In: Journal of negative results in speechand audio sciences 1.1 (2004), pp. 1–13.

18

Page 24: The Effect of Audio Snippet Locations and Durations on

BIBLIOGRAPHY 19

[13] N. Scaringella, G. Zoia, and D. Mlynek. ”Automatic genre classificationof music content: a survey”. In: IEEE Signal Processing Magazine 23.2(Mar. 2006), pp. 133–141. issn: 1053-5888.

[14] Claus Weihs et al. ”Classification in music research”. In: Advances inData Analysis and Classification 1.3 (Dec. 2007), pp. 255–291. issn:1862-5355. url: https://doi.org/10.1007/s11634-007-0016-x.

[15] Janice Wong. ”Visualising music: the problems with genre classifica-tion”. In: Masters of Media (2011). url: https://mastersofmedia.hum.uva.nl/blog/2011/04/26/visualising-music-the-problems-with-genre-classification/.

Page 25: The Effect of Audio Snippet Locations and Durations on

Appendices

20

Page 26: The Effect of Audio Snippet Locations and Durations on

Appendix A

Confusion Matrices

Parameters:Number of songs from each genre: 50Total number of songs: 300Test/training split: 30/70Number of MFCCS = 20Sample rate = 44kHz

Row: actual genreColumn: guessed genre

60s snippetsOffsetMode START

House Classical Hip-Hop Dubstep Reggae RockHouse 8 0 0 5 0 2

Classical 0 11 0 1 3 0Hip-Hop 1 0 5 1 6 2Dubstep 2 1 3 7 0 2Reggae 2 0 1 0 12 0Rock 2 0 4 0 0 9

Accuracy 57.77777777777778% [52/90]

21

Page 27: The Effect of Audio Snippet Locations and Durations on

APPENDIX A. CONFUSION MATRICES 22

OffsetMode MIDDLE

House Classical Hip-Hop Dubstep Reggae RockHouse 11 0 1 3 0 0

Classical 0 14 0 0 1 0Hip-Hop 0 0 2 1 8 4Dubstep 4 0 2 8 0 1Reggae 0 0 3 1 9 2Rock 1 0 2 1 0 11

Accuracy 61.111111111111114% [55/90]

OffsetMode END

House Classical Hip-Hop Dubstep Reggae RockHouse 7 0 1 3 1 3

Classical 0 14 0 0 1 0Hip-Hop 0 0 8 1 4 2Dubstep 4 0 3 5 0 3Reggae 2 0 5 0 6 2Rock 0 0 1 0 2 12

Accuracy 57.77777777777778% [52/90]

OffsetMode RANDOM

House Classical Hip-Hop Dubstep Reggae RockHouse 4 0 4 4 0 3

Classical 0 13 0 0 2 0Hip-Hop 1 0 6 2 3 3Dubstep 1 0 1 9 0 4Reggae 0 0 8 3 3 1Rock 0 0 2 0 1 12

Accuracy 52.22222222222222% [47/90]

Page 28: The Effect of Audio Snippet Locations and Durations on

APPENDIX A. CONFUSION MATRICES 23

45s snippetsOffsetMode START

House Classical Hip-Hop Dubstep Reggae RockHouse 7 0 0 6 0 2

Classical 0 11 0 1 3 0Hip-Hop 1 0 7 1 4 2Dubstep 2 1 2 7 1 2Reggae 2 0 1 0 12 0Rock 3 0 2 0 1 9

Accuracy 58.888888888888886% [53/90]

OffsetMode MIDDLE

House Classical Hip-Hop Dubstep Reggae RockHouse 9 0 2 3 1 0

Classical 0 14 0 0 1 0Hip-Hop 0 0 1 1 9 4Dubstep 4 0 3 6 1 1Reggae 0 0 1 1 10 3Rock 0 0 2 1 1 11

Accuracy 56.666666666666664% [51/90]

OffsetMode END

House Classical Hip-Hop Dubstep Reggae RockHouse 9 0 1 0 2 3

Classical 0 14 0 0 1 0Hip-Hop 2 0 7 1 4 1Dubstep 7 0 3 3 1 1Reggae 2 0 5 0 6 2Rock 1 0 1 0 2 11

Accuracy 55.55555555555556% [50/90]

Page 29: The Effect of Audio Snippet Locations and Durations on

APPENDIX A. CONFUSION MATRICES 24

OffsetMode RANDOM

House Classical Hip-Hop Dubstep Reggae RockHouse 2 0 1 7 3 2

Classical 0 13 1 1 0 0Hip-Hop 0 0 1 2 8 4Dubstep 2 0 4 3 1 5Reggae 0 0 3 2 9 1Rock 2 0 1 1 0 11

Accuracy 43.333333333333336% [39/90]

30s snippetsOffsetMode START

House Classical Hip-Hop Dubstep Reggae RockHouse 5 0 0 5 3 2

Classical 1 11 0 1 2 0Hip-Hop 1 0 6 3 3 2Dubstep 3 2 1 6 1 2Reggae 1 0 2 1 11 0Rock 2 0 2 0 1 10

Accuracy 54.44444444444444% [49/90]

OffsetMode MIDDLE

House Classical Hip-Hop Dubstep Reggae RockHouse 5 0 2 6 2 0

Classical 0 13 0 1 1 0Hip-Hop 0 0 1 2 8 4Dubstep 3 1 1 8 2 0Reggae 0 0 3 0 9 3Rock 0 0 2 0 1 12

Accuracy 53.333333333333336% [48/90]

Page 30: The Effect of Audio Snippet Locations and Durations on

APPENDIX A. CONFUSION MATRICES 25

OffsetMode END

House Classical Hip-Hop Dubstep Reggae RockHouse 7 0 2 0 3 3

Classical 0 14 1 0 0 0Hip-Hop 3 1 6 1 3 1Dubstep 5 1 3 1 2 3Reggae 0 0 6 0 6 3Rock 3 0 1 0 0 11

Accuracy 50.0% [45/90]

OffsetMode RANDOM

House Classical Hip-Hop Dubstep Reggae RockHouse 4 0 1 2 4 4

Classical 0 13 1 0 1 0Hip-Hop 0 0 5 1 6 3Dubstep 0 0 3 5 4 3Reggae 2 0 2 1 9 1Rock 3 0 1 2 0 9

Accuracy 50.0% [45/90]

25s snippetsOffsetMode START

House Classical Hip-Hop Dubstep Reggae RockHouse 5 2 0 4 2 2

Classical 0 11 0 2 2 0Hip-Hop 1 1 7 2 2 2Dubstep 3 4 2 4 1 1Reggae 1 0 2 1 11 0Rock 2 0 2 0 2 9

Accuracy 52.22222222222222% [47/90]

Page 31: The Effect of Audio Snippet Locations and Durations on

APPENDIX A. CONFUSION MATRICES 26

OffsetMode MIDDLE

House Classical Hip-Hop Dubstep Reggae RockHouse 4 0 3 4 2 2

Classical 0 13 0 1 1 0Hip-Hop 0 2 1 0 8 4Dubstep 1 1 1 6 4 2Reggae 0 0 2 1 9 3Rock 0 0 1 1 1 12

Accuracy 50.0% [45/90]

OffsetMode END

House Classical Hip-Hop Dubstep Reggae RockHouse 8 0 3 0 1 3

Classical 0 14 1 0 0 0Hip-Hop 2 1 6 1 3 2Dubstep 5 1 3 1 2 3Reggae 0 1 6 0 5 3Rock 4 1 0 0 0 10

Accuracy 48.888888888888886% [44/90]

OffsetMode RANDOM

House Classical Hip-Hop Dubstep Reggae RockHouse 4 1 2 1 3 4

Classical 1 12 0 0 2 0Hip-Hop 2 0 1 0 9 3Dubstep 1 2 2 2 5 3Reggae 0 0 1 0 13 1Rock 0 0 2 1 1 11

Accuracy 47.77777777777778% [43/90]

Page 32: The Effect of Audio Snippet Locations and Durations on

APPENDIX A. CONFUSION MATRICES 27

20s snippetsOffsetMode START

House Classical Hip-Hop Dubstep Reggae RockHouse 7 3 0 2 1 2

Classical 1 12 0 0 2 0Hip-Hop 1 1 5 2 4 2Dubstep 3 4 1 4 2 1Reggae 1 0 1 1 12 0Rock 2 0 3 0 2 8

Accuracy 53.333333333333336% [48/90]

OffsetMode MIDDLE

House Classical Hip-Hop Dubstep Reggae RockHouse 4 1 1 2 3 4

Classical 0 13 0 1 1 0Hip-Hop 0 2 1 0 8 4Dubstep 2 3 0 2 4 4Reggae 0 0 2 1 9 3Rock 0 0 1 1 1 12

Accuracy 45.55555555555556% [41/90]

OffsetMode END

House Classical Hip-Hop Dubstep Reggae RockHouse 9 0 2 0 2 2

Classical 0 14 1 0 0 0Hip-Hop 2 4 5 0 3 1Dubstep 6 3 1 1 4 0Reggae 0 1 7 0 4 3Rock 3 1 1 0 1 9

Accuracy 46.666666666666664% [42/90]

Page 33: The Effect of Audio Snippet Locations and Durations on

APPENDIX A. CONFUSION MATRICES 28

OffsetMode RANDOM

House Classical Hip-Hop Dubstep Reggae RockHouse 1 0 4 2 4 4

Classical 1 12 2 0 0 0Hip-Hop 1 0 3 0 8 3Dubstep 0 1 6 1 1 6Reggae 1 0 2 0 9 3Rock 3 0 1 0 1 10

Accuracy 40.0% [36/90]

15s snippetsOffsetMode START

House Classical Hip-Hop Dubstep Reggae RockHouse 6 5 0 1 2 1

Classical 2 11 0 0 2 0Hip-Hop 2 2 5 0 4 2Dubstep 2 4 0 4 5 0Reggae 2 0 0 1 12 0Rock 2 0 1 0 4 8

Accuracy 51.111111111111114% [46/90]

OffsetMode MIDDLE

House Classical Hip-Hop Dubstep Reggae RockHouse 4 1 1 1 4 4

Classical 0 13 0 1 1 0Hip-Hop 0 2 0 1 8 4Dubstep 1 4 1 3 2 4Reggae 1 0 3 1 7 3Rock 1 0 0 2 0 12

Accuracy 43.333333333333336% [39/90]

Page 34: The Effect of Audio Snippet Locations and Durations on

APPENDIX A. CONFUSION MATRICES 29

OffsetMode END

House Classical Hip-Hop Dubstep Reggae RockHouse 9 1 4 0 1 0

Classical 0 12 3 0 0 0Hip-Hop 0 5 6 0 2 2Dubstep 6 4 1 1 2 1Reggae 0 3 8 0 3 1Rock 2 1 1 0 4 7

Accuracy 42.22222222222222% [38/90]

OffsetMode RANDOM

House Classical Hip-Hop Dubstep Reggae RockHouse 0 2 5 1 2 5

Classical 0 13 1 0 1 0Hip-Hop 0 0 6 0 6 3Dubstep 2 2 5 3 0 3Reggae 1 0 4 0 8 2Rock 1 0 2 0 0 12

Accuracy 46.666666666666664% [42/90]

10s snippetsOffsetMode START

House Classical Hip-Hop Dubstep Reggae RockHouse 6 6 0 0 2 1

Classical 1 11 0 1 2 0Hip-Hop 1 7 2 0 3 2Dubstep 1 7 1 2 4 0Reggae 1 1 1 0 12 0Rock 2 0 1 0 4 8

Accuracy 45.55555555555556% [41/90]

Page 35: The Effect of Audio Snippet Locations and Durations on

APPENDIX A. CONFUSION MATRICES 30

OffsetMode MIDDLE

House Classical Hip-Hop Dubstep Reggae RockHouse 2 1 0 1 6 5

Classical 0 12 0 2 1 0Hip-Hop 1 2 1 0 7 4Dubstep 3 3 0 0 4 5Reggae 0 0 3 2 7 3Rock 0 0 2 0 0 13

Accuracy 38.888888888888886% [35/90]

OffsetMode END

House Classical Hip-Hop Dubstep Reggae RockHouse 5 2 4 0 1 3

Classical 0 10 4 1 0 0Hip-Hop 0 10 1 0 2 2Dubstep 7 3 1 1 3 0Reggae 1 5 3 1 4 1Rock 3 2 0 0 4 6

Accuracy 30.0% [27/90]

OffsetMode RANDOM

House Classical Hip-Hop Dubstep Reggae RockHouse 2 1 5 1 2 4

Classical 0 13 1 1 0 0Hip-Hop 0 0 7 0 6 2Dubstep 2 3 6 0 1 3Reggae 0 0 5 0 9 1Rock 0 1 2 0 0 12

Accuracy 47.77777777777778% [43/90]

Page 36: The Effect of Audio Snippet Locations and Durations on

APPENDIX A. CONFUSION MATRICES 31

5s snippetsOffsetMode START

House Classical Hip-Hop Dubstep Reggae RockHouse 5 7 0 1 2 0

Classical 0 12 0 1 2 0Hip-Hop 0 6 1 2 3 3Dubstep 4 5 1 3 1 1Reggae 1 1 1 0 10 2Rock 0 2 0 1 4 8

Accuracy 43.333333333333336% [39/90]

OffsetMode MIDDLE

House Classical Hip-Hop Dubstep Reggae RockHouse 2 1 0 2 6 4

Classical 0 11 0 3 1 0Hip-Hop 1 2 1 0 8 3Dubstep 2 3 0 1 5 4Reggae 1 0 3 1 8 2Rock 0 0 0 2 1 12

Accuracy 38.888888888888886% [35/90]

OffsetMode END

House Classical Hip-Hop Dubstep Reggae RockHouse 0 9 1 0 3 2

Classical 0 13 0 0 0 2Hip-Hop 0 14 0 0 1 0Dubstep 0 7 2 1 2 3Reggae 1 11 1 0 0 2Rock 3 5 1 0 0 6

Accuracy 22.22222222222222% [20/90]

Page 37: The Effect of Audio Snippet Locations and Durations on

APPENDIX A. CONFUSION MATRICES 32

OffsetMode RANDOM

House Classical Hip-Hop Dubstep Reggae RockHouse 5 0 5 0 4 1

Classical 0 13 0 1 1 0Hip-Hop 1 1 2 0 10 1Dubstep 4 2 3 0 4 2Reggae 2 0 2 0 11 0Rock 9 0 3 0 1 2

Accuracy 36.666666666666664% [33/90]

Page 38: The Effect of Audio Snippet Locations and Durations on

www.kth.se