Automatic Transcription of Drums and Vocalised percussion · 2019. 7. 14. · segue o método "segment-and-classify" para transcrição de bateria [1]. O LVT tem três módulos: i)

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO

Automatic Transcription of Drums andVocalised percussion

António Filipe Santana Ramires

FOR JURY EVALUATION

Mestrado Integrado em Engenharia Eletrotécnica e de Computadores

Supervisor in FEUP: Rui Penha, PhD

Supervisor in INESC: Matthew Davies, PhD

July 13, 2017

c© António Filipe Santana Ramires, 2017

Resumo

A evolução do poder de processamento dos computadores, e consequente capacidade de efetuarprocessamento de sinais digitais em tempo real, levou ao aparecimento de DAWs, tornando acriação musical acessível ao público geral. Com estas alterações, novos instrumentos e inter-faces para a criação de música eletrónica surgiram, mas continua a haver uma grande procura pornovos controladores. Os desenvolvimentos em MIR e em Machine Learning, tornaram possíveissistemas capazes de transcrever frases de bateria e de beatbox. No entanto, esses sistemas são de-senvolvidos com foco na avaliação da performance de algoritmos de transcrição e não são fáceisde usar num cenário de produção musical.

O objetivo principal deste trabalho é criar uma aplicação que permita que produtores musicaispossam usar a sua voz para criar frases de percussão quando compõem em DAWs. Um sistemafácil de utilizar, orientado para o utilizador e capaz de transcrever automaticamente vocalizaçõesde percussão, chamado LVT, é proposto. Esta aplicação foi desenvolvida usando o Max for Live esegue o método "segment-and-classify" para transcrição de bateria [1]. O LVT tem três módulos:i) um detetor de eventos, que deteta o início de uma vocalização; ii) um módulo que extrai ca-racterísticas relevantes do áudio de cada evento; e iii) uma componente de Machine Learning queimplementa o algoritmo k-nearest neighbours para classificação de vocalizações de percussão.

Devido às diferenças nas vocalizações do mesmo som de percussão de diferentes utilizadores,uma abordagem dependente do utilizador foi desenvolvida. Nesta perspetiva, o utilizador finaltem a capacidade de treinar o algoritmo com as vocalizações desejadas para cada som de bateria.Um externo para Max, que implementa o algoritmo Sequential Forward Selection, para escolheras características mais relevantes para cada utilizador é proposto, assim como um dataset anotadode vocalizações de percussão.

A avaliação do LVT feita neste trabalho tem dois objetivos. O primeiro é identificar a melhoriade performance ao ser usado um algoritmo treinado pelo utilizador final, em comparação comum algoritmo treinado por um dataset geral. O segundo objetivo é analisar se o LVT forneceao utilizador um melhor workflow para produção musical em comparação com as ferramentas jáexistentes: LDT [2] e a função do Ableton Live Convert Drums to MIDI. Os resultados mostraramque ambos os objetivos para o LVT foram alcançados.

i

ii

Abstract

The development of computers performance capacity, and consequent possibility for real-timeDigital Signal Processing (DSP) for audio, led to the appearance of Digital Audio Workstations(DAWs), making the creation of computer music available to the general public. Along with thesechanges, new instruments and interfaces for creating electronic music have surfaced. Howeverthere is still a high demand for new controllers. The developments in music information retrieval(MIR) and in machine-learning paved the way for systems capable of transcribing drum loopsand beatboxing. However, these systems are focused on evaluating the performance of transcrip-tion algorithms in offline testing scenarios and are either not easy to operate for end-users or notsufficiently reliable for use in a real music production workflow.

The primary goal of this work is to develop an application that enables music producers touse their voice to create drum patterns when composing in music DAWs. An easy-to-use anduser-oriented system capable of automatically transcribing vocalisations of percussive sounds,called LVT, is presented. This system was developed as a Max for Live device which follows the“segment-and-classify” methodology [1] for drum transcription. LVT includes three modules: i)an onset detector to segment events in time; ii) a module that extracts relevant features from theaudio content; and iii) a machine-learning component that implements the k-nearest neighbours(k-NN) algorithm for classification of vocalised drum timbres.

Due to the differences in vocalisations from distinct users for the same drum sound, a userspecific approach to vocalised transcription was developed. In this perspective, a specific end-user trains the algorithm with their own vocalisations for each drum sound before vocalising thedesired pattern. A Max external, that implements the sequential forward selection for choosingthe features most relevant for their chosen sounds, is proposed as well as a new annotated datasetof vocalised drum sounds.

The evaluation of the LVT presented in this work addresses two objectives. The first one isto identify the improvement when using a user trained algorithm instead of a dataset trained one.The second one is to assess if LVT can provide an optimised workflow for music production inAbleton Live when compared to existing drum transcription algorithms: LDT [2], and the AbletonLive Convert Drums to MIDI function. The results showed that both objectives expected for theLVT were accomplished.

iii

iv

Agradecimentos

Em primeiro lugar, gostaria de agradecer aos meus pais por todo o amor e apoio que me deram,pelo constante desejo de aumentar o meu conhecimento e por sempre terem aceitado as minhasdecisões.

À minha irmã, avós, tios e primos por sempre terem desejado o meu bem e torcerem por mim.Um agradecimento especial à Tia Lurdes, por todo o carinho e, à Mimi, pelos mimos e comidacaseira.

À Catarina, por todo o amor e carinho, por ter aturado todos os meus momentos mais difícies,dando-me força para continuar, pelo quanto me fez crescer e por fazer do meu mundo um mundobem melhor.

Aos orientadores desta dissertação, Professor Matthew Davies e Professor Rui Penha, pelasua supervisão, por terem apoiado esta ideia e me terem dado a oportunidade de trabalhar numaárea que gosto. A Matthew Davies pelo incansável apoio e interesse por esta dissertação e pelaamizade. A Rui Penha pela paixão pela música que consegue sempre transmitir.

A todos os elementos do Sound and Music Computing Group, que me apoiaram nas dificul-dades encontradas nesta tese, especialmente ao Diogo pela sua prontidão em ajudar e esclarecerqualquer dúvida.

Aos meus amigos que sempre lá estiveram para mim e que fortaleceram a minha paixão pelamúsica: Chico, Bicá, Brás, Craveiro, Gonçalo, Martins, Costa, Alex, Cavaleiro e Sérgio.

A todos os que participaram na recolha do dataset e à Rádio Universidade de Coimbra.A todos os outros colegas e amigos que me apoiaram ao longo da vida, especialmente na

Universidade de Coimbra.

António Ramires

v

vi

“Qualquer criador quando estimula outro para fazerem coisas deve sentir-se contente.”

António Pinho Vargas in "À Procura da Perfeita Repetição"

vii

viii

Contents

Abstract i

Acknowledgements v

Abbreviations xv

1 Introduction 11.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Publication Resulting from this Dissertation . . . . . . . . . . . . . . . . . . . . 2

2 Background and State of the Art 32.1 Vocalised Percussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Electronic Music Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Electronic Music Composition Tools . . . . . . . . . . . . . . . . . . . . 42.3 Music Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Drum Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Vocalised Percussion Transcription . . . . . . . . . . . . . . . . . . . . . . . . . 92.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Problem Characterization 133.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Methodology 154.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1.1 Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.1.3 Feature Selection and Machine Learning Algorithm . . . . . . . . . . . . 18

4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.1 Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.3 Feature Selection and Machine Learning Algorithm . . . . . . . . . . . . 21

4.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3.1 LVT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3.2 LVT Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

ix

x CONTENTS

5 Data Preparation 275.1 Dataset Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Dataset Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Evaluation 316.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7 Conclusions 417.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.3 Perspectives on the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

A seqfeatsel C Code 45A.1 seqfeatsel Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46A.2 Flowchart of the seqfeatsel external . . . . . . . . . . . . . . . . . . . . . . . . 55

References 57

List of Figures

3.1 Different kick drums waveforms overlaid with spectogram. From left to right:Drum kit, beatboxer and vocalised kick drum. . . . . . . . . . . . . . . . . . . . 13

4.1 Flowchart summarising the system . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Main part of the Max patch responsible for the operation of the system and its

components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Inside the pfft∼ patch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4 User interface of the LVT device . . . . . . . . . . . . . . . . . . . . . . . . . . 254.5 User interface of LVT receiver device . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 Pattern participants were asked to reproduce . . . . . . . . . . . . . . . . . . . . 275.2 Organization of the dataset files in an Ableton Live project . . . . . . . . . . . . 295.3 Example the audio annotation in Sonic Visualizer . . . . . . . . . . . . . . . . . 305.4 Example of how participants vocalised the pattern . . . . . . . . . . . . . . . . . 305.5 Two different vocalisations of kick drum . . . . . . . . . . . . . . . . . . . . . . 305.6 Two different vocalisations of a snare drum . . . . . . . . . . . . . . . . . . . . 30

6.1 Ableton project for the evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 326.2 Desired Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.3 How the number of operations was calculated. 1) Delete the extra events; 2) Cor-

rect the events that can be corrected; 3) Add the missing events. . . . . . . . . . 336.4 Effect of changing the window size per vocalised drum sounds and across micro-

phones. All LDT scores are shown in red, Ableton Live (ABL) in green and LVTin blue. The solid lines indicate the laptop microphone, the dotted lines the AKGmicrophone, and the dashed lines the iPad microphone. . . . . . . . . . . . . . . 35

6.5 Transcription of the first user vocalisations using the LVT system trained by thesecond user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.6 Transcription of the second user vocalisations using the LVT system trained by thefirst user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.7 Effect of choosing a wrong feature for a user. a) 2nd user; b)1st user with featurefor 2nd user; c)1st user. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.8 Example of an LVT transcription . . . . . . . . . . . . . . . . . . . . . . . . . . 376.9 Example of an Ableton Live Convert Drums to MIDI transcription . . . . . . . . 376.10 Example of a LDT transcription . . . . . . . . . . . . . . . . . . . . . . . . . . 37

A.1 Flowchart of the seqfeatsel external . . . . . . . . . . . . . . . . . . . . . . . . 56

xi

xii LIST OF FIGURES

List of Tables

2.1 Summary of the different vocalised percussion approaches . . . . . . . . . . . . 12

5.1 Number of individual hits contained in the recordings . . . . . . . . . . . . . . . 28

6.1 F-measure results for the PC microphone . . . . . . . . . . . . . . . . . . . . . 346.2 F-measure results for the AKG microphone . . . . . . . . . . . . . . . . . . . . 346.3 F-measure results for the iPad microphone . . . . . . . . . . . . . . . . . . . . . 346.4 Number of Operations for the PC microphone . . . . . . . . . . . . . . . . . . . 386.5 Number of Operations for the AKG c4000b microphone . . . . . . . . . . . . . 386.6 Number of Operations for the iPad microphone . . . . . . . . . . . . . . . . . . 38

xiii

xiv LIST OF TABLES

Abbreviations

DAW - Digital Audio Workstationk-NN - k-Nearest NeighboursACE - Autonomous Classification EngineRMS - Root Mean SquareFFT - Fast Fourier TransformANN - Artifical Neural NetworkMFCC - Mel frequency Cepstral CoefficientsBFCC - Bark frequency Cepstral CoefficientsMIR - Music Information RetrievalSVM - Support Vector MachinesGMM - Gaussian Mixture ModelsHMM - Hidden Markov Model IF - Instance Filtering SNR - Signal to Noise RatioNMF - Non-negative Matrix FactorisationDSP - Digital Signal ProcessingMIDI - Musical Instrument Digital Interface

xv

Chapter 1

Introduction

1.1 Context

Music culture has changed a lot in the past years. New music genres were created, new possibili-

ties of production were discovered and new instruments were tested. In this context, in particular

with the emergence of drum machines, different ways of expressing percussive patterns have sur-

faced. The most common interfaces either use pads or a sequencer in order to acquire rhythmic

representations. These tools fail to fulfil their task if the user is not able to reproduce the desired

pattern through finger-drumming or by sequencing it. The human voice is an easy and cheap way

to express a drum pattern. With the development of computers, software Digital Audio Work-

stations began to emerge, and Ableton Live, through the use of the "Convert Drums" to MIDI

function, is able to transcribe drum recordings to a MIDI pattern. The transcription presented by

this function is not accurate if the voiced input does not realistically mimic expected drum sounds,

such as the ones from a drum machine or a drum kit. Therefore this project aims to design an

interface to express drum patterns through the use of human voice.

1.2 Goals

The objectives defined for this project are the following:

• Compile a dataset of vocalised drum patterns to be available online.

• Conceive methods for automatic transcription of vocalised percussion.

• Research techniques for the incorporation of user-input.

• Create a Max for Live device to transcribe vocalised percussion.

• Evaluate the device and compare it to the existent solutions.

1

2 Introduction

1.3 Motivation

With the changes that occurred in music culture, music production and the way musicians work

with their instruments have also changed [3]. The ability to invent and reinvent new ways to pro-

duce music is nowadays a key to progress. Consequently, new proposals, such as designing new

techniques for the composition of music, are necessary. Within the genre of electronic music, the

sequencing of drum patterns plays a critical role. The voice is an important and powerful instru-

ment of rhythm production [4] and it can be used to express a drum pattern. In order to leverage

this concept within a computational system, we create a tool that can help users (both expert mu-

sicians and amateur enthusiasts) input the rhythm patterns they have in mind to a sequencer, via

automatic transcription of vocalised percussion. Our proposed tool is beneficial both from the per-

spective of workflow optimisation (by providing accurate real-time transcriptions) and as a means

to encourage users to engage with technology in the pursuit of creative activities.

1.4 Dissertation Structure

Besides the Introduction, this dissertation contains seven Chapters. In Chapter 2, the state of the

art is described and the evolution of the work in this area is presented. In Chapter 3, the obstacles

we were met with in this project as well as the proposed solution to overcome them are detailed.

In Chapter 4, a description of the theoretical and practical implementation of the LVT system

is given. Chapter 5 describes the procedure used to collect, organise and annotate the dataset. In

Chapter 6, the test methodology and the results for the evaluation of the state of the art systems

and of LVT are presented. Finally, in Chapter 7, the results and contributions of this work are

summarised and future work is proposed.

1.5 Publication Resulting from this Dissertation

This dissertation led to the presentation of the following paper:

• A. Ramires, M. Davies and R. Penha, “Automatic Transcription of Vocalised Percussion” in

DCE17 - 2nd Doctoral Congress in Engineering, 2017.

Chapter 2

Background and State of the Art

2.1 Vocalised Percussion

Vocalised percussion is one of the most intuitive ways for humans to express a rhythm. This

universal language uses phonemes with no meaning to mimic instruments and has been used in

many different cultures, throughout history, either to represent and teach percussive patterns or as

a percussive instrument itself [4] [5].

Firstly, in Australia, the ‘didjeridu talk’ or ‘tongue talk’ is used to memorise and guide a yidaki

(didjeridu) performance and the onomatopoeias used vary between different communities. Both

the conga players from Cuba and the Ewe people from Ghana use vocalised percussion sounds to

speak their riffs. In Asia, the bols and Konnakol from India comprise sound symbols to represent

different tabla hits. In the case of Europe, vocalisation is also used as an instrument itself. The

puirt a’ bhèil from Scotland and Ireland is used as an instrument substitute when the fiddles and

pipes are not available [5].

In the United States, in addition to early examples of jazz skat singing, we have beatboxing,

a form of vocal percussion originated in 1980’s hip-hop culture, where musicians use their lips,

cheeks, and throat to create different beats. This term originally referred to the vocal reproduction

of 80’s drum-machines, also known as beatboxes, that were not affordable by the vast majority of

people from this culture. With the evolution of beatboxing and with the use of microphones, the

range of expressions that beatboxers use is not simply restricted to drum sounds. By using inhaled

and exhaled sounds, different vocal modes such as head voice, growl or falsetto and trills, rolls

and buzzes, the beatboxer can create sounds such as a vocal scratch or a "synth kick" [6] [5] [4]

[7].

2.2 Electronic Music Production

With the increase in computer’s performance capacity, it became possible to do real-time Digi-

tal Signal Processing (DSP) on their personal systems and the emphasis in music-making tools

has gone from hardware to software, and the general public can now make music on their home

3

4 Background and State of the Art

computers. While computer music has been performed in academic research and composition

communities for many years, the availability of accessible software music tools has given rise to

a computer music culture outside these circles. Many exciting kinds of music are being made by

non-academic artists and producers in home studios all over the world [3].

Electronically produced music is part of popular culture. Musical ideas that were once con-

sidered far out, such as the use of environmental sounds, ambient music, turntable music, digital

sampling, computer music, the electronic modification of acoustic sounds, and music made from

fragments of speech, have now been incorporated into popular music. Genres including new age,

rap, hip-hop, electronic music, techno and jazz have been influenced with production values and

techniques that originated with classic electronic music [8].

2.2.1 Electronic Music Composition Tools

Various inventions have been devised to assist musicians in performing, arranging, recording and

composing music. A historically early method of recording music which is still in use today is the

player piano. Holes, corresponding to particular notes, are punched in paper which is rotated as

the player piano is played [9]. Newer tools used in both analogue and software electronic music

production comprise sounds generators, effects processors and mixers [10].

Any technology that can transduce human gesture or movement into electrical signal is avail-

able to be used as an electronic music composition tool. The commonly used technologies include

infrared, ultrasonic, hall effect, electromagnetic and video. With the development of MIDI, com-

puter hardware such as keyboards, switches, pushbuttons, sliders, joysticks or drum pads can also

be used as a means to input patterns and melodies in the computer. Many musicians have built

their own input devices as prototypes by using microphones, accelerometers and other types of

sensors combined to electronic circuitry. But all this hardware only works if it is connected to the

computer and managed by some software - the performance software [11].

2.3 Music Information Retrieval

Music Information Retrieval (MIR) is concerned with the extraction and inference of meaningful

features from music, indexing of music using these features and search and retrieval schemes, as

defined by [12] [13].

During the 2000s, with the development of computers and the corresponding increase in com-

puter power, MIR research shifted its focus from analysing symbolic representation of music

pieces to the use of signal processing techniques directly to the music audio signals [13].

According to Schedl et al., in [13], MIR comprises several investigation subfields. The most

typical ones are the following:

2.3 Music Information Retrieval 5

• Feature Extraction: This first group of topics is related to the extraction of relevant features

from music content. It includes several tasks such as timbre description; music transcrip-

tion and melody extraction; onset detection, beat tracking and tempo estimation; tonality

estimation and structural analysis, segmentation and summarisation.

• Similarity: This subsystem is the core of many applications such as music retrieval and mu-

sic recommendation systems. This comprises tasks such as similarity measurement, identi-

fication of cover songs and query by humming.

• Classification: This group uses the information retrieved by the previous subfields in order

to classify music. Emotion and mood recognition; music genre classification; instrument

classification; composer, singer and artist identification and auto-tagging are common areas

of research.

• Applications: This final subfield comprises the development of application that use MIR

tools. These can vary from audio fingerprinting to playlist generation and music visualisa-

tion.

Feature extraction is essential to this work and its tasks will, therefore, be described more

thoroughly.

Automatic music transcription, according to [14] "is the process of converting an acoustic

musical signal into some form of musical notation". This area is one of the most intensively re-

searched in MIR and is often considered the core technology to improve any MIR system. While

most publications deal with pitched instruments, rhythm extraction is also a major focus. A com-

plete music transcription system comprises various sub-systems such as multi-pitch detection, on-

set detection, instrument recognition and rhythm extraction. A large subset of current approaches

for the transcription of harmonic sounds employs spectogram factorisation techniques, such as

NMF and probabilistic latent component analysis [15] [16].

Beat tracking is defined in [17] as deriving from a music audio signal a sequence of beat

instants that might correspond to when a human listener would tap his or her foot. This task is

related both to note onset detection, which consists in identifying the start points of musical events

in an audio signal, and tempo induction, which resides in finding the underlying rate of a piece of

music [18]. Despite the differences between these two tasks, their investigation has always been

closely connected. Research on this topic started in the 1970s and an overview on its evolution

can be found in [19].

Research on structural analysis or self-similarity analysis mainly consists in detecting signal

changes and repetitions, within the same musical piece. This analysis is based on the computation

of the self-similarity matrix, proposed by Foote in [20]. An important application of this research

is music summarisation, as songs may be represented by their most frequently repeated segments

[13].


2.4 Drum Transcription

Drum transcription is essential in automatic music transcription as, in several music genres, the

drum track possesses information about tempo, rhythm, style and possibly the structure of the song

[15]. Various problems can arise when dealing with this task. These are related to the diversity of

the drum sounds to be labelled, the difference in loudness in different loops and the possibility of

overlapping sounds.

Most drum transcription methods can be separated in three different groups, as proposed in [1]

and [21]:

• Segment and Classify: This first approach segments different drum events and, based on

the features extracted, classifies them using machine learning techniques, such as support

vector machine (SVM) or gaussian mixture models (GMM). This proved successful in solo

drum recordings, but, in polyphonic music, its application is more challenging, as most of

the features used for classification are sensitive to the presence of background music.

• Separate and Detect: The input signal is split in its various components via source separa-

tion. The different streams then go through an onset detector, such as an energy threshold

based one, in order to find the instances of each signal. To achieve source separation, a

time-frequency transform is normally used. This decomposition is traditionally achieved

with independent subspace analysis (ISA) or non-negative matrix factorisation (NMF).

• Match and Adapt: These methods search for the occurrence of a temporal or time-frequency

templates within the music signal and browse a database to find the most similar pattern to

the queried one.

The "segment and classify" and "separate and detect" methods are the most relevant ones

for this work, as they do not need previously created templates to match the query. In order to

power creativity, the user should be able to input any sequence of their own design, and not only

previously constructed ones.

Gillet et al. [22], study the performance of hidden Markov models (HMM) and SVMs on the

transcription of drum loops in a "segment and classify" method. Instead of focusing the identifi-

cation only on sounds taken in isolation, the dataset used consists of pre-recorded drum patterns,

such as those found in commercial sample CDs, where an event can contain more than one drum

hit. In order to split the loop into the corresponding events, an onset detection algorithm based

on sub-band decomposition was used. A k-NN classifier was used to find the most appropriate

group of features to be used in the event classification sub-system. The selected features were:

(1) the mean of 13 Mel frequency Cepstral Coefficients (MFCC); (2) the spectral centroid; (3) the

spectral width; (4) the spectral asymmetry; (5) the spectral flatness; (6) 6 band-wise frequency

content parameters, that correspond to the log-energy in six pre-defined bands. The classification

was done both with an HMM and an SVM. The first class of models was used as it proved efficient

when short term time dependencies exist. This is the case if the sound produced by a drum con-

tinues to resonate when the following stroke happens. Both classes were tested with two different

2.4 Drum Transcription 7

approaches. The first one using only one 2n-ary classifier, in which each possible combination

of strokes is represented as a separate class. The second one uses n binary classifiers, one per

instrument. A third experiment was conducted using a drum kit dependent approach, where four

different HMM classifiers were used, one for each kind of drum kit (Electro, Light, Heavy or Hip-

Hop). The results obtained show that the SVM surpasses the HMMs in all approaches, acquiring

65.1% accuracy using only one classifier, and 64.8% using n binary classifiers. The highest accu-

racy for the HMM classifier occurred when the drum kit dependent approach was used, attaining a

precision of 62.5%, while only 59.1% was achieved in the model trained on all data. The authors

state this probably was due to the high variability of the data set, which the HMM approach could

not handle.

Gillet et al. [23], in order to remove the non-percussive parts of polyphonic music, use a

band-wise harmonic/noise decomposition. This algorithm is only able to identify two classes:

kick and snare. The aim of the first stage of this system is to obtain the stochastic part of the input

signal. Percussive sounds have a strong stochastic component, contrary to the pitched instruments.

In order to achieve this, the input signal is decomposed into eight non-overlapping sub-bands, by

passing it through an octave-band filter bank, since, in this way, the computational cost of the noise

subspace projection is greatly reduced. The second stage is to project the noise subspace. This

is accomplished by using the Exponentially Damped Sinusoidal model. After this, the signals

still contain attacks and transients from pitched instruments, therefore, in the next stage, where

the resulting signal goes to an onset detection algorithm, non-percussive events are also detected.

This is handled by adding another class for these sounds. The onset detection is done by half-

wave rectifying and low-pass filtering the sub-band noise signals, and then finding the peaks of

their derivative. The features extracted from each segment were the energy in the first 6 sub-bands

and the average of the first 12 MFCCs, without c0. The classification used is different from the

standard "segment and classify" approach as some onsets do not contain drum events and have

to be discarded. Two SVM classifiers were used, one for the kick and one for the snare. The

probabilistic output of the SVM classifier was used as a likelihood measure, in order to retrain the

system with the most probable events. The best F-measure achieved was 89.2% for a mix where

the drum was 6dB louder than the rest, and 84.0% for a balanced mix.

Similarly to Gillet et al. in [23], Tanghe et al., in [24], present a strategy to segment and

classify drum events, in real time, in polyphonic music, using an SVM. The algorithm operates in

a streaming way, which allows the processing of big audio files and never ending audio streams.

The onset detection consists of several sub-systems, in order to only detect local maxima, and

detects both drum and non-drum events. Firstly, the audio signal is fed into a short term Fourier

transform and then input to a Mel filterbank. The weighted sum of the differences between the

current amplitude levels and that of the recent past is calculated. Here, the more recent the value

is, the more important it is. By dividing the envelope follower of the output of the Mel filterbank

by the result of the weighted sum, the relative differences in each frequency band are calculated.

The output of this sub-system goes to a peak detector and, if this peak is higher than a selected

threshold, it is sent to a heuristic grouping peak detector that outputs “true” if a local maximum


is reached and “false” if not. A new peak can only be detected if the calculated sum decreases

after the previous peak is identified. A module was created to extract features from the stream of

audio samples. The descriptors obtained were the overall RMS, the RMS in 3 frequency bands, the

RMS per band relative to the overall RMS, the RMS per band relative to RMS of other bands, the

zero-crossing rate, the crest factor, the temporal centroid, the spectral centroid, kurtosis, skewness,

rolloff and flatness and the MFCC and ∆ MFCC. The classification is then executed by an SVM,

trained with annotated audio files. The highest average classification F-measure achieved was

61.1%.

Miron et al. [2] introduce a drum transcription algorithm capable of handling real-time audio.

In order to detect events, first, a high frequency onset detection is used, since this was reported

to be the best for percussive onsets. As this stage can detect false positives, an instance filtering

(IF) method using sub-band onset detection is used. This method uses the complex domain onset

detector in three different frequency bands, one for each class (kick, snare and hi-hat). Since

different drum strokes can occur at the same time, features for each frequency band are computed

separately, so that the noise influence is reduced. The obtained features are computed in the decay

part of each sound and, in order to give less importance to silent frames, weighted with the RMS

value. The extracted features are energy in 23 bark bands, 23 bark frequency cepstrum coefficients,

spectral centroid and spectral rolloff. The machine learning part consists of three k-NN classifiers,

each adapted to the corresponding class. This system is implemented in PD-extended, Max MSP

and in Max for Live. The F-measure result obtained by the event detection sub-system was 93%,

and by the complete system was of 81% for the validation dataset. The use of the IF stage along

with the k-NN classifier led to an increase in the performance and precision in all classes.

A "separate and detect" method for drum transcription was presented by Roebel et al. [25].

The separation of the three sources is done using a non-negative matrix deconvolution, in which

the update rules are obtained from the Itakuro Saito divergence. The detection of drum events uses

three criteria. The first one comprises an activation based test. The second threshold is used in

order to establish a minimum SNR for detected events. The prominence of the target class in the

power spectrum is weighted and compared to a third threshold. If all test are passed, the event is

retained. The use of these three conditions show a better overall performance in the algorithm.

The non-negative matrix factorisation approach can also be used in a "match and adapt"

method, as proposed by Wu et al. [15]. A template adaptive drum transcription algorithm that

uses partially fixed non-negative matrix factorisation is presented. This algorithm uses two dictio-

naries: one previously defined with drum templates and one trained with melodic content, in the

standard NMF manner. Two methods are then tested for the template adaption. For three classes

of drums, the system is able to achieve average F-measures of 77.9% and 72.2% in monophonic

and polyphonic music respectively.

2.5 Vocalised Percussion Transcription 9

2.5 Vocalised Percussion Transcription

The transcription of vocalised percussion deals with automatic identification of vocalisations of

percussive sounds. This area has many applications such as live music transcription, human-

machine musical interaction or even identifying drum loops within a database [26].

Most systems for the transcription of vocalised percussion follow the "segment and classify"

approach and integrate three different parts: a component that separates the different events, a

component that generates descriptors for each event and a machine learning component that as-

signs the different events to the corresponding class [27]. Therefore, vocalised percussion is a

monophonic transcription problem.

Amaury Hazan et al. [27] created a tool for transcribing voice percussive rhythms that aims to

reduce the gap between the user and the device used for acquiring rhythmic representation. This

work focus in transcribing not only standard vocalised drum sounds but also a whole range of

acoustic oral rhythms. In order to separate the different events, an energy based algorithm is used.

This decomposes the input stream into several frames, computes their energy and compares it to

a threshold that is user-defined. In order to obtain descriptors, each event is split into attack and

decay, by finding the maximum of the event’s sound envelope. The features to be extracted were

divided between temporal and spectral. From the decay, both temporal and spectral descriptors

were obtained, whereas in the attack only temporal features were analysed. The spectral descrip-

tors used were obtained through the use of an FFT. These descriptors are: spectral energy, spectral

centroid, flatness, kurtosis and the first five MFCC. In the case of temporal features, the descriptors

used were the duration, the log-duration, the energy, the zero-crossing rate and the temporal cen-

troid. Two different machine learning components were used. The first one was the tree induction

algorithm method C4.5, with and without two optimisations: boosting and bagging. The second

algorithm used was the k-NN. The attained results favoured the utilisation of the C4.5 algorithm

with bagging with 90% accuracy for a test set with recordings from unseen performers, compared

to 87% accuracy for the C4.5 algorithm with boosting, and 79% both for the C4.5 algorithm alone

and for the k-NN.

In the report "Automatic Transcription of Beatboxing" by Christensen et al. [28], the group

developed a MATLAB application that identifies three beatboxing sounds, the kick, the snare and

the hi-hat. In order to segment different events, they use an energy based algorithm similar to

the one in the paper by Amaury Hazan et al. [27]. The classification was done using the k-NN

classifier, previously trained with a beatbox dataset recorded by them. A choice was made to

only test one feature at a time. These were energy, zero crossing rate, the first 20 MFCCs and

spectral centroid, spread and flux. Later, they analysed which k values showed better results for

each feature used. The best performing feature was the MFCC feature vector, with k = 7 or 8. An

accuracy of 98.9% was achieved when using these parameters.

In their paper [4], besides presenting a data collection comprising recordings from both beat-

boxers and non-beatboxers, Sinyor et al. studied the efficiency of the Autonomous Classification

Engine in identifying vocalised percussion sounds. This engine optimises the set of features to


be used by the machine learning component [29]. A second experiment was performed using a

genetic algorithm for selecting features that proved to be superior compared to the 1-NN ACE

experiment. The segmentation of the input was done manually and the descriptors used were both

the average and the standard deviation of compactness, of spectral rolloff, of RMS derivative and

of zero crossing overall, the standard deviation of the overal RMS and of the frequency corre-

sponding to the highest peak of the FFT, and the average of both the the zero crossing derivative

and the strongest frequency of the spectral centroid. The classifier that proved to work best with

ACE was AdaBoost with C4.5 decision trees as base learners, which obtained a 98.15% accuracy

when using 3 classes of sounds, and 95.55% when using 5 classes.

Kapur et al. [7] introduce two different systems. The first one receives a beatboxing loop

and identifies the corresponding drum loop within a bank. The second one transcribes the same

input to the corresponding drum sounds (kick, snare and hi-hat). Despite the first system being

an interesting utilisation for beatbox transcription, we will focus on the transcription application.

The segmentation of the input is also done by splitting the audio when its volume is higher than

a definable value. The classification algorithm used was a backpropagation Artificial Neural Net-

work with a single dimensional feature that was the number of Zero Crossings. This method was

used since it was the one to achieve best results in a real time implementation, having obtained

an accuracy of 97.3%. In this method each of the drum sounds should be recorded 4 times before

the transcription, in order to train the ANN. The user may then record a beatboxing loop, that will

be processed and fed to a sampler, where the user can select the desired real drum sounds. The

transformed query can the be saved as a new audio file.

In [26], Stowell et al. study the effect of delaying the classification in a real time transcription

system and present a new annotated beatbox dataset. They choose to have 3 classes that are kick,

snare and hi-hat. The onset detection was done manually, in order to factor out the influence of

this component in the system. Several different features were obtained through SuperCollider

3.3, and then analysed to see the corresponding effect in the accuracy of the transcription by the

naive Bayes classifier, with the Kullback-Leibler divergence. The best value of time frames to

be analysed for each feature was also tested. The features that proved to be more appropriate for

beatbox transcription were the 25th- and 50th-percentile and the spectral centroid and flux. The

delay that performed best in most tests was 23ms. The obtained accuracy with these parameters

was 88.4 % for the kick, 81.6% for the hi-hat and 53.1% for the snare. A perceptual experiment

was also conducted in order to evaluate the tolerable latency in the decision-making component.

For common drum sounds, the best maximum delay which preserved an excellent or good audio

quality varied from 12ms to 35ms.

Nakano et al. [30] present a "match and adapt" method to retrieve drum patterns from a

database by voice percussion recognition. By using the Viterbi algorithm and only two sound

classes (kick and snare), a recognition rate of the desired pattern of 93% is attained. Gillet et

al. [31] present a system, with the same function as the previous one, that uses a segment and

classify approach. The input is manually segmented and the features used are 13 MFCCs and 13

∆ MFCCs. The transcription of the query is done by using the Bakis (left-right) HMM model.

2.6 Summary 11

Hipke et al. [32] present a transcription system which uses an identifier that is trained by

the end user. This system is named BeatBox and enables end-user creation of custom beatbox

recognisers, represented in the GUI by different pads. Each pad also shows the reliability of

the differentiation for each vocalisation. The onset detection is threshold based and the features

computed are the spectral centroid and RMS. The classification algorithm used was k-NN, which

k value is automatically selected by the system.

The DAW Ableton Live has a "Convert Drums to New MIDI Track" function that "extracts

the rhythms from unpitched, percussive audio and places them into a clip on a new MIDI track"

and should be able to work "with your own recordings such as beatboxing" [33]. In the context of

this work, this is the most similar system to the one we propose, as Ableton Live is a DAW widely

used by both expert musicians and amateur enthusiasts. This feature works satisfactorily when

transcribing drums and beatboxing recordings that imitate drum sets. By testing it with simple

onomatopoeia such as boom, pam, ta, pa or tss the transcription did not work as intended. In

recordings with a cheap microphone, the noise was identified as a hi-hat. Moreover, some of the

vocalisations of snares or kicks were identified as a snare and kick at the same time.

Table 2.1 resumes the previous articles in terms of accuracy, number of classes used, type of

segmentation, the descriptors used and the machine learning algorithm adopted.

2.6 Summary

As was described in this chapter, there are many articles focused on the complications regarding

the transcription of percussive sounds. The presented systems are mostly focused on evaluating

the performance of transcription algorithms and not on the possible applications to end-users.

Despite the satisfactory results, the transcription of percussion presents highly diverse prob-

lems in need of techniques specifically adapted to the main target scenario, whether the input of

the system consists in vocalised percussion, beatboxing, isolated or mixed polyphonic recordings.


Table 2.1: Summary of the different vocalised percussion approaches

Author NumClass

Segmentation Descriptors Used MachineLearning

Accuracyachieved

Hazan [27] 4 EnergyThreshold

Spectral energy, spectral cen-troid, flatness, kurtosis, the firstfive MFCC,duration, the log-duration, the energy, the zero-crossing rate and the temporalcentroid

C4.5w/Bagging

90%

Christensen[28]

3 EnergyThreshold

20 MFCC k-NN 98.9%

Sinyor [4] 5 Manual Average and standard deviationof compactness, spectral rolloff,RMS derivative and zero cross-ing overall; standard deviationof the overall RMS and of themaximum frequency and aver-age of the zero crossing deriva-tive and of the strongest fre-quency of the spectral centroid.

C4.5 w/-Boosting

95.55%

Sinyor [4] 3 Manual Average and standard deviationof compactness, spectral rolloff,RMS,derivative and zero cross-ing overall; standard deviationof the overall RMS and of themaximum frequency and aver-age of the zero crossing deriva-tive and of the strongest fre-quency of the spectral centroid.

C4.5 w/-Boosting

98.15%

Kapur [7] 3 EnergyThreshold

Number of zero crossings ANN 97.3%

Stowell [26] 3 Manual The first 8 MFCCs, the spectralcentroid, spread, flatness, flux,slope,crest, crest in subbandsand distribution percentiles, thehigh-frequency content and thezero-crossing rate.

naiveBayes

Kick 88.4%Snare 81.6%Hi hat 53.1%

Chapter 3

Problem Characterization

3.1 Problem Definition

The problem consists on the creation of a tool that assists producers, either trained in beatboxing

or not, to create patterns with the use of their own voice. This application should receive as

input either a recording or a stream of audio that contains vocalised percussive sounds and output

the corresponding transcription in a ready-to-use MIDI file. Furthermore, the system should be

adapted to vocalised input constraints, such as the inability of producing two sounds at the same

time.

The applications stated in the previous chapter do not suffice if a user-specific vocalised per-

cussion transcription system, aimed at computer musicians, is desired. These tools are aimed at

testing the behaviour of transcription systems, they export results instead of patterns and their

interfaces are not easy to use. Moreover, most of them are aimed at either beatboxing or drum

transcription and are tuned to receive only a selection of sounds.

Figure 3.1: Different kick drums waveforms overlaid with spectogram. From left to right: Drumkit, beatboxer and vocalised kick drum.

A substantial difference exists between the sound produced by a drum kit, by a beatboxer and

13

14 Problem Characterization

by a common user not trained in beatboxing as can be seen informally in Figure 3.1. In addition,

different people vocalise drum sounds in different manners.

Therefore, for this tool to function as required, it has to be adapted to each user and to the

characteristics of the human voice.

3.2 Proposed Solution

In order to solve this problem, a vocalised drum transcription software, able to be trained with the

user vocalisations is proposed. The system is integrated in a Max for Live project. Max for Live

is a visual programming environment, based on Max 7, that allows users to build instruments and

effects for use within Ableton Live.

Firstly, a dataset of vocalised percussion was compiled. It was then annotated using Sonic

Visualiser1, a free application for viewing and analysing the contents of music audio files. The

recordings were saved and organised both in a compressed archive and in an Ableton Live project

file, for compatibility and to facilitate the testing of the transcription systems. These files are

hosted in the project’s web page2.

Then, the system was developed following an user-specific approach. This system follows the

"segment and classify" method previously described and integrates three elements: an onset de-

tector, a component that generates descriptors for each event, and a machine learning component.

The onset detection was done with Aubio Onset∼ [34]. The extraction of the features described

in the state of the art was done in real-time with the use of the object Zsa.mfcc∼, the library

Zsa.descriptors [35] and other Max MSP objects. The first of these tools outputs the MFCCs as

a list, the second one extracts spectral centroid, spread, slope, decrease and rolloff, a sinusoidal

model based on peaks detection and a tempered virtual fundamental [35]. The zero crossing rate

and number of zero crossings were calculated with the zerox∼ object. The machine learning com-

ponent was trained with the user’s preferred vocalisation and the features selected to show the

better results for the provided input. This was done through the use of the Sequential Forward Se-

lection method, with the most significant features selected by the accuracy obtained from testing

the training data. This metric evaluates the most adequate feature in order to achieve the maximum

separation between clusters in a machine learning algorithm. The Sequential Forward Selection

method works by selecting the most significant feature, according to a specific parameter (in this

case the accuracy obtained from testing the training data), and adding it to an initially empty set

until there are no improvements or no features remain. An user interface was created in Max for

Live, so as to facilitate the utilisation of the application.

1http://www.sonicvisualiser.org/2https://lvtsmc.wordpress.com/

Chapter 4

Methodology

In this chapter, the approach used to create an automatic system that transcribes vocalised per-

cussion is described. When developing the system, the focus was on creating a model capable of

delivering a reliable and accurate transcription and which is easily operated by music professionals

and amateur enthusiasts.

This chapter is divided in three different parts. The first one details the approach chosen for

the design of the system, the second one describes how this solution was implemented and in the

third part, the way the system should be used is described. A flowchart of the system functioning

is presented in Figure 4.1.

4.1 Approach

In this section, the operation of the system is detailed and divided into various components. Fol-

lowing [1], this device uses the segment-and-classify approach. This approach proved to be partic-

ularly successful on solo drum signals [1] but not as efficient in polyphonic sounds. In our system

it is only desired to transcribe percussive events and, therefore, this approach was chosen as it is

the most suitable option.

The first component is an onset detector, which is responsible for detecting when events occur.

When these are detected, the second module extracts the features from the relevant time frame and

outputs their value to the final stage, the machine learning and feature selection component. The

features that provide a better classification are chosen and used in a k-NN classifier.

4.1.1 Onset Detection

The onset detection algorithm used is the high frequency content (HFC). In the tests conducted

in [36], it proved to be the most effective method for identifying non-pitched percussive onsets,

detecting 96.7% of all the events and not detecting any false positive. The function was originally

proposed in [37]. This algorithm calculates the weighted mean of the amplitude for each bin. The

higher the frequency is, the more weight the bin has. It is powerful at detecting onsets that can be

modelled as bursts of white noise, such as snares and cymbal sounds.

15

16 Methodology

Figure 4.1: Flowchart summarising the system

4.1.2 Feature Extraction

The module that follows the onset detection is the feature extraction. A set of temporal and spectral

features are extracted from the incoming audio signal when an onset is detected.

The temporal features are the RMS value of the energy and the number of zero crossings. The

number of zero crossings corresponds to the number of times that a signal crosses the x-axis in a

fixed time frame, while the energy corresponds to the RMS value of the energy contained in an

audio frame. These features are calculated over a time frame of 4096 samples, so that the features

are extracted from the first 93ms of the vocalisation for audio sampled at 44.1kHz.

The spectral descriptors extracted are the following:

• Spectral Centroid: This feature corresponds to the centre of mass of a spectrum and is

connected to the perception of a sounds’ “brightness” [38]. Its value can be calculated as

follows:

µ =∑

n−1i=0 f [i]a[i]

∑n−1i=0 a[i]

(4.1)

4.1 Approach 17

where n is half of the FFT window size, i is the bin index, a[i] the corresponding amplitude

and f [i] is its frequency and is calculated as follows:

f [i] = i∗sample rate

FFT window size(4.2)

[35]

• Spectral Spread: This descriptor measures the variance of the spectral centroid:

ν =∑

n−1i=0 ( f [i]−µ)2 a[i]

∑n−1i=0 a[i]

(4.3)

[35]

• Spectral Slope: Calculates the slope of the magnitude spectrum by doing a linear regression

of it:

slope =1

∑n−1i=0 a[i]

n∑n−1i=0 f [i]a[i]−∑

n−1i=0 f [i]∑n−1

i=0 a[i]

n∑n−1i=0 f 2[i]−

(∑

n−1i=0 f [i]

)2 (4.4)

[35]

• Spectral Decrease: This feature is similar to the spectral slope as it also represents the

decreasing of the magnitude spectrum, but according to [39], is supposed to relate to human

perception:

decrease =∑

n−1i=0 a[i]−a[1]

∑n−1i=2:K a[i] (i−1)

(4.5)

[35]

• Spectral Roll-Off: Computes the frequency value so that 95% of the signal is below this

frequency. For x = rolloff point

fc[i]

∑i=0

a2[ f [i]] = xn−1

∑i=0

a2[ f [i]] (4.6)

[35]

• Spectral Skewness: Measures the asymmetry of the spectrum around its centre of mass and

was originally proposed in [39].

• Spectral Flux: Measures how quickly the energy of a signal is changing and is calculated

as the Euclidean distance between two normalised spectra. [40]

18 Methodology

• Spectral Kurtosis: Is similar to skewness but, instead of measuring the asymmetry of the

spectrum, it measures the flatness around its centre of mass.

• Spectral Flatness: Provides a measure of how similar to white noise a sound is and is mea-

sured in four frequency bands (250-500Hz, 500-1000Hz, 1000-2000Hz and 2000-4000Hz)

[41].

• MFCC: These are the coefficients that form a mel-frequency cepstrum and are commonly

used in speech recognition. They represent the spectrum based on its perception and is

considered in [42] to be the “best available approximation of human ear”.

• BFCC: This method is similar to the MFCC, but uses the Bark frequency filter bank instead

of the Mel filters [42].

The extracted features are normalised and, if the output is a frequency value, the scale is

changed from exponential to linear so that the lower frequencies are as important as high frequen-

cies.

4.1.3 Feature Selection and Machine Learning Algorithm

The features extracted in the previous module are given to the feature selection object that is

connected to a machine learning algorithm.

Our system is meant to be user-specific. The classification is adapted to each user and not to

a general dataset. Therefore, each user should train the algorithm with their own vocalisations,

in order to obtain a higher accuracy of prediction. The features that better differentiate each

drum sound vocalisation for each user should be chosen automatically without user interference.

To achieve this, we implemented a feature selection method which is the Sequential Forward

Selection, to be used along with the k-NN machine learning algorithm.

The SFS method was first proposed in [43] and is a bottom-up feature selection algorithm,

which means that it starts with an empty set of features. The feature that provides a best measure-

ment is initially added to this set. Additional features are added sequentially to this set until the

stopping condition is met. This condition normally is a threshold in the performance of the system

or a number of features selected.

The user should first train the system with vocalisations of each drum sound. Therefore the

SFS can use the number of correct k-NN guesses from the training set as a measure of efficiency

for each feature. This was the approach chosen in order to select the features that work better for

each user’s vocalisations. Whenever there is no improvement with the addition of features, the

algorithm stops.

The machine learning algorithm used is the k-Nearest Neighbour. This is a simple method

based on the measurement of the distances between the training data and the input sample. The

Euclidean distance is a common measure of distance between points and can be calculated as√∑

ni=1(xi − yi)2, where x and y are the two points, i the index of the axis and n the total number

4.2 Implementation 19

of axis in the Euclidean n-space. The input sample is classified as the class that has k samples less

distant to it [44].

4.2 Implementation

The system was implemented as a Max for Live device, in order for it to be easy to install and to

work with for Ableton Live users. Max for Live is a toolkit that allows to build devices to be used

in Ableton Live, using the visual programming language Max.

The implementation of the most relevant part of the back end system is shown in Figure 4.2.

Figure 4.2: Main part of the Max patch responsible for the operation of the system and its compo-nents

4.2.1 Onset Detection

Before the audio input that comes from Ableton Live is given to the onset detector, the left and

right channel are summed to mono. This way, there is only one signal chain in the system instead

of two audio channels to be analysed separately.

In Max, the HFC algorithm for onset detection can be implemented using the external object

aubioOnset∼. Aubio [45] is a free and open source C library designed for the extraction of anno-

tations from audio signals. Some of the functions provided by this library were wrapped as Pure

Data externals and, later, the onset detection function was adapted for Max MSP as an external

20 Methodology

by Marius Miron. This object receives one audio signal and, when a peak is detected, outputs a

“bang”.

The aubioOnset∼ is initialised as “aubioOnset∼ hfc 512 128 -70 0.7”. Therefore, the param-

eters used in our system are the following:

• Onset Detection Function: HFC. As previously shown, high frequency content is the most

appropriate method to use when detecting non pitched percussive sounds, which is the case

of the vocalised percussion sounds.

• Threshold: 0.7. This parameter controls the threshold value for the onset peak picking

and the values should be between 0.001 and 0.9 [34]. Different values were tested for

the threshold of the onset detection algorithm. Different audio clips from the dataset were

analysed and 0.7 provided a good balance between detecting most of the vocalisations while

not detecting many false positives.

• Silence Threshold: -70dB. This option corresponds to the volume under which the onsets

will not be detected.

• Buffer Size: 512 samples. This value deals with the number of samples that are present in

the buffer to be analysed. It also corresponds to the window of the spectral and temporal

computations [34]. The bigger this buffer is, the bigger the frequency resolution will be and

the longer it will take to detect an onset. A buffer size of 512 samples provides an accurate

detection of onsets and only corresponds to a delay of approximately 11.6ms if the sample

rate is 44.1kHz.

• Hop Size: 128 samples. This parameter corresponds to the number of samples between two

consecutive analysis frames [34]. The selected value provided a good temporal resolution.

4.2.2 Feature Extraction

The feature extraction is implemented by using either Max MSP native objects or the library

Zsa.Descriptors [35].

The number of zero crossings is calculated using the zerox∼ Max MSP object. This function

receives audio in its first inlet and outputs the number of times the analysed signal passed through

the X axis. The audio frame used for the analysis is the last signal vector. A signal vector is the

block that MSP uses in its operation and its size can be defined in the audio setup window. In

order to derive the energy, an envelope follower is used. The value of the envelope is stored when

an onset is detected and, unlike the rest of the features, it is not used for the classification but

for acquiring a velocity value. The sampled value is compared to a maximum and mapped to a

number between 1 and 127, that corresponds to the velocity of the given vocalisation.

In order to extract the spectral features from the audio, the Zsa.Descriptors library is used. This

library, which was developed in IRCAM by Mikhail Malt and Emmanuel Jordan, covers a large

set of descriptors and is able to work in a real-time situation. In order to increase CPU efficiency,


the Zsa objects are implemented inside the same pfft∼ patch, as shown in Figure 4.3. This object

is a spectral processing manager for patchers and allows to work in Max MSP in the frequency

domain. The window size chosen for the FFT is 4096 samples, in order to provide a good fre-

quency resolution. The purpose of this system is to provide good accuracy in the identification of

the vocalisations, and is not focused on real-time applications. The overlap factor chosen for the

FFT analysis is 8, which corresponds to a hop size of 512 samples.

The number of zero crossings and all the spectral features are extracted 3584 samples after the

onset is detected. The buffer size of the onset detection module is 512 samples. An event can only

be detected in the end of the buffer. As we only want to analyse the audio frame that is after the

onset, the values of the feature extraction are only evaluated 4096 samples after the earliest onset

can happen. If the onset is at the beginning of the onset detector buffer, it will take 512 samples

to be detected and the features that correspond to this event will be extracted 512+3584 = 4096

samples after it occurred. The value of the energy is only calculated when the onset is detected, as

we want to use the maximum power to calculate the velocity, and this occurs at the beginning of

the vocalisation.

Figure 4.3: Inside the pfft∼ patch

4.2.3 Feature Selection and Machine Learning Algorithm

The k-NN algorithm was implemented by using TimbreID, a Pure Data external developed by

William Brent [46] and ported to Max by Marius Miron that implements the k-NN machine learn-

ing algorithm. It can use different metrics for the calculation of the distance but the one chosen

for this system is the Euclidean distance.

22 Methodology

As it did not exist, an external written in C that implements the Sequential Forward Selection

was created, using the Max API [47]. This external was developed to work together with Tim-

breID, the messages sent and received are adapted to this classifier. A flowchart resuming most of

the functioning and the full code can be seen in Appendix A.

When writing an external object for Max, according to Max API, there are five basic steps.

The first one corresponds to adding the ext.h and ext_obex.h header files

The object declaration follows. Here a C structure is declared that contains all the class vari-

ables. In this case, the C structure is declared as follows:

1 t y p e d e f s t r u c t _ s e q f e a t s e l2 {3 t _ o b j e c t ob ; / / t h e o b j e c t i t s e l f ( must be f i r s t )4 boo l i d e n ; / / i s i t a l r e a d y t r a i n e d o r n o t ?5 boo l f l a g ; / / has t i m b r e I D g i v e n an answer ?6 boo l debug ; / / used f o r debugg ing p u r p o s e s7 boo l f a s e ; / / i s t i m b r e I D answer a no c a r e ?8 l ong numFea tu res ; / / number o f f e a t u r e s r e c e i v e d9 l ong rowCount ; / / c o u n t s t h e rows

10 l ong u l t i m a N o t a ; / / l a s t r e c e i v e d MIDI n o t e11 l ong r e s p o s t a ; / / answer from t i m b r e I D12 l ong knn ; / / knn v a l u e13 l ong numNotas ; / / t o t a l number o f n o t e s14 s h o r t nSe l ; / / number o f s e l e c t e d f e a t u r e s15 t_member ∗ n o t a s P C l u s t e r ; / / a r r a y wi th t h e n o t e s f o r t h e c l u s t e r msg16 t _ i n s t a n c e ∗ t r a i n i n g T a b ; / / a r r a y wi th t h e r e c e i v e d f e a t u r e s17 i n t ∗ s e l C o l ; / / columns s e l e c t e d t h r o u g h s f s18 vo id ∗ a _ o u t ; / / o u t p u t t h e column t o use19 vo id ∗ b_ou t ; / / o u t p u t s t h e rows f o r t h e kNN e x t e r n a l20 } t _ s e q f e a t s e l ;

struct.c

The next step is to create an initialisation routine. When the object is loaded in Max this

routine is ran and it informs Max which methods should be run when an instance of the object is

created, destroyed or when it receives a message.

1 vo id ex t_main ( vo id ∗ r )2 {3 t _ c l a s s ∗c ;4

5 c = c l a s s _ n e w ( " s e q f e a t s e l " , ( method ) s e q f e a t s e l _ n e w , ( method ) s e q f e a t s e l _ f r e e ,( l ong ) s i z e o f ( t _ s e q f e a t s e l ) ,

6 0L , A_GIMME, 0) ;7

8 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ m e s s a g e , " l i s t " , A_GIMME, 0) ;9 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ i n 1 , " i n 1 " , A_LONG, 0) ;

10 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ i d , " i d " , 0 ) ;11 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ c l e a r , " c l e a r " , 0 ) ;12 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ d e b u g , " debug " , 0 ) ;


13 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ k n n , " knn " , A_LONG, 0) ;14 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ a s s i s t , " a s s i s t " , A_CANT, 0 ) ;15

16 c l a s s _ r e g i s t e r (CLASS_BOX, c ) ;17 s e q f e a t s e l _ c l a s s = c ;18 }

initialization.c

After this, the new instance routine should be developed. In this function, all the class variables

are initialised, the memory space for the storage of the arrays is allocated and the inputs for the

object are created. The code is presented next:

1 vo id ∗ s e q f e a t s e l _ n e w ( t_symbol ∗s , l ong argc , t_a tom ∗ a rgv )2 {3 t _ s e q f e a t s e l ∗x = NULL;4 x = ( t _ s e q f e a t s e l ∗ ) o b j e c t _ a l l o c ( s e q f e a t s e l _ c l a s s ) ;5 x−>b_ou t = l i s t o u t ( x ) ;6 x−>a _ o u t = o u t l e t _ n e w ( ( t _ s e q f e a t s e l ∗ ) x , NULL) ;7 x−>n o t a s P C l u s t e r = ( t_member ∗ ) sysmem_newptr ( 0 ) ;8 x−> t r a i n i n g T a b = ( t _ i n s t a n c e ∗ ) sysmem_newptr ( 0 ) ;9 x−>s e l C o l = ( i n t ∗ ) sysmem_newptr ( 0 ) ; / / columns s e l e c t e d t h r o u g h s f s

10 i n t i n ( x , 1 ) ;11 x−>i d e n = f a l s e ;12 x−> f l a g = f a l s e ;13 x−>debug = f a l s e ;14 x−>f a s e = f a l s e ;15 x−>numFea tu res = 0 ;16 x−>rowCount = 0 ; / / c o u n t s t h e rows17 x−>u l t i m a N o t a = 0 ;18 x−> r e s p o s t a = 0 ;19 x−>knn = 1 ;20 x−>numNotas = 0 ;21 x−>nSe l = 0 ;22 r e t u r n x ;23 }

newinstance.c

Finally, the message handlers were written. These are the methods that are run when a message

arrives. As it can be seen in the initialisation routine, this object handles 7 different messages:

When a feature list is received, if the object is not trained, the number of features and the

label is stored and the number of instances is incremented. Otherwise, if the object is trained, the

received message is filtered and the object outputs the selected features.

Whenever a message containing “id” is received, this object will start the process to identify

which features provide a better performance for the module. For each feature:

• If the feature has already been selected, jump to the end.

• All instances are sent to timbreID, in order to train it with data.

24 Methodology

• Messages are sent to timbreID informing it how to cluster the training data.

• A message to set the k-NN value is sent to timbreID.

• Send each instance again but this time compare the output from timbreID with the correct

label. In order to make this method wait for timbreID’s answer, a flag is set to 0 and, while

this flag is not changed, the thread is put to sleep.

• A message to reset timbreID is sent.

The feature that best improves the classification is added to the selected features list. If a feature

is in this list, every time features are sent to timbreID, the values corresponding to the instance of

the said feature is also sent. The improvement of the iteration is calculated and, if it is bigger than

0, another feature can be added and the cycle repeats itself. Otherwise, the object is considered

trained, messages are sent in order to train timbreID with the selected features and with the cluster

information.

If an integer is received in the right inlet, the inlet to which the timbreID output is connected,

the flag is set to 1 and the “id” method can stop the while cycle.

When a “debug” message is received, the debug flag is set to 1. This is used for debugging

proposes and prints to Max window important log messages.

The “clear” message resets all the class variables to their initial values and a reset message is

sent to timbreID.

If the “knn (int)” message is received, the k-NN value is set to the one specified in the message.

The final method handles the assist message. This is used to provide visual information in the

Max patch window related to what each input and output does.

4.3 User Interface

Due to the constrains imposed by the Max for Live toolkit, in order to have an instrument capable

of outputting MIDI notes, two devices have to be used. The first one is a Max for Live audio

effect, while the second one is a MIDI effect. The user should load both devices in Ableton Live,

LVT loaded as an audio effect (in an effect rack or on an audio track) and the LVT Receiver loaded

in a MIDI track. The audio effect, LVT, receives the audio input and sends messages containing

the transcribed MIDI notes to the MIDI effect, LVT Receiver, which outputs these notes to the

Ableton Live track. These objects’ user interfaces and how they should be used are described in

the following subsections.

4.3.1 LVT

The user interface of the LVT device contains two panels as can be seen in Figure 4.4. In the first

panel, the number of times that each drum vocalisation to train the identifier is repeated and their

corresponding MIDI note are set and a light for each drum sound is displayed. The second panel

4.3 User Interface 25

contains the Train button to initialise the system’s training, the Reset button that resets the system

to its initial state, a light that informs if the system is trained or not and the device identifier.

Figure 4.4: User interface of the LVT device

The first step when using this system is to set the number of repetitions for each vocalisation.

If one of the percussive instruments is not going to be vocalised, setting this value to 0 will disable

it. Then, the desired MIDI notes for each vocalisation should be set. These should be the notes

that correspond to the desired sounds in the instrument that follows the LVT receiver in the MIDI

chain. LVT allows for up to five different drum types.

To start the training of the system starts by pressing the Train button in the second panel. The

user should then begin vocalising the desired drum sounds by repeating each sound the number

of times defined earlier. As each vocalisation is detected, the correspondingly labelled drum type

will light up. When the training phase has finished, the Trained light is turned on.

Now, the system is trained, or it can be reset to its initial state by pressing the reset button.

When a vocalisation is performed, the corresponding light will flash. The messages are not yet

reaching the LVT Receiver. The set up of the LVT Receiver is described in the next subsection.

In order to link the interface with the transcription system, several small Max patches were

used. The train button works as switch for the audio input. A patch was created to count the num-

ber of training instances and to label each one appropriately. One patch calculates the velocity of

the detected event from the instantaneous RMS energy value. Another patch packs the classifica-

tion from timbreID as a MIDI note and sends it to the LVT Receiver. The final patch is responsible

for getting the device identifier from Ableton Live and to present it in the LVT window.

4.3.2 LVT Receiver

The LVT Receiver user interface only has one panel. This panel contains a drop-down menu used

to select the identification of the LVT sender device, a reset button and a mute button, used to stop

listening the the messages from LVT.

The setup of the LVT receiver is simple. The identifier that is displayed in the LVT window

should be selected from the LVT receiver drop-down menu. After this step, the LVT system is

ready to use and a VST or a plugin can be loaded in Ableton Live after the LVT receiver. The

26 Methodology

Figure 4.5: User interface of LVT receiver device

MIDI transcription can be stored in an Ableton Live MIDI clip by creating a new MIDI track that

gets its MIDI input from the Post FX channel of the track where the LVT Receiver is loaded. There

is a constant delay between the vocalisation and the transcription, therefore, the start time of the

MIDI clip should be adjusted manually when the recording is finished.

This receiver device is responsible for unpacking the messages received from the chosen LVT

device and converting them to MIDI notes. The message is formatted using the midiformat object

which packs the information in a MIDI message that is then output to Ableton Live.

Chapter 5

Data Preparation

In this chapter, the approach used to collect a dataset of vocalised percussion will be described

in detail. The method used to record the vocalisations will be explained first, followed by the

procedure used to annotate these recordings.

5.1 Dataset Recording

In order to collect different vocalisations of percussion, a group of 11 men and 9 women were

selected. A similar gender distribution was desired so that this dataset samples the real world.

Almost all of the participants work at a University Radio and 7 of the participants are music en-

thusiasts and have some knowledge of music making, while the rest have basic music knowledge.

Only one of the people involved has beatboxing skills, while the others vocalised the percussive

sounds in a “less professional” way.

Participants were first asked to reproduce 4 bars of a fixed drum pattern with the vocalisations

they were more comfortable with and, after this, 4 bars of improvisation using the same sounds.

The pattern the participants were asked to reproduce was a simple 4 bar loop with kicks on the first

and second beat, snare on the third and hi-hats between them, as shown in Figure 5.1. Participants

were first familiarised with this pattern by listening to it played through an 808 drum-kit, as some

of them could not read music scores. Based on the state of the art, only these three drum sounds

were chosen to be vocalised.

Figure 5.1: Pattern participants were asked to reproduce

27

28 Data Preparation

Participants were given Audio Technica ATH M50X headphones with a metronome at 140

bpm to have a time reference. In order to collect data for possible different use cases, the audio

output was recorded through 3 different microphones (one from a laptop, an AKG c4000b and one

from an iPad). The first one had a lot of noise due to its poor quality, the second one provided a ref-

erence of a good microphone, while the last one is a reference to good mobile phone microphone.

The recordings were made in a sound treated studio in order to isolate external noises.

This process led to 120 audio clips with approximately 6 seconds of duration. The number of

different drum hits contained in these recordings can be seen in Table 5.1.

Fixed Pattern Improvisation TotalKick 8×20×3 = 480 164×3 = 492 972Snare 4×20×3 = 240 98×3 = 294 534Hi-hat 8×20×3 = 480 181×3 = 543 1023

Table 5.1: Number of individual hits contained in the recordings

5.2 Dataset Annotation

The audio files that resulted from the recordings referred in the previous section were compiled in

2 folders, one for the improvisation part and the other one for the fixed pattern recordings. Each

file was given a name that contained the code for each person, ‘I’ if it was an improvisation or ‘P’

if it was a fixed pattern and a number corresponding to the microphone that recorded the audio (1

for laptop microphone, 2 for the AKG microphone and 3 for the iPad one).

Besides saving the recordings in files, these were also collected in an Ableton Live project, as

seen in Figure 5.2. The files were split in 2 groups, each one having a track for each microphone.

Both the files with the annotations and the Ableton Live project are available in the dissertation

website1.

The annotation of the audio was done using Sonic Visualizer2, an application for viewing and

analysing the contents of music audio files, developed at the Centre for Digital Music, Queen

Mary University of London. The detection of the onsets was done manually and each event was

labelled as kick, snare or hi-hat. The transcription was both saved as a .csv file and as a MIDI file,

with the kick in the note 36, snare being the note 38 and the hi-hat 42. The resulting files were

named with the same name as the audio files, but without the microphone number. An example of

a transcription can be seen in Figure 5.3.

From the resulting transcriptions, we could see that participants did not reproduce the hits on

time as per the 140bpm metronome, as seen on Figure 5.4. However, this has no effect on the

transcription system or accuracy which is not tempo dependent.

Participants did not vocalise the drum hits in the same way. First, the user with beatboxing

knowledge vocalised the kick and snare in a different manner than the rest of the participants, as

1https://lvtsmc.wordpress.com/2http://www.sonicvisualiser.org/

5.2 Dataset Annotation 29

Figure 5.2: Organization of the dataset files in an Ableton Live project

can be heard in the dataset (JSil audio clip). Besides this, participants vocalised the kick and the

snare in the way it was easiest for them, therefore, different sounds were used to reproduce the

same drum hits, as can be seen in Figure 5.5 and 5.6.

30 Data Preparation

Figure 5.3: Example the audio annotation in Sonic Visualizer

Figure 5.4: Example of how participants vocalised the pattern

(a) A vocalised kick

(b) A kick reproduced by a beatboxer

Figure 5.5: Two different vocalisations of kick drum

(a) One vocalisation of a snare (b) Another possible vocalisation of a snare

Figure 5.6: Two different vocalisations of a snare drum

Chapter 6

Evaluation

This chapter describes the methodology used to test and evaluate the performance of the LVT

system, in comparison with the existing solutions, LDT [2] and Ableton Live Convert Drums to

MIDI function.

6.1 Experiment Design

The evaluation for LVT comprises one principal experiment that serves two purposes. The first

is to understand how a user specific trained system performs compared to the state of the art, i.e.,

systems which are trained to work on general drum timbres, while the second purpose is to explore

whether LVT can help to improve a producer’s workflow by examining the effort required to get

from a vocalised input pattern to a accurate MIDI representation.

In order to evaluate the three systems in the same data and still being able to use different data

for the training and the evaluation of LVT, some work was done on the dataset. Five kick, snare

and hi-hat vocalisations were extracted from the improvisation part of the dataset so as to create

training clips for the LVT (presented in Section 4.3). These clips were created in a manner that

simulates how a user would train the algorithm, with a speed similar to the one each participant had

in their improvisation audio recording. Seven of the contributors of the dataset did not vocalise at

least five times each drum sound and, therefore, their recordings were removed from the evaluation

data. This resulted in an evaluation set of 13 participants with both a training and a testing audio

clip recorded in three different microphones, which corresponds to 78 audio clips. These clips

were compiled in an Ableton Live project, which is available in the dissertation website1 and that

can be seen in Figure 6.1. This Ableton Live project contains an audio track and three MIDI

tracks for each microphone. The audio tracks contain the training and the testing audio clips and

the MIDI tracks contain the clips with the transcriptions from each system: LVT, LDT Max for

Live device [2] and Ableton Live Convert Drums to MIDI.

To obtain a measure of the accuracy of a user trained system compared to the state of the art

systems, the F-measure of the transcriptions was calculated. The F-measure is the harmonic mean

1https://lvtsmc.wordpress.com/

31

32 Evaluation

Figure 6.1: Ableton project for the evaluation

of precision and recall, and can be calculated as follows:

2∗ precision∗ recallprecision+ recall

. (6.1)

Where:

precision =truepositives

truepositives+ f alsepositives(6.2)

recall =truepositives

truepositives+ f alsenegatives. (6.3)

This was calculated by importing all the transcription MIDI clips with MIDI Tools2 into MAT-

LAB, comparing the transcriptions with the annotations and then calculating the F-measure for

each drum and the average of these values. These results were plotted to see the effect of increas-

ing the F-measure tolerance window (as a means to understand the effect of temporal localisation)

in this accuracy measurement.

Finally, to acquire a measure of how this system can improve a producers workflow, the time

to get a transcription and the number of operations needed to get to the desired patterns were

calculated.

To measure the time an Ableton Live transcription takes, a stopwatch was used to measure the

time since the "Convert Drums to MIDI" button was pressed until the resulting MIDI clip came

into view. This procedure was done in 9 random clips from the 3 different microphones. These

measurements were then averaged. As LDT works in real time, the time to achieve a transcription

is the same as the time of the audio recording to be transcribed. Finally, in order to obtain a

2http://www.ee.columbia.edu/∼csmit/matlab_midi.html

6.2 Results 33

transcription from LVT, two times have to be measured. First, the training time corresponds to the

time of the training audio clips which was calculated and averaged. The time to transcribe a given

audio recording when the system is trained corresponds to the time of this recording.

Then, the number of operations to achieve the desired pattern, that can be seen in Figure 6.2

was calculated.

Figure 6.2: Desired Pattern

Figure 6.3: How the number of operations was calculated. 1) Delete the extra events; 2) Correctthe events that can be corrected; 3) Add the missing events.

The possible operations are divided in three categories: to correct, remove, or add an event.

The procedure to compute these values is explained in Figure 6.3. First, all the additional events

are removed. These are the events that do not correspond to a real onset or that cannot be corrected

to the real classification. Afterwards, the events that are the result of a misclassification are cor-

rected and, finally, the missing events are added to the MIDI clip. The number of operations for

each clip was written down and, then, the number of operations for each category was calculated

for each microphone and for each transcription system. Since it is not reasonable to assume that

a producer editing the transcriptions would work at a constant speed, it was deemed more reliable

to take an objective measurement of workflow effort in terms of the number of operations, rather

than recording the temporal duration - as per the algorithm processing.

6.2 Results

In this section the results from the evaluation previously described are presented.

The calculated results of the F-measure accuracy (with a tolerance window of ± 0.035 sec-

onds) for each microphone are presented in Tables 6.1, 6.2 and 6.3.

34 Evaluation

Kick Snare Hi-hatLVT 27.9% 18.1% 7.6%Ableton Live 34.4% 34.3% 15.8%LDT 3.87% 15.4% 11.8%

Table 6.1: F-measure results for the PC microphone


Table 6.2: F-measure results for the AKG microphone


Table 6.3: F-measure results for the iPad microphone

These results show that, except for the laptop microphone and the snare from the iPad micro-

phone, LVT achieves much better performance than the other systems, sometimes even the double

of the F-measure is achieved. All systems report low accuracy on the laptop microphone. The

performance of the state of the art systems in the AKG and iPad microphone are similar, whilst,

for LVT, the use of the laptop microphone leads to the worst overall performance.

The effect of changing the window size value for the F-measure for every drum can be seen

in Figure 6.4, where each line represents the performance of each system for each microphone (1-

laptop microphone, 2- AKG microphone and 3-iPad microphone).

The values for the F-measure stay approximately constant for values higher than 0.035 sec-

onds. The performance of LVT on the recordings from the AKG microphone surpasses the others

by a significant amount, as shown by the fact that it outperforms all other configurations for all

window sizes. On the kick and hi-hat, it is followed by the performance of LVT on the iPad

recordings.

In order to see the effect of user-specific training on the performance of LVT, an example is

provided where LVT is trained on one user and tested on another – and vice-versa. When training

the LVT with a different person with different vocalisations, the accuracy of the transcription

is decreased as can be seen in Figures 6.5 and 6.6. In the upper part of the pictures there is the

transcription of the user when trained with its own vocalisations, while the bottom part corresponds

to the transcription when trained with the other user. As has been pointed out by these figures,

without the user-specific training, a lot of misclassifications occur.

A surprising observation for these two users is that the feature selection in the training phase

suggested the need only for a single feature per user. In other words, the different drum vocalisa-

tions in training could be perfectly separated using just one dimension. The selected feature for

6.2 Results 35

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

0.2

0.4

0.6

0.8

1Kick

Window Size in seconds

Fm

easu

re

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

0.2

0.4

0.6

0.8

1Snare


Fm

easu

re

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

0.2

0.4

0.6

0.8

1Hi hat


Fm

easu

re

LDT1LDT2LDT3ABL1ABL2ABL3LVT1LVT2LVT3

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

0.2

0.4

0.6

0.8

1Average


Fm

easu

re

Figure 6.4: Effect of changing the window size per vocalised drum sounds and across micro-phones. All LDT scores are shown in red, Ableton Live (ABL) in green and LVT in blue. Thesolid lines indicate the laptop microphone, the dotted lines the AKG microphone, and the dashedlines the iPad microphone.

Figure 6.5: Transcription of the first user vocalisations using the LVT system trained by the seconduser

the first user was the spectral flatness from 0 to 250 Hz and for the second one was the spectral

skewness.

The effect of selecting the wrong feature on clustering can be seen in Figures 6.5, 6.6 and 6.7.

In the latest, the distribution of kicks (circle), snares (diamond) and hi-hat (*) are shown. In a),

the second user vocalised pattern elements are distributed from 0 to 1 according to the value of

the selected feature (skewness). In b), the elements from the first user vocalisation according to

36 Evaluation

Figure 6.6: Transcription of the second user vocalisations using the LVT system trained by thefirst user

the feature selected from the second user are displayed and finally, in c) these same elements are

distributed according to the most appropriate feature according to SFS (flatness from 0 to 250 Hz).

When the appropriate feature is selected, it can be seen that the different drum hits are closely

clustered. When this is not the case, the drum hits are more spread and the regions for each drum

are not well defined.

Figure 6.7: Effect of choosing a wrong feature for a user. a) 2nd user; b)1st user with feature for2nd user; c)1st user.

In terms of a qualitative comparison between LVT, LDT and Ableton Live, an example tran-

scription can be seen in Figure 6.8 for LVT; for Ableton Live Convert Drums to MIDI in 6.9; and

for LDT in 6.10.

For an example where LVT transcribes the vocalised pattern accurately, Ableton Live detects

constantly hi-hats on top of the other drum sounds and even on silence. Furthermore, it did not

detect any kick drum in this recording. In turn, LDT, besides detecting all the ground truth events,

also identified a lot of false positives.

6.2 Results 37

Figure 6.8: Example of an LVT transcription

Figure 6.9: Example of an Ableton Live Convert Drums to MIDI transcription

Figure 6.10: Example of a LDT transcription

38 Evaluation

The timing measurements for each of the systems are the following:

• Ableton Live Convert Drums to MIDI: 12.9s

• LVT: 6.2s+6.9s= 13.1s (average of training clips + audio clip time)

• LDT: 6.9s (audio clip time)

In order to achieve a transcription, LDT is the quickest one, followed by Ableton Live and then

LVT. Ableton Live Convert Drums to MIDI function firstly shows a loading window, but after it

is finished, some time elapses until this DAW displays the MIDI clip in its user interface.

In addition to these processing times, required to give an initial automatic transcription, Tables

6.4, 6.5 and 6.6 summarise the results obtained from counting the total number of operations

needed to obtain the desired pattern for each microphone. The total number of events for each

microphone is 13∗20 = 260 which corresponds to the number of users ∗ the number of events in

each audio clip.

Corrected Added RemovedLVT 26 182 3Ableton Live 22 33 440LDT 52 124 95

Table 6.4: Number of Operations for the PC microphone


Table 6.5: Number of Operations for the AKG c4000b microphone


Table 6.6: Number of Operations for the iPad microphone

From these tables, it is easy to see that the transcriptions from LDT and Ableton Live require

a lot of events to be removed, whilst the ones from LVT do not. The number of corrected vocalisa-

tions from LVT and Ableton are similar, while the ones from LDT remain approximately constant

for all microphones. For the laptop and iPad microphones, LVT under-detected more events than

the rest of the systems.

6.3 Discussion 39

6.3 Discussion

In this section, the results presented in the previous section are analysed and discussed.

By examining the previously shown results provided in Tables 6.1, 6.2 and 6.3, we can under-

stand that the LVT provides a transcription closer to the ground truth than the generally trained

state of the art systems, shown by the higher F-measure. Besides the detail that LVT is trained

per user, these results may derive from the fact that this system does not try to detect polyphonic

events (more than one drum vocalisation at the same time) as the other systems do. Furthermore,

LVT does not detect as many events as the other systems, and, therefore, this has an influence in

the F-measure results, in terms of false positives.

From Figure 6.7, we can see that feature selection is an important step to acquiring an accurate

transcription. In order to have the k-NN algorithm work as good as possible, the different vocali-

sations must be tightly clustered, as it can be seen in a) and c). From Figures 6.5 and 6.6 we can

understand that having a system trained with a different user reports an inaccurate classification

and, therefore, we can see the importance of training the system to adapt the user vocalisations.

For the small cost in terms of timing for training the LVT, the transcription accuracy is greatly

increased and, as shown by having far fewer post-transcription operations, a significant amount of

time is saved when correcting the transcribed pattern. On this basis, the end-to-end workflow, from

training to transcription to correction is most efficient for LVT suggesting a real tangible benefit for

user-adaptive analysis. However, while LVT performs especially well with the AKG microphone

and with the iPad microphone, its performance with the computer microphone is particularly poor.

This microphone has a lot of background noise and the system is not able to detect onsets and

hence cannot provide a transcription. Thus within the processing pipeline, accurate onset detection

is extremely important, and its impact is directly observable in the transcription accuracy.

Concerning other possible limitations of LVT, if a user does not vocalise the drum sounds the

same way in the training and in the identification phase, the transcription will not work well – a

factor which more generally trained systems would not be susceptible to. Furthermore, if a user

vocalises drums that sound too similar to each other, as it was the case in some of the clips from the

dataset, there will not be a separation clear enough of the events from the perspective of the audio

features, and therefore the machine learning component will struggle to identify them correctly.

Finally, if a user vocalises the drum sounds too quietly, the onset detection will not work and the

event will not be transcribed.

As a final point, in order to have a quantitative measure on how a system performs in terms of

producer’s workflow, the number of operations to achieve a desired pattern is considered a more

meaningful measure than the F-measure. This is due to the fact that F-measure is dependent on

the window size and on the fact that a misclassification is represented both as a false positive and

as a false negative. The same occurs with poorly timed transcriptions, where an onset outside the

F-measure window also contributes to this calculation. These two possible deviations are easily

fixed in a DAW via simple shifting operations and are thus, less significant errors than totally

spurious false positives or totally missing events.

40 Evaluation

Chapter 7

Conclusions

7.1 Summary

In this dissertation, a new interface for music creation, called LVT, was presented. This system

allows Ableton Live users to sequence MIDI patterns that can be used for designing rhythms by

using their voice. The state of the art systems, including one already in Ableton Live, are not

able to transcribe vocalised percussion effectively, as these are trained for general recorded drum

sounds which are not vocalised. Different people vocalise drum sounds in different manners, a

snare drum vocalisation from a user can sound similar to the hi-hat vocalisation from another user.

LVT has to be trained before it is used, in order to fit the vocalisations of any end user. As each

user can choose the desired vocalisations for each drum sound, the system is versatile enough to

also transcribe drums or any kind of unpitched percussive sounds. As long as the training sounds

are different enough from each other, the system is able to choose the features that provide a

good separation and therefore a good classification accuracy for any input. In order to improve

the accuracy of this system, a Max external that implements the Sequential Forward Selection for

selecting features was developed. LVT is implemented as a Max for Live device, which enables

Ableton Live users to use this system by interacting with the simple and easy-to-use graphical user

interface designed for it.

The evaluation of the LVT and of the existing state of the art systems was done by running tests

on a dataset that was recorded and annotated. In order to collect different percussion vocalisations,

partakers expressed a fixed pattern with the vocalisations they found more suitable. The evaluation

of the accuracy of the transcription was done by calculating the F-measure and by counting the

number of actions needed to transform the resulting transcription in the desired pattern. The F-

measure is an adequate evaluation of the transcription accuracy while the number of operations

relates to how this accuracy affects the Ableton Live user workflow. LVT produced superior results

in both tests, showing that this tool can be used as an alternative to the existing drum transcription

systems in order to create MIDI drum patterns using the voice as the instrument.

41

42 Conclusions

7.2 Future Work

Despite the good results of the LVT described earlier, there are some features that can be added

to this system in order to improve its usability and performance. Due to time restrictions and the

amount of work some of these features need, they were not yet implemented. These improvements

are the following:

• Settings window: Add a window to the User Interface where system parameters can be set.

Examples of possible relevant parameters that may be changed by the user are the number

of neighbours for the k-NN classifier and the threshold and mode for the onset detection.

These values can only be changed inside the Max for Live patch and, therefore, they are not

easily accessible to the user.

• Save and Load button: Adding a Save and a Load button. This enables the possibility to

save the training in a file, so that the system does not have to be trained each time the user

loads it and provides more portability between computers for the same user.

• More feature selection methods and machine learning algorithms: Evaluate the effec-

tiveness of other feature selection or extraction methods and add the possibility to choose

the one desired by the user. Possible methods to be added are Sequential Backward Selec-

tion, Generalised Sequential Forward and Backward Selection, Sequential Forward Floating

Search or even adding the possibility for the user to manually select the desired features to

be extracted. Another machine learning algorithms can also be implemented and evaluated.

• Export the selected features report: The system prints to the Max command window the

index of the selected features. Adding the possibility of exporting a report that contains the

chosen features is a possible improvement.

• More features extracted: Adding more features to the feature extraction module. Temporal

and spectral features can be added such as the duration or the spectral roughness. Adding

the duration as a feature can help to make a distinction between open and closed hi-hats, as

some vocalisations of these cymbals only differ in their duration.

• Further Testing: More testing should be done on LVT with a different number of vocalised

drum sounds and with different values for the training instances.

• Pattern-Based Analysis: Researching pattern-based analysis techniques to be incorporated

in the LVT system. A possible implementation of this topic is to detect microtimings on the

vocalised pattern and correct the result when these are not present.

7.3 Perspectives on the Project 43

7.3 Perspectives on the Project

To introduce participants with no background in music production to music creation and its tech-

nology, as well as allowing them to improvise with vocalised drum sounds, was a rewarding expe-

rience. Furthermore, this project provided me with insight into the development of Max externals

and how Max MSP works, on feature extraction, on onset detection and on machine learning.

44 Conclusions

Appendix A

seqfeatsel C Code

This appendix contains the C code for the Max external seqfeatsel and the corresponding flowchart.

45

46 seqfeatsel C Code

A.1 seqfeatsel Code

1 # i n c l u d e " e x t . h " / / s t a n d a r d Max i n c l u d e , a lways r e q u i r e d2 # i n c l u d e " e x t _ o b e x . h " / / r e q u i r e d f o r new s t y l e Max o b j e c t3 / / / / / / / / / / / / / / / / / / / / / / / / / / o b j e c t s t r u c t4

5 t y p e d e f s t r u c t i n s t a n c e6 {7 do ub l e ∗ i n s t a n c e ;8 } t _ i n s t a n c e ;9

10 t y p e d e f s t r u c t member11 {12 i n t ∗member ;13 } t_member ;14

15 t y p e d e f s t r u c t _ s e q f e a t s e l16 {17 t _ o b j e c t ob ; / / t h e o b j e c t i t s e l f ( must be f i r s t )18 boo l i d e n ; / / i s i t a l r e a d y t r a i n e d o r n o t ?19 boo l f l a g ; / / has t i m b r e I D g i v e n an answer ?20 boo l debug ; / / used f o r debugg ing p u r p o s e s21 boo l f a s e ; / / i s t i m b r e I D answer a no c a r e ?22 l ong numFea tu res ; / / number o f f e a t u r e s r e c e i v e d23 l ong rowCount ; / / c o u n t s t h e rows24 l ong u l t i m a N o t a ; / / l a s t r e c e i v e d MIDI n o t e25 l ong r e s p o s t a ; / / answer from t i m b r e I D26 l ong knn ; / / knn v a l u e27 l ong numNotas ; / / t o t a l number o f n o t e s28 s h o r t nSe l ; / / number o f s e l e c t e d f e a t u r e s29 t_member ∗ n o t a s P C l u s t e r ; / / a r r a y wi th t h e n o t e s f o r t h e c l u s t e r message30 t _ i n s t a n c e ∗ t r a i n i n g T a b ; / / a r r a y wi th t h e r e c e i v e d f e a t u r e s31 i n t ∗ s e l C o l ; / / columns s e l e c t e d t h r o u g h s f s32 vo id ∗ a _ o u t ; / / o u t p u t t h e column t o use33 vo id ∗ b_ou t ; / / o u t p u t s t h e rows f o r t h e kNN e x t e r n a l34 } t _ s e q f e a t s e l ;35

36 / / / / / / / / / / / / / / / / / / / / / / / / / f u n c t i o n p r o t o t y p e s37 / / / / s t a n d a r d s e t38 vo id ∗ s e q f e a t s e l _ n e w ( t_symbol ∗s , l ong argc , t_a tom ∗ a rgv ) ; / / o b j e c t c r e a t i o n

method39 vo id s e q f e a t s e l _ a s s i s t ( t _ s e q f e a t s e l ∗x , vo id ∗b , l ong m, long a , c h a r ∗ s ) ;40 vo id s e q f e a t s e l _ f r e e ( t _ s e q f e a t s e l ∗x ) ;41

42

43 vo id s e q f e a t s e l _ m e s s a g e ( t _ s e q f e a t s e l ∗x , t_symbol ∗s , l ong argc , t_a tom ∗ a rgv ) ;44 vo id s e q f e a t s e l _ i n 1 ( t _ s e q f e a t s e l ∗x , l ong e n t r a d a ) ;45 vo id s e q f e a t s e l _ d e b u g ( t _ s e q f e a t s e l ∗x ) ;46 vo id s e q f e a t s e l _ i d ( t _ s e q f e a t s e l ∗x ) ;

A.1 seqfeatsel Code 47

47 vo id s e q f e a t s e l _ c l e a r ( t _ s e q f e a t s e l ∗x ) ;48 vo id s e q f e a t s e l _ p r i n t ( l ong argc , t_a tom ∗ a rgv ) ;49 vo id s e q f e a t s e l _ p r i n t 2 ( l ong c e r t a , l ong p r o p o s t a , l ong maxrows , l ong j ) ;50 vo id s e q f e a t s e l _ k n n ( t _ s e q f e a t s e l ∗x , l ong knn ) ;51

52 / / / / / / / / / / / / / / / / / / / / / / / / g l o b a l c l a s s p o i n t e r v a r i a b l e53 vo id ∗ s e q f e a t s e l _ c l a s s ;54

55

56 vo id ex t_main ( vo id ∗ r )57 {58 t _ c l a s s ∗c ;59

60 c = c l a s s _ n e w ( " s e q f e a t s e l " , ( method ) s e q f e a t s e l _ n e w , ( method ) s e q f e a t s e l _ f r e e ,( l ong ) s i z e o f ( t _ s e q f e a t s e l ) ,

61 0L , A_GIMME, 0) ;62

63 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ m e s s a g e , " l i s t " , A_GIMME, 0) ;64 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ i n 1 , " i n 1 " , A_LONG, 0) ;65 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ i d , " i d " , 0 ) ;66 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ c l e a r , " c l e a r " , 0 ) ;67 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ d e b u g , " debug " , 0 ) ;68 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ k n n , " knn " , A_LONG, 0) ;69 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ a s s i s t , " a s s i s t " , A_CANT, 0 ) ;70

71 c l a s s _ r e g i s t e r (CLASS_BOX, c ) ;72 s e q f e a t s e l _ c l a s s = c ;73 }74

75 vo id s e q f e a t s e l _ p r i n t 2 ( l ong c e r t a , l ong p r o p o s t a , l ong maxrows , l ong j )76 {77 p o s t ( " R e s p o s t a c e r t a : %ld , R e s p o s t a TimbreID : %l d , Numero de i n s t a n c e s :

%l d , j : %l d . " , c e r t a , p r o p o s t a , maxrows , j ) ;78 }79

80 vo id s e q f e a t s e l _ k n n ( t _ s e q f e a t s e l ∗x , l ong k nnv a l ) {81

82 x−>knn = k nn va l ;83

84 }85 vo id s e q f e a t s e l _ p r i n t ( l ong argc , t_a tom ∗ a rgv )86 {87 l ong i ;88 t _a tom ∗ap ;89 p o s t ( " t h e r e a r e %l d a rgumen t s " , a r g c ) ;90 / / i n c r e m e n t ap each t ime t o g e t t o t h e n e x t atom91 f o r ( i = 0 , ap = a rgv ; i < a r g c ; i ++ , ap ++) {92 s w i t c h ( a t o m _ g e t t y p e ( ap ) ) {93 c a s e A_LONG:


94 p o s t ( "%l d : %l d " , i + 1 , a t o m _ g e t l o n g ( ap ) ) ;95 b r e a k ;96 c a s e A_FLOAT :97 p o s t ( "%l d : %.2 f " , i + 1 , a t o m _ g e t f l o a t ( ap ) ) ;98 b r e a k ;99 c a s e A_SYM:

100 p o s t ( "%l d : %s " , i + 1 , atom_getsym ( ap )−>s_name ) ;101 b r e a k ;102 d e f a u l t :103 p o s t ( "%l d : unknown atom t y p e (% l d ) " , i + 1 , a t o m _ g e t t y p e ( ap ) ) ;104 b r e a k ;105 }106 }107 }108

109 vo id s e q f e a t s e l _ m e s s a g e ( t _ s e q f e a t s e l ∗x , t_symbol ∗s , l ong argc , t_a tom ∗ a rgv ) {110

111 i f ( x−>i d e n == f a l s e ) {112 i n t i , l i s t L e n g t h ;113 l ong l i n h a = x−>rowCount ;114 l i s t L e n g t h = a r g c ;115

116 x−> t r a i n i n g T a b = ( t _ i n s t a n c e ∗ ) s y s m e m _ r e s i z e p t r ( x−>t r a i n i n g T a b ,( x−>rowCount + 1) ∗ s i z e o f ( t _ i n s t a n c e ) ) ;

117 x−> t r a i n i n g T a b [ l i n h a ] . i n s t a n c e = ( d ou b l e ∗ ) sysmem_newptr ( l i s t L e n g t h ∗s i z e o f ( d ou b l e ) ) ;

118

119 x−>rowCount ++;120 i f ( x−>debug == 1) { p o s t ( " Receb i Mensagem e e n t r e i no i d e n == f a l s e " ) ; }121 i f ( x−>numFea tu res != l i s t L e n g t h ) {122 x−>s e l C o l = ( i n t ∗ ) s y s m e m _ r e s i z e p t r ( x−>s e l C o l , l i s t L e n g t h ∗ s i z e o f ( i n t ) ) ;123 x−>numFea tu res = l i s t L e n g t h ; }124 i f ( x−>debug == 1) { p o s t ( " a r g c = %l d e numFea tu res = %l d " , a rgc ,

x−>numFea tu res ) ; }125

126 i f ( l i n h a == 0) { / / F i r s t Note127 i f ( x−>debug == 1) { p o s t ( " P r i m e i r a Nota " ) ; }128

129 x−>u l t i m a N o t a = a t o m _ g e t l o n g ( a rgv ) ;130 i f ( x−>debug == 1) { p o s t ( " u l t i m a n o t a = %l d " , x−>u l t i m a N o t a ) ; }131 x−>numNotas = 0 ;132 x−>n o t a s P C l u s t e r = ( t_member ∗ ) s y s m e m _ r e s i z e p t r ( x−>n o t a s P C l u s t e r ,

s i z e o f ( t_member ) ) ;133 x−>n o t a s P C l u s t e r [ 0 ] . member = ( i n t ∗ ) sysmem_newptr (3 ∗ s i z e o f ( i n t ) ) ;134 x−>n o t a s P C l u s t e r [ 0 ] . member [ 0 ] = 0 ;135 x−>n o t a s P C l u s t e r [ 0 ] . member [ 1 ] = 0 ;136 }137


138 i f ( x−>u l t i m a N o t a != a t o m _ g e t l o n g ( a rgv ) ) { / / i f t h e r e c e i v e d n o t e i sd i f f e r e n t from t h e p r e v i o u s one

139 i f ( x−>debug == 1) { p o s t ( " Nota D i f e r e n t e " ) ; }140

141 x−>u l t i m a N o t a = a t o m _ g e t l o n g ( a rgv ) ;142

143 x−>n o t a s P C l u s t e r [ x−>numNotas ] . member [ 2 ] = l i n h a − 1 ; / / end of t h ep r e v i o u s row

144

145 x−>numNotas ++;146 x−>n o t a s P C l u s t e r = ( t_member ∗ ) s y s m e m _ r e s i z e p t r ( x−>n o t a s P C l u s t e r ,

( x−>numNotas + 1) ∗ s i z e o f ( t_member ) ) ;147 x−>n o t a s P C l u s t e r [ x−>numNotas ] . member = ( i n t ∗ ) sysmem_newptr (3 ∗

s i z e o f ( i n t ) ) ;148

149 x−>n o t a s P C l u s t e r [ x−>numNotas ] . member [ 0 ] = x−>numNotas ; / / new n o t e150 x−>n o t a s P C l u s t e r [ x−>numNotas ] . member [ 1 ] = l i n h a ; / / s t a r t i n d e x o f new

n o t e151 }152 x−> t r a i n i n g T a b [ l i n h a ] . i n s t a n c e [ 0 ] = x−>numNotas ;153 i f ( x−>debug == 1) { p o s t ( " [% l d ][% l d ] : %l d " , 0 , l i n h a , x−>numNotas ) ; }154

155 f o r ( i = 1 ; i < l i s t L e n g t h ; i ++)156 x−> t r a i n i n g T a b [ l i n h a ] . i n s t a n c e [ i ] = a t o m _ g e t l o n g ( a rgv + i ) ;157 }158

159 i f ( x−>i d e n == t r u e ) {160

161 t _a tom ∗ s a i d a = ( t_a tom ∗ ) sysmem_newptr ( x−>nSe l ∗ s i z e o f ( t_a tom ) ) ;162 / / f i l t e r s columns163 i f ( x−>debug == 1) { p o s t ( " x i d e n e v e r d a d e " ) ; }164

165 f o r ( i n t k = 0 ; k < x−>nSe l ; k ++) {166 a t o m _ s e t f l o a t ( s a i d a + k , a t o m _ g e t f l o a t ( a rgv + x−>s e l C o l [ k ] ) ) ;167 }168 o u t l e t _ l i s t ( x−>b_out , NULL, x−>nSel , s a i d a ) ;169 s y s m e m _ f r e e p t r ( s a i d a ) ;170

171 }172 }173

174 vo id s e q f e a t s e l _ i d ( t _ s e q f e a t s e l ∗x ) {175

176 s h o r t i , j , k , improv , numCorrec tasOld , l , numCorrec tas , c , maximum ,ind iceMax ;

177 numCorrec tasOld = 0 ;178

179 t _a tom ∗mensknn = ( t_a tom ∗ ) sysmem_newptr ( s i z e o f ( t_a tom ) ) ;180 t _a tom ∗m e n s C l u s t e r = ( t_a tom ∗ ) sysmem_newptr (4 ∗ s i z e o f ( t_a tom ) ) ;


181

182 t _a tom ∗ s a i d a = ( t_a tom ∗ ) sysmem_newptr ( 0 ) ;183 t _a tom ∗ t r e i n o = ( t_a tom ∗ ) sysmem_newptr ( 0 ) ;184

185 i n t ∗ c o r r e c t a s = ( i n t ∗ ) sysmem_newptr ( x−>numFea tu res ∗ s i z e o f ( i n t ) ) ;186 t _ symbol ∗ c l u s t e r ,∗ knnsym ;187 c l u s t e r = gensym ( " m a n u a l _ c l u s t e r " ) ;188 knnsym = gensym ( " knn " ) ;189 improv = 1 ;190

191 x−>n o t a s P C l u s t e r [ x−>numNotas ] . member [ 2 ] = x−>rowCount −1; / / marca o f im dat a b e l a

192 i f ( x−>debug == 1) { p o s t ( " Es tou no ID " ) ; }193 i f ( x−>rowCount > 1) {194 w h i l e ( improv > 0) {195

196 s a i d a = ( t_a tom ∗ ) s y s m e m _ r e s i z e p t r ( s a i d a , ( x−>nSe l + 1) ∗s i z e o f ( t_a tom ) ) ;

197 t r e i n o = ( t_a tom ∗ ) s y s m e m _ r e s i z e p t r ( t r e i n o , x−>nSe l ∗ s i z e o f ( t_a tom ) ) ; ;198

199 / / Runs a l l columns e x c e p t t h e one wi th l a b e l s200 f o r ( i = 1 ; i < x−>numFea tu res ; i ++) {201 i f ( x−>debug == 1) { p o s t ( " E n t r e i na Coluna %l d " , i ) ; }202

203 / / i f column a l r e a d y s e l e c t e d don t run204 f o r ( k = 0 ; k < x−>nSe l ; k ++) {205 i f ( i == x−>s e l C o l [ k ] ) { c o r r e c t a s [ i ] = 0 ; go to f o r a ; }206

207 }208

209 / / T r a i n s t i m b r e I D one row a t a t ime210 x−>f a s e = f a l s e ;211 f o r ( j = 0 ; j < x−>rowCount ; j ++) {212

213 / / c r e a t e s message and s e n d s i t214 f o r ( k = 0 ; k < x−>nSe l ; k ++) {215 a t o m _ s e t f l o a t ( s a i d a + k , x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ x−>s e l C o l [ k ] ] ) ;216 }217 a t o m _ s e t f l o a t ( s a i d a + x−>nSel , x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ i ] ) ;218 o u t l e t _ l i s t ( x−>a_out , NULL, x−>nSe l + 1 , s a i d a ) ;219

220 i f ( x−>debug == 1) { p o s t ( " E n v i e i p a r a t r e i n a r o t i m b r e I D %f " ,x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ i ] ) ; }

221 }222

223 / / s e n d s c l u s t e r i n g messages224 f o r ( l = 0 ; l < x−>numNotas + 1 ; l ++) {225 a t o m _ s e t l o n g ( mensClus t e r , x−>numNotas + 1) ;226 a t o m _ s e t l o n g ( m e n s C l u s t e r + 1 , l ) ;


227 a t o m _ s e t l o n g ( m e n s C l u s t e r + 2 , x−>n o t a s P C l u s t e r [ l ] . member [ 1 ] ) ;228 a t o m _ s e t l o n g ( m e n s C l u s t e r + 3 , x−>n o t a s P C l u s t e r [ l ] . member [ 2 ] ) ;229

230 o u t l e t _ a n y t h i n g ( x−>a_out , c l u s t e r , 4 , m e n s C l u s t e r ) ;231 i f ( x−>debug == 1) { p o s t ( " E n v i e i mensagem de C l u s t e r " ) ; }232 }233 a t o m _ s e t l o n g ( mensknn , x−>knn ) ;234 o u t l e t _ a n y t h i n g ( x−>a_out , knnsym , 1 , mensknn ) ;235

236 numCor rec t a s = 0 ;237 x−>f a s e = t r u e ;238 / / s e n d s messages f o r t i m b r e I D t o i d e n t i f y239 f o r ( j = 0 ; j < x−>rowCount ; j ++) { / / f o r e v e r y row240 / / s e n d s row t o o u t p u t 2241 f o r ( k = 0 ; k < x−>nSe l ; k ++) {242 a t o m _ s e t f l o a t ( s a i d a + k , x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ x−>s e l C o l [ k ] ] ) ;243 }244 a t o m _ s e t f l o a t ( s a i d a + x−>nSel , x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ i ] ) ;245 x−> f l a g = f a l s e ;246 o u t l e t _ l i s t ( x−>b_out , NULL, x−>nSe l + 1 , s a i d a ) ;247 i f ( x−>debug == 1) { p o s t ( " E n v i e i p a r a o t i m b r e I D i d e n t i f i c a r %f " ,

x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ i ] ) ; }248 / / w h a i t f o r answer249 w h i l e ( x−> f l a g = f a l s e ) {250

251 s y s t h r e a d _ s l e e p ( 1 ) ;252 }253 i f ( x−>debug == 1) { p o s t ( " S a i do w h i l e " ) ; }254 / / s e e i f answer i s r i g h t255 l ong r e s p o s t a = x−> r e s p o s t a ;256 i f ( r e s p o s t a == ( long ) x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ 0 ] ) {257 / / add t o v e c t o r wi th number o f c o r r e c t answer s258 numCor rec t a s ++;259 i f ( x−>debug == 1) { s e q f e a t s e l _ p r i n t 2 ( ( l ong )

x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ 0 ] , r e s p o s t a , x−>rowCount , j ) ; }260 }261 }262 c o r r e c t a s [ i ] = numCor rec t a s ;263 i f ( x−>debug == 1) { p o s t ( " Coluna %l d tem %l d c e r t a s " , i ,

numCor rec t a s ) ; }264 / / send r e s e t message t o t i m b r e I D265 o u t l e t _ a n y t h i n g ( x−>a_out , gensym ( " c l e a r " ) , 0 , NULL) ;266 f o r a : i f ( x−>debug == 1) { p o s t ( " E n v i e i mensagem de C l e a r " ) ; }267 }268 / / adds t h e column wi th t h e b e s t a c c u r a c y t o t h e s e l e c t e d columns269 maximum = c o r r e c t a s [ 1 ] ;270 ind iceMax = 1 ;271 f o r ( c = 1 ; c < x−>numFea tu res ; c ++)272 {


273 i f ( x−>debug == t r u e ) { p o s t ( " c o r r e c t a s [% l d ] = %l d e max = %l d " , c ,c o r r e c t a s [ c ] , maximum ) ; }

274 i f ( c o r r e c t a s [ c ] > maximum )275 {276 maximum = c o r r e c t a s [ c ] ;277 ind iceMax = c ;278 }279 }280 / / c a l c u l improv281 improv = maximum − numCorrec tasOld ;282 numCorrec tasOld = maximum ;283 i f ( improv > 0) {284 p o s t ( " Added column %l d " , ind iceMax ) ;285 x−>s e l C o l [ x−>nSe l ] = ind iceMax ;286 x−>nSe l ++;287 }288 i f ( x−>debug == 1) { p o s t ( " Improv %l d " , improv ) ; }289 } / / f im do w h i l e290

291 f o r ( j = 0 ; j < x−>rowCount ; j ++) {292

293 / / c r i a t_a tom [ ] com a l i n h a a e n v i a r e e n v i a294 f o r ( k = 0 ; k < x−>nSe l ; k ++) {295 a t o m _ s e t f l o a t ( t r e i n o + k , x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ x−>s e l C o l [ k ] ] ) ;296 }297 o u t l e t _ l i s t ( x−>a_out , NULL, x−>nSel , t r e i n o ) ;298 i f ( x−>debug == 1) { p o s t ( " TREINEI O TIMBREID" ) ; }299 }300

301 / / C l u s t e r message302 f o r ( l = 0 ; l < x−>numNotas + 1 ; l ++) {303 a t o m _ s e t l o n g ( mensClus t e r , x−>numNotas + 1) ;304 a t o m _ s e t l o n g ( m e n s C l u s t e r + 1 , l ) ;305 a t o m _ s e t l o n g ( m e n s C l u s t e r + 2 , x−>n o t a s P C l u s t e r [ l ] . member [ 1 ] ) ;306 a t o m _ s e t l o n g ( m e n s C l u s t e r + 3 , x−>n o t a s P C l u s t e r [ l ] . member [ 2 ] ) ;307

308 o u t l e t _ a n y t h i n g ( x−>a_out , c l u s t e r , 4 , m e n s C l u s t e r ) ;309 i f ( x−>debug == 1) { p o s t ( " E n v i e i mensagem de C l u s t e r " ) ; }310 }311 a t o m _ s e t l o n g ( mensknn , x−>knn ) ;312 o u t l e t _ a n y t h i n g ( x−>a_out , knnsym , 1 , mensknn ) ;313

314 s y s m e m _ f r e e p t r ( mensknn ) ;315 s y s m e m _ f r e e p t r ( m e n s C l u s t e r ) ;316 s y s m e m _ f r e e p t r ( s a i d a ) ;317 s y s m e m _ f r e e p t r ( t r e i n o ) ;318 s y s m e m _ f r e e p t r ( c o r r e c t a s ) ;319

320 o u t l e t _ a n y t h i n g ( x−>a_out , gensym ( " i d f i n i s h " ) , 0 , NULL) ;


321 x−>i d e n = t r u e ;322 i f ( x−>debug == t r u e ) { p o s t ( " a c a b e i o c i c l o w h i l e " ) ; }323 }324 }325

326 vo id s e q f e a t s e l _ i n 1 ( t _ s e q f e a t s e l ∗x , l ong e n t r a d a ) {327 / / r e c e i v e s t h e t i m b r e I D g u e s s328 i f ( x−>debug == t r u e ) { p o s t ( " Receb i do t i m b r e I D " ) ; }329

330 i f ( x−>f a s e == t r u e ) {331 i f ( x−>debug == t r u e ) { p o s t ( " Receb i do t i m b r e I D e l i g u e i " ) ; }332 x−> r e s p o s t a = e n t r a d a ;333 x−> f l a g = t r u e ;334 }335 }336

337 vo id s e q f e a t s e l _ a s s i s t ( t _ s e q f e a t s e l ∗x , vo id ∗b , l ong msg , l ong arg , c h a r ∗ d s t )338 { / / a s s i s t message339 i f ( msg == ASSIST_INLET ) {340 s w i t c h ( a r g ) {341 c a s e 0 : s t r c p y ( d s t , " ( l i s t s / messages ) R e c e i v e s f e a t u r e l i s t s w i th t h e

f i r s t e l e m e n t l a b e l i n g t h e e v e n t " ) ; b r e a k ;342 c a s e 1 : s t r c p y ( d s t , " ( i n t e g e r ) R e c e i v e s t h e i d e n t i f i c a t i o n from t h e

c l a s s i f i e r " ) ; b r e a k ;343 }344 }345 e l s e i f ( msg == ASSIST_OUTLET ) {346 s w i t c h ( a r g ) {347 c a s e 0 : s t r c p y ( d s t , " ( l i s t s / messages ) Sends messages and t r a i n i n g l i s t s t o

t h e c l a s s i f i e r " ) ; b r e a k ;348 c a s e 1 : s t r c p y ( d s t , " ( l i s t s ) Sends l i s t s f o r i d e n t i f i c a t i o n as w e l l a s t h e

f i n a l f i l t e r e d l i s t t o t h e c l a s s i f i e r " ) ; b r e a k ;349

350 }351 }352 }353

354 vo id s e q f e a t s e l _ c l e a r ( t _ s e q f e a t s e l ∗x )355 { / / when r e c e i v e s c l e a r356 i f ( x−>debug == 1) { p o s t ( " Receb i C l e a r " ) ; }357

358 x−>i d e n = f a l s e ;359 x−> f l a g = f a l s e ;360 x−>debug = f a l s e ;361

362 x−>numFea tu res = 0 ;363 x−>rowCount = 0 ; / / c o u n t s t h e rows364 x−>u l t i m a N o t a = 0 ;365 x−> r e s p o s t a = 0 ;


366

367 x−>numNotas = 0 ;368 x−>nSe l = 0 ;369 o u t l e t _ a n y t h i n g ( x−>a_out , gensym ( " c l e a r " ) , 0 , NULL) ;370

371 }372

373 vo id s e q f e a t s e l _ d e b u g ( t _ s e q f e a t s e l ∗x )374 { / / s e t s debug t o 1375

376 x−>debug = t r u e ;377 i f ( x−>debug == 1) { p o s t ( " Receb i debug " ) ; }378

379 }380

381 vo id ∗ s e q f e a t s e l _ n e w ( t_symbol ∗s , l ong argc , t_a tom ∗ a rgv )382 {383 t _ s e q f e a t s e l ∗x = NULL;384

385 x = ( t _ s e q f e a t s e l ∗ ) o b j e c t _ a l l o c ( s e q f e a t s e l _ c l a s s ) ;386

387

388 x−>b_ou t = l i s t o u t ( x ) ;389 x−>a _ o u t = o u t l e t _ n e w ( ( t _ s e q f e a t s e l ∗ ) x , NULL) ;390

391 x−>n o t a s P C l u s t e r = ( t_member ∗ ) sysmem_newptr ( 0 ) ;392 x−> t r a i n i n g T a b = ( t _ i n s t a n c e ∗ ) sysmem_newptr ( 0 ) ;393 x−>s e l C o l = ( i n t ∗ ) sysmem_newptr ( 0 ) ; / / columns s e l e c t e d t h r o u g h s f s394

395 i n t i n ( x , 1 ) ;396

397 x−>i d e n = f a l s e ;398 x−> f l a g = f a l s e ;399 x−>debug = f a l s e ;400 x−>f a s e = f a l s e ;401

402 x−>numFea tu res = 0 ;403 x−>rowCount = 0 ; / / c o u n t s t h e rows404 x−>u l t i m a N o t a = 0 ;405 x−> r e s p o s t a = 0 ;406 x−>knn = 1 ;407

408 x−>numNotas = 0 ;409 x−>nSe l = 0 ;410 r e t u r n x ;411 }412

413 vo id s e q f e a t s e l _ f r e e ( t _ s e q f e a t s e l ∗x ) {414 i n t i ;

A.2 Flowchart of the seqfeatsel external 55

415 i f ( x−>numNotas != 0) {416 f o r ( i = 0 ; i < x−>numNotas + 1 ; i ++)417 s y s m e m _ f r e e p t r ( x−>n o t a s P C l u s t e r [ i ] . member ) ;418 }419 f o r ( i = 0 ; i <x−>rowCount ; i ++)420 s y s m e m _ f r e e p t r ( x−> t r a i n i n g T a b [ i ] . i n s t a n c e ) ;421

422 s y s m e m _ f r e e p t r ( x−>n o t a s P C l u s t e r ) ;423 s y s m e m _ f r e e p t r ( x−> t r a i n i n g T a b ) ;424 s y s m e m _ f r e e p t r ( x−>s e l C o l ) ;425 }

seqfeatselclean.c

A.2 Flowchart of the seqfeatsel external


Figure A.1: Flowchart of the seqfeatsel external

References

[1] O. Gillet and G. Richard, “Transcription and separation of drum signals from polyphonicmusic,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, pp. 529–540, March 2008.

[2] M. Miron, M. E. P. Davies, and F. Gouyon, “An open-source drum transcription system forpure data and max msp,” in 2013 IEEE International Conference on Acoustics, Speech andSignal Processing, pp. 221–225, May 2013.

[3] M. Zadel and G. Scavone, “Laptop performance: Techniques, tools, and a new interface de-sign,” in Proceedings of the International Computer Music Conference, pp. 643–648, 2006.

[4] E. Sinyor, C. Mckay, R. Fiebrink, D. Mcennis, and I. Fujinaga, “BEATBOX CLASSIFICA-TION USING ACE,” in Proceedings of the International Conference on Music InformationRetrieval, pp. 672–675, 2005.

[5] M. Atherton, “RHYTHM-SPEAK : MNEMONIC , LANGUAGE PLAY OR SONG ?,” inProceedings of the Inaugural International Conference on Music Communication Science.,pp. 15–18, 2007.

[6] D. Stowell and M. D. Plumbley, “Characteristics of the beatboxing vocal style,” tech. rep.,Centre for Digital Music Department of Electronic Engineering Queen Mary, University ofLondon, 2008.

[7] A. Kapur, M. Benning, and G. Tzanetakis, “QUERY-BY-BEAT-BOXING: MUSIC RE-TRIEVAL FOR THE DJ,” in Proceedings of the International Conference on Music Infor-mation Retrieval, pp. 170–178, 2004.

[8] T. B. Holmes and T. Holmes, Electronic and experimental music: pioneers in technology andcomposition. Psychology Press, 2002.

[9] S. Sanderson, “Low profile keyboard device and system for recording and scoring music,”Dec. 13 1988. US Patent 4,790,230.

[10] M. Duignan, J. Noble, P. Barr, and R. Biddle, Metaphors for Electronic Music Production inReason and Live, pp. 111–120. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004.

[11] S. Sapir, “Gestural control of digital audio environments,” Journal of New Music Research,vol. 31, no. 2, pp. 119–129, 2002.

[12] J. S. Downie, “Music information retrieval,” Annual Review of Information Science and Tech-nology, vol. 37, no. 1, pp. 295–340, 2003.

57

58 REFERENCES

[13] M. Schedl, E. Gómez, and J. Urbano, “Music information retrieval: Recent developmentsand applications,” Foundations and Trends in Information Retrieval, vol. 8, no. 2-3, pp. 127–261, 2014.

[14] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri, “Automatic musictranscription: challenges and future directions,” Journal of Intelligent Information Systems,vol. 41, no. 3, pp. 407–434, 2013.

[15] C. W. Wu and A. Lerch, “Drum transcription using partially fixed non-negative matrix factor-ization,” in 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 1281–1285,Aug 2015.

[16] E. Benetos, S. Ewert, and T. Weyde, “Automatic transcription of pitched and unpitchedsounds from polyphonic music,” in Acoustics, Speech and Signal Processing (ICASSP), 2014IEEE International Conference on, pp. 3107–3111, IEEE, 2014.

[17] D. P. W. Ellis, “Beat tracking by dynamic programming,” Journal of New Music Research,vol. 36, no. 1, pp. 51–60, 2007.

[18] M. E. P. Davies and M. D. Plumbley, “Context-dependent beat tracking of musical audio,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 1009–1020,March 2007.

[19] M. F. McKinney, D. Moelants, M. E. Davies, and A. Klapuri, “Evaluation of audio beattracking and music tempo extraction algorithms,” Journal of New Music Research, vol. 36,no. 1, pp. 1–16, 2007.

[20] J. Foote, “Visualizing music and audio using self-similarity,” in Proceedings of the seventhACM international conference on Multimedia (Part 1), pp. 77–80, ACM, 1999.

[21] C. Dittmar and D. Gärtner, “Real-time transcription and separation of drum recordings basedon NMF decomposition,” in Proc. of the Intl. Conference on Digital Audio Effects (DAFx),pp. 187–194, 2014.

[22] O. Gillet and G. Richard, “Automatic transcription of drum loops,” in 2004 IEEE Interna-tional Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. iv–269–iv–272vol.4, May 2004.

[23] O. Gillet and G. Richard, “Drum track transcription of polyphonic music using noise sub-space projection,” in In Proc. of ISMIR, pp. 92–99, 2005.

[24] K. Tanghe, S. Degroeve, and B. D. Baets, “An algorithm for detecting and labeling drumevents in polyphonic music,” in In Proc. of First Annual Music Information Retrieval Evalu-ation eXchange, pp. 11–15, 2005.

[25] A. Roebel, J. Pons, M. Liuni, and M. Lagrangey, “On automatic drum transcription usingnon-negative matrix deconvolution and itakura saito divergence,” in 2015 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), pp. 414–418, April 2015.

[26] D. Stowell and M. D. Plumbley, “Delayed decision-making in real-time beatbox percussionclassification,” Journal of New Music Research, vol. 39, pp. 203–213, Sep 2010.

REFERENCES 59

[27] A. Hazan, “Towards automatic transcription of expressive oral percussive performances,” inProceedings of the 10th International Conference on Intelligent User Interfaces, IUI ’05,(New York, NY, USA), pp. 296–298, ACM, 2005.

[28] D. Christensen, E. R. Høeg, R. B. Lind, S. A. Nilsson, D. M. Smed, C. Sørensen, and S. P.Vinkel, “Automatic transcription of beatboxing,” tech. rep., Aalborg University, 2014.

[29] C. Mckay, R. Fiebrink, D. Mcennis, B. Li, and I. Fujinaga, “Ace: A framework for optimizingmusic classification,” in Proceedings of the International Conference on Music InformationRetrieval, pp. 42–49, 2005.

[30] T. Nakano, J. Ogata, M. Goto, and Y. Hiraga, “A drum pattern retrieval method by voicepercussion,” in In Proc. of ISMIR 2004, pp. 550–553, 2004.

[31] O. Gillet and G. Richard, “Drum loops retrieval from spoken queries,” Journal of IntelligentInformation Systems, vol. 24, no. 2, pp. 159–177, 2005.

[32] K. Hipke, M. Toomim, R. Fiebrink, and J. Fogarty, “Beatbox: end-user interactive definitionand training of recognizers for percussive vocalizations,” in Proceedings of the 2014 Inter-national Working Conference on Advanced Visual Interfaces, pp. 121–124, ACM, 2014.

[33] D. DeSantis, I. Gallagher, K. Haywood, R. Knudsen, G. Behles, J. Rang, R. Henke, andT. Slama, Ableton Reference Manual Version 9. Ableton, 2016.

[34] P. Brossier, “Man page of aubioonset,” Jan 2017.

[35] M. Malt and E. Jourdan, “Zsa. descriptors: a library for real-time descriptors analysis,” inProceedings of 5th Sound and Music Computing Conference Berlin, 2008.

[36] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B. Sandler, “A tutorialon onset detection in music signals,” IEEE Transactions on Speech and Audio Processing,vol. 13, pp. 1035–1047, Sept 2005.

[37] P. Masri, Computer modelling of sound for transformation and synthesis of musical signals.PhD thesis, University of Bristol, 1996.

[38] J. M. Grey and J. W. Gordon, “Perceptual effects of spectral modifications on musical tim-bres,” The Journal of the Acoustical Society of America, vol. 63, no. 5, pp. 1493–1500, 1978.

[39] G. Peeters, “A large set of audio features for sound description (similarity and classification)in the CUIDADO project,” tech. rep., IRCAM, 2004.

[40] D. Giannoulis, M. Massberg, and J. D. Reiss, “Parameter automation in a dynamic rangecompressor,” Journal of the Audio Engineering Society, vol. 61, no. 10, pp. 716–726, 2013.

[41] S. Dubnov, “Generalization of spectral flatness measure for non-gaussian linear processes,”IEEE Signal Processing Letters, vol. 11, pp. 698–701, Aug 2004.

[42] T. Gulzar, A. Singh, and S. Sharma, “Comparative analysis of lpcc, mfcc and bfcc for therecognition of hindi words using artificial neural networks,” International Journal of Com-puter Applications, vol. 101, no. 12, pp. 22–27, 2014.

[43] A. W. Whitney, “A direct method of nonparametric measurement selection,” IEEE Trans.Comput., vol. 20, pp. 1100–1103, Sept. 1971.

60 REFERENCES

[44] L. E. Peterson, “K-nearest neighbor,” Scholarpedia, vol. 4, no. 2, p. 1883, 2009. revision#136646.

[45] P. Brossier, MartinHN, N. Philippsen, T. Seaver, E. Müller, and S. Alexander, “aubio/aubio:0.4.5.” https://doi.org/10.5281/zenodo.496134. Accessed: 2017-5-17.

[46] W. Brent, “A timbre analysis and classification toolkit for pure data,” in ICMC, 2010.

[47] “Max API.” https://cycling74.com/sdk/MaxSDK-7.1.0/html/index.html.Accessed: 2017-5-20.

https://doi.org/10.5281/zenodo.496134

https://cycling74.com/sdk/MaxSDK-7.1.0/html/index.html

Documents

Automatic Transcription of Drums and Vocalised percussion · 2019. 7. 14. · segue o método "segment-and-classify" para transcrição de bateria [1]. O LVT tem três módulos: i)