Deep Learning Music Generation

Deep Learning Music Generation

Kinbert ChouStanford UniversityStanford, CA 94305

[email protected]

Ryan PengStanford UniversityStanford, CA 94305

[email protected]

Abstract

We are interested in using deep learning models to generate new music. Using the1

Maestro Dataset, we will use an LSTM architecture that inputs tokenized Midi2

files and outputs predictions for note. Our accuracy will be measured by taking3

predicted noted and comparing those to ground truths. Using A.I. for music is a4

relatively new area of study, and this project provides an investigation into creating5

an effect model for the music industry.6

1 Introduction7

Music is deeply embedded in our everyday lives from listening to the radio to YouTube music videos.8

With everyone having a distinct music taste, the area of music is only expanding. Everyday, new9

songs are being created and the art of music holds the interests of many all across the world from10

different countries and cultures. We are interested in exploring the intersection between music and11

A.I. In recent years, deep learning has reached a level of generating words such as natural language12

processing (NLP), however, there is less research done on generating music. While there have been13

projects aimed at generating new music, this project poses a front on whether music generation using14

an existing model language architecture is achievable. Our project tackles the category of music15

generation using classical music. We plan on modeling musical data similarly to human language in16

our projects.17

2 Related Work18

There are existing implementations built using a recurrent neural network architecture and a few19

that explore the use of long short-term memory (LSTM) architectures. In Yu’s article (1), recurrent20

neural networks (RNNs) are inefficient at generating music. The research shows that vanilla neural21

networks are bad at sequential data generation. Combined with Briot et al.’s study "Deep Learning22

Techniques for Music Generation" (2), RNNs are known to have problems such as vanishing and23

exploding gradients when the network is too deep. This is solved by using LSTMs which basically24

create shortcuts in the network We used a LSTM model because its cell state can carry information25

about longer-term structures in music as opposed to a gated recurrent unit (GRU). To improve the26

algorithm, changing the hyperparameters and choosing our own set of hidden layers in accordance27

with data piano music may prove to be the best option. An advanced component of this project we28

implemented was feeding in tokenized MIDI data into a LSTM model similar to what is typically used29

to generate text sequences, which is derived from Skúli’s data preparation for tokenizing MIDI files30

(3). Spezzatti’s study shows tips for encoding, tokenizing, and sequencing lyrics as well as an analysis31

of different state-of-the-art music generation examples including Magenta, MuseGAN, Wavenet, and32

MuseNet (4). Each using different encoding methods and architectures, but also states shortcomings33

with each model (notes not in key, excessive repeating of notes, etc.) Another component to consider34

is the evaluation metrics for a model. From Jung’s analysis, many models use a variety of evaluation35

techniques all of which emphasize certain qualities in a song like rhythm, uniqueness, and aesthetic.36

While most of these metrics are more subjective, Jung proposed an evaluation model that considers37

the structure of the song (intro, verse, bridge, chorus, etc.) using a self-similarity matrix (5).38

3 Dataset and Features39

We used the MAESTRO dataset (6) for our project which comes from a leading project in the area of40

processing, analyzing, and creating music using artificial intelligence. The dataset consists of over41

200 hours of piano music. The dataset is well defined and cleaned: the dataset includes MIDI files42

from over ten years of International Piano-e-Competition. The genre for the music files is mostly43

classical with composers from the 17th to early 20th century. The metadata has the following fields44

for every MIDI/WAV pair: canonical composer, canonical title, split, year, MIDI filename, audio45

filename, and duration. We follow the train, validation, and test splits defined by the metadata .csv46

file.47

3.1 Preprocessing48

All of the data was put into three categories (train, validate, and test). The MIDI files we are working49

with will be read using Music21. The data contains the notes object type and this contains information50

about the pitch, octave, and offset of the note. For the base model there are 967 samples for the train51

set, 137 samples for the validation set, and 178 samples for the test set. We start by loading each file52

into a Music21 stream object using the converter.parse(file) function. Then, we get a list of all the53

notes in the MIDI file. Next, we append the pitch of every note object using its string representation54

since the most significant parts of the note can be recreated using the string notation of the pitch. We55

tokenize those string outputs to feed it into the network. For each example, we use a sequence of56

the 100 preceding notes in order to predict the next note. We continue this "sliding window" process57

until we have seen all notes in the file. Our model is fed these inputs of "window," note pairs. These58

encodings allows us to easily decode the output generated by the network into the correct notes. We59

write this processed data to our data folder and load them at time of training. This preprocessing60

workflow is heavily adapted from the work of Sigurður Skúli (3).61

4 Approach62

The current general architecture of our model consists of using a single long short-term memory63

(LSTM) layer that takes in a sequence of note tokens generated from our data and outputs the64

predicted token given our vocabulary. The model takes in the input and passes through three LSTM65

layers. The first layer turns the note indexes into hidden state encodings. The second layer then66

refines the hidden state outputs from the first layer. The third layer takes in the refined hidden state67

outputs, and we pass the third layer’s last hidden state to a series of fully-connected layers to make68

predictions. This is followed by a layer of batch norm and two linear layers to produce a vector the69

same dimensions as our vocabulary size. There is also a layer of batch norm and ReLU activation70

between the linear layers. The last fully connected layer maps the output of the previous linear layer71

to a vector the length of our vocabulary. We pass this final output to a softmax function, and this72

final softmax output represents what our prediction for the note is. LSTMs are good at processing73

sequential data which music is an example of. The notes leading up to a note in question will inform74

the model of what the next note will be, as the previous notes hold crucial information. For our loss75

function we are using the cross-entropy loss given that we are predicting the next token from our input.76

We also use the stochastic gradient descent optimizer in our training. We closely follow the structure77

stated in: https://github.com/Skuldur/Classical-Piano-Composer, but recreate the architecture using78

PyTorch.79

5 Training and Evaluation80

5.1 Training81

We trained our model on 2839786 examples of (length-100 sequences, next note) pairs generated82

from roughly 1000 midi files. Training for the model on a p3.2 large instance on AWS took roughly83

2

36 hours. The following plot diagrams the training loss values over the first 10 epochs.84

85

86

Training loss for first 10 epochs87

From the progression of loss over iterations, we can see that the loss decreases, albeit minimally, for88

each epoch, and roughly achieves the minimum loss at the same point for each epoch. However, for89

each epoch, loss can vary widely. We believe this can be explained by the structure of the data: notes,90

chords, key signatures, and other musical features can vary widely from song to song. Thus, over one91

epoch, the model will encounter notes that occur very often, and can update its weights easily, and92

notes that occur infrequently in the dataset. Because we are not using embedding representations as do93

similar tasks such as next-word prediction, the model must first learn its own internal representation94

of each note/chord token. Thus, some songs can be inherently easier because they are in a common95

key signature, and some can be more difficult.96

97

Training loss for our model with batch size 128, epoch 20-10098

For the remaining epochs, the model shows consistent decrease in average loss value over time, but99

still oscillates in a large range between easy and hard songs. The range for which the loss oscillates is100

very high as compared to a model with a lower batch size.101

5.2 Hyperparameter Tuning102

We trained a few times using different parameters to analyze loss and accuracy. We trained using103

a batch size of 32, 64, and 128. We also trained the model for 30, 50 and 100 epochs to compare.104

The model with the best results contained a batch size of 128 with 100 epochs. This model achieved105

lower loss and higher accuracy compared to other models.106

3

5.3 Results and Evaluation107

Table 1: Accuracy & Loss for Different Models

Model Accuracy Loss128/100 Train 0.0460 1.62128/100 Validation 0.0152128/100 Test 0.014664/85 Train 0.0108 4.2764/85 Validation 0.013464/85 Test 0.012432/30 Train 0.0108 4.8132/30 Validation 0.013432/30 Test 0.0124

5.4 Error Analysis108

We observe that while the model does poorly in training accuracy, it performs even worse in dev and109

test sets. There are several possible reasons for this: Analysis of our model’s predictions show that110

for Validation and Test sets, it is mostly predicting the notes G or D, at different octaves. These two111

are very common notes in many keys in music. During training, the model could have learned to112

predict these notes frequently to reduce loss. One proposed solution to address this is to train the113

model for more epochs, allowing it to learn better representations of input sequences. Additionally,114

the dev and test sets can possibly come from a different distribution than the training set. Pieces in115

these sets can have new key signatures, or new and unseen chords. One suggested improvement to116

address this is to model our input data similar to character embeddings used in NLP deep learning117

models. This will allow the model to generalize its learning to unseen chords.118

119

Another possible error in the approach is that the vocabulary is simply too complex to learn. To test120

this, we train on a significantly reduced dataset, roughly 12% of the actual, and we expect the model121

to overfit to this set. However, do to either a low number of epochs (30) or an insufficiently complex122

model, the performance on this smaller dataset is also poor.123

6 Conclusion124

With our training model, the loss does generally decrease with each subsequent epoch. Since the125

training is done with a variety of songs, harder more complex songs have a higher loss to learn-126

ing. This is mainly seen in the increase and decrease in loss over time within a single training iteration.127

128

Our results show that a batch size of 128 with training under 100 epochs produces the best accuracy.129

The accuracy for the train set produces a higher accuracy than the test and validation as expected, but130

is still significantly underfitting. Accuracy is still low as the model requires more epochs, additional131

tuning, and possibly a deeper network. For further work, finding a dataset with similar songs may be132

a good start to improving the baseline for this model. A lot of fluctuations for the loss during training133

attributed to different songs containing notes that are harder to interpret than others.134

135

Given the data gathered during this study, the area of music generation requires more training and136

model structure. From other studies, music generation using a LSTM model needs many epochs to137

accurately predict next generative notes. This is coupled with a complex multi-layer architecture in138

most models for maximum accuracy. In the future, this model can be improved by adding another139

hidden layer and a method for processing unique notes in the validation and test set data. Ultimately,140

the area of music generation is quite large and once a model is established, there can be more research141

done for different music industries (pop, rock, EDM, etc.) and while piano music generation is a start,142

building off of these models will be new entry into a creative way to make music.143

4

7 Github Repo144

https://github.com/kl-chou/CS-230-Project145

Our work can be found in the LSTMModel folder in the above GitHub repository. get_notes() and146

prepare_sequences() is adapted from the Classical-Piano-Composer repository, the work of Sigurður147

Skúli.148

5

References149

1. https://towardsdatascience.com/neural-networks-for-music-generation-97c983b50204150

2. https://arxiv.org/pdf/1709.01620.pdf151

3. https://towardsdatascience.com/how-to-generate-music-using-a-lstm-neural-network-in-keras-152

68786834d4c5153

4. https://towardsdatascience.com/neural-networks-for-music-generation-97c983b50204154

5. https://towardsdatascience.com/making-music-when-simple-probabilities-outperform-deep-155

learning-75f4ee1b8e69156

6. https://magenta.tensorflow.org/datasets/maestro157

6

Documents

Deep Learning Music Generation