Music generation using tracker music and machine learning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2021 Music generation using tracker music and machine learning BJÖRN A. LINDQVIST KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Music generation using tracker music and machine learning BJÖRN A. LINDQVIST BJOLIN2@KTH.SE Degree Programme in Computer Science Date: June 27, 2021 Supervisor: Bobby Lee Townsend Sturm JR Examiner: Sten Ternström School of Electrical Engineering and Computer Science Swedish title: Musikgenerering med trackermusik och maskininlärning
iii Abstract We investigate the modelling of polyphonic “tracker music” using deep neural networks. Tracker music is a music storage format invented in the late 1980s for use on that time’s home computers and is often used for storing synthesized electronic music. Tracker music differs significantly from other music formats and has properties that makes it both harder and easier to use for training neural networks than other formats. This makes it interesting to explore what methods are most suitable for extracting musical information from the format. As far as we know, we are the first to explore how to use tracker music for music generation. We design a method for turning tracker music into sequential data usable for training neural networks. The sequential nature of the data means that musically unaware sequence models can be used for training. The method is general and can be applied to other kinds of symbolic music. We then compile a dataset of about 20 000 freely available in- strumental songs in the tracker format MOD, downloaded from the website The Mod Archive. We use the dataset to train several differ- ent sequence models, including a Long Short-Term Memory (LSTM) network and a Transformer model. We evaluate the models using a sequence completion task and we investigate the statistical properties of the output. We also conduct a listener study involving some 100 participants to determine how often music generated by the models is preferred over human-composed music. The listener study’s result indicates that music generated by the models trained on the dataset is sometimes competitive with music composed by humans. We conclude that neural networks for music generation can be trained using tracker music using our proposed conversion method, but that it is cumbersone. Due to how the tracker music format is constructed it is significantly more difficult to get musical information out of it than we initially thought.
iv Sammanfattning Vi undersöker hur man bäst använder sig av trackermusik för att träna neurala nätverk till att generera polyfonisk musik. Trackermusik kan sägas både vara en speciell instrumental musikgenre och ett speciellt musikformat. Formatet som skiljer sig markant från exempelvis MIDI och MP3 har en del egenskaper som gör det svårare och andra som gör det lättare att träna neurala nätverk med det än med jämförbara musikformat. Därför är det intressant att utforska vilka metoder som är bäst att använda för att utvinna musikinformation ur formatet. Så vitt vi vet har ämnet inte utforskats tidigare och vårt utforskande av det är vår uppsats centrala bidrag till forskningen kring musikgenerering med neurala nätverk. I uppsatsen föreslår vi en metod som konverterar trackermusik till ett sekventiellt format som är lämpligt att använda för att träna neu- rala nätverk. Vi demonstrerar också att metoden fungerar i praktiken genom att träna ett antal neurala nätverk med en samling på cirka 20 000 instrumentala sånger i trackermusiklagringsformatet MOD som vi sedan utvärderar. Utvärderingen består bland annat i en lyssnarstudie. Resultatet av den visar att den musik som genereras av tre neurala nätverk som vi tränar med trackermusiksamlingen ibland föredras av lyssnare över musik skapad av människor. Vi drar slutsatsen att musikgenererande neurala nätverk kan tränas med hjälp av trackermusik med vår föreslagna konverteringsmetod, men att det är bökigt. På grund av hur trackermusik är uppbyggt och organiserat är det betydligt svårare än vad vi inledningsvis trodde att utvinna musikinformation ur det.
v Acknowledgements I would like to thank my supervisor Bob L. Sturm without whose plentiful support and insistence this thesis would not have been com- pleted.
Contents 1 Introduction 1 2 Theoretical background 4 2.1 Symbolic music . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Sequence modelling . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Sequence completion . . . . . . . . . . . . . . . . . 6 2.2.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Artificial neural networks . . . . . . . . . . . . . . . . . . 11 2.3.1 Feed-forward networks . . . . . . . . . . . . . . . 14 2.3.2 Recurrent neural networks . . . . . . . . . . . . . 15 2.3.3 The Transformer . . . . . . . . . . . . . . . . . . . 18 2.3.4 Inferencing . . . . . . . . . . . . . . . . . . . . . . 21 3 Related work 22 3.1 Sturm, Santos, et al. (2016) . . . . . . . . . . . . . . . . . . 23 3.2 Hadjeres and François Pachet (2016) . . . . . . . . . . . . 25 3.3 Donahue, Mao, Y. E. Li, et al. (2019) . . . . . . . . . . . . 29 3.4 Huang et al. (2019) . . . . . . . . . . . . . . . . . . . . . . 31 4 Tracker music and machine learning 33 4.1 The MOD file format . . . . . . . . . . . . . . . . . . . . . 33 4.2 Turning MOD files to training data . . . . . . . . . . . . . 35 4.2.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.2 Dcode . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Turning training data to music . . . . . . . . . . . . . . . 39 5 Dataset and neural network training 41 vi
CONTENTS vii 6 Evaluation 45 6.1 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . 46 6.1.1 Plagiarism . . . . . . . . . . . . . . . . . . . . . . . 50 6.2 Listener study . . . . . . . . . . . . . . . . . . . . . . . . . 52 7 Discussion 57 7.1 Training on modules . . . . . . . . . . . . . . . . . . . . . 57 7.2 Sequential encodings . . . . . . . . . . . . . . . . . . . . . 59 7.3 LSTM versus pcode GPT-2 . . . . . . . . . . . . . . . . . . 60 7.4 Ethics and sustainability . . . . . . . . . . . . . . . . . . . 61 Bibliography 63 8 Appendix 66
Chapter 1 Introduction Music is the art of combining sounds with silence in a way that elicits emotions. The sounds are not merely noise, but has meaning, structure, and a purpose. To create music is to compose – to organize smaller atomic elements into a larger whole that is greater than the sum of its parts. While most music is composed by humans, an intriguing question is whether some algorithm can excel at the same task. This leads to deeper questions on what would be required of such an algorithm and how it should be implemented. The recent surge of interest in machine learning has seen many researchers use neural networks to generate music (Sturm, Santos, et al. 2016; Peracha 2019; Donahue, Mao, and McAuley 2018; Huang et al. 2019). The challenge for them and for us is that music is predictable and unpredictable, familiar and alien, structured and chaotic, all at the same time. Good music cannot be to much of either; too structured music is repetitive and dull – too chaotic music is confusing and without meaning. Music is structured into multiple layers such as beats, bars, riffs, voices, etc, that interact with each other. Capturing the interactions between all these layers makes music generation difficult. Tracker music is a unique way of representing synthesized elec- tronic music invented in in the late 1980s and designed for use on that time’s home computers. Tracker music is sample-based and frequently used for composing chiptunes, a genre of instrumental music that sounds reminiscent of vintage arcade machines, computers, and video game consoles (Driscoll and Diaz 2009). The music is created with tracker software – a type of music sequencing software. The term “tracker” derives from the first tracker, Ultimate Soundtracker, writ- 1
2 CHAPTER 1. INTRODUCTION ten by Karsten Obarski and released in 1987 for Commodore Amiga. Trackers made it possible for amateurs without access to expensive synthesizers to create music with their own computers. Obarski’s software became hugely popular with videogame developers, hobbyist musicians, and on the demoscene. The success of Ultimate Sound- tracker spawned several clones and lookalikes that expanded on the tracker concept, including ProTracker, NoiseTracker, Scream Tracker, and FastTracker II (Cant 2020). Ultimate Soundtracker also introduced the MOD file format for storing tracker music. MOD is shorthand for module and tracker music is often called MOD/module music (see chapter 4 for details on the MOD format). The objective of this thesis is to explore how tracker music can be used for training neural networks to generate music. We fullfill our objective by training neural networks on a corpus of freely available tracker music in MOD format. To the best of our knowledge, neither this dataset nor tracker music in general has been used for music generation. We also evaluate the trained networks; in a listener study conducted online involving over 100 participants we find that the net- works produce music that is preferred over human-composed music a fairly often (see section 6.2 for details). This demonstration on how to use tracker music for music generation is our main contribution. A secondary contribution is our analysis of the neural networks’ perfor- mance. A tertiary contribution is the software we have developed for training and evaluating neural networks trained on tracker music that we have published online.1 The target audience for this thesis follows from its objective; ma- chine learning practitioners interested in novel datasets for music generation. It ought to be an interesting read for them because it introduces a music format they probably have not encountered. We consider this thesis a proof-of-concept to inspire others to improve the methods we present, resulting in better ways to use tracker music. Or, at the very least, to adapt our proposed methods for other musical datasets. The rest of this thesis is structured as follows. In chapter 2, we define symbolic music and we discuss sequence modelling and why it is a good tool for generating symbolic music. We review the theory underlying artificial neural networks, paying special attention to two specific state of the art sequencing approaches - the LSTM and the 1 See https://github.com/bjourne/musicgen.
CHAPTER 1. INTRODUCTION 3 Transformer architectures – employed to implement our networks. In chapter 3, we discuss some recent music modelling research and how it relates to our work. In chapter 4, we discuss tracker music and explain our algorithm for encoding it as sequences amendable for sequence modelling. In chapter 5 and 6, we present the tracker music dataset, our training work, and what results it yielded. In chapter 7, in we discuss what we have learned and stake out directions for future research.
Chapter 2 Theoretical background The theoretical basis of this thesis is symbolic music, sequence mod- elling, and artificial neural networks. We define symbolic music in section 2.1, discuss sequence modelling, including decoding and eval- uation, in section 2.2, and in section 2.3, we discuss artificial neural networks in general and the LSTM and the Transformer architecture in particular. 2.1 Symbolic music Symbolic music are compositions presented in symbolic notation. Ex- amples of symbolic music are sheet music, tabulature, and MIDI files. The fundamental elements in symbolic music are notes. They instruct the performer on how to play the symbolic music. A note has three central properties: an onset, defining when the note is played relative to the start of the composition; a duration, defining for how the long the note sounds; and a pitch, defining the frequency of the note.1 A composition’s duration is usually subdivided into intervals of non-overlapping measures. The measures’ durations are in turn subdi- vided into intervals of non-overlapping note durations. For example, if a composition contains 16 measures and every measure contains four quarter-notes and the duration of each quarter-note is 500 ms, the composition’s total length is 32 seconds. Quarter-note durations may be further subdivided into non-overlapping triplets or quintuplets, called sixteenth-notes. Other subdivisions of time exists and it is also 1 Notescan have additional properties called embellishments. Embellishments are beyond the scope of our discussion. 4
CHAPTER 2. THEORETICAL BACKGROUND 5 possible for durations of measures to vary, but for simplicity’s sake, in this thesis we assume that all measures are exactly 16 sixteenth-notes long, that there are always four sixteenth-notes per quarter-note, and that all notes begin at and cover one or more sixteenth-notes. The composition’s tempo is the pace of its rhythm and specifies the duration of its quarter-notes. It is often notated as the number of quarter-notes, or beats, per minute (BPM). For example, 125 BPM is the same as 480 ms per quarter-note. A note’s pitch is its frequency relative to some base frequency. In Western music, notes’ frequencies are fit into octaves – non-overlapping subdivisions of the frequency spectrum. Octaves contain 12 relative frequences known as tone steps or “notes”; terminology that may invite some confusion since a “note” in a composition also has an onset and a duration. The 12 tone steps in an octave are, in ascending order of frequency: C, C#, D, D#, E, F, F#, G, G#, A, Bb, and B. Octaves wrap around so that the B note of the first octave is followed by the C note of the second octave and so on. Scientific pitch notation specifies pitch by note and octave. For ex- ample, F#3 denotes the third octave’s F# note. The absolute frequency f of a note given in scientific pitch notation is given by: f = A0 × 2i/12 . (2.1) A0 is a standard frequency, usually set to 440 Hz, and i is the distance in half-steps between the note and A0 . E.g., if the note is D1 , the distance is 5 because there are four notes in between: Bb0 , B0 , C1 , and C#1 . 2.2 Sequence modelling We choose to model music as token sequences because sequence modelling is a mature field; we inherit all its theory. A major use case for sequence modelling is language modelling, wherein text is modelled as sequences of words, letters, or lexemes. Lots of research have gone into developing language models and we want to reuse the result of this research for music modelling. One caveat though; most music is polyphonic, meaning that multi- ple notes can sound at the same time. This is unlike text in Western scripts whose letters and words are strictly ordered from first to last.
6 CHAPTER 2. THEORETICAL BACKGROUND There are several solutions to this “simultaneity” problem. The one we use is to interleave simultaneous notes. See section 4.2 for details on our various music encoding schemes. In sequence modelling, the goal is to model a latent probability distribution over variable-length token sequences drawn from a finite vocabulary. We denote token sequences x = ( x1 , . . . , xn ), where n is the length of the token sequence,2 the latent distribution p∗ (x), and our model of the distribution pθ (x). The model is parametrized by θ and it should resemble p∗ (x) so that p∗ (x) is approximately equal to pθ (x) for all x (Welleck, Kulikov, Roller, et al. 2019). The chain rule lets us factorize the joint probabilities as the product of conditional probabilities: |x| pθ (x) = ∏ p θ ( x t | x < t ). (2.2) t =1 |x| denotes the length of the sequence and x
CHAPTER 2. THEORETICAL BACKGROUND 7 completion in language modelling – the model is prompted with questions and has to come up with a suitable answers. Let x p = ( x1 , . . . , xk ) be the prefix and xc = ( xk+1 , . . . , xn ) the continuation so that ( x1 , . . . , xk , xk+1 , . . . , xn ) is the completion. Our goal is to find xc that maximizes the likelihood: n ∏ p θ ( x t | x < t ). (2.4) t = k +1 This is known as maximum a posteriori (MAP) decoding since pθ is a probability model. However, since the search space is exponentially large, solving the problem exactly is intractable, and one has to resort to approximate decoding algorithms (Gu, Cho, and V. O. K. Li 2017; Meister, Vieira, and Cotterell 2021). 2.2.2 Decoding Generating sequences from a sequence model is called decoding. Several decoding strategies have been invented, each with its own advantages and disadvantages. Broadly speaking, they can be categorized as either deterministic or stochastic. In the following sections, we review some popular ones. Deterministic decoding Greedy decoding is a simple approximate decoding algorithm. It selects the highest probability token at each time step: argmax pθ ( xt |x
8 CHAPTER 2. THEORETICAL BACKGROUND Beam search generalizes greey decoding. It keeps B sequences in memory and at each time step and for each sequence it considers the B tokens with the highest conditional probability. Among the B2 possible continuations, it selects B sequences with the highest likelihood and repeats the process. When a sufficient number of tokens have been generated, the sequence with the highest likelihood is returned. With B = 1 beam search is equivalent to greedy decoding. While beam search will always yield sequences with equal or higher likelihood than greedy decoding, it is also much more computation- ally expensive. Furthermore, several authors have shown that beam search suffer from severe excessive repetition just like greedy decoding (Holtzman et al. 2020). Stochastic decoding Stochastic decoding uses randomness to generate sequences that, ide- ally, are more “diverse” than deterministically decoded sequences. In its most basic form, it samples one token from the distribution pθ ( xt |x
CHAPTER 2. THEORETICAL BACKGROUND 9 suggested setting the p parameter to values in the range 0.9 to 1.4 A third variant is tempered sampling which samples from q, where q is derived from pθ ( xt |x
10 CHAPTER 2. THEORETICAL BACKGROUND mean negative log-likelihood per token: |x| ( ) 1 |x| t∑ PPL(x) = exp − log pθ ( xt | x
CHAPTER 2. THEORETICAL BACKGROUND 11 the set of 2-grams or bigrams: {(hello, how), (how, are), (are, you), (you, doing)}, the set of 3-grams or trigrams: {(hello, how, are), (how, are, you), (are, you, doing)}, and so on. BLEU computes the fraction of the ngrams in the candidate sen- tence that is matched by an identical ngram in a reference sentence. The metric therefore ranges from 0 to 1 and is 1 only if the candidate translation has the same ngrams as one of the reference translations (Papineni et al. 2002). Let C be the set of ngrams in the candidate sequence and R = { R1 , . . . , Rn } a set of the sets of ngrams in the reference sequences and let πS ( g) be the number of occurances of the ngram g in the sequence S, then BLEU is defined as: BLEU (C, R) = ∑ min max{π R (c)}, πC (c) / ∑ πC (c) (2.10) R∈R c∈C c∈C While BLEU measures how similar sentences are to each other, Zhu et al. (2018) proposed turning the metric around to measure the diversity of generated data which they called Self-BLEU. Self-BLEU works by sampling, say, 1000 sentences and for each sentence calculate its BLEU score by using the others as reference. The lower the score, the higher the diversity. 2.3 Artificial neural networks Artificial neural networks is a subfield of machine learning centered around computational graphs, loosely inspired by how neurons com- municate with each other (Goodfellow, Bengio, and Courville 2016, Chapter 6). The neurons in artificial neural networks (often called “neural networks” or just “networks”) form directed graphs along whose edges they propagate signals. The signals’ strengths depend on the weights of the edges they are transmitted on. If the strength exceeds some threshold, the neuron propagates the signal further. Eventually, the signal reaches a set of output neurons from which
12 CHAPTER 2. THEORETICAL BACKGROUND the network’s value is read. Neural networks have been applied for many tasks for which it is difficult to come up with explicit logical constraints, such as pattern recognition, predictive analysis, and image recognition. Thus, they ought to be an excellent choice for modelling music for which coming up with such constraints is difficult. Mathematically speaking, a neural network is a general class of parametric nonlinear functions from a vector x of input variables to a vector o of output variables, o = f (x, θ ). (2.11) θ represents the parameters; i.e. the set of weights for all edges in the graph. The Universal Approximation Theorem states that sufficiently large neural networks can approximate any function. Thus, we can use a neural network for solving prediction problems by finding proper values for θ. The method used to fit a neural network is analoguous to poly- nomial curve fitting. Let D = {(x1 , y1 ), . . . , (xn , yn )} be a sequence of training examples, comprised of pairs of input vectors together with matching target vectors. The parameters that minimize an error function over the training examples, such as the sum-of-squares, is sought: 1 n E(θ ) = ∑ || f (xi , θ ) − yi ||2 . (2.12) 2 i =1 If f is smooth and continuous,5 it follows that E is too. Therefore, E forms a surface over θ. All its minima will occur where its gradient is zero: ∇ E(θ ) = 0. (2.13) Due to the large number of free parameters, finding the gradient’s zero points with analytical methods is infeasible. Instead, iterative numerical methods are used – commonly gradient descent. Gradient descent can be imagined as a ball dropped on a hilly surface consisting of peaks and valleys. No matter where the ball is dropped, it will roll downhill until it comes to rest in one of the valleys. The gradient at that point must be zero, otherwise the ball would keep rolling. 5 Any neural network function must have these properties.
CHAPTER 2. THEORETICAL BACKGROUND 13 Gradient descent simulates the process by picking some initial values for the ball’s position, θ (0) , and updates them in a stepwise fashion: θ ( t +1) = θ ( t ) − η ∇ E ( θ ( t ) ). (2.14) η is a scalar parameter called learning rate. It controls how far the ball rolls in the gradient’s opposite direction on each update. The process repeats until E(θ ) stops decreasing at which point a local minima has been found. The local minima gradient descent finds may be different and larger than the error function’s global minima. Suppose the hilly surface contains a valley in a mountainous region. The lowest elevation of the valley may be higher than the elevation of the flatlands in the other regions of the surface. Gradient descent could get stuck in this valley, unable to escape the local minima. To alleviate this problem, researchers have proposed variants of the basic gradient descent update process to let it explore a larger portion of the search space (Bottou 1991). One such variant is stochastic gradient descent. Unlike total gradient descent, which computes ∇ E(θ (t) ) on every example in the dataset D and is infeasible to compute if the dataset is large, stochastic gradient descent computes ∇ E(θ (t) ) on a single randomly selected element of D . This introduces noise and helps the optimization process escape local minima. But it may also lead to the error function’s variance becoming prohibitively large, making the optimization process inefficient. Furthermore, since only one piece of data is used at a time, data parallelism hardware cannot be exploited. Mini-batch gradient descent offers a compromise between these two extremes. It computes ∇ E(θ (t) ) based on a random subsequence of D . The size of the subsequence is called the batch size and is commonly set to values ranging from 10 to 1 000. Mini-batch gradient descent is the gradient descent variant most often used in practice. So far, we have not discussed how neural networks are imple- mented. The reason is because there are many different types of networks and, besides what we stated above, they have little in com- mon with each other. While it is not a property all neural networks share, many of them organize their neurons into layers. The network’s input represents the values of the neuron’s in the input layer and the network’s output the values of the neuron’s in the output layer (x and o in equation 2.11). In between the input and output layers sits one or more hidden layers, so called because they are not directly observable
14 CHAPTER 2. THEORETICAL BACKGROUND from the inputs and the outputs (Reed and Marks II 1999, Chapter 4). Organizing neurons into layers makes neural networks very flexible because the parameters of each layer, like the number of neurons, can be configured independently. Indeed, one can even think of the layers, rather than the neurons themselves, as nodes in a computational graph and the job of the practitioner to be to combine these layers by drawing the graph’s edges. In the following sections, we describe a few different neural net- work designs; feed-forward networks, recurrent neural networks, and the LSTM and the transformer architecture – two architectures de- signed for sequential data and which are the ones we use in this thesis. 2.3.1 Feed-forward networks The simplest neural network architecture is the feed-forward network. The information in feed-forward networks flows in a single direction – there are no cycles in the computational graph. This sets them apart from recurrent networks wherein cycles exist (Goodfellow, Bengio, and Courville 2016, Chapter 6). A feed-forward network is a composition of parametric functions; f (x, (θ1 , . . . , θn )) = f n ( f n−1 (. . . f 1 (x, θ1 ) . . . , θn−1 ), θn ). (2.15) The functions define the layers of the network and the number of layers its depth. “Deep” in “deep learning” comes from this terminology. Most layers in most feed-forward networks are fully-connected layers – so called because every neuron in the layer is connected to every neuron in the preceding layer. A fully-connected layer composes a parametrized affine transformation with a non-parametrized scalar activation function. The activation function is often nonlinear which allows the network to represent nonlinear functions. (Goodfellow, Bengio, and Courville 2016, Chapter 6). A network with only fully- connected layers is called a fully-connected network. Let θ = ((W1 , b1 ), . . . , (Wn , bn )) specify the parameters of a fully- connected network’s affine transformations and σ1 , . . . , σn its activation functions, then the network computes: f (x, θ ) = σn (Wn σn−1 (. . . σ1 (xW1 + b1 ) · · · + bn−1 ) + bn ). (2.16) While the Universal Approximation Theorem states that a two-layer fully-connected network can represent any function, it may require a
CHAPTER 2. THEORETICAL BACKGROUND 15 prohibitively large number of parameters or may generalize poorly (Goodfellow, Bengio, and Courville 2016, Chapter 6). Therefore, more sophisticated architectures have been designed to overcome the limita- tions of fully-connected feed-forward networks. 2.3.2 Recurrent neural networks A shortcoming of feed-forward networks is that they have no notion of “previous” and cannot recall what they were doing previously. For example, suppose a neural network classifies frames in movies. The current frame is of course important, but information from prior frames could perhaps improve the network’s predictions. If the last frame was of an elephant the probability of the next frame also being of an elephant must be quite high. It is unclear how a feed-forward neural network could incorporate such information. Recurrent neural networks (RNNs) were invented to address this issue. They excel on sequence prediction tasks and are a good fit for symbolic music which, as described in section 2.2, can be modelled as sequences. Unlike feed-forward networks, RNNs contain state, known as hid- den state, that is affected by the data that passes through them. Let ht be the network’s hidden state at time t and xt be the t:th element of a sequence, then the network is defined by the recurrence relation h t = f ( h t −1 , x t , θ ). (2.17) Typically, an RNN will use the hidden state to make predictions (Goodfellow, Bengio, and Courville 2016, Chapter 10): o t = g ( h t ). (2.18) If the network is employed to predict the next element of the sequence then xt+1 = ot and we can write the above as xt+1 = g( f (ht−1 , xt , θ )). (2.19) Like feed-forward networks, RNNs can be layered. Suppose we have three recurrent layers, defined by the hidden states h(1) , h(2) , and h(3) and the parameters θ (1) , θ (2) , and θ (3) , then the network computes
16 CHAPTER 2. THEORETICAL BACKGROUND the following recurrences and prediction function g: (1) (1) ht = f ( h t −1 , x t , θ (1) ) (2) (2) (1) ht = f ( h t −1 , h t , θ (2) ) (2.20) (3) (3) (2) ht = f ( h t −1 , h t , θ (3) ) (3) o t = g ( h t ). To train an RNN, the recurrence relation has to be removed by unfolding it over training sequences of fixed lengths. Unfolding effec- tively means making a copy of the network for every element in the training sequence with every copy sharing the same parameters. Let x = (x1 , . . . , xn ) be a training sequence of length n, then the unfolded computation is defined by iteratively applying f to the hidden state and the sequence elements: h n = f ( . . . f ( h0 , x1 , θ ) . . . , x n , θ ). (2.21) The initial hidden state, h0 , is set to some default value, commonly zero (Zimmermann et al. 2005). The unfolded computation is free of cycles and the RNN can be trained using backpropagation and gradient descent like a feed-forward network. The unfolding of the computation graph causes two major prob- lems for RNNs. First, the depth is proportional to n, causing the complexity of the training to also become proportional to n. Due to the inherently sequential nature of the data flow, the problem cannot be meaningfully parallelized, making training with longer sequences very expensive. Furthermore, the activations of the neurons inside the network are replicated n times which consumes memory pro- portional to n (Hwang and Sung 2017). The second problem is the vanishing/exploding gradient problem, caused by long chains of re- peated multiplication necessary for propagating gradients in deep computational graphs. Repeated multiplication of numbers greater than one tends towards infinity and causes gradients to explode, while repeated multiplication of numbers less than one tends towards zero and causes gradients to vanish (Goodfellow, Bengio, and Courville 2016, Chapter 10). Vanishing gradients causes training to become very slow and exploding gradients cause it to be wildly unstable. This puts an bound on the length of the temporal dependencies RNNs can learn (McGonagle, Williams, and Khim 2021). Goodfellow, Bengio,
CHAPTER 2. THEORETICAL BACKGROUND 17 and Courville (2016) estimates that the limit is somewhere around 10 to 20 tokens for traditional RNNs trained with stochastic gradient descent. To overcome this limitation, gated RNNs were invented. They are based on the idea of creating paths through time whose derivatives will neither vanish nor explode (Goodfellow, Bengio, and Courville 2016, Chapter 10). LSTM The Long Short-Term Memory (LSTM) network is one of the most successful gated RNN types (Mauthes 2018). Researchers have shown that LSTMs can learn long-term dependencies more easily than sim- ple recurrent architectures (Goodfellow, Bengio, and Courville 2016, Chapter 10). The LSTM has a cell state in addition to its hidden state. The cell state works as an auxiliary memory onto which the LSTM puts and removes information that needs to be remembered long-term. Thus, the LSTM’s recurrence relation is: h t , C t = f ( h t −1 , C t −1 , x t , θ ). (2.22) Ct is the cell state at time t and the other variables are as previously defined. The equations implementing the recurrence are: f t = σ ( x t U f + h t −1 W f + b f ) it = σ(xt Ui + ht−1 Wi + bi ) ot = σ(xt Uo + ht−1 Wo + bo ) (2.23) e t = tanh(xt Ug + ht−1 Wg + bC ) C C t = f t ◦ C t −1 + i t ◦ C et ht = tanh(Ct ) ◦ ot The subscripted U, W, and b variables are appropriately sized pa- rameter matrices and biases, ◦ the Hadamard product (element-wise multiplication), and σ the sigmoid function: 1 σ( x) = . (2.24) 1 + e x −1 Three gates, the forget, input, and output gate, represented by the vectors ft , it , and ot in the previous equations, control how the LSTM
18 CHAPTER 2. THEORETICAL BACKGROUND stores information in the cell state and how it uses it to update its hidden state (Olah 2015). They all apply a parametric affine transfor- mation to ht−1 and xt , followed by the sigmoid function. The domain of the sigmoid function is the range 0 to 1 so the gates’ values are also constrained to that range. Note that the range of the tanh function is -1 to 1, meaning that the range of ht is -1 to 1. The forget gate determines what fraction of the cell state to discard. The input gate determines how much of the candidate cell state, C e t , to discard. 0 again means discard everything and 1 means keep every- thing. The update of Ct in equation 2.11 can be thought of assigning a weighted sum of the old cell state, Ct−1 and the candidate cell state, C e t. This is what allows the LSTM to solve the vanishing gradient problem and to better capture long-term dependencies. The final gate is the output gate and controls how much memory information is passed through to the predictor. 2.3.3 The Transformer The Transformer is a family of neural network architectures based on the attention mechanism (Vaswani et al. 2017). It was popularized by OpenAI’s Transformer-based GPT-2 architecture, which achieved state-of-the-art results on many natural language processing tasks. Like RNNs, the Transformer is designed for sequence prediction. However, it is wider and shallower than RNNs, lending itself well to parallelization. This makes it possible to build extremy large Trans- former models (Alammar 2018). An important feature of the Trans- former is that it processes all sequence elements in parallel, rather than serially like RNNs. At a high level, the Transformer is a Sequence-to-Sequence architec- ture that implements an Encoder-Decoder architecture. Sequence-to- Sequence (or seq2seq) means that it transforms an input sequence to an output sequence and Encoder-Decoder that it consists of an encoder and a decoder unit. The encoder unit takes the input sequence and returns a context vector. The decoder unit takes the context vector and returns the output sequence (Allard 2019). The encoder and decoder units consists of multiple identical encoder and decoder layers stacked on top of each other, each with its own set of learnable parameters.
CHAPTER 2. THEORETICAL BACKGROUND 19 The Transformer’s data flow is:6 h0 = PE(x) hn = En (. . . E1 (h0 ) . . . ) (2.25) hm+n = Dm (. . . D2 ( D1 (hn , hn ), hn ) . . . , hn ) f (x) = FC (hm+n ). E1 , . . . , En and D1 . . . , Dm represents the encoder and decoder layers, h0 to hm+n the context vector as it flows through the layers, PE is a layer for embedding and positionally encoding sequence elements, and FC a fully-connected layer that produces predictions. Note that the decoder layers takes two values as input; the output of the previous decoder layer and the output of the last encoder layer. The encoder layers’ data flow is: h0 = LN ( MH A(h, h, h) + h) (2.26) E(h) = LN ( FFN (h0 ) + h0 ) and the decoder layers’ is: h0 = LN ( MH A(h, h, h) + h) h00 = LN ( MH A(he , he , h0 ) + h0 ) (2.27) 00 00 D (h, he ) = LN ( FFN (h ) + h ). he is the context vector returned by the last encoder layer, MH A is the multi-head attention mechanism, LN a normalization layer, and FNN a two-layer feed-forward network with ReLU activation inbetween: FFN ( x ) = max(0, xW1 + b1 )W2 + b2 . (2.28) Multi-head attention The multi-head attention mechanism is the heart of the Transformer architecture. In a recurrent seq2seq model, the encoder processes the input sequence one element at a time. Each element updates the encoder’s context vector or hidden state. The decoder takes the context vector as input and produces predictions. The problem with this is 6 For brevity, in this and the following equations we omit most parameters and dropout rates. We hope it is clear from context where parameters are missing. For these and other details about the Transformer, see Vaswani et al. (2017).
20 CHAPTER 2. THEORETICAL BACKGROUND that older elements tend to be forgotten or “overwritten” by more recent elements. To deal with this problem the attention mechanism was invented. It creates “shortcut” connections between the context vector and the entire input sequence. The weights of these shortcut connections are learnable for each input element, allowing the model to learn which prior elements are most important for predicting the current element. In other words, the model learns which elements to pay attention to. There are many ways to implement attention. In the Transformer model it is implemented as FC (q) FC (k ) T MH A(q, k, v) = softmax √ FC (v). (2.29) s/n s is the dimensionality of the context vector and n the number of attention heads in the layer (see below). FC is a fully-connected layer with no activation layer and no bias, i.e a multiplication of a learnable matrix with a vector. The multi-headed nature of the Transformer’s attention mechanism is not communicated in the equation above. The Transformer splits FC (q), FC (k), and FC (v) into as many parts as there are heads and computes the attention for each part separately. Intuitively, multiple attention heads allows for attending to parts of the sequence differently. Positional encodings As the elements in a sequence simultaneously flow through the Trans- former, it does not have any sense for the elements’ ordering. While implicit in RNNs, ordering has to be imputed explicity using a posi- tional encoding to the Transformer by adding a piece of information to every element so that the model can track its order. The positional encoding is fixed in the vanilla Transformer but learnable in GPT-2. GPT-2 GPT-2 (Generative Pre-trained Transformer 2) is a variant of the Trans- former architecture. Developed by OpenAI and published in 2019, it is a refinement of the GPT architecture from 2018 which, in turn, is a refinement of the Transformer-Decoder architecture (Radford, Wu, et al. 2019). It differs from the vanilla Transformer in that it uses only decoder layers and a learnable positional encoding, rather than a
CHAPTER 2. THEORETICAL BACKGROUND 21 fixed one. The following equations describe the architecture (Radford, Narasimhan, et al. 2018): h0 = xWe + Wp hm = Dm (. . . D2 ( D1 (h0 )) . . . ) (2.30) f (x) = softmax(hm WeT ) Wp and We are matrices containing the embedding vectors and the positional encoding. Both are learnable parameters. Note that the embedding matrix is reused for producing predictions. The GPT-2’s decoder layers are similar to the encoder layers in the Transformer: d0 = LN (d) d00 = MH A(d0 , d0 , d0 ) + d (2.31) 00 00 D (d) = FORW ( LN (d )) + d . FORW is a feed-forward network with Gaussian Error Linear Unit as the activation function and LN and MH A are as previously defined. 2.3.4 Inferencing To get a neural network to perform inferencing over categorical dis- tributions, its last layer needs to be a computation whose result is a probability distribution. A probability distribution is a vector whose element-wise sum is 1, and whose elements are all in the range 0 to 1. Very often, this is accomplished by using softmax as the activation function for the last layer: e xi softmaxi (x) = xj . (2.32) ∑Kj=1 e K is the number of classes in the categorical distribution. For classi- fication problems cross-entropy loss is almost always preferred over other types of error functions (Janocha and Czarnecki 2017): E(θ ) = − ∑ y log( f (x, θ )). (2.33) (x,y)∈D
Chapter 3 Related work A common approach to generating symbolic music is to model it as token sequences and generate it using autoregressive models. This is the sequence modelling approach we discussed in section 2.2 applied to music. The model starts with a possibly empty sequence and adds one token at a time to the sequence until the sequence reaches a sufficient length or until an end token is seen. The autoregressive method’s major drawback is that it often fails to consider the overall structure of the music it is generating. Melodies can wander and, while pleasantly sounding notes may be generated, they often lack coherence. The opposite, that the generated music is too repititive, is also a problem. A musical phrase or a motif perhaps should be repeated four to eight times, depending on the musical genre, but repeating it more might make the music sound repetitive. The cause of these problems may be the models’ sizes. To capture long-term structure, models need to operate on very long sequences. Name Year Representation Architecture Length folk-rnn 2015 ABC notation 3-layer LSTM 200 tokens vgm-rnn 2018 MIDI RNN 4-8 measures LakhNES 2019 Event-based Transformer-XL 5-10s DeepBach 2016 Pitch matrix LSTMs + CNNs 12s Relative Transformer 2019 Mixed Transformer 10-15s Table 3.1: Overview of some music generation models. The length column indicates the duration of clips used in the model authors’ listener survey or, in the case of folk-rnn, the average number of tokens per transcription (all models are able to generate clips of any length). 22
CHAPTER 3. RELATED WORK 23 But the longer the sequences, the more computationally expensive the models become. With today’s hardware, autoregressive models are limited to operating on sequences a few hundreds of tokens long. Many models reviewed in this chapter use novel techniques to get around the autoregressive paradigm’s limitations. Sturm, Santos, et al. (2016) modeled complete musical pieces that required, for the most part, no more than 200 tokens to describe – short enough to fit into their model’s memory. Donahue, Mao, Y. E. Li, et al. (2019) limited the model’s output to about 500 tokens, equivalent to nine seconds of audio. Huang et al. (2019) implemented a memory-efficient Transformer able to operate on 2,000-tokens-long sequences. Coming up with space-efficient music encodings that minimizes the number of tokens required to represent music appears to be very important for creating successful models. Deep neural networks’ impressive performances stem from having massive datasets to train on. Such datasets are readily available for modeling, for example, English text, but not for modelling music. Often, data augmentation techniques, including transposing songs and adjusting the tempo by some small percentage, are used to expand the training data available. 3.1 Sturm, Santos, et al. (2016) Sturm, Santos, et al. (2016) used a dataset consisting of some 23 thousand folk music songs in ABC notation to train an RNN. ABC notation is a simple text-based format created for notating folk music. The notation is very compact – a few hundred characters is enough to represent an entire song – and can be read by humans. While there are extensions to ABC notation to make it usable for polyphonic multi-instrumental music, it is more suitable for monophonic music. Figure 3.1 shows a reel (a type of folk dance) notated in ABC and sheet notation. The authors trained two networks using the dataset; char-rnn and folk-rnn, using different tokenization strategies. In char-rnn individual characters were tokens, while in folk-rnn syntactic units were tokens. For example, “:|” (right repeat) and “M:4/4” (meter) would be single tokens to folk-rnn but sequences of tokens to char-rnn. Both networks were three-layer LSTM-networks with 512 units in each layer, totaling
24 CHAPTER 3. RELATED WORK X:262 T:ReelDuGin M:4/4 L:1/8 K:Gmaj |:gdgbagab|gdgbagef |gdgbagbg|gedBABG2:| |:gedBABGA|BGABcdef |gedBABGA|BGEBA2G2:| Figure 3.1: The same reel in both ABC notation and sheet notation. Note the repetition markers on every other line. Figure 3.2: Screnshot of folk-rnn’s web interface at https://folkrnn.org about 5.6 million parameters. Figure 3.2 shows a screenshot of folk-rnn in action. Creating a full transcription takes only a few seconds. In their evaluation of folk-rnn, they found that the probability of each token in the generated data roughly matched that of the training data, but that the generated transcriptions’ lengths differed from the ones in the training data (Sturm, Santos, et al. 2016). folk-rnn tended to generate transcriptions whose lengths were closer to the peaks. In other words, the variance of the distribution of the number of tokens were lower than expected. The compact notation and the homogenity of the dataset probably contributed to the good performance of the networks they trained. Folk music is a very regular type of music with similar themes and structures recurring in many songs. 54% of all transcriptions in the dataset had an AABB structure with each section being eight bars longs, according to the authors. I.e., eight bars repeated once, followed by another set of eight bars also repeated once.
CHAPTER 3. RELATED WORK 25 A drawback of their model was that it only generated monophonic music. However, the authors argued that “harmony is implicit in the melody” and that richer compositions could be constructed based on the generated transcriptions. One interesting issue they did not cover is the effect of various sampling techniques on the quality of the generated transcriptions. 3.2 Hadjeres and François Pachet (2016) Hadjeres and François Pachet (2016) created a program called Deep- Bach for generating four-voice chorales in the style of Johann Sebastian Bach. DeepBach first generates a random chorale and then iteratively refines it using random resampling (Gibbs sampling). This, they argued, is a more flexible approach than autoregressive generation because a part of the chorale can be held constant while the remainder is resampled. For example, the user can specify one voice and let DeepBach reharmonize the chorale. I.e generate the three remaining voices based on the user-specified voice. DeepBach represents each chorale as a pitch matrix. The pitch matrix has four rows for the four voices: soprano (S), alto (A), tenor (T), and bass (B). The columns are time in the chorale quantized as sixteenth notes. Thus, each element in the matrix describes a voice’s pitch at a given time step. Special hold values (–––) are used to distinguish between repeating notes and notes longer than one sixteenth note. For example, a C# eight-note in the third octave is notated as C#3 followed by –––. In addition to the pitch matrix, the representation includes two sequences whose elements correspond to the chorale’s time steps; the subdivision list and the fermata list. The subdivision list cycles four integers, 1, 2, 3, 4, 1, 2, 3, 4, 1, . . ., and representes the indexes of the sixteenth notes within their beats. The elements of the fermata list are 1 where notes annotated by the fermata symbol occur and 0 elsewhere. Fermatas are used in sheet music to introduce slight pauses in performances and demarcates musical phrases. The subdivision list helps DeepBach keep the rhythm steady and the fermata list helps it understand the overall structure of the chorale it generates. Figure 3.3 shows the same measure represented in three ways; in DeepBach’s grid representation, in piano roll format, and in sheet notation.
26 CHAPTER 3. RELATED WORK qn1 qn2 qn3 qn4 S: D-5 --- E-5 F-5 D-5 --- --- --- C-5 --- --- --- E-5 --- --- --- A: A-4 --- --- --- G-4 --- F-4 --- E-4 --- E-4 --- E-4 --- --- --- T: C-4 --- --- --- B-3 --- --- --- G-3 --- --- --- A-3 --- --- --- B: F-3 --- D-3 --- G-3 --- --- --- C-2 --- --- --- C#2 --- --- --- s: 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 f: 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 (a) Grid representation (b) Piano roll (c) Sheet notation Figure 3.3: A measure in the grid representation Hadjeres and François Pachet (2016) created and the equivalent piano roll and sheet music notation. The top four lines shows the voices (soprano, alt, tenor, and bass) and the bottom two shows the subdivision (s) and fermata (f) lists. The four ones in the fermata list denotes one fermata over one of the notes in the third quarter note. qn1 to qn4 is not part of the representation – they clarify at which time steps quarter notes can begin.
CHAPTER 3. RELATED WORK 27 The authors formalized DeepBach as a dependency network over the conditional probability distributions for each pitch in the chorale as pi,t (Vi,t |V\i,t , M, θi,t ), for i ∈ [4] and t ∈ [ T ]. (3.1) Vi,t is the pitch voice i plays at time step t, V\i,t the pitches for all voices at all time steps except for Vi,t , M the subdivision and fermata lists, and θi,t the distributions’ parametrizations. They dropped the t indexes to make DeepBach time invariant, resulting in distributions of the form pi (Vi,t |V\i,t , M, θi ). (3.2) I.e., each voice is modeled separately but depends on every other voice. Note that this formalization, unlike traditional step-by-step-generation methods, incorporates information from both the past and the future. To implement this sophisticated scheme, DeepBach uses four net- works for the four voices. Each of the four networks consists of several subnetworks; two LSTMs for processing the voice’s past and future pitches, and a feed-forward network for processing simultaneous pitches. The three networks can be thought of as looking left, right, and vertically (up and down). A fourth network merges the underlying networks outputs, yielding an approximation for pi (Vi,t |V\i,t , M, θi ). The LSTMs only considers 16 future and past time steps, rather than summing up over the whole of V\i,t to keep DeepBach fast. The authors stated that “[t]his approximation appears to be accurate since musi- cal anaysis reveals that Bach chorales do not exhibit clear long-term dependencies.” DeepBach generates chorales by first selecting a subdivision and fermata list from an existing chorale and then initializing the pitch matrix with random values. It then repeatedly updates the pitch matrix a set number of times by resampling a randomly choosen Vi,t . To improve efficiency, it batches updates in groups of 16 or 32. Figure 3.4 shows the piano roll of two measures as DeepBach incrementally improves the measures using Gibbs sampling. The authors trained DeepBach on the Johann Sebastian Bach Chorales dataset, a collection of 382 roughly one-minute long, four-voice chorales. To this collecton they added all chorale transpositions fitting within the vocal ranges defined by the initial dataset, growing the dataset to 2,503 chorales. In a test of whether humans could tell DeepBach’s harmonizations
28 CHAPTER 3. RELATED WORK Figure 3.4: Piano rolls of two measures (32 sixteenth-notes) iteratively up- dated by Gibbs sampling. Topleft figure shows the random initialization, the right figure the state after 50 batched iterations, the next figure the state after 100, and so on. The bottomright figure shows the state after 250 iterations.
CHAPTER 3. RELATED WORK 29 apart from real harmonizations written by Bach, the authors found that DeepBach outperformed two baselines; a Maximum Entropy model as in (Hadjeres, Sakellariou, and François Pachet 2016) and a multilayer perceptron network. The models were given random soprano voices from the chorales in the validation dataset and had to generate the alto, tenor, and bass voices. Around 50% of the time, the survey takers would judge a reharmonization generated by DeepBach as composed by Bach, which they considerd to be a good score. 3.3 Donahue, Mao, Y. E. Li, et al. (2019) Donahue, Mao, Y. E. Li, et al. (2019) trained a Transformer-XL network to generate video game music in the style of the 1980s video game console the Nintendo Entertainment System (NES). The NES sound system has four channels;1 P1, P2, and TR for playing melodic notes, and NO for generating noise serving as percussion. The channels playing melodic notes cover six to seven octaves and the noise channel can play 16 types of noise. Each channel can only play one note at a time. The authors created a simple event-based format for NES-compatible music, with NOTEON and NOTEOFF events for turning pitches and per- cussive sounds on and off, and DT (delta tick) events for advancing time. Since they subdivided time into 44,100 ticks per second, they quantized time advancements to keep the number of events manage- able. For example, instead of having one time advance event for every conceivable tick delta they represented a time advancement of 1,840 ticks as the three-event sequence DT_1000, DT_800, and DT_40. On average, nine seconds of audio required about 500 events to represent. They used the Lakh MIDI dataset for pre-training and NES-MDB for fine-tuning. Lakh MIDI is a collection of about 175 thousand songs in MIDI format and NES-MDB a collection of about five thousand songs from the soundtracks to about three hundred NES games (Don- ahue, Mao, and McAuley 2018). Since the MIDI format is much richer than the authors’ event-based format, they invented a protocol for converting MIDI files to their format by randomly mapping melodic MIDI instruments to one of the NES’s three melodic channels and per- 1A fifth channel exists for playing low-quality samples, but it wasn’t used in the authors’ work.
30 CHAPTER 3. RELATED WORK P2_NOTEON_87, DT_30, TR_NOTEON_20, DT_50, P2_NOTEOFF, NO_NOTEON_12, DT_50, P2_NOTEON_90, DT_50, P1_NOTEON_76, TR_NOTEON_22, ... Figure 3.5: Example of Donahue, Mao, Y. E. Li, et al. (2019)’s event-based representation and resulting piano roll. The event sequence should be read as follows: play pitch 87 on P2, wait 30 ticks, play pitch 20 on TR, wait 50 ticks, silence P2, play noise 12 on NO, wait 50 ticks, play pitch 90 on P2, wait 50 ticks, play pitch 76 on P1, change to pitch 22 on TR, and so on. cussive instruments to one of the 16 noise types. This conversion grew their pre-training dataset to some 775,000 songs as there were multiple ways to convert each song. They also augmented their datasets by transposing songs and randomly adjusting the speed of songs by some small percentage. They compared their Transformer-XL with other neural networks of similar sizes and found that it performed very well both when comparing perplexities, and in a listener survey. In the listener survey respondents indicated which of two clips they preferred and which they thought were composed by a human. However, they only used five and nine seconds long clips. Perhaps because their network failed to stay coherent for longer durations. They theorized that a beat-based format, rather than their tick- based one, would fare better. Likely, it would reduce the number of tokens required to encode the music (500 tokens is a lot for nine seconds of audio) and also make it easier for the network to learn rhythms.
CHAPTER 3. RELATED WORK 31 qn1 qn2 qn3 qn4 S: 65 65 65 65 72 72 70 70 69 69 67 67 65 65 65 65 A: 60 60 60 60 60 60 60 60 60 60 60 60 62 62 64 64 T: 57 57 57 57 55 55 55 55 53 53 55 55 57 57 58 58 B: 53 53 53 53 52 52 52 52 53 53 52 52 50 50 50 50 Figure 3.6: The opening measure of Chorale #305 encoded in grid representa- tion and the resulting piano roll. The tenor and bass voices overlap in the third quarter note. 3.4 Huang et al. (2019) Like previously mentioned authors, Huang et al. (2019) used sequence- modeling for generating symbolic music. They developed a memory- efficient implementation of the Relative Transformer model introduced by Shaw, Uszkoreit, and Vaswani (2018), allowing them to train on 2000-token-long sequences, far longer than the training sequences used in other works. The authors reported state-of-the-art results for their model which outperformed two baselines – an LSTM model augmented with attention and a vanilla Transformer model – both in listener tests and when measuring negative log-likelihood. The authors used two dataset for evaluating their model; the Johann Sebastian Bach Chorales dataset (see section 3.2 for a description of the dataset and the structure of chorales), and the Piano-e-Competition dataset, consisting of 1,100 classical piano performances in MIDI for- mat. Due to differences in the structure of the songs in each dataset, they encoded them differently. The authors used a pitch matrix with time quantized as sixteenth notes for representing the chorales, as shown in figure 3.6. Thus, one measure is represented using a 4 × 16-grid of integers.2 The matrix is 2 Figure6 on page 11 in Huang et al. (2019) indicates that time was actually quantized as quarter-notes and not sixteenth-notes so that one measure would
You can also read