EUROVISION SONG FESTIVAL MELODY GENERATION - UVA SCRIPTIES

Page created by Derrick Romero
 
CONTINUE READING
EUROVISION SONG FESTIVAL MELODY GENERATION - UVA SCRIPTIES
Eurovision song festival melody
          generation.
              Bachelor thesis

                   Lex Johan
                   12181242

        Bachelor Kunstmatige Intelligentie

            University of Amsterdam
               Faculty of Science
               Science Park 904
             1098 XH Amsterdam

                   Supervisor
               Dr. J.A. Burgoyne
       Faculteit der Geesteswetenschappen
       Capaciteitsgroep Muziekwetenschap
           Nieuwe Doelenstraat 16-18
              1090 BB Amsterdam

                   April, 2021

                        1
EUROVISION SONG FESTIVAL MELODY GENERATION - UVA SCRIPTIES
Abstract
  The research question that this thesis will try to answer is: “What quality of
melody are state-of-the-art music generation artificial intelligent algorithms able
to produce based on Euro vision song festival songs”. To achieve this, live audio
   recording have been turned into MIDI files to train three neural networks:
  Melody RNN, MusicVAE and MidiNet. These models were compared to one
 another both objectively and subjectively through a user study. Both of these
comparisons agree that MusicVAE is able to make the best melodies out of the
                                     three.

                                        2
EUROVISION SONG FESTIVAL MELODY GENERATION - UVA SCRIPTIES
Contents

1 Introduction                                                                                                                            4

2 Background                                                                                                                              6
  2.1 Variational autoencoder (VAE) . . . .                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    6
  2.2 Generative adversarial network (GAN)                               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    6
  2.3 Long short-term memory (LSTM) . . .                                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    7
  2.4 Convolutional neural network (CNN) .                               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    7

3 Related work                                                                                                                            8
  3.1 Melody generation      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    8
  3.2 MusicVAE . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
  3.3 MidiNet . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   11
  3.4 Melody RNN . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   12

4 Method                                                                                                                                 13
  4.1 Dataset . . . . . . . . . . . . . . . . . . .                               . . . . .          .   .   .   .   .   .   .   .   .   13
  4.2 Preprocessing . . . . . . . . . . . . . . . .                               . . . . .          .   .   .   .   .   .   .   .   .   13
      4.2.1 Segmentation (SBIC) . . . . . . . .                                   . . . . .          .   .   .   .   .   .   .   .   .   13
      4.2.2 Melody extraction (MELODIA) . .                                       . . . . .          .   .   .   .   .   .   .   .   .   14
      4.2.3 bpm estimation (Multi-feature beat                                   tracker)            .   .   .   .   .   .   .   .   .   14
      4.2.4 Melody frequency to MIDI file . . .                                   . . . . .          .   .   .   .   .   .   .   .   .   15
  4.3 Training the models . . . . . . . . . . . . .                               . . . . .          .   .   .   .   .   .   .   .   .   15

5 Results                                                                      16
  5.1 Objective comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 16
  5.2 Subjective comparison . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Conclusion and discusion                                                                                                               21

                                                     3
Chapter 1

Introduction
The goal of artificial intelligence (AI) has long been to emulate human intelligence,
and a part of that intelligence is our creativity. Computational creativity is a mul-
tidisciplinary field that tries to obtain creative behaviors from computers. One of
its most prolific subfields is that of music generation (also called algorithmic com-
position or musical meta-creation), which uses computational means to compose
music. This process of automatically generating new musical pieces is particularly
hard for many reasons, for example, there is no loss function to test the quality of
the music generated. Furthermore, what does good quality music mean to human
listeners? These questions will not be answered any time soon, but researchers
have made many strides in the last decade.

    The goal of music generating systems is sometimes to create a formalization of
a certain musical style, like Bach. Other times, the generation of music itself is
the only goal. These diverging goals also lead to diverging methods, each with its
strengths and weaknesses. Not all these methods use AI in the sense of an artificial
neural network (ANN). Some try to approach music as if it was a natural language
with formal grammars, while others try to solve the problem of music generation
with genetic algorithms.

     This thesis will focus on melody generation based on live Eurovision audio
files of 1593 songs from 1956 to 2020, in an attempt to create ’pleasant-sounding’
melodies. To achieve this, three types of ANNs will be compared with each other:
a convolutional neural network generative adversarial network (Yang et al. 2017),
a recurrent neural network with long short-term memory (Waite 2016), and a
variational autoencoder (Roberts et al. 2018). To evaluate the quality of music
generation, people will be asked to state a preference among melodies created
through the three ANNs. The research question that this thesis will try to answer

                                         4
is: “What quality of melody are state-of-the-art music generation artificial intelli-
gent algorithms able to produce based on Euro vision song festival songs”.

                                         5
Chapter 2

Background
This is a quick overview of the different kinds of ANNs that have been used in this
thesis with their major characteristics.

2.1     Variational autoencoder (VAE)
A VAE (Kingma & Welling 2019) is a type of unsupervised ANN which can learn
the representation or encoding of the input data. It consists of three major parts:
the encoder, the latent space and the decoder. The latent space acts as a bottleneck
between the encoder and decoder, forcing the model to ignore the insignificant data
to learn the representation. The training steps for this type of model are as follows:
The input data is encoded as a distribution over the latent space, from this latent
space a point can be sampled to be decoded, this allows the reconstruction loss to
be calculated which can be back propagated through the model.

2.2     Generative adversarial network (GAN)
A GAN (Creswell et al. 2018) consists of two types of networks: a generator and
a discriminator. The generator is trained to produce new samples from the input
data, while the discriminator is trained to distinguish between real samples data
and generated samples. These two models are trained together in a zero-sum game,
which is halted when the generator is able to recover the training data and the
discriminator is able to distinguish between real and generated samples 50% of the
time.

                                          6
2.3     Long short-term memory (LSTM)
LSTMs (Hochreiter & Schmidhuber 1997) are a special kind of recurrent neural
networks (RNN), which allow information to persist like memory. RNNs can use
this memory to process sequences of variable lengths as input. However, RNNs
suffer when learning long-term dependence (Bengio et al. 1993), where the interval
between the relevant information and the current state of the input sequence has
become too large. LSTMs were introduced to combat this issue and are capable of
remembering values over arbitrary time intervals. A LSTM unit contains: a cell,
an input gate, an output gate and a forget gate. These components work together
to control the flow of information into and out of the cell.

2.4     Convolutional neural network (CNN)
CNNs (O’Shea & Nash 2015) utilize convolutions to capture the spatial and tem-
poral dependencies of an image. A convolutional kernel is repeatedly applied to the
input image, which results in a feature map detailing the locations and strengths
of a detected feature in the input image. The strength of CNNs lies in the fact that
they are capable of automatically learning a large amount of these convolutional
kernels in parallel.

                                         7
Chapter 3

Related work

3.1     Melody generation
Music generation is far from a new idea, going as far back as the 1960s (Ames
1987). Research in this area has not stalled since then, and many new methods
have been created with the goal of music generation. However, with the advent
of machine learning, new life has been blown into this field. These recent models
utilize state-of-the-art techniques used in the other domains of machine learning,
for example, transformer models primarily used in natural language processing
tasks have also been adapted to music generation (Huang et al. 2018). This thesis
focuses on a sub-domain of music generation: melody generation, which is much
simpler because these models only have to predict the probability of a single note
at each time step. A few popular models in melody generation from the past five
years are detailed below in figure 3.1.

          Figure 3.1: Chronology of melody generation. (Ji et al. 2020)

                                        8
According to Ji et al. (2020), these kinds of models face three major challenges:
structure, closure and creativity. Structure in melodies are recurring themes, mo-
tivations, and patterns, these are often lacking in the output of the models which
make the melodies boring. Closure is the sense of conclusion that follows from any
tension created in the melody (Hopkins 1990), since these models lack any control
over when the melodies will, there will always be a lack of closure. Lastly, cre-
ativity in melody generation is the ability to create new melodies not found in the
dataset, however, contemporary machine learning techniques are mainly capable
of interpolation of the training dataset and therefore unable to truly come up with
something new.

                                         9
3.2     MusicVAE
MusicVAE (Roberts et al. 2018) follows the basic structure of VAEs for sequential
data proposed in Bowman et al. (2015), however, it tries to solve the challenge
of structure with their unique hierarchical decoder. For their encoder, they use
a two-layer bidirectional LSTM network with a state size of 2048 for both layers,
which feeds into the latent space with 512 dimensions.
    In their preliminary testing, they found that using a simple RNN as the decoder
would result in the loss of long-time structure in the generated melodies. This was
hypothesized to be caused by the vanishing influence of the latent space as the
output sequence was generated. To alleviate this issue, the authors came up with
the ’conductor’, which is a two-layer unidirectional LSTM with a hidden state size
of 1024 and 512 output dimensions. This conductor uses the latent vector to create
embeddings, which through a tanh activation layer serve as initial states for the
bottom-layer decoder. The decoder consists of a two-layer LSTM with 1024 units
per layer. An overview of this model can be found in figure 3.2

            Figure 3.2: Schematic overview of the MusicVAE model.

                                        10
3.3     MidiNet
MidiNet presents a GAN CNN-based model that takes advantage of the symbolic
domain generation of melody. Instead of creating a continuous melody sequence,
MidiNet proposes to generate melodies in a successive manner. This is achieved by
converting the MIDI sequence of notes into a two-dimensional matrix representing
the presence of notes over time in a bar and employing convolutions over this
matrix. Furthermore, to promote the creativity displayed by the network, random
noise is used as input to the generator part of MidiNet.
    This generator’s goal is to transform the random noises into two-dimensional
score-like representations which should resemble real MIDI sequences. The gen-
erator uses a special convolution operator called transposed convolution, which
“up-samples” smaller matrices or vectors to larger ones. The output of the gener-
ator is given to a discriminator, which has to learn to distinguish between which
of its inputs are from real MIDI files and which of its inputs are from the genera-
tor. Together, this generator and discriminator pair form the GAN, but this GAN
alone does not ensure the temporal dependency between the different bars which
this model claims to do.
    MidiNet resolves this by introducing a conditional mechanism to use music
from the previous bars to condition the generation of the current bar. To do this,
a separate CNN model has to be trained, which is referred to as the conditioner.
Using this conditioner, MidiNet has the ability to look back at its previously
generated bars without the use of recurrent units, as is done in music generation
models using RNN techniques.

                   Figure 3.3: Schematic overview of MidiNet

                                        11
3.4     Melody RNN
Melody RNN (Waite 2016) consists of two methods to improve a basic RNN’s
ability to learn longer-term structures: Lookback RNN and Attention. The first
method, Lookback RNN, in addition to inputting the previous event in the se-
quence as a normal RNN would, it can look back 1 and 2 bars to recognize patterns
such as mirrored or contrasting melodies. Furthermore, the position within the
measure is also used as an input, allowing the model to more easily learn patterns
associated with 4/4 time music.
    The second method, Attention RNN, builds upon the basic RNN by imple-
menting a mechanism of attention as found in Bahdanau et al. (2014). Attention
is a different approach in ANNs to access previous information without having to
store it in an RNN cell’s state. In this implementation, the amount of attention
the last n steps receive from the current step will determine how much of their
activation is concatenated with the output of the current cell.

                                       12
Chapter 4

Method

4.1     Dataset
The dataset used to train all four models consists of all 1593 Eurovision Song
Contest songs in the period of 1956 to 2020 in the form of MP3 live audio files.
The total duration of play time is around 88 hours, which includes introductory
talks and Charpentier’s ’Te Deum’ as opening theme in certain years. Both of
these have to be removed in later preprocessing steps.

4.2     Preprocessing
To train the models, the live audio recordings need to be turned into the symbolic
music format MIDI. A four-part data preprocessing pipeline was constructed to
achieve this.

4.2.1    Segmentation (SBIC)
The live audio recordings still contain the introductory talks and opening themes
in certain years, these had to be removed to achieve the best possible result in
the melody extraction in the next step. To accomplish this, the SBIC algorithm
(Gravier et al. 2010) from the Python Essentia library (Bogdanov et al. 2013)
was utilized, this algorithm segments audio using Bayesian information criterion
given a matrix of feature vectors. For this use case, the Mel-frequency cepstrum
coefficients (MFCCs) were used as feature vectors, since the MFCCs of either
an introductory talk or ’Te Deum’ are sure to differ greatly from the rest of the
song. The algorithm then searches for homogeneous segments in which the feature
vectors have the same probability distribution.

                                       13
4.2.2    Melody extraction (MELODIA)
To convert the live audio recordings to MIDI, the fundamental frequency has to be
extracted. This was accomplished with the MELODIA algorithm (Salamon et al.
2013) from the Essentia library, the working of this algorithm can be summarized
in four parts: sinusoidal extraction, a salience function, pitch contour creation and
the melody selection. The sinusoidal extraction takes the frequencies present in
the live audio recordings and transforms them using the discrete-time short-time
Fourier transformation to compute the prevalence of each frequency.
    The output of this transformation acts as the input for the salience function,
the goal of this function is to estimate the pitches from the frequencies in each
moment of time. For all possible pitches (within a reasonable range) a harmonic
series is sought which contributes to our perception of that pitch. The weighted
sum of the energy in these harmonic frequencies can be referred to as the ’salience’
of that pitch, this is repeated for each time step in the audio.
    The peaks of the salience function at each time step can be used to track the
pitch contours, these are a series of consecutive pitch values which are continuous in
both time and frequency. Any outliers left after this step are removed in the melody
selection, and we are left with an approximation of the fundamental frequency of
the audio and thus its melody.
    This algorithm is focussed on the melody produced by the singer’s voice, mean-
ing the generated melody will fall silent when the singer’s voice cannot be detected.

4.2.3    bpm estimation (Multi-feature beat tracker)
For the construction of MIDI files, a bpm estimation is necessary. This was
achieved through the multi-feature beat tracker algorithm (Oliveira et al. 2012)
from the Essentia library, which utilizes the mutual agreement of five different fea-
tures to estimate the bpm of the input audio: complex spectral difference, energy
flux, spectral flux in MFCCs, a beat emphasis function and spectral flux between
histotgrammed spectrum frames.

                                         14
4.2.4    Melody frequency to MIDI file
Given a frequency f of a melody in a certain time step, the corresponding MIDI
note d can be found with:
                                $                     !'
                                                 f
                            d = 69 + 12 log2                                  (4.1)
                                                440
Applying equation 4.1 over all frequencies in the melody and with the estimated
bpm, it is possible to create a MIDI file.

    These steps were combined in the processing pipeline summarized below in
figure 4.1, after preprocessing, the total amount of playtime in the MIDI files was
reduced to around 81 hours.

                 Figure 4.1: Overview of preprocessing pipeline

4.3     Training the models
These MIDI files cannot be used directly in the training of the ANNs, however, the
models’ respective GitHub repositories all contain tools to convert MIDI files to
the correct format. Furthermore, all models were trained with their recommended
parameters for the recommended number of epochs.

                                        15
Chapter 5

Results

5.1     Objective comparison
The aesthetics of music cannot be measured by a computer, however, it is possible
to compare the outputs of the different models in an objective manner through the
features in the MIDI files. To compare these models, each model had to generate
10 melodies, each with a length of 30 seconds. As a control group, the MIDI files
from the melody extraction can be used.

Figure 5.1: Distribution of number of pitch classes. the Models appear on the
x-axis and the number of pitch classes appears on the y-axis.

  Firstly, there is the distribution of the total number of pitch classes from each
model in figure 5.1. Melody RNN immediately jumps out with the highest devia-

                                        16
tion, while the control group centers clearly on 15 pitch classes, both MidiNet en
MusicVAE fall in between these two results. If there are too few different pitch
classes in the output, it will result in a boring melody, while too many different
pitch classes can result in a lack of cohesion. The fact that Melody RNN displays
both behaviors could mean that this ANN was unable to learn anything significant.
The control group also displays remarkable behavior with its small deviation, but
this could also be the result of the melody extraction.

Figure 5.2: Distribution of average pitch range per melody. the Models appear on
the x-axis and the average pitch range appears on the y-axis.

    Secondly, there is the distribution of the average pitch range in figure 5.2. This
number is calculated by the subtraction of the highest and lowest used pitch in
semitones. Melody RNN is an outlier here as well, with the control and MusicVAE
having a much smaller deviation. MidiNet also comes close to the control group.
A too low pitch range could result in a boring melody, while a too high pitch range
could indicate a chaotic melody. Again, Melody RNN displays both behaviors just
as in figure 5.1, but the control group also display similar behavior as in figure 5.1.

                                          17
Figure 5.3: Distribution of the average pitch shift. the Models appear on the x-axis
and the average pitch shift appears on the y-axis.

    Finally, we have the distribution of the average pitch shift in figure 5.3. This
number is calculated as the average value of the interval between two consecutive
pitches in semitones. Melody RNN again has by far the largest deviation, however,
this time MidiNet comes closest to the control group while MusicVAE is deviates
more. Here as well, Melody RNN is on both ends of the spectrum, but MusicVAE
has the highest mean average pitch shift, possibly pointing towards more chaotic
melodies generated by this model.

5.2     Subjective comparison
The aesthetic qualities of the music were measured with a user study. Fifteen
Participants each had to listen to 30 seconds of melody per ANN and a melody
out of the training data. They were asked to score three questions in Dutch
between one and seven. These questions can be translated as follows: 1. How
pleasant is the melody to listen to? 2. How is the cohesion of the melody? And
3. How interesting is the melody? The results are as follows:

                                        18
Figure 5.4: Distribution of answering scores for the control group

    The results for the control group are quite high across the board as seen in figure
5.4, with question one and two both a higher mean score of 4 points, however these
melodies were lacking interesting aspects according to the participants

          Figure 5.5: Distribution of answering scores for Melody RNN

    The results for Melody RNN are much lower, as seen in figure 5.5, never reach-
ing a higher mean than three points and even a mean of two points on question
two. The lack of cohesion, which the participants have voted on in question 2, is
in line with the objective comparisons made earlier. This could also explain the
low scores on the other two questions.

                                          19
Figure 5.6: Distribution of answering scores for MidiNet

    The results for MidiNet as seen in figure 5.6 are quite average, while it scores
above average on question one with a mean of 4 points, the deviation on question
3 is quite high.

           Figure 5.7: Distribution of answering scores for MusicVAE

    The results for MusicVAE as seen in figure 5.7 are the highest outside the con-
trol group, with all mean scores higher than 4 points. It even scores higher than
the control group on whether the melody is interesting

    From both the results of the objective and subjective comparisons, the Mu-
sicVAE is the winner. Melody RNN is lower on the spectrum of the subjective
comparison, but this is in line with the objective comparison in section 5.1.

                                        20
Chapter 6

Conclusion and discusion
The question posed in the introduction of this thesis:“What quality of melody are
state-of-the-art music generation artificial intelligent algorithms able to produce
based on Euro vision song festival songs”. To this end, a processing pipeline was
made to turn the live audio performances into viable training data for three kinds
of ANNs. The generated melodies were both put through objective and subjective
comparisons, highlighting their differences. Melody RNN is the clear loser of this
comparison in both perspectives, despite the promises of long-time structure that
LSTMs bring with them, however, Melody RNN is also by far the simplest model
of the three. But the fact remains that there is enough evidence to say that Melody
RNN unable to learn anything significant from the data.
    Second in place is MidiNet, which scored average in subjective comparison and
did not stand out in the objective comparison, while the participants overall agree
that the melody is pleasant, they were more divided on whether the melody was
interesting.
    Finally, there is MusicVAE, which has scored comparatively high on the sub-
jective comparison and was very similar to the control group in the objective
comparison. Most remarkable is that this model was able to create more inter-
esting melodies than the control group according to the participants of the user
study. Perhaps this can also be seen in figure 5.7 where the distribution of the
average pitch shift is higher than the other models and the control group.
    To improve on the results found in this thesis, the hyperparameters of both the
models and the MELODIA algorithm could be tweaked. The latter is especially
important because the training data was far from perfect, in future work the
processing pipeline should contain a step which removes long silences which are
left after the MELODIA algorithm.

                                        21
Bibliography
Ames, C. (1987), ‘Automated composition in retrospect: 1956-1986’, Leonardo
 pp. 169–185.
Bahdanau, D., Cho, K. & Bengio, Y. (2014), ‘Neural machine translation by jointly
  learning to align and translate’, arXiv preprint arXiv:1409.0473 .
Bengio, Y., Frasconi, P. & Simard, P. (1993), Problem of learning long-term de-
  pendencies in recurrent networks, pp. 1183 – 1188 vol.3.
Bogdanov, D., Wack, N., Gómez Gutiérrez, E., Gulati, S., Boyer, H., Mayor, O.,
  Roma Trepat, G., Salamon, J., Zapata González, J. R., Serra, X. et al. (2013),
  Essentia: An audio analysis library for music information retrieval, in ‘Britto
  A, Gouyon F, Dixon S, editors. 14th Conference of the International Society
  for Music Information Retrieval (ISMIR); 2013 Nov 4-8; Curitiba, Brazil.[place
  unknown]: ISMIR; 2013. p. 493-8.’, International Society for Music Information
  Retrieval (ISMIR).
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R. & Ben-
  gio, S. (2015), ‘Generating sentences from a continuous space’, arXiv preprint
  arXiv:1511.06349 .
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B. & Bharath,
  A. A. (2018), ‘Generative adversarial networks: An overview’, IEEE Signal Pro-
  cessing Magazine 35(1), 53–65.
Gravier, G., Betser, M. & Ben, M. (2010), ‘Audioseg: Audio segmentation toolkit,
  release 1.2’, IRISA, january .
Hochreiter, S. & Schmidhuber, J. (1997), ‘Long short-term memory’, Neural com-
 putation 9(8), 1735–1780.
Hopkins, R. G. (1990), Closure and Mahler’s Music: The Role of Secondary Pa-
  rameters, University of Pennsylvania Press, pp. 4–28.
  URL: http://www.jstor.org/stable/j.ctv5138cg.5

                                       22
Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Hawthorne, C., Dai,
 A. M., Hoffman, M. D. & Eck, D. (2018), ‘Music transformer: Generating music
 with long-term structure’, arXiv preprint arXiv:1809.04281 .

Ji, S., Luo, J. & Yang, X. (2020), ‘A comprehensive survey on deep music genera-
   tion: Multi-level representations, algorithms, evaluations, and future directions’,
   arXiv preprint arXiv:2011.06801 .

Kingma, D. P. & Welling, M. (2019), ‘An Introduction to Variational Autoen-
  coders’, arXiv e-prints p. arXiv:1906.02691.

Oliveira, J. L., Zapata, J., Holzapfel, A., Davies, M. & Gouyon, F. (2012), ‘As-
  signing a confidence threshold on automatic beat annotation in large datasets’.

O’Shea, K. & Nash, R. (2015), ‘An introduction to convolutional neural networks’,
  arXiv preprint arXiv:1511.08458 .

Roberts, A., Engel, J., Raffel, C., Hawthorne, C. & Eck, D. (2018), A hierarchical
  latent vector model for learning long-term structure in music, in ‘International
  Conference on Machine Learning’, PMLR, pp. 4364–4373.

Salamon, J. J. et al. (2013), Melody extraction from polyphonic music signals,
  PhD thesis, Universitat Pompeu Fabra.

Waite, E. (2016), ‘Generating long-term structure in songs and stories’.
 URL: https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn

Yang, L.-C., Chou, S.-Y. & Yang, Y.-H. (2017), ‘Midinet: A convolutional gener-
  ative adversarial network for symbolic-domain music generation’.

                                         23
You can also read