TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage Method
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage Method Zeqian Ju1 , Peiling Lu2 , Xu Tan* 2 , Rui Wang2 , Chen Zhang3 , Songruoyao Wu3 , Kejun Zhang3 , Xiangyang Li1 , Tao Qin2 , Tie-Yan Liu2 1 University of Science and Technology of China 2 Microsoft Research Asia 3 Zhejiang University, China juzeqian@mail.ustc.edu.cn, {peil, xuta, ruiwa, taoqin, tyliu}@microsoft.com, {zc99, 22021296, zhangkejun}@zju.edu.cn, xiangyangli@ustc.edu.cn arXiv:2109.09617v1 [cs.SD] 20 Sep 2021 Abstract Srivastava, and Canales 2021; Sheng et al. 2020; Bao et al. Lyric-to-melody generation is an important task in auto- 2019; Lee, Fang, and Ma 2019). However, these systems suf- matic songwriting. Previous lyric-to-melody generation sys- fer from the following issues: 1) They require large amount tems usually adopt end-to-end models that directly generate of paired lyric-melody data to learn the correlation between melodies from lyrics, which suffer from several issues: 1) syllables in lyrics and notes in melodies (Sheng et al. 2020). lack of paired lyric-melody training data; 2) lack of control However, collecting lots of paired data is quite difficult and on generated melodies. In this paper, we develop TeleMelody, costs much. Sheng et al. (2020) have attempted to allevi- a two-stage lyric-to-melody generation system with music ate the low-resource challenge by unsupervised pre-training template (e.g., tonality, chord progression, rhythm pattern, on lyric-to-lyric and melody-to-melody model respectively. and cadence) to bridge the gap between lyrics and melodies However, the utilization of the unpaired data helps on the un- (i.e., the system consists of a lyric-to-template module and derstanding and generation of lyrics and melodies while has a template-to-melody module). TeleMelody has two advan- tages. First, it is data efficient. The template-to-melody mod- little effect on the correlation learning between lyrics and ule is trained in a self-supervised way (i.e., the source tem- melodies. 2) They generate melodies directly from lyrics, plate is extracted from the target melody) that does not need which hinders end users to control musical elements (e.g. any lyric-melody paired data. The lyric-to-template module tonality and chord progression) over the generation process. is made up of some rules and a lyric-to-rhythm model, which Without controllability, requirements from users may be ig- is trained with paired lyric-rhythm data that is easier to ob- nored and the application scenarios are limited. tain than paired lyric-melody data. Second, it is controllable. In this paper, we propose TeleMelody1 , a two-stage lyric- The design of template ensures that the generated melodies to-melody generation system with a carefully designed tem- can be controlled by adjusting the musical elements in tem- plate as a bridge to connect lyrics and melodies. The tem- plate. Both subjective and objective experimental evaluations demonstrate that TeleMelody generates melodies with higher plate contains tonality, chord progression, rhythm pattern, quality, better controllability, and less requirement on paired and cadence. This designed template is effective because: 1) lyric-melody data than previous generation systems. It is convenient to be extracted from melodies and predicted from lyrics, and can successfully catch their characteristics; 2) It is easy to be understood and manipulated from users. 1 Introduction Accordingly, we break down the lyric-to-melody task into a With the rapid development of artificial intelligence, au- lyric-to-template module and a template-to-melody module. tomatic songwriting has drawn much attention from both This can reduce the task difficulty, improve data efficiency academia and industry. Automatic songwriting covers many and enable controllability on generated melodies. The de- tasks, such as lyric generation (Malmi et al. 2016; Xue et al. tails are described as follows: 2021), melody generation (Wu et al. 2020; Choi, Fazekas, This two-stage framework can help reduce the difficulty and Sandler 2016; Zhu et al. 2018), lyric-to-melody gener- of learning the correlation between lyrics and melodies in ation (Yu, Srivastava, and Canales 2021; Sheng et al. 2020; end-to-end models. In the template-to-melody module, we Bao et al. 2019; Lee, Fang, and Ma 2019), and melody-to- train a template-to-melody model with templates extracted lyric generation (Sheng et al. 2020; Watanabe et al. 2018; Li from melodies by rules. Generating melodies from templates et al. 2020). In this paper, we focus on lyric-to-melody gen- is much easier than from lyrics, since the correlation be- eration, since it is one of the most important and common tween templates and melodies is much stronger than that be- tasks in songwriting, and is still under-explored. tween lyrics and melodies, and the paired template-melody Recent years, deep learning techniques have been widely data can be easily got from a self-supervised way. In the used to develop end-to-end lyric-to-melody systems (Yu, lyric-to-template module, rhythm pattern in template is ob- * This work was conducted at Microsoft. Corresponding author: 1 Xu Tan, xuta@microsoft.com Tele is from TEmpLatE.
tained by a lyric-to-rhythm model, which is trained with C F G F G C paired lyric-rhythm data. This paired data is obtained by extracting the rhythm pattern from crawled lyric-audio data Twin-kle twin-kle lit -tle star, how I won-der what you are. through audio processing tools, which is much easier to get (a) The melody, lyric, and chord progression. than paired lyric-melody data. Cadence is inferred based on punctuation mappings. Chord progression and tonality C C C C F F G F F F F G G C Chord in the template can be acquired with predefined musical C major 0 1 2 3 0 1 2 0 1 2 3 0 1 2 Rhythm pattern knowledge. In this way, the two modules either rely on self- ( Tonality) Cadence No No No No No No Half No No No No No No Authentic supervised learning or data mining on external lyric-audio data, which do not require any paired lyric-melody data and (b) The corresponding template. are more data-efficient than end-to-end models. Moreover, benefiting from the template based framework, Figure 1: The song “Twinkle Twinkle Little Star” in “C ma- end users can control the generated melodies by changing jor” tonality. the musical elements in templates. Besides, in the sequence- to-sequence based template-to-melody model, we use musi- cal knowledge to guide the learning of attention alignments musical element. Fukayama et al. (2010) obtain the opti- between the template tokens and the corresponding melody mal pitch sequence by maximizing the conditional probabil- tokens, which can lead to better controllability. ity given chords, tonality, accompaniment bass, rhythm and The main contributions of this work are as follows: pitch accent information, but the generated melodies may • We propose TeleMelody, a two-stage lyric-to-melody suffer from bad musical structure without repetition pat- system with a carefully designed template as the bridge. terns. Meanwhile the algorithm cannot be directly applied It decomposes lyric-to-melody task into a lyric-to- to lyrics written in “stress accent” languages like English. template module and a template-to-melody module. This Recently, developing lyric-to-melody systems based on framework can help reduce the task difficulty and im- machine learning methods attracts lots of attentions. Scirea prove data efficiency. et al. (2015) allocate the same number of notes as the syllables in lyrics using Markov chain. Ackerman and • Chord progression, tonality, rhythm pattern and cadence Loker (2017) leverage random forest model to construct a designed in templates can help control basic musical el- rhythm model and a melody model separately on paired ements and high-level music structures. We introduce lyric-audio data. Bao et al. (2019) and Yu, Srivastava, and alignment regularization based on musical knowledge to Canales (2021) use sequence-to-sequence models to gen- ensure better controllability for generated melodies. erate melody from lyrics. However, deep learning methods • Experimental results demonstrate that TeleMelody sig- usually require large amount of paired lyric-melody data for nificantly outperforms previous end-to-end lyric-to- learning the correlation between lyrics and melodies. Sheng melody generation models in terms of both objective and et al. (2020) attempt to address low-resource challenge by subjective evaluations on generation quality, and is capa- performing pre-training for lyric-to-lyric and melody-to- ble of better controlling the generated melodies. melody modules, and incorporating supervised learning into the pre-training to learn a shared latent space between lyrics 2 Background and melodies. This is pioneering, but the unpaired data has not been sufficiently utilized on correlation learning between In this section, we introduce the background of lyric-to- lyrics and melodies. Moreover, these works do not consider melody generation task and then briefly introduce each mu- to control specific musical elements of generated melodies. sical element in template for better understanding. Inspired by template-based methods in language generation (Fabbri et al. 2020; Yang et al. 2020; Li et al. 2020), we pro- 2.1 Lyric-to-Melody Generation pose TeleMelody, a two-stage template-based system, which Considerable development on lyric-to-melody generation consists of a lyric-to-template module and a template-to- has been seen in recent years, from rule or statistical based melody module. The two-stage framework can help address methods to deep learning based methods. Traditional rule the issues of limited paired data, and the designed template or statistical based methods usually need lots of manual together with the alignment regularization in this frame- designs with abundant domain knowledge in music, and work is able to ensure better controllability over generated also hinder end users to control musical elements. Mon- melodies. teith, Martinez, and Ventura (2012) generate rhythm pattern with rule-based modules, and construct an n-gram model to 2.2 Music Background Knowledge predict note pitch2 . Long, Wong, and Sze (2013) and Ra- In this subsection, we use the song “Twinkle Twinkle Lit- bin (1963) learn the lyric-note correlation by performing tle Star” in Figure 1(a) as an example to introduce musical statistical based method with limited paired data. In these elements of the template. works, the melody is generated either without the control- • Tonality3 is composed of a scale4 and a root note. For a ling of any musical element or just considering one specific 3 https://en.wikipedia.org/wiki/Tonality 2 4 https://en.wikipedia.org/wiki/Pitch (music) https://en.wikipedia.org/wiki/Scale (music)
given tonality, the notes will be distributed within spe- Lyric Lyric-to-Template Module Template Template-to-Melody Module Melody cific order and pitch intervals. For example, the tonality of the melody in Figure 1(a) is “C major”, since notes are (a) Pipeline of TeleMelody. ordered within pitches in major scale5 , and ending in root pitch “C”. Tonality Rhythm Pattern Template Cadence Chord Progression • Chord progression6 is an ordered sequence of chords, in which a chord7 is a set of more than two notes with har- Lyric-to-Rhythm Model Punctuation-to-Cadence Mapping monious pitches. As in Figure 1(a), the chord progression is “C-F-G-F-G-C’. Chord progression, interacting with Lyric melody, should create a sense of harmony when compos- (b) Architecture of lyric-to-template module. ing music. • Rhythm8 generally refers to the pattern of occurrence of Melody ··· i ℎ Bar Position Pitch Duration ቀi+1 ℎ Bar Position ··· notes and rests. In Figure 1(a), each note is aligned with a syllable and notes in green boxes are in the same rhythm Template Tonality ··· i ℎ Chord Rhythm Pattern Cadence ··· patterns. • Cadence9 occurs at the end of phrase and gives a sense (c) The representation of template and its connection to of ending in melody, and is often aligned with punctu- melody. ation marks in lyrics. In Figure 1(a), we assign the half cadence9 and the authentic cadence9 to the comma and Figure 2: Architecture of TeleMelody. the period respectively. 3 Methodology one token at the start of sequence to represent the tonality The system architecture is shown in Figure 2(a). We intro- of the melody and three consecutive tokens (chord, rhythm duce a template as a bridge between lyrics and melodies, pattern, cadence) to represent musical elements of each note. and decompose lyric-to-melody generation into a lyric-to- For better understanding, we provide an example of tem- template module and a template-to-melody module. In this plate in Figure 1(b), which corresponds to the melody in section, we describe each component (i.e, template, lyric- Figure 1(a) . to-template module, and template-to-melody module) of the The influence of musical elements in templates on gen- system in detail. erated melodies are described as follows (shown in Fig- ure 2(c)): 1) Tonality can control pitch distribution. 2) 3.1 Template Chord can influence the harmony of generated melodies. 3) Rhythm pattern can constrain note position and control high- In this subsection, we introduce template from several as- level musical structure with repetitive patterns. 4) Cadence pects: definition, design principles, contained musical ele- can guarantee the accordance between punctuation in lyrics ments and their representation, and the connection with the and rest notes in melodies. generated melodies. The template is a well-designed sequence of musical el- 3.2 Template-to-Melody Module ements that can capture the characteristics of both lyrics and melodies. With the connection of this template, we de- The template-to-melody module is to generate melodies compose lyric-to-melody generation into a lyric-to-template from given templates. An encoder-attention-decoder Trans- module and a template-to-melody module as shown in Fig- former model is adopted in this module, which can learn the ure 2(a). alignment between source and target sequences by attention Several high-level principles of template design are as fol- mechanism. In this subsection, we first introduce how to ex- lows: 1) Templates can be extracted directly from melodies tract musical elements in templates from melodies for model so that a template-to-melody model can be trained in a self- training and then introduce alignment regularization, which supervised way. 2) Templates obtained from lyrics should be can leverage musical knowledge to improve model control- in accordance with those extracted from melodies. 3) Com- lability. pared with hidden representation in lyric-to-melody end-to- Template Extraction Method We introduce the method end model, templates should be more easily manipulated on for extracting templates (i.e., tonality, chord progression, demand. rhythm pattern, cadence) from melodies for model training: Based on these principles, our designed template consists of tonality, chord progression, rhythm pattern, and cadence. • Tonality can be inferred according to the note pitch distri- The template representation is shown in Figure 2(c): We use bution of a whole melody, following Liang et al. (2020). 5 • Chord progression can be inferred based on note pitch https://en.wikipedia.org/wiki/Major scale distribution based on a viterbi algorithm proposed by 6 https://en.wikipedia.org/wiki/Chord progression Magenta10 . 7 https://en.wikipedia.org/wiki/Chord (music) 8 10 https://en.wikipedia.org/wiki/Rhythm#Composite rhythm https://github.com/magenta/note-seq/blob/master/note seq/ 9 https://en.wikipedia.org/wiki/Cadence chord inference.py
• Rhythm pattern can be inferred based on position infor- • We add an alignment between tonality (at the begin- mation of notes in melodies. ning of the template) and every note pitch in the melody, • Cadence can be inferred based on note pitch, onset and since tonality controls the pitch distribution of the entire duration11 . Specifically, 1) “no cadence” is assigned to melody. the note when the note duration is short (e.g., less than • We add an alignment between the chord of ith note in the 1 beat12 ) and the onset interval between the current note template and pitch of ith note in the melody, since chord with the next note is small (e.g., less than 1.5 beats). 2) influences the note pitch. “authentic cadence” is assigned to the note when the note • We add an alignment between the rhythm pattern of ith is the root note13 of tonic14 chord, or the inferred chord note in the template and the position of ith note in the of this note is tonic chord. There is also a probability of p melody, since rhythm pattern determines the note posi- (e.g., 0.3) for labeling this note with “authentic cadence” tion. when the note is other notes in tonic chord rather than root note, and the onset interval is large (e.g., more than • We add an alignment between the cadence of ith note in 2 beats). 3) “half cadence” is assigned to notes outside the template and the duration of ith note in the melody, the above two situations. since cadence controls the note duration. Besides, since onset interval between ith note and i+1th note is also Alignment Regularization With the introduction of tem- closely related to the pause, we add alignments between plates, we are able to control the melody generation by ad- cadence of ith note with the bar index and position of justing musical elements in templates. To further increase i+1th note. We also add an alignment between cadence model controllability, we introduce musical knowledge to of ith note and the pitch of ith note, since pitch is an im- the template-to-melody model through well-designed align- portant factor in distinguishing “half cadence” and “au- ment regularization during training. Each alignment is de- thentic cadence”. signed based on musical knowledge and is imposed on the encoder-decoder attention. 3.3 Lyric-to-Template Module Following Garg et al. (2019), alignment regularization is In this subsection, we describe the lyric-to-template mod- described as follows: We denote mk as the k th note infor- ule, which generates musical elements (i.e., tonality, chord mation token in melody sequence, and tj as the j th musical progression, rhythm pattern, and cadence) in templates from element token in template sequence. ŵ denotes a 0-1 matrix lyrics. Tonality and chord progression are weakly correlated such that ŵk,j = 1 if mk is aligned with tj . We simply nor- with lyrics and can be manipulated on demand. Therefore, malize the rows of matrix ŵ to get a matrix w. We expect we focus on generating rhythm pattern and cadence in the encoder-decoder attention weight Ak,j between mk and tj following subsections. to be closer to wk,j defined as: 1 Lyric-to-Rhythm Model Previous works (Monteith, Mar- if mk is aligned to tj , tinez, and Ventura 2012; Fukayama et al. 2010) usually gen- wk,j = T (1) 0 otherwise, erate a fixed rhythm pattern based on syllable count in a bar, where T is the number of tokens in the template sequence which need lots of manual design and constrain the diversity that mk is aligned to. The alignment regularization term is of rhythm pattern. Therefore, we introduce a lyric-to-rhythm defined as follows: model to predict rhythm pattern in template with given lyrics. We utilize an encoder-attention-decoder Transformer K 1 XX J model to generate rhythm pattern in an auto-regressive way, Lattn = wk,j ∗ logAk,j , (2) which requires large amount of paired lyric-rhythm data for K ∗J j=1 training. To collect adequate training data, we extracted the k=1 where J and K are the number of tokens in the source and paired data from crawled lyric-singing audio data with audio target sentence respectively. Finally, we train our model to processing tools. The data pipeline is similar to Xue et al. minimize Lattn in conjunction with the negative log likeli- (2021) and is described in Appendix A. hood loss Lnll . The overall loss is defined as follows: Punctuation-Cadence Mapping Since punctuation L = Lnll + λattn ∗ Lattn , (3) marks in lyrics are closely related to cadences in melodies, where λattn is a hyperparameter. we design musical knowledge based mapping as follows: As shown in Figure 2(c), we give an example to illustrate 1) We align “authentic” cadence and “half cadence” with the alignment between the consecutive ith and i+1th note in period and comma in lyrics respectively for indicating the the melody, and the corresponding musical elements in the end of the sentence. 2) We align the “no cadence” label with template. The note information is consisted of bar index, po- each syllable in lyrics since it is a voiced part. sition in a bar, pitch and duration. The alignments mentioned in Equation (1) are designed as follows: 4 Experimental Settings 11 In this section, we introduce the experimental settings, in- https://en.wikipedia.org/wiki/Duration (music) cluding datasets, model configurations, and evaluation met- 12 https://en.wikipedia.org/wiki/Beat (music) rics. Our code is available via this link15 . 13 https://en.wikipedia.org/wiki/Root (chord) 14 15 https://en.wikipedia.org/wiki/Tonic (music) https://github.com/microsoft/muzic/tree/main/telemelody
4.1 Dataset Objective Evaluation Following previous work (Sheng We conduct experiments on both English (EN) and Chi- et al. 2020), we consider two metrics to measure the sim- nese (ZH) lyric-to-melody generation tasks to verify the ilarity of the generated and the ground-truth melodies: 1) effectiveness of TeleMelody. For lyric-to-template gener- Similarity of pitch and duration distribution (PD and DD). ation, we construct two lyric-rhythm datasets in English 2) Melody distance (MD). and Chinese respectively for training lyric-to-rhythm model. Besides, we also use accuracy of tonality, chord, rhythm For template-to-melody generation, we only require one pattern, and cadence (TA, CA, RA, AA) to measure the template-melody dataset since templates are language inde- consistency between the generated melody and musical el- pendent. The processing details about melody and template ements in the template. The more consistent the generated are provided in Appendix B. melody is with the template, the more controllable the model is. TA, RA, AA of ground-truth melodies are 100%, while Lyric-Rhythm Dataset Following data mining pipeline the CA score of ground-truth melodies is less than 100% mentioned in Appendix A, we can collect adequate paired since introducing some notes outside chord are encouraged lyric-rhythm training data. Finally, we obtain 9, 761 sam- to avoid monotony. Therefore, CA scores is better if it is ples in English and 74, 328 samples in Chinese. Statistics of closer to that of the ground-truth melodies. Accordingly, we these two datasets are shown in Table 4 of Appendix C. consider that the closer TA, CA, RA, and AA are to the ground-truth melodies, the more controllable the model is. Template-Melody Dataset Since templates are language The definitions of these metrics are described in Appendix independent, we only require one template-melody model F. for lyrics in two languages. We use LMD-matched MIDI dataset16 (Raffel 2016), which contains 45, 129 MIDI data. Subjective Evaluation We invite 10 participants with mu- We perform pre-processing to get high-quality melody data sic knowledge as human annotators to evaluate 10 songs in as follows: First, we extract the melody from the track17 each language. We require each participant to rate properties with at least 50 notes that has the highest average pitch of the melodies in a five-point scale, from 1 (Poor) to 5 (Per- among all the tracks, and then delete polyphonic notes. Sec- fect). The whole evaluation is conducted in a blind-review ond, we normalize the tonality to “C major” or “A minor”, mode. Following previous works (Sheng et al. 2020; Zhu and normalize the pitches to fit the range of vocals. Fi- et al. 2018; Watanabe et al. 2018), we use following sub- nally, we filter the empty bars in each sample. After these jective metrics to evaluate the generated melodies: 1) Har- steps, we can obtain the processed target melody dataset. mony: Is the melody itself harmonious? 2) Rhythm : Is the To construct paired template-melody dataset, we utilize our rhythm of the melody sounds natural and suitable for lyrics? proposed knowledge-based rules to extract the correspond- 3) Structure: Does the melody consist of repetitive and im- ing template from the target melody. Statistics of template- pressive segments? 4) Quality: What is the overall quality of melody dataset are shown in Table 5 of Appendix C. the melody? 4.2 System Configurations 5 Experimental Results We choose encoder-attention-decoder Transformer in our In this section, we first compare TeleMelody with baselines system. The configurations of the two models in TeleMelody to demonstrate its effectiveness. Then, we show the analysis are as follows: 1) Lyric-to-rhythm Transformer model con- of TeleMelody in the aspects of controllability and data ef- sists of 4 encoder layers and 4 decoder layers. The hidden ficiency. Audio samples of the generated melodies are avail- size and the number of attention heads are set as 256 and 4. able via this link18 . During training, we apply dropout with the rate of 0.2. We use Adam optimizer (Kingma and Ba 2014) with a learning 5.1 Main Results rate of 0.0005. 2) Template-to-melody Transformer model In this subsection, we evaluate the performance of has a 4-layer encoder and 4-layer decoder with 4 attention TeleMelody. Two baselines are considered: 1) Song- heads in each layer. We set hidden size as 256 and dropout MASS (Sheng et al. 2020), an end-to-end Transformer as 0.0005. The alignment regularization weight λattn is set model that also deals with low-resource scenario by unsu- as 0.05 and the learning rate of Adam optimizer is set as pervised lyric-to-lyric and melody-to-melody pre-training; 0.0005. 3) To ensure the diversity of generated melodies, 2) Transformer baseline, a Transformer model directly we use stochastic sampling inference following Huang et al. trained with paired lyric-melody data, which is similar (2018). The temperature and top-k parameters are set as 0.5 to SongMASS without pre-training and and an alignment and 2 in lyric-to-rhythm generation, and as 0.5 and as 10 in constraint. Our TeleMelody (including lyric-to-rhythm and template-to-melody generation. template-to-melody models) has similar number of model parameters with that of SongMASS and Transformer base- 4.3 Evaluations Metrics line for fair comparison. For English, the two baselines use We conduct both objective and subjective evaluations to val- 8, 000 paired lyric-melody data (Yu, Srivastava, and Canales idate the performance of our proposed system. 2021), and SongMASS additionally uses 362, 237 unpaired lyrics and 176, 581 unpaired melodies for pre-training. For 16 https://colinraffel.com/projects/lmd/ 17 18 https://en.wikipedia.org/wiki/Multitrack recording https://speechresearch.github.io/telemelody
Table 1: Objective and subjective evaluation of TeleMelody and the baselines systems. Objective Subjective PD(%)↑ DD(%)↑ MD↓ Harmony↑ Rhythm↑ Structure↑ Quality↑ (EN) Transformer Baseline 24.41 47.04 2.69 2.0 2.1 1.8 1.8 (EN) SongMASS 34.32 47.88 2.61 2.4 2.3 2.3 2.2 (EN) TeleMelody 43.86 52.06 2.40 3.2 3.4 3.3 3.3 (ZH) Transformer Baseline 11.40 38.03 6.75 1.5 2.0 1.5 1.5 (ZH) SongMASS 26.36 50.09 3.29 1.8 2.2 1.8 1.7 (ZH) TeleMelody 49.76 51.48 2.80 3.0 3.5 3.3 3.2 Chinese, the two baselines use 18, 000 paired lyric-melody Controllability Study Previous works in lyric-to-melody data (Bao et al. 2019), and SongMASS additionally uses generation based on deep learning techniques hinders end 228, 000 unpaired lyrics and 283, 000 unpaired melodies users to control musical elements in the generated melodies. crawled from the Web. As shown in Table 2, TeleMelody with template is quite high As shown in Table 1, TeleMelody significantly outper- in RA and AA, and close to the ground truth in CA, which forms Transformer baseline on all the objective metrics (Im- indicates that the melody generated by TeleMelody is highly provement on EN: 19.45% in PD, 5.02% in DD, and 0.29 consistent with the rhythm, chord, and cadence in the tem- in MD; Improvement on ZH: 38.36% in PD, 13.45% in DD plate. Meanwhile, TeleMelody can also control the tonality and 3.95 in MD). Compared with SongMASS, TeleMelody with a good TA accuracy with the template. Moreover, the also performs better on all the objective metrics (Improve- proposed alignment regularization (AR) can further improve ment on EN: 9.54% in PD, 4.18% in DD, and 0.21 in the controllability (with closer TA, CA, RA, and AA to the MD; Improvement on ZH: 23.40% in PD, 1.39% in DD ground truth). and 0.49 in MD). Meanwhile, for all the subjective metrics, TeleMelody is better than both the Transformer baseline and Table 2: Results on model controllability. The closer that TA, SongMASS. Specifically for the quality, TeleMelody out- CA, RA, AA are to Ground Truth, the better the controlla- performs Transformer baseline by 1.49 in English and 1.7 bility is. in Chinese, and outperforms SongMASS by 1.10 in English and 1.43 in Chinese. Metric TA(%) CA(%) RA(%) AA(%) The results show that an end-to-end Transformer model has poor performance, since the available lyric-melody Ground Truth 100.00 62.10 100.00 100.00 paired data is limited. In SongMASS, the lyric-to-lyric and Template 75.64 64.11 99.88 99.94 melody-to-melody unsupervised pre-training can effectively Template+AR 77.41 62.62 99.99 99.98 improve the performance of end-to-end models, but it is still insufficient since the unlabeled data is not effectively uti- lized to improve the correlation learning between lyrics and To show the effect of the proposed alignment regulariza- melodies. TeleMelody performs the best, since it success- tion intuitively, we further visualize the average encoder- fully reduces the difficulty and effectively utilizes the un- decoder attention weights of all heads in the last layer. As paired melodies in the two-stage framework. Besides, the shown in Figure 3, after adding alignment regularization improvements in Table 1 are also consistent with the intu- (template + AR), the related elements in the template and ition in designing the template: 1) tonality and chord pro- the generated melody are clearly aligned, according to the gression in the template can control note pitch and thus help alignment rules introduced in Section 3.2. improve PD and harmony; 2) rhythm pattern and cadence in We also conduct a case study to illustrate the controlla- the template can control both onset and duration of notes, bility of TeleMelody and how the elements in the template and thus help improve DD and rhythm; 3) the repetitive pat- affect the generated melody. The baseline melody is shown terns in template can help improve structure. in Figure 4(a), with a “C major” tonality and a “Am-F-C- G” chord progression. We evaluate the control performance from the following aspects: 5.2 Method Analyses In this subsection, we analyze the proposed method from • Tonality determines all the pitches in the melody. As two aspects: 1) verifying the controllability of TeleMelody, shown in Figure 4(b), when we change the tonality in as well as the effect of the designed template and the align- the template from “C major” to “A minor”, most of the ment regularization on controllability; 2) verifying data ef- pitches change, and the pitch of the ending note changes ficiency by testing TeleMelody with varying paired lyric- from tonic pitch of “C major” to tonic pitch of “A minor”. rhythm training data scale. Besides, in Appendix D, we also • For each note, a chord is provided in the template, which evaluate the effectiveness of the rule to extract cadence from affects the pitch of the note. As shown in Figure 4(c), melodies. when we change the chords of the notes in the third
3-Rhythm 1.0 3-Rhythm 1.0 Table 3: Objective evaluations of lyric-to-rhythm model with 3-Cadence 3-Chord 0.8 3-Cadence 3-Chord 0.8 varying lyric-rhythm training data. 2-Rhythm 2-Rhythm 2-Cadence 2-Cadence 0.6 0.6 2-Chord 2-Chord 1-Rhythm 1-Cadence 1-Rhythm 1-Cadence PD(%)↑ DD(%)↑ MD↓ 0.4 0.4 1-Chord 1-Chord 0-Rhythm 0-Cadence 0-Rhythm 0-Cadence 0.2 (EN) TeleMelody (100% data) 43.86 52.06 2.40 0.2 0-Chord Toanlity 0-Chord Toanlity (EN) TeleMelody (50% data) 41.42 48.59 2.40 0.0 (EN) TeleMelody (0% data) 35.60 47.93 2.55 0-Position 0-Pitch 0-Duration 1-Position 1-Pitch 1-Duration 2-Position 2-Pitch 2-Duration 3-Position 3-Pitch 3-Duration 0-Bar 1-Bar 2-Bar 3-Bar 0.0 0-Position 0-Pitch 0-Duration 1-Position 1-Pitch 1-Duration 2-Position 2-Pitch 2-Duration 3-Position 3-Pitch 3-Duration 0-Bar 1-Bar 2-Bar 3-Bar (EN) SongMASS 34.32 47.88 2.61 (a) Template. (b) Template+AR. (ZH) TeleMelody (100% data) 49.76 51.48 2.80 (ZH) TeleMelody (50% data) 46.25 50.14 3.12 Figure 3: Alignment visualization. We denote x-coordinate (ZH) TeleMelody (0% data) 30.22 33.84 4.95 (ZH) SongMASS 26.36 50.09 3.29 “i-feature” as the feature of the ith note in the melody se- quence, and y-coordinate “i-element” as the musical ele- ment of ith note in the template sequence. Am F C G model with extracted templates from melodies in a self- supervised way. 2) In lyric-to-template module, only the lyric-to-rhythm model requires paired training data, which days like this I want to drive a -way, days like this I want to drive a -way. is much easier to obtain than paired lyric-melody data. We (a) Basic melody. Tonality is “C major” and chord progression is further study the performance with varying paired lyric- “Am–F-C-G”. rhythm data on the lyric-to-rhythm model, to demonstrate this model is also data-efficient. As shown in Table 3, when Am F C G reducing the paired lyric-rhythm data to 50%, the perfor- mance of TeleMelody only declines slightly, while still out- days like this I want to drive a -way, days like this I want to drive a -way. performs SongMASS on all the metrics. We further re- (b) Adjusting tonality to “A minor”. duce the paired lyric-rhythm data to zero, to demonstrate that TeleMelody can be applied in the scenario without any paired data. We replace the lyric-to-rhythm model with Am F Am F hand-craft rules in TeleMelody. The details of the hand- craft rules are described in the Appendix E. Table 3 shows days like this I want todrive a -way, days like this I want to drive a -way. that when replacing the lyric-to-rhythm model with hand- (c) Adjusting chord progression to “Am-F-Am-F” craft rules, the performance of TeleMelody significantly de- grades, which demonstrates the advantage of the lyric-to- Figure 4: Case study on template adjustment. Similar bars rhythm model in TeleMelody. Moreover, it is also shown are labeled by blue boxes while repetitive bars are labeled in Table 3, comparing to the performance on SongMASS by green boxes. based on the large amount of paired data and large scale pre- training, TeleMelody achieves comparable performance on all the objective metrics in English without any paired data. bar from “C” to “Am”, the pitches in the third bar are This promising result illustrates that TeleMelody is also ef- changed correspondingly. fective in zero resource scenario. • Rhythm affects the onset positions of the notes. For ex- ample, in Figure 4, when we use the same rhythm for the first and third bars as labeled by green or blue boxes in 6 Conclusion each melody, the onset positions of the notes in the two bars are the same. In this paper, we proposed TeleMelody, a two-stage lyric- • Cadence affects the pauses in the generated melody. to-melody generation system with music template (e.g., Meanwhile, for a note with “authentic cadence” in the tonality, chord progression, rhythm pattern, and cadence) template, the pitch should be tonic pitch. As shown in to bridge the gap between lyrics and melodies. TeleMelody Figure 4, a pause exists in the melody for each punc- is data efficient and can be controlled by end users by ad- tuation mark in the end of the second and fourth bars justing the musical elements. Both subjective and objec- as labeled by pink boxes, since a “half cadence” or an tive experimental evaluations demonstrate that TeleMelody “authentic cadence” is added for each punctuation mark. can generate melodies with higher quality than previous Moreover, the pitch of the ending note of the entire lyric-to-melody generation systems. Moreover, experimen- melody is the tonic pitch as labeled by orange boxes, tal studies also verify the data efficiency and controllability since an “authentic cadence” is added for each period. of TeleMelody. In future work, we will extend our proposed two-stage framework to other music generation tasks (e.g., Data Efficiency Study Our proposed TeleMelody is data- melody-to-lyric generation and melody-to-accompaniment efficient: 1) In template-to-melody module, we train a generation).
References Rabin, M. O. 1963. Probabilistic automata. Information and Ackerman, M.; and Loker, D. 2017. Algorithmic songwrit- control, 6(3): 230–245. ing with alysia. In International conference on evolutionary Raffel, C. 2016. Learning-based methods for comparing se- and biologically inspired music and art, 1–16. Springer. quences, with applications to audio-to-midi alignment and Bao, H.; Huang, S.; Wei, F.; Cui, L.; Wu, Y.; Tan, C.; matching. Columbia University. Piao, S.; and Zhou, M. 2019. Neural melody composition Scirea, M.; Barros, G. A.; Shaker, N.; and Togelius, J. 2015. from lyrics. In CCF International Conference on Natu- SMUG: Scientific Music Generator. In ICCC, 204–211. ral Language Processing and Chinese Computing, 499–511. Sheng, Z.; Song, K.; Tan, X.; Ren, Y.; Ye, W.; Zhang, S.; Springer. and Qin, T. 2020. SongMASS: Automatic Song Writing Choi, K.; Fazekas, G.; and Sandler, M. 2016. Text-based with Pre-training and Alignment Constraint. arXiv preprint LSTM networks for automatic music composition. arXiv arXiv:2012.05168. preprint arXiv:1604.05358. Watanabe, K.; Matsubayashi, Y.; Fukayama, S.; Goto, M.; Fabbri, A. R.; Ng, P.; Wang, Z.; Nallapati, R.; and Xiang, B. Inui, K.; and Nakano, T. 2018. A melody-conditioned lyrics 2020. Template-based question generation from retrieved language model. In Proceedings of the 2018 Conference of sentences for improved unsupervised question answering. the North American Chapter of the Association for Com- arXiv preprint arXiv:2004.11892. putational Linguistics: Human Language Technologies, Vol- Fukayama, S.; Nakatsuma, K.; Sako, S.; Nishimoto, T.; and ume 1 (Long Papers), 163–172. Sagayama, S. 2010. Automatic song composition from the Wu, J.; Liu, X.; Hu, X.; and Zhu, J. 2020. PopMNet: Gener- lyrics exploiting prosody of the Japanese language. In Proc. ating structured pop music melodies using neural networks. 7th Sound and Music Computing Conference (SMC), 299– Artificial Intelligence, 286: 103303. 302. Xue, L.; Song, K.; Wu, D.; Tan, X.; Zhang, N. L.; Qin, T.; Garg, S.; Peitz, S.; Nallasamy, U.; and Paulik, M. 2019. Zhang, W.-Q.; and Liu, T.-Y. 2021. DeepRapper: Neural Jointly learning to align and translate with transformer mod- Rap Generation with Rhyme and Rhythm Modeling. arXiv els. arXiv preprint arXiv:1909.02074. preprint arXiv:2107.01875. Huang, C.-Z. A.; Vaswani, A.; Uszkoreit, J.; Shazeer, N.; Yang, J.; Ma, S.; Zhang, D.; Li, Z.; and Zhou, M. 2020. Im- Simon, I.; Hawthorne, C.; Dai, A. M.; Hoffman, M. D.; Din- proving neural machine translation with soft template pre- culescu, M.; and Eck, D. 2018. Music transformer. arXiv diction. In Proceedings of the 58th Annual Meeting of the preprint arXiv:1809.04281. Association for Computational Linguistics, 5979–5989. Kingma, D. P.; and Ba, J. 2014. Adam: A method for Yu, Y.; Srivastava, A.; and Canales, S. 2021. Conditional stochastic optimization. arXiv preprint arXiv:1412.6980. lstm-gan for melody generation from lyrics. ACM Transac- Lee, H.-P.; Fang, J.-S.; and Ma, W.-Y. 2019. iComposer: An tions on Multimedia Computing, Communications, and Ap- automatic songwriting system for Chinese popular music. plications (TOMM), 17(1): 1–20. In Proceedings of the 2019 Conference of the North Ameri- Zhu, H.; Liu, Q.; Yuan, N. J.; Qin, C.; Li, J.; Zhang, K.; can Chapter of the Association for Computational Linguis- Zhou, G.; Wei, F.; Xu, Y.; and Chen, E. 2018. Xiaoice band: tics (Demonstrations), 84–88. A melody and arrangement generation framework for pop Li, P.; Zhang, H.; Liu, X.; and Shi, S. 2020. Rigid formats music. In Proceedings of the 24th ACM SIGKDD Interna- controlled text generation. In Proceedings of the 58th An- tional Conference on Knowledge Discovery & Data Mining, nual Meeting of the Association for Computational Linguis- 2837–2846. tics, 742–751. Liang, H.; Lei, W.; Chan, P. Y.; Yang, Z.; Sun, M.; and A Pipeline of Crawling Lyric-Rhythm Data Chua, T.-S. 2020. Pirhdy: Learning pitch-, rhythm-, and We crawl paired lyric-singing audio data from the Web and dynamics-aware embeddings for symbolic music. In Pro- then utilize spleeter19 , a public music separation tool, to sep- ceedings of the 28th ACM International Conference on Mul- arate the vocal from the accompaniment part. The times- timedia, 574–582. tamps of lyrics and the tempo of melodies are two neces- Long, C.; Wong, R. C.-W.; and Sze, R. K. W. 2013. T- sary information during collection of rhythm information. Music: A melody composer based on frequent pattern min- To extract lyric timestamps, we first split the singing au- ing. In 2013 IEEE 29th International Conference on Data dio into sentence-level segments with crawled start and end Engineering (ICDE), 1332–1335. IEEE. timestamp. We convert lyrics into sequences of phonemes Malmi, E.; Takala, P.; Toivonen, H.; Raiko, T.; and Gio- via Phonemizer20 and then obtain the vocal-lyric alignment nis, A. 2016. Dopelearning: A computational approach to in phoneme level with an audio alignment tool, Montreal rap lyrics generation. In Proceedings of the 22nd ACM Forced Aligner21 . Based on these phoneme-level vocal-lyric SIGKDD International Conference on Knowledge Discov- alignments, we can obtain the corresponding timestamp of ery and Data Mining, 195–204. 19 https://github.com/deezer/spleeter Monteith, K.; Martinez, T. R.; and Ventura, D. 2012. Au- 20 https://github.com/bootphon/phonemizer tomatic Generation of Melodic Accompaniments for Lyrics. 21 https://github.com/MontrealCorpusTools/Montreal-Forced- In ICCC, 87–94. Aligner
each lyric. To extract tempo information, we perform a di- Table 5: Statistics of template-melody dataset. rect estimation from the accompaniment audio with an au- dio information retrieval tool, librosa22 . Finally, with tempo # of data samples 157,702 information and timestamp of each lyric, we can infer beat- Average # of notes per song 28.76 level onset, that is, the corresponding rhythm of each lyric. Average # of bars per song 5.93 B Processing Details of Melody and Template D Effect of Cadence Extraction Rule B.1 Melody Table 6: Statistics of notes labeled with different cadences in In this paper, we only consider the melody with a constant template-to-melody dataset. Both duration and onset interval tempo and a time signature of 4/423 . Each note is repre- are quantified to 1/16 note. sented by four consecutive tokens (bar, position, pitch and duration). We use 256 tokens to represent different bars and Cadence Average # per song Duration Onset Interval 16 tokens to represent different positions in a bar with gran- No 24.57 1.74 2.29 ularity of 1/16 note. We use 128 tokens to represent pitch Half 2.53 8.54 9.66 values following the MIDI format. We use 16 tokens to rep- Authentic 1.66 10.34 11.70 resent duration values ranging from a 1/16 note to a whole note. In lyric-to-template module, we directly obtain cadence from lyrics in inference stage through punctuation-to- B.2 Template cadence mapping. In template-to-melody module, we ex- In this paper, template contains musical elements includ- tract cadence from melodies in training stage through ca- ing tonality, chord progression, rhythm pattern, and cadence. dence extraction rule. Therefore, a question may arise: is We only consider “C major” and “A minor” tonalities for there a gap in cadence between training and inference? To simplicity, since other tonalities can be transposed to these answer this question, we explore the statistics of cadences in two tonalities based on their scales. Chord consists of a template-melody dataset. The results are shown in Table 6: root note and a chord quality. We consider 12 chord roots • Notes labeled with “no cadence”, which are aligned with (C, C], D, D], E, F, F ], G, G], A, A], B) and 7 chord qual- syllables in lyrics, have short duration and onset inter- ities (major, minor, diminished, augmented, major7, minor7, val. Notes labeled with “half cadence”, which are aligned half diminished), resulting in 84 possible chords in total. We with commas in lyrics, have 4.91× longer duration and use 4 tokens ranging from 0 to 3 to represent rhythm pat- 4.22× longer onset interval than those labeled with “no terns, that is, beat-level onset position in a bar. For cadence, cadence”. Notes labeled with ”authentic cadence”, which we consider “half cadence”, “authentic cadence” and “no ca- are aligned with periods in lyrics, have 5.94× loner du- dence”, which is aligned with comma, period and other syl- ration and 5.11× longer onset interval than those labeled lables in lyrics respectively. with “no cadence”. This is in consistent with musical knowledge, since punctuation marks are usually aligned with pauses. C Dataset Statistics • The average number of “no cadence” is 5.86× greater than the average number of “half cadence’ and “authen- Table 4: Statistics of lyric-rhythm dataset. tic cadence” combined. This ratio is similar to the ratio of syllables to punctuation marks in lyrics, as shown in (a) English Table 4. # of data samples 9, 761 E Lyric-to-Rhythm Hand-Craft Rules Average # of words per song 26.02 Average # of syllables per song 31.15 We design several lyric-to-rhythm rules to demonstrate that Average # of punctuation marks per song 3.79 TeleMelody can be applied in the scenario without any paired data. Specifically, for English lyrics, a note is corre- (b) Chinese sponding to a syllable, and we generate the rhythm patterns syllable by syllable, where the onset interval between a note # of data samples 74, 328 and its previous note is 2 beats if its corresponding syllable is Average # of words per song 78.77 the start of a sentence, is 1 beat if its corresponding syllable Average # of punctuation marks per song 9.30 is the start of a word but not the start of a sentence, and is 0.5 beat otherwise. For Chinese lyrics, a note is corresponding to a character, and we generate the rhythm patterns character 22 by character, where the onset interval between a note and its https://github.com/librosa/librosa 23 4/4 denotes that that each beat is a 1/4 note and each bar has 4 previous note is 2 beats if its corresponding character is the beats. start of a sentence, and is 1 beat otherwise.
F Definitions of TA, RA, CA, and AA We evaluate model controllability with accuracy of tonality, rhythm pattern, chord, and cadence (TA, RA, CA, AA), that is, the proportion of notes in consistent with given template. Specifically, a melody is in consistent with tonality if the in- ferred tonality is the same as the template tonality; a note is in consistent with rhythm pattern if its position is in accord with given rhythm pattern information; a note is in consis- tent with chord if its pitch is within the chord; a note is in consistent with cadence if both its duration and its onset in- terval between this note and the next note comply with the extraction rules described in Section 3.2.
You can also read