IMPROVED CHORD RECOGNITION BY COMBINING DURATION AND HARMONIC LANGUAGE MODELS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IMPROVED CHORD RECOGNITION BY COMBINING DURATION AND HARMONIC LANGUAGE MODELS Filip Korzeniowski and Gerhard Widmer Institute of Computational Perception, Johannes Kepler University, Linz, Austria filip.korzeniowski@jku.at ABSTRACT meaningful segments of chords from these possibly volatile frame-wise predictions). Chord recognition systems typically comprise an acoustic Acoustic models extract frame-wise chord predictions, model that predicts chords for each audio frame, and a tem- typically in the form of a distribution over chord la- poral model that casts these predictions into labelled chord bels. Originally, these models were hand-crafted and split arXiv:1808.05335v1 [cs.SD] 16 Aug 2018 segments. However, temporal models have been shown to into feature extraction and pattern matching, where the only smooth predictions, without being able to incorporate former computed some form of pitch-class profiles (e.g. musical information about chord progressions. Recent re- [26, 29, 33]), and the latter used template matching or Gaus- search discovered that it might be the low hierarchical level sian mixtures [6, 14] to model these features. Recently, such models have been applied to (directly on audio frames) however, neural networks became predominant for acoustic which prevents learning musical relationships, even for ex- modelling [18, 22, 23, 27]. These models usually compute a pressive models such as recurrent neural networks (RNNs). distribution over chord labels directly from spectral repre- However, if applied on the level of chord sequences, RNNs sentations and thus fuse both feature extraction and pattern indeed can become powerful chord predictors. In this paper, matching. Due to the discriminative power of deep neural we disentangle temporal models into a harmonic language networks, these models achieve superior results. model—to be applied on chord sequences—and a chord Temporal models process the predictions of an acous- duration model that connects the chord-level predictions of tic model and cast them into coherent chord segments. the language model to the frame-level predictions of the Such models are either task-specific, such as hand-designed acoustic model. In our experiments, we explore the impact Bayesian networks [26], or general models learned from of each model on the chord recognition score, and show that data. Here, it is common to use hidden Markov mod- using harmonic language and duration models improves the els [8] (HMMs), conditional random fields [23] (CRFs), results. or recurrent neural networks (RNNs) [2, 32]. However, existing models have shown only limited capabilities to 1. INTRODUCTION improve chord recognition results. First-order models are not capable of learning meaningful musical relations, and Chord recognition methods recognise and transcribe mu- only smooth the predictions [4, 7]. More powerful mod- sical chords from audio recordings. Chords are highly de- els, such as RNNs, do not perform better than their first- scriptive harmonic features that form the basis of many order counterparts [24]. In addition to the fundamental flaw kinds of applications: theoretical, such as computational of first-order models (chord patterns comprise more than harmonic analysis of music; practical, such as automatic two chords) both approaches are limited by the low hier- lead-sheet creation for musicians 1 or music tutoring sys- archical level they are applied on: the temporal model is tems 2 ; and finally, as basis for higher-level tasks such as required to predict the next symbol for each audio frame. cover song identification or key classification. Chord recog- This makes the model focus on short-term smoothing, and nition systems face the two key problems of extracting neglect longer-term musical relations between chords, be- meaningful information from noisy audio, and casting this cause, most of the time, the chord in the next audio frame information into sensible output. These translate to acoustic is the same as in the current one. However, exploiting these modelling (how to predict a chord label for each position or longer-term relations is crucial to improve the prediction frame in the audio), and temporal modelling (how to create of chords. RNNs, if applied on chord sequences, are capa- 1 https://chordify.net/ ble of learning these relations, and become powerful chord 2 https://yousician.com predictors [21]. Our contributions in this paper are as follows: i) we de- © Filip Korzeniowski and Gerhard Widmer. Licensed under scribe a probabilistic model that allows for the integration a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Filip Korzeniowski and Gerhard Widmer. “Improved Chord of chord-level language models with frame-level acoustic Recognition by Combining Duration and Harmonic Language Models”, models, by connecting the two using chord duration models; 19th International Society for Music Information Retrieval Conference, ii) we develop and apply chord language models and chord Paris, France, 2018. duration models based on RNNs within this framework;
and iii) we explore how these models affect chord recogni- ȳk+1 tion results, and show that the proposed integrated model out-performs existing temporal models. Chord Sequence 2. CHORD SEQUENCE MODELLING ȳk · 1 − Chord recognition is a sequence labelling task, i.e. we need :k ȳ 1 :t) | 1 to assign a categorical label yt ∈ Y (a chord from a chord ȳ | y PL (c alphabet) to each member of the observed sequence xt (an PD audio frame), such that yt is the harmonic interpretation of ȳk−1 PD (s | y1:t ) the music represented by xt . Formally, t−1 t t+1 ŷ1:T = argmax P (y1:T | x1:T ) . (1) Audio Frames y1:T Assuming a generative structure as shown in Fig. 1, the Figure 2. Chord-time lattice representing the temporal probability distribution factorises as model PT , split into a language model PL and duration model PD . Here, ȳ1:K represents a concrete chord se- Y 1 quence. For each audio frame, we move along the time-axis P (y1:T | x1:T ) ∝ PA (yt | xt ) PT (yt | y1:t−1 ) , t P (yt ) to the right. If the chord changes, we move diagonally to the upper right. This corresponds to the first case in Eq. 2. where PA is the acoustic model, PT the temporal model, If the chord stays the same, we move only to the right. This and P (yt ) the label prior which we assume to be uniform corresponds to the second case of the equation. as in [31]. the preceding chord changes, i.e. PD (st | s1:t−1 ). Explor- ing more complex models of harmonic rhythm is left for future work. Using these definitions, the temporal model PT fac- torises as y1 y2 y3 ··· yT PT (yt | y1:t−1 ) = (2) ( PL (ȳk | ȳ1:k−1 ) PD (c | y1:t−1 ) if yt 6= yt−1 . x1 x2 x3 xT PD (s | y1:t−1 ) else The chord progression can then be interpreted as a path Figure 1. Generative chord sequence model. Each chord through a chord-time lattice as shown in Fig. 2. label yt depends on all previous labels y1:t−1 . This model cannot be decoded efficiently at test-time be- cause each yt depends on all predecessors. We will thus use The temporal model PT predicts the chord symbol of either models that restrict these connections to a finite past each audio frame. As discussed earlier, this prevents both (such as higher-order Markov models) or use approximate finite-context models (such as HMMs or CRFs) and unre- inference methods for other models (such as RNNs). stricted models (such as RNNs) to learn meaningful har- monic relations. To enable this, we disentangle PT into a 3. MODELS harmonic language model PL and a duration model PD , where the former models the harmonic progression of a The general model described above requires three sub- piece, and the latter models the duration of chords. models: an acoustic model PA that predicts a chord distri- The language model PL is defined as PL (ȳk | ȳ1:k−1 ), bution from each audio frame, a duration model PD that where ȳ1:k = C (y1:t ), and C (·) is a sequence compression predicts when chords change, and a language model PL mapping that removes all consecutive duplicates of a chord that predicts the progression of chords in the piece. (e.g. C ((C, C, F, F, G)) = (C, F, G)). The frame-wise 3.1 Acoustic Model labels y1:t are thus reduced to chord changes, and PL can focus on modelling these. The acoustic model we use is a VGG-style convolutional The duration model PD is defined as PD (st | y1:t−1 ), neural network, similar to the one presented in [23]. It uses where st ∈ {c, s} indicates whether the chord changes three convolutional blocks: the first consists of 4 layers of (c) or stays the same (s) at time t. PD thus only predicts 32 3×3 filters (with zero-padding in each layer), followed whether the chord will change or not, but not which chord by 2 × 1 max-pooling in frequency; the second comprises will follow—this is left to the language model PL . This 2 layers of 64 such filters followed by the same pooling definition allows PD to consider the preceding chord labels scheme; the third is a single layer of 128 12×9 filters. Each y1:t−1 ; in practice, we restrict the model to only depend on of the blocks is followed by feature-map-wise dropout with
effect the language models have on chord recognition. P (z1 | h1 ) P (z2 | h2 ) P (zK | hK ) 3.3 Duration Model h0 h1 h2 ··· hK The duration model PD predicts whether the chord will change in the next time step. This corresponds to modelling the duration of chords. Existing temporal models induce implicit duration models: for example, an HMM implies an v(z0 ) v(z1 ) v(zK−1 ) exponential chord duration distribution (if one state is used to model a chord), or a negative binomial distribution (if Figure 3. Sketch of a RNN used for next step prediction, multiple left-to-right states are used per chord). However, where zk refers to an arbitrary categorical input, v(·) is a such duration models are simplistic, static, and do not adapt (learnable) input embedding vector, and hk the hidden state to the processed piece. at step k. Arrows denote matrix multiplications followed An explicit duration model has been explored in [4], by a non-linear activation function. The input is padded where beat-synchronised chord durations were stored with a dummy input z0 in the beginning. The network then as discrete distributions. Their approach is useful for computes the probability distribution for the next symbol. beat-synchronised models, but impractical for frame-wise models—the probability tables would become too large, probability 0.2, and each layer is followed by batch normal- and data too sparse to estimate them. Since our approach isation [19] and an ELU activation function [10]. Finally, a avoids the potentially error-prone beat synchronisation, the linear convolution with 25 1×1 filters followed by global approach of [4] does not work in our case. average pooling and a softmax produces the chord class Instead, we opt to use recurrent neural networks to model probabilities PA (yt | xt ). chord durations. These models are able to adapt to charac- The input to the network is a 1.5 s patch of a quarter- teristics of the processed data [21], and have shown great tone spectrogram computed using a logarithmically spaced potential in processing periodic signals [1] (and chords triangular filter bank. Concretely, we process the audio at do change periodically within a piece). To train an RNN- a sample rate of 44 100 Hz using the STFT with a frame based duration model, we set up a next-step-prediction size of 8192 and a hop size of 4410. Then, we apply to task, identical in principle to the set-up for harmonic lan- the magnitude of the STFT a triangular filter bank with 24 guage modelling: the network has to compute the proba- filters per octave between 65 Hz and 2 100 Hz. Finally, we bility of a chord change in the next time step, given the take the logarithm of the resulting magnitudes to compress chord changes it has seen in the past. We thus simplify the input range. PD (st | y1:t−1 )=Pb D (st | s1:t−1 ), as mentioned earlier. Neural networks tend to produce over-confident pre- Again, see Fig. 3 for an overview (for duration modelling, dictions, which in further consequence could over-rule replace zk with st ). the predictions of a temporal model [9]. To mitigate In Section 4, we will describe in detail the hyper- this, we use two techniques: first, we train the model parameters of the networks we employed, and compare the using uniform smoothing (i.e. we assign a proportion properties of various settings to baseline duration models. of 1 − β to other classes during training); second, dur- We will also assess the impact on the duration modelling ing inference, we apply the temperature softmax function quality on the final chord recognition result. z στ (z)j = e j/τ/ K zk/τ instead of the standard softmax P k=1 e in the final layer. Higher values of τ produce smoother 3.4 Model Integration probability distributions. In this paper, we use β = 0.9 and Dynamic models such as RNNs have one main advantage τ = 1.3, as determined in preliminary experiments. over their static counter-parts (e.g. n-gram models for language modelling or HMMs for duration modelling): they 3.2 Language Model consider all previous observations when predicting the next The language model PL predicts the next chord, regardless one. As a consequence, they are able to adapt to the piece of its duration, given the chord sequence it has previously that is currently processed—they assign higher probabilities seen. As shown in [21], RNN-based models perform bet- to sub-sequences of chords they have seen earlier [21], or ter than n-gram models at this task. We thus adopt this predict chord changes according to the harmonic rhythm of approach, and refer the reader to [21] for details. a song (see Sec. 4.3). The flip side of the coin is, however, To give an overview, we follow the set-up introduced that this property prohibits the use of dynamic programming by [28] and use a recurrent neural network for next-chord approaches for efficient decoding. We cannot exactly and prediction. The network’s task is to compute a probability efficiently decode the best chord sequence given the input distribution over all possible next chord symbols, given the audio. chord symbols it has observed before. Figure 3 shows an Hence we have to resort to approximate inference. In par- RNN in a general next-step prediction task. In our case, the ticular, we employ hashed beam search [32] to decode the inputs zk are the chord symbols given by C (y1:T ). chord sequence. General beam search restricts the search We will describe in detail the network’s hyper- space by keeping only the Nb best solutions up to the cur- parameters in Section 4, where we will also evaluate the rent time step. (In our case, the Nb best paths through
all possible chord-time lattices, see Fig. 2.) However, as GRU-512 GRU-32 4-gram 2-gram pointed out in [32], the beam might saturate with almost log-P −1.293 −1.576 −1.887 −2.393 identical solutions, e.g. the same chord sequence differing only marginally in the times the chords change. Such patho- Table 1. Language model results: average log-probability logical cases may impair the final estimate. To mitigate of the correct next chord computed by each model. this problem, hashed beam search forces the tracked solu- tions to be diverse by pruning similar solutions with lower probability. 4.2 Language Models The similarity of solutions is determined by a task- The performance of neural networks depends on a good specific hash function. For our purpose, we define the choice of hyper-parameters, such as number of layers, num- hash function of a solution to be the last Nh chord sym- ber of units per layer, or unit type (e.g. vanilla RNN, gated bols in the sequence, regardless of their duration; formally, recurrent unit (GRU) [5] or long short-term memory unit the hash function fh (y1:t ) = ȳ(k−Nh ):k . (Recall that (LSTM) [17]). The findings in [21] provide a good start- ȳ1:k = C (y1:t ).) In contrast to the hash function originally ing point for choosing hyper-parameter settings that work proposed in [32], which directly uses y(t−Nh ):t , our formu- well. However, we strive to find a simpler model to re- lation ensures that sequences that differ only in timing, but duce the computational burden at test time. To this end, not in chord sequence, are considered similar. we perform a grid search in a restricted search space, us- To summarise, we approximately decode the optimal ing the validation score of the first fold. We search over chord transcription as defined in Eq. 1 using hashed beam the following settings: number of layers ∈ {1, 2, 3}, num- search, which at each time step keeps the best Nb solutions, ber of units ∈ {256, 512}, unit type ∈ {GRU, LSTM}, and at most Ns similar solutions. input embedding ∈ {one-hot, R8 , R16 , R24 }, learning rate ∈ {0.001, 0.005}, and skip connections ∈ {on, off}. Other 4. EXPERIMENTS hyper-parameters were fixed for all trials: we train the net- works for 100 epochs using stochastic gradient descent with In our experiments, we will first evaluate harmonic language mini-batches of size 4, employ the Adam update rule [?], and duration models individually. Here, we will compare and starting from epoch 50, linearly anneal the learning rate the proposed models to common baselines. Then, we will to 0. integrate these models into the chord recognition framework To increase the diversity in the training data, we use two we outlined in Section 2, and evaluate how the individual data augmentation techniques, applied each time we show a parts interact in terms of chord recognition score. piece to the network. First, we randomly shift the key of the piece; the network can thus learn that harmonic relations 4.1 Data are independent of the key, as in roman numeral analysis. We use the following datasets in 4-fold cross-validation. Iso- Second, we select a sub-sequence of random length instead phonics 3 : 180 songs by the Beatles, 19 songs by Queen, of the complete chord sequence; the network thus has to and 18 songs by Zweieck, 10:21 hours of audio; RWC Pop- learn to cope with varying context sizes. ular [15]: 100 songs in the style of American and Japanese The best model turned out to be a single-layer network pop music, 6:46 hours of audio; Robbie Williams [13]: 65 of 512 GRUs, with a learnable 16-dimensional input embed- songs by Robbie Williams, 4:30 of audio; and McGill Bill- ding and without skip connections, trained using a learning board [3]: 742 songs sampled from the American billboard rate of 0.005 7 . We compare this model and a smaller, but charts between 1958 and 1991, 44:42 hours of audio. The otherwise identical RNN with 32 units, to two baselines: compound dataset thus comprises 1125 unique songs, and a 2-gram model, and a 4-gram model. Both can be used a total of 66:21 hours of audio. for chord recognition in a higher-order HMM [25]. We Furthermore, we used the following data sets (with dupli- train the n-gram models using maximum likelihood estima- cate songs removed) as additional data for training the lan- tion with Lidstone smoothing as described in [21], using guage and duration models: 173 songs from the Rock [11] the key-shift data augmentation technique (sub-sequence corpus; a subset of 160 songs from UsPop2002 4 for which cropping is futile for finite context models). As evaluation chord annotations are available 5 ; 291 songs from Weimar measure, we use the average log-probability of predicting Jazz 6 , with chord annotations taken from lead sheets of the correct next chord. Table 1 presents the test results. The Jazz standards; and Jay Chou [12], a small collection of 29 GRU models predict chord sequences with much higher Chinese pop songs. probability than the baselines. We focus on the major/minor chord vocabulary, and When we look into the input embedding v(·), which was following [7], map all chords containing a minor third to learned by the RNN during training from a random initiali- minor, and all others to major. This leaves us with 25 sation, we observe an interesting positioning of the chord classes: 12 root notes × {major, minor} and the ‘no- chord’ symbols (see Figure 4). We found that similar patterns de- class. velop for all 1-layer GRUs we tried, and these patterns are 3 http://isophonics.net/datasets consistent for all folds we trained on. We observe i) that 4 https://labrosa.ee.columbia.edu/projects/musicsim/uspop2002.html 5 https://github.com/tmc323/Chord-Annotations 7 Due to space constraints, we cannot present the complete grid search 6 http://jazzomat.hfm-weimar.de/dbformat/dboverview.html results.
b Negative Binomial GRU-16 GRU-128 d 0.3 a[ DB f PD (st | s1:t−1 ) F A[ 0.2 f] c a C e[ N AE[ 0.1 F] B[G E D[b[ 0.0 e g 55 60 65 70 75 80 d[ Time [s] Figure 4. Chord embedding projected into 2D using PCA Figure 5. Probability of chord change computed by differ- (left); Tonnetz of triads (right). The “no-chord” class resides ent models. Gray vertical dashed lines indicate true chord in the center of the embedding. Major chords are upper-case changes. and orange, minor chords lower-case and blue. Clusters in the projected embedding and the corresponding positions in the Tonnetz are marked in color. If projected into 3D (not GRU-256 GRU-16 Neg. Binom. Exp. shown here), the chord clusters split into a lower and upper log-P −2.014 −2.868 −3.946 −4.003 half of four chords each. The chords in the lower halves are shaded in the Tonnetz representation. Table 2. Duration model results: average log-probability of chord durations computed by each model. chords form three clusters around the center, in which the minor chords are farther from the center than major chords; by [4, 7]. The first baseline we consider is a negative bi- ii) that the clusters group major and minor chords with the nomial distribution. It can be modelled by a HMM us- same root, and the distance between the roots are minor ing n states per chord, connected in a left-to-right manner, thirds (e.g. C, E[, F], A); iii) that clockwise movement with transitions of probability p between the states (self- in the circle of fifths corresponds to clockwise movement transitions thus have probability 1 − p). The second, a in the projected embedding; and iv) that the way chords special case of the first with n = 1, is an exponential distri- are grouped in the embedding corresponds to how they are bution; this is the implicit duration distribution used by all connected in the Tonnetz. chord recognition models that employ a simple 1-state-per- At this time, we cannot provide an explanation for these chord HMM as temporal model. Both baselines are trained automatically emerging patterns. However, they warrant a using maximum likelihood estimation. further investigation to uncover why this specific arrange- To measure the quality of a duration model, we consider ment seems to benefit the predictions of the model. the average log-probability it assigns to a chord duration. The results are shown in Table 2. We further added results 4.3 Duration Models for the simplest GRU model we tried—using only 16 recur- As for the language model, we performed a grid search rent units—to indicate the performance of small models of on the first fold to find good choices for the recurrent unit this type. We will also use this simple model when judg- type ∈ {vanilla RNN, GRU, LSTM}, and number of recur- ing the effect of duration modelling on the final result in rent units ∈ {16, 32, 64, 128, 256} for the LSTM and GRU, Sec. 4.4. As seen in the table, both GRU models clearly and {128, 256, 512} for the vanilla RNN. We use only one out-perform the baselines. recurrent layer for simplicity. We found networks of 256 Figure 5 shows the reason why the GRU performs so GRU units to perform best; although this indicates that even much better than the baselines: as a dynamic model, it bigger models might give better results, for the purposes of can adapt to the harmonic rhythm of a piece, while static this study, we think that this configuration is a good balance models are not capable of doing so. We see that a GRU with between prediction quality and model complexity. 128 units predicts chord changes with high probability at The models were trained for 100 epochs using the Adam periods of the harmonic rhythm. It also reliably remembers update rule [?] with a learning rate linearly decreasing from the period over large gaps in which the chord did not change 0.001 to 0. The data was processed in mini-batches of 10, (between seconds 61 and 76). During this time, the peaks where the sequences were cut in excerpts of 200 time steps decay differently for different multiples of the period, which (20 s). We also applied gradient clipping at a value of 0.001 indicates that the network simultaneously tracks multiple to ensure a smooth learning progress. periods of varying importance. In contrast, the negative We compare the best RNN-based duration model with binomial distribution statically yields a higher chord change two baselines. The baselines are selected because both are probability that rises with the number of audio frames since implicit consequences of using HMMs as temporal model, the last chord change. Finally, the smaller GRU model with as it is common in chord recognition. We assume a single only 16 units also manages to adapt to the harmonic rhythm; parametrisation for each chord; this ostensible simplifica- however, its predictions between the peaks are noisier, and tion is justified, because simple temporal models such as it fails to remember the period correctly in the time without HMMs do not profit from chord information, as shown chord changes.
Language Model Model Root Maj/Min Seg. 2-Gram 4-Gram GRU-32 GRU-512 2-gram / neg. binom. 0.812 0.795 0.804 Duration Model GRU-512 / GRU-256 0.821 0.805 0.814 0.804 Neg. Binomial, Exact Neg. Binomial GRU-16 WCSR (maj/min) Table 3. Results of the standard model (2-gram language 0.802 GRU-256 model with negative binomial durations) compared to the best one (GRU language and duration models). 0.800 0.798 4.4 Integrated Models The individual results for the language and duration models 0.796 are encouraging, but only meaningful if they translate to better chord recognition scores. This section will thus eval- 2.393 1.887 1.576 1.293 uate if and how the duration and language models affect the Language Model Log-P performance of a chord recognition system. Duration Model The acoustic model used in these experiments was Neg. Binomial GRU-16 GRU-256 trained for 300 epochs (with 200 parameter updates per Language Model epoch) using a mini-batch size of 512 and the Adam up- 0.804 2-Gram GRU-512 date rule with standard parameters. We linearly decay the 4-Gram 2-Gram, Exact GRU-32 4-Gram, Exact WCSR (maj/min) learning rate to 0 in the last 100 epochs. 0.802 We compare all combinations of language and duration models presented in the previous sections. For language 0.800 modelling, these are the GRU-512, GRU-32, 4-gram, and 2-gram models; for duration modelling, these are the GRU- 0.798 256, GRU-16, and negative binomial models. (We leave out the exponential model, because its results differ negligibly 0.796 from the negative binomial one). The models are decoded 3.946 2.979 2.014 using the Hashed Beam Search algorithm, as described in Duration Model Log-P Sec. 3.4: we use a beam width of Nb = 25, where we track at most Ns = 4 similar solutions as defined by the Figure 6. Effect of language and duration models on the hash function fh , where the number of chords considered final result. Both plots show the same results from different is set to Nh = 5. These values were determined by a small perspectives. number of preliminary experiments. Additionally, we evaluate exact decoding results for the 5. CONCLUSION AND DISCUSSION n-gram language models in combination with the negative binomial duration distribution. This will indicate how much the results suffer due to the approximate beam search. We described a probabilistic model that disentangles three As main evaluation metric, we use the weighted chord components of a chord recognition system: the acoustic symbol recall (WCSR) over the major/minor chord alpha- model, the duration model, and the language model. We bet, as defined in [30]. We thus compute WCSR = tc/ta , then developed better duration and language models than where tc is the total duration of chord segments that have have been used for chord recognition, and illustrated why been recognised correctly, and ta is the total duration of the RNN-based duration models perform better and are chord segments annotated with chords from the target al- more meaningful than their static counterparts implicitly phabet. We also report chord root accuracy and a measure employed in HMMs. (For a similar investigation for chord of segmentation (see [16], Sec. 8.3). Table 3 compares the language models, see [21].) Finally, we showed that im- results of the standard model (the combination that implic- provements in each of these models directly influence chord itly emerges in simple HMM-based temporal models) to the recognition results. best model found in this study. Although the improvements are modest, they are consistent, as shown by a paired t-test We hope that our contribution facilitates further research (p < 2.487 × 10−23 for all differences). in harmonic language and duration models for chord recog- Figure 6 presents the effects of duration and language nition. These aspects have been neglected because they did models on the WCSR. Better language and duration models not show great potential for improving the final result [4, 7]. directly improve chord recognition results, as the WCSR However, we believe (see [24] for some evidence) that this increases linearly with higher log-probability of each model. was due to the improper assumption that temporal models As this relationship does not seem to flatten out, further applied on the time-frame level can appropriately model improvement of each model type can still increase the score. musical knowledge. The results in this paper indicate that We also observe that the approximate beam search does chord transitions modelled on the chord level, and con- not impair the result by much compared to exact decoding nected to audio frames via strong duration models, indeed (compare the dotted blue line with the solid one). have the capability to improve chord recognition results.
6. ACKNOWLEDGEMENTS arXiv:1511.07289, San Juan, Puerto Rico, February 2016. This work is supported by the European Research Coun- cil (ERC) under the EU’s Horizon 2020 Framework Pro- [11] Trevor de Clercq and David Temperley. A corpus anal- gramme (ERC Grant Agreement number 670035, project ysis of rock harmony. Popular Music, 30(01):47–70, “Con Espressione”). January 2011. 7. REFERENCES [12] Junqi Deng and Yu-Kwong Kwok. Automatic Chord es- timation on seventhsbass Chord vocabulary using deep [1] Sebastian Böck and Markus Schedl. Enhanced Beat neural network. In International Conference on Acous- Tracking With Context-Aware Neural Networks. In Pro- tics Speech and Signal Processing (ICASSP), Shanghai, ceedings of the 14th International Conference on Digi- China, March 2016. tal Audio Effects (DAFx-11), Paris, France, September 2011. [13] Bruno Di Giorgi, Massimiliano Zanoni, Augusto Sarti, and Stefano Tubaro. Automatic chord recognition based [2] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and on the probabilistic modeling of diatonic modal har- Pascal Vincent. Audio chord recognition with recurrent mony. In Proceedings of the 8th International Work- neural networks. In Proceedings of the 14th Interna- shop on Multidimensional Systems, Erlangen, Germany, tional Society for Music Information Retrieval Confer- 2013. ence (ISMIR), Curitiba, Brazil, 2013. [3] John Ashley Burgoyne, Jonathan Wild, and Ichiro Fu- [14] Takuya Fujishima. Realtime Chord Recognition of Mu- jinaga. An Expert Ground Truth Set for Audio Chord sical Sound: A System Using Common Lisp Music. Recognition and Music Analysis. In Proceedings of In Proceedings of the International Computer Music the 12th International Society for Music Information Conference (ICMC), Beijing, China, 1999. Retrieval Conference (ISMIR), Miami, USA, October [15] Masataka Goto, Hiroki Hashiguchi, Takuichi 2011. Nishimura, and Ryuichi Oka. RWC Music Database: [4] Ruofeng Chen, Weibin Shen, Ajay Srinivasamurthy, Popular, Classical and Jazz Music Databases. In and Parag Chordia. Chord Recognition Using Duration- Proceedings of the 3rd International Conference on Explicit Hidden Markov Models. In Proceedings of Music Information Retrieval (ISMIR), Paris, France, the 13th International Society for Music Information 2002. Retrieval Conference (ISMIR), Porto, Portugal, 2012. [16] Christopher Harte. Towards Automatic Extraction of [5] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- Harmony Information from Music Signals. Dissertation, danau, and Yoshua Bengio. On the Properties of Neural Department of Electronic Engineering, Queen Mary, Machine Translation: Encoder-Decoder Approaches. University of London, London, United Kingdom, 2010. arXiv:1409.1259 [cs, stat], September 2014. [17] Sepp Hochreiter and Jürgen Schmidhuber. Long Short- [6] Taemin Cho. Improved Techniques for Automatic Chord Term Memory. Neural Computation, 9(8):1735–1780, Recognition from Music Audio Signals. Dissertation, November 1997. New York University, New York, 2014. [18] Eric J. Humphrey and Juan P. Bello. Rethinking Au- [7] Taemin Cho and Juan P. Bello. On the Relative Impor- tomatic Chord Recognition with Convolutional Neural tance of Individual Components of Chord Recognition Networks. In 11th International Conference on Machine Systems. IEEE/ACM Transactions on Audio, Speech, Learning and Applications (ICMLA), Boca Raton, USA, and Language Processing, 22(2):477–492, February December 2012. IEEE. 2014. [8] Taemin Cho, Ron J Weiss, and Juan Pablo Bello. Ex- [19] Sergey Ioffe and Christian Szegedy. Batch Normaliza- ploring common variations in state of the art chord tion: Accelerating deep network training by reducing in- recognition systems. In Proceedings of the Sound and ternal covariate shift. arXiv preprint arXiv:1502.03167, Music Computing Conference (SMC), pages 1–8, 2010. 2015. [9] Jan Chorowski and Navdeep Jaitly. Towards better de- [20] Diederik Kingma and Jimmy Ba. Adam: A coding and language model integration in sequence to method for stochastic optimization. arXiv preprint sequence models. arXiv:1612.02695 [cs, stat], Decem- arXiv:1412.6980, 2014. ber 2016. [21] Filip Korzeniowski, David R. W. Sears, and Gerhard [10] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Widmer. A Large-Scale Study of Language Models for Hochreiter. Fast and Accurate Deep Network Learn- Chord Prediction. In IEEE International Conference ing by Exponential Linear Units (ELUs). In Interna- on Acoustics, Speech and Signal Processing (ICASSP), tional Conference on Learning Representations (ICLR), Calgary, Canada, April 2018.
[22] Filip Korzeniowski and Gerhard Widmer. Feature Learn- for Music Information Retrieval Conference (ISMIR), ing for Chord Recognition: The Deep Chroma Extrac- Málaga, Spain, October 2015. tor. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), New [33] Yushi Ueda, Yuki Uchiyama, Takuya Nishimoto, Nobu- York, USA, August 2016. taka Ono, and Shigeki Sagayama. HMM-based ap- proach for automatic chord detection using refined [23] Filip Korzeniowski and Gerhard Widmer. A Fully Con- acoustic features. In International Conference on Acous- volutional Deep Auditory Model for Musical Chord tics Speech and Signal Processing (ICASSP), Dallas, Recognition. In Proceedings of the IEEE International USA, March 2010. Workshop on Machine Learning for Signal Processing (MLSP), Salerno, Italy, September 2016. [24] Filip Korzeniowski and Gerhard Widmer. On the Futil- ity of Learning Complex Frame-Level Language Mod- els for Chord Recognition. In Proceedings of the AES International Conference on Semantic Audio, Erlangen, Germany, June 2017. [25] Filip Korzeniowski and Gerhard Widmer. Automatic Chord Recognition with Higher-Order Harmonic Lan- guage Modelling. In Proceedings of the 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, September 2018. [26] M. Mauch and S. Dixon. Simultaneous Estimation of Chords and Musical Context From Audio. IEEE Trans- actions on Audio, Speech, and Language Processing, 18(6):1280–1289, August 2010. [27] Brian McFee and Juan Pablo Bello. Structured Training for Large-Vocabulary Chord Recognition. In Proceed- ings of the 18th International Society for Music Infor- mation Retrieval Conference (ISMIR), Suzhou, China, October 2017. [28] Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cer- nocký, and Sanjeev Khudanpur. Recurrent neural net- work based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pages 1045–1048, Chiba, Japan, 2010. [29] Meinard Müller, Sebastian Ewert, and Sebastian Kreuzer. Making chroma features more robust to tim- bre changes. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2009. [30] Johan Pauwels and Geoffroy Peeters. Evaluating au- tomatically estimated chord sequences. In 2013 IEEE International Conference on Acoustics, Speech and Sig- nal Processing, pages 749–753. IEEE, 2013. [31] S. Renals, N. Morgan, H. Bourlard, M. Cohen, and H. Franco. Connectionist Probability Estimators in HMM Speech Recognition. IEEE Transactions on Speech and Audio Processing, 2(1):161–174, January 1994. [32] Siddharth Sigtia, Nicolas Boulanger-Lewandowski, and Simon Dixon. Audio chord recognition with a hybrid recurrent neural network. In 16th International Society
You can also read