Early Abandoning and Pruning for Elastic Distances
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Early Abandoning and Pruning for Elastic Distances Matthieu Herrmann, Geoffrey I. Webb Monash University, Australia {matthieu.herrmann,geoff.webb}@monash.edu Abstract Elastic distances are key tools for time series analysis. Straightforward implementations require O(n2 ) space and time complexities, preventing many applications from scaling to long series. Much work has arXiv:2102.05221v1 [cs.LG] 10 Feb 2021 been devoted in speeding up these applications, mostly with the development of lower bounds, allowing to avoid costly distance computations when a given threshold is exceeded. This threshold also allows to early abandon the computation of the distance itself. Another approach, developed for DTW, is to prune parts of the computation. All these techniques are orthogonal to each other. In this work, we develop a new generic strategy, “EAPruned”, that tightly integrates pruning with early abandoning. We apply it to DTW, CDTW, WDTW, ERP, MSM and TWE, showing substantial speedup in NN1-like scenarios. Pruning also shows substantial speedup for some distances, benefiting applications such as clustering where all pairwise distances are required and hence early abandoning is not applicable. We release our implementation as part of a new C++ library for time series classification, along with easy to use Python/Numpy bindings. 1 Introduction Elastic distances are important tools for time series classification [30], clustering [12], outlier detection [2] and similarity search [26]. An elastic distance computes a similarity score, lower scores indicating greater similarity. This property makes them fit for nearest neighbor classifiers, which have historically been a preferred approach to time series classification. Dynamic Time Warping (DTW) is a widely used elastic distance, introduced in 1971 by [21, 20] along with its constrained variant CDTW. While state of the art classifiers are now significantly more accurate than simple nearest neighbor classifiers, some utilize elastic distances in more sophisticated ways [14, 15, 16, 24]. Two key strategies that have been developed for speeding up elastic distance computations are pruning and early abandoning. Pruning identifies and saves computation of parts of the cost matrix that cannot be on the optimal path (see Section 2). Early abandoning comes in two kinds. The first one, “computation abandoning”, terminates computation when some internal variable exceeds a threshold. The second one, “lower bounding”, first computes a lower bound (an approximated minimal score), and then only proceeds to compute the actual score if the lower bound is below a threshold. Early abandoning is useful in applications such as nearest neighbor search for which it is possible to derive thresholds above which the precise score is not relevant. In this paper we develop a new strategy, “EAPruned”, that tightly integrates pruning and early abandon- ing. We investigate its effectiveness with six key elastic distances. To enable fair comparison, we implemented several versions of the distances in C++ and compared their run times using NN1 classification over the UCR archive (in its 85 univariate datasets version). We show that EAPruned offers significant speedups: around 7.62 times faster then a simple implementation, and around 2.88 times faster than an implementation ∗ This research has been supported by Australian Research Council grant DP210100072. 1
with the usual computation abandoning scheme. We also show, in the cases of DTW and CDTW, that lower bounding is complementary to EAPruned. The rest of this paper is organised as follows. The section 2 presents the background on ensemble classifiers that lead us to consider a given set of elastic distances. It then presents these elastic distances and highlights their common structure. The section 3 presents EAPruned itself. We then present our experimentation results in the section 4, and conclude section 5. 2 Background and Related Work In this paper, we only consider univariate time series, although our technique is applicable to multivariate series. Any mention of “series” or “times series” should be understood as “univariate time series”. We usually denote series with the letters Q (for a query), C (for a candidate), S, T and U , and their length with the letter L (using subscript such as LS to disambiguate the lengths of different series). Subscripts Ci are used to distinguish multiple candidates. The elements s1 , s2 , . . . sLS are the elements of the series S = (s1 , s2 , . . . sLS ). The element si is understood to be the i-th element of S, with 1 ≤ i ≤ LS . 2.1 The Recent History of Ensemble Classifiers Ensemble classifiers have recently revolutionized time series classification. The “Elastic Ensemble” (“EE”, see [14]), introduced in 2015, was one of the first classifiers to be consistently more accurate than NN1-DTW over a wide variety of tasks. EE combines eleven NN1 classifiers based on eleven elastic distances (presented sections 2.3 and 2.5). Most elastic distances have parameters. At train time, EE fine tunes them using cross validation. At test time, the query’s class is determined through a majority vote between the component classifiers. “Proximity Forest” (“PF”, see [16]) uses the same set of distance measures as EE, but deploys them for nearest neighbor tests between a single exemplar of each class within an ensemble of random trees. Both EE and PF solely work in the time domain, leading to poor accuracies when discriminant features are in other domains. This was addressed by their respective evolution into “HIVE-COTE” (“HC”, see [15]) and “TSCHIEF” (see [24]), combining more classifiers working in different domains (interval, shapelets, dictionary and spectral, see [15] for an overview). While EE provided a qualitative jump in accuracy, it did so at the cost of a quantitative jump in computa- tion time. For example, [29] reports 17 days to learn and classify the UCR archive’s ElectricDevice benchmark ([4]). Indeed, in their simplest forms and given series of length L, elastic distances have O(L2 ) space and time complexities. This is only compounded by an extensive search of the best possible parametrization through leave-one-out cross validation. For a training set of N times series, searching among M parameters (M = 100 for EE), EE has a training time complexity in O(M.N 2 .L2 ). HIVE-COTE not only expands on EE, it actually embeds it as a component – or did. Recently, EE was dropped from HIVE-COTE (see [1]) because of its computational cost. Doing so caused an average drop of 0.6% in accuracy, and beyond a 5% drop on some datasets (tested on the UCR archive). The authors considered that it was a fair price to pay for the resulting speedup (the authors only report that the new version is “significantly faster”). 2.2 Early Abandoning and Pruning A double buffer technique [18] overcomes the space complexity, but the time complexity remains challenging. Luckily, nearest neighbor search opens the door to early abandoning strategies. When searching for the nearest neighbor of a query Q under a distance D, we sequentially check Q against each candidate Ci from a database. Let S be the current nearest neighbor of Q. Early abandoning encompasses a set of techniques that abort before actually completing the computation D(Q, Ci ) if they can determine that D(Q, Ci ) > D(Q, S) (i.e. Ci cannot be the nearest neighbor). The two main kinds of early abandoning techniques are “lower bounding” and “computation abandoning”. 2
Lower bounding employs lower bounds lb such that lb(Q, Ci ) ≤ D(Q, Ci ). It follows that if lb(Q, Ci ) > D(Q, S) then D(Q, Ci ) > D(Q, S), i.e. we do not need to compute D(Q, Ci ) if the lower bound is already above D(Q, S). This allows to discard Ci even before starting to compute D(Q, Ci ). To be effective, lower bounds must be computationally cheap while remaining as “tight” as possible, i.e. as close as possible to the actual result of the distance. Lower bounding is shown to significantly speedup classification [10, 29] and similarity search [19]. Lower bounds have mainly been developed for DTW and CDTW, the most common being LB Kim and LB Keogh [23, 10]. They also exist for other elastic distances [29], and remain an active field of research. The second way to early abandon is to stop the computation of D(Q, Ci ) itself. We can give an intuition of how this works without diving into the details yet. While computing D(Q, Ci ), a minimum value v ≤ D(Q, Ci ) is maintained. The value v usually starts at 0 and increases toward D(Q, Ci ). If (and as soon as) v > D(Q, S), then D(Q, Ci ) > D(Q, S), allowing to abandon the computation. Finally, recent progress has been made on the DTW distance. PrunedDTW [25] computes DTW while avoiding computation of some alignments that cannot be on an optimal path [19], and was subsequently extended to incorporate early abandoning [26]. 2.3 Elastic Distances with a Common Structure EE uses eleven elastic distances. This section examines DTW, CDTW, WDTW, MSM, and TWE, the computations of which follow a similar pattern and can be sped up by a common strategy. DDTW, DCTW and DWDTW are respectively DTW, CDTW and WDTW applied to the derivative of the series [11]. We do not examine these variants in this paper, but we note that they automatically benefit from our strategy. The two remaining distances, LCSS and SQED, are examined in section 2.5. All the distances compute an alignment cost between two series. For most distances, this alignment cost depends on a cost “cost” function used between elements of the series. The squared Euclidean distance is often used, and this is the cost we use in all our implementations. However, other cost functions are possible, such as the L1 norm. 2.3.1 Dynamic Time Warping The Dynamic Time Warping (DTW) distance was first introduced in 1971 by Sakoe and Chiba as a speech recognition tool [21, 20]. Compared to the classic Euclidean distance, DTW handles distortion and disparate lengths. Intuitively, DTW “aligns” two series by aligning their points. The end points must be aligned, and no point-alignment shall cross each other. Figure 1a illustrates how DTW aligns two series S and T where the cost function is the square Euclidean distance. DTW compute the minimal (optimal) alignment cost. Note that several optimal alignments with the same cost may exist. To compute the optimal alignment cost, DTW computes an optimal alignment which is represented by an optimal “warping path” in the (L+ 1×L+ 1) DTW cost matrix MDTW . Its cells represent the cumulative optimal cost of aligning the series according to the Equations 1d. Equations 1a to 1c describe the boundary conditions. Figure 1b shows the cost matrix for two series S and T along with the optimal warping path. DTW(S, T ) = MDTW (LS , LT ). MDTW (0, 0) = 0 (1a) MDTW (i, 0) = +∞ (1b) MDTW (0, j) = +∞ (1c) MDTW (i − 1, j − 1) MDTW (i, j) = cost(si , tj ) + min MDTW (i − 1, j) (1d) MDTW (i, j − 1) 3
T= 1 3 2 1 2 2 5 0 5 0 0 1 2 3 4 5 6 S= 0 0 3 1 4 4 5 9 10 11 1 2 4 8 5 5 6 7 Series S 4 Series T Alignments 1 1 4 3 13 5 9 14 9 10 3 4 4 22 6 9 18 13 13 2 4 1 1 1 1 5 22 10 7 7 8 9 1 0 0 1 2 3 4 5 6 1 6 22 14 8 7 8 9 (a) DTW alignements with costs. (b) DTW cost matrix with warping path. Figure 1: MDTW(S,T ) with warping path and alignments for S = (3, 1, 4, 4, 1, 1) and T = (1, 3, 2, 1, 2, 2). We have DTW(S, T ) = MDTW(S,T ) (6, 6) = 9. Taking the diagonal means a new alignment between two new points. Going down or right means that one point is reused. For example, in Figure 1b, the fifth point of S is aligned with the third, fourth and fifth point of T . This can also be visualise on Figure 1a, with three dotted lines reaching the fifth point of S. An alignment requires the end points to be aligned. This means that a valid warping path starts from the top left cell and reaches the bottom right cell. One approach to speeding up elastic distances is through approximation, e.g. FastDTW (see [22], and [31] for a counterpoint). Our approach computes exact DTW (and other distances), which is useful when ap- proximation is not desirable. Comparing approximated approaches with exact approaches is left for further research. In the following, we will use notions of direction (left, right, previous, up, top, diagonal, etc...). They should be understood in the context of a matrix similar to the one depicted Figure 1b. A diagonal always mean a new alignment between two new points. We call such an alignment a “canonical alignment”. Going down or right (or with a top or left dependency) represents an “alternate alignment”. Alternate alignments happen either because the canonical alignment is too expensive, or because the series have differing lengths. Note that the alternate cases are always symmetric. This is reflected in the commutative nature of the distances regarding their series arguments (e.g. DTW(S, T ) = DTW(T, S)). 2.3.2 Constrained DTW In CDTW, the “constrained” variant of DTW, the warping path is restricted to a subarea of the matrix. Different constraints exist [20, 8]. We focus on the popular Sakoe-Chiba band [20], also known as “Warping Window” (or “window” for short). This constraint can also be applied to other distances such as ERP and LCSS (section 2.3.4 and 2.5.2). The window is a parameter w controlling how far the warping path can deviate from the diagonal. Given a matrix M , a line index 1 ≤ l ≤ L and a column index 1 ≤ c ≤ L, we have |l − c| ≤ w. Figure 2a illustrates the state of MCDTW (S, T ) for a window of 1. Notice how the path can only step away one cell from each side of the diagonal. The resulting cost is 11, which is greater than the cost of 9 obtained earlier by unconstrained DTW (see Figure 1). Due to the window constraint, the cost of 4
T= 1 3 2 1 2 2 U= 1 3 2 1 2 2 4 4 5 5 0 0 5 0 0 1 2 3 4 5 6 5 0 0 1 2 3 4 5 6 7 8 S= 0 0 S= 0 0 3 1 4 4 3 1 4 4 1 2 4 8 5 1 2 4 8 5 4 3 5 9 14 4 3 5 9 14 4 4 9 18 18 4 4 9 18 18 1 5 9 10 11 1 5 9 10 11 1 6 10 11 1 6 10 11 20 (a) CDTW cost matrix with w = 1 for S and T of equal length (b) CDTW cost matrix with w = 1 for S and U of disparate lengths. Figure 2: Example of CDTW cost matrices with a window of 1. In the second case, the window it too small to allow an alignment between the last two points at (6, 8). CDTW is always greater than or equal to DTW’s. Figure 2b shows what happens with a window too short for series of disparate lengths. The last two points can’t be aligned. Given two series S and T of respective lengths LS and LT , an alignment is possible with a window w ≥ |LS − LT |. A window of 0 is akin to the squared Euclidean distance, while a window of L is equivalent to DTW (no constraint). With a correctly set window, NN1-CDTW can achieve better accuracy than NN1-DTW by preventing spurious alignments. It is also faster to compute than DTW, as cells beyond the window are ignored. However, the window parameter must be set. A technique exploiting several properties of CDTW (e.g. increasing the window can only lead to a smaller or equal cost) has been developed to learn this parameter faster [28]. 2.3.3 Weighted DTW The Weighted Dynamic Time Warping (WDTW, see [9]) distance is another variant of DTW. If CDTW can be seen as a hard constraint on the warping paths, WDTW can be seen as a soft one. The cells (l, c) of the MW DT W cost matrix are weighted according to their distance to the diagonal d = |l − c|. The larger the weight, the more the cell is penalized and unlikely to be part of an optimal path. The weight is computed according to the Equation 2, where g controls the penalization. Its optimal range is 0.01 – 0.6 [9]. 1 w(d) = (2) 1 + exp−g×(d−L/2) 2.3.4 Edit Distance with Real Penalty The Edit distance with Real Penalty (ERP, see [3]) is actually a metric fulfilling the triangular inequality. It takes as parameters a warping window (see CDTW section 2.3.2) and a “gap value” g. Intuitively, ERP aligns the points in the case of canonical alignments, and inserts a “gap” at a cost based on g in the case of 5
alternate alignments. The authors suggest a gap value of 0, but EE still fine tunes it, along with the window size. ERP is describe by the Equations 3. Again, the cost function usually is the squared Euclidean distance, but other norms are allowed. Compared to DTW, the borders are computed (equations 3b and 3c) rather than being set to +∞. The cost function is now specialized per argument of the min { function. MERP (0, 0) = 0 (3a) MERP (i, 0) = MERP (i − 1, 0) + cost(si , g) (3b) MERP (0, j) = MERP (0, j − 1) + cost(g, tj ) (3c) MERP (i − 1, j − 1) + cost(si , tj ) MERP (i, j) = min MERP (i − 1, j) + cost(si , g) (3d) MERP (i, j − 1) + cost(ti , g) 2.3.5 Move-Split-Merge Move-Split-Merge (MSM, see [27]) is another metric developed to overcome shortcomings in other elastic distances. Compared to ERP, it is robust to translation1 . The authors showed that MSM is competitive against DTW, CDTW and ERP for NN1 classification. MSM (Equations 5) defined its own cost function Cc (Equation 4) used to compute the cost of the alternate alignments. It takes as argument the new point (np) of the alternate alignment, and the two previously considered points (x and y) of each series. C also takes a real number c as a constant penalty. This penalty is the only parameter of MSM (and is fine tuned by EE). c If x ≤ np ≤ y or x ≥ np ≥ y ( Cc (np, x, y) = |np − x| (4) c + min otherwise |np − y| MMSM (0, 0) = 0 (5a) MMSM (i, 0) = +∞ (5b) MMSM (0, j) = +∞ (5c) MMSM (i − 1, j − 1) + |si − tj | MMSM (i, j) = min MMSM (i − 1, j) + C(si , si−1 , tj ) (5d) MMSM (i, j − 1) + C(tj , si , tj−1 ) 2.3.6 Time Warp Edit Distance The Time Warp Edit distance (TWE, see [17]) is motivated by the idea that timestamps (i.e. when a value is recorded) should be taken into account. This is important for series with non-uniform sampling rates. Note that our current implementation does not use timestamps, also assuming a constant sampling rate. The i-th timestamp of a series S is denoted by τs,i . TWE (Equations 7) defines its cost functions (Equations 6) with two parameters. The first one, ν, is a “stiffness” parameter weighting the timestamp contribution. A parameter of ν = 0 is similar to DTW. The second one, λ, is a constant penalty added to the cost of the alternate alignments (“delete” in TWE 1 In ERP, the gap cost is given by cost(s , g). If S is translated, the gap cost also changes while the canonical alignment cost i remains unchanged, making ERP translation-sensitive. 6
terminology — deleteA for the lines, deleteB for the columns). The cost of the alternate case is the cost between the two current points, plus their timestamp difference, plus the λ penalty. The canonical alignment (“match”) cost is the sum of the cost between the two current and the two previous points, plus a weighted contribution of their respective timestamps differences. match: γM = cost(si , tj ) + cost(si−1 , tj−1 ) + ν(|τs,i − τt,j | + |τs,i−1 − τt,j−1 |) deleteA: γA = cost(si , si−1 ) + ν|τs,i − τs,i−1 | + λ (6) deleteB: γB = cost(tj , tj−1 ) + ν|τt,j − τt,j−1 | + λ MTWE (0, 0) = 0 (7a) MTWE (i, 0) = +∞ (7b) MTWE (0, j) = +∞ (7c) MTWE (i − 1, j − 1) + γM MTWE (i, j) = min MTWE (i − 1, j) + γA (7d) MTWE (i, j − 1) + γB 2.4 A Common Structure These six distances share a common structure, described by Equations 8. For a distance D, these equations assign a value to every cell of a cost matrix MD . Given a series S “along the rows” and a series T “along the columns”, the 0-indexed matrix MD has dimension (LS + 1) × (LT + 1). MD (i, j) is the cumulative cost of aligning the first i points of S with the first j points of T . It follows that the last cell at (LS + 1, LT + 1) holds the actual cost computed by D. The first three equations describe the border initialization of the matrix. Equation 8a initialises the top left corner with 0. Equation 8b initialises the left vertical border, and Equation 8c initialises the top horizontal border. The borders must be monotonously increasing, e.g. MD (i, 0) ≤ MD (i + 1, 0). Among the presented distances, only ERP has computed borders. The borders of all the others are initialised to +∞. The last equation (8d) is the main one. A cell is computed by taking the minimum of three different align- ment costs plus the cost of their respective dependency. The canonical alignment (“Canonical”) cost depends on the top left diagonal cell (“diag” dependency). The vertical alternate alignment (“AlternateLine”) cost depends on the above cell (“top” dependency). The horizontal alternate alignment (“AlternateColumn”) cost depends on the left cell (“left” dependency). This process amounts to computing an optimal alignment (with minimum cost) represented by an optimal warping path in the cost matrix. For example, in Figure 1b, the cell (4, 2) is part of the optimal path, meaning that the fourth value of S is aligned with the second value of T . MD (0, 0) = 0 (8a) MD (i, 0) = InitVBorder such that MD (i, 0) ≤ MD (i + 1, 0) (8b) MD (0, j) = InitHBorder such that MD (0, i) ≤ MD (0, i + 1) (8c) MD (i − 1, j − 1) + Canonical MD (i, j) = min MD (i − 1, j) + AlternateLine (8d) MD (i, j − 1) + AlternateColumn This generic structure guarantees the following properties: • An optimal path (with minimum cost) is computed. 7
• Series extremities are aligned with each other ((1, 1) and (LS , LT )). • The path is continuous and monotonous, i.e. for two of its consecutive cells (i1, j1) 6= (i2, j2), we have i1 ≤ i2 ≤ i1 + 1 and j1 ≤ j2 ≤ j1 + 1. This structure hints at a linear space complexity. All dependencies of a cell are satisfied either by the previous row (diag and top dependencies), or by the previously computed cell on the same row (left dependency) case. It follows that if we proceed row by row, only two rows of the matrix are required at any time, achieving linear space complexity. We can further contain the memory footprint by using the shortest series as the “column series”, minimizing row length. Algorithm 1 implements Equations 8 with the two rows technique. We have two arrays representing the current row (curr) and the previous row (prev). The arrays are swapped (line 6) at each iteration of the outer loop. The horizontal border is stored in the curr array (line 4), the swap operation putting it in prev so it can act as the previous row when computing the first one. The vertical border is gradually computed for each new row (line 7). Finally, the inner loop (line 8) computes from left to right the value of the cells of the current row. Algorithm 1: Generic linear space complexity for a distance D. Input: the time series S and T Result: Cost D(S, T ) 1 co ← shortest series between S and T 2 li ← longest series between S and T 3 (prev, curr) ← arrays of length lco + 1; curr[0 ] ← 0 4 curr[1 — L] ← InitHBorder 5 for i ← 1 to Lli do 6 swap(prev, curr) 7 curr[0 ] ← InitVBorder 8 for j ← 1 to Lco do prev[j-1 ] + Canonical 9 curr[j ] ← min prev[j ] + AlternateLine curr[j-1 ] + AlternateColumn 10 return curr[lco ] Algorithm 2 builds upon Algorithm 1, adding support for a warping window w and the usual early abandoning technique with a cut-off value. Hence, it is a suitable base for early abandoned CDTW and ERP. These two changes being orthogonal to each other, removing the window handling code yields the base for early abandoning all other distances presented until now. When dealing with a warping window w, we first check whether it permits an alignment or not (line 3) – see Figure 2b for a graphical representation of a window too small. The horizontal border is only initialized up to w + 1 (line 6), and the inner loop is caped within the window around the diagonal (from jStart to jStop, lines 9, 10 and 13). The vertical border is only computed while the window covers the first column (line 11). More interestingly, the arrays are now initialized to +∞. To understand why, let us consider the cell (2, 3) for MCDTW Figure 2a. The cell (1, 3) is accessed to compute the AlternateLine case. However, this cell is outside the window and should not contribute to the minimum. By initializing the arrays to +∞, we implicitly set this cell, and all the upper right triangle outside of the window, to +∞. The lower triangle is implicitly set to +∞ line 11, after the window stops covering the first column. This explain why we assign to curr[jStart − 1] and not to curr[0 ]. Only the diagonal edge of the triangle is set to +∞, which is enough for the next cell’s AlternateColumn case to be properly ignored (e.g. in Figure 2a the cell (3, 2) ignoring the cell (3, 1)). 8
Algorithm 2: Generic distance with window and early abandoning. Input: the time series S and T , a warping window w, a cut-off point cutoff Result: Cost D(S, T ) 1 co ← shortest series between S and T 2 li ← longest series between S and T 3 if w < Lli − Lco then return +∞ 4 (prev, curr) ← arrays of length lco + 1 filled with +∞ 5 curr[0 ] ← 0 6 curr[1 — w+1 ] ← InitHBorder 7 for i ← 1 to Lli do 8 swap(prev, curr) 9 jStart ← max(1, i − w) 10 jStop ← min(i + w, Lco ) 11 curr[jStart − 1] ← if jStart == 1 then InitVBorder else +∞ 12 minv ← +∞ 13 for j ← jStartto jStop do prev[j-1 ] + Canonical 14 v ← min prev[j ] + AlternateLine curr[j-1 ] + AlternateColumn 15 minv ← min(minv, v) 16 curr[j ] ← v 17 if minv > cutoff then return +∞ 18 return curr[lco ] 2.5 Other Distances The two missing distances from EE are the squared Euclidean distance and the longest common sub-sequence distance. They do not fit the common structure described section 2.4, nor will they fit the common structure described section 3. Hence, we fully treat them here. 2.5.1 Squared Euclidean Distance The squared Euclidean distance is exactly what you would expect. It requires series of equal length (i.e. it is not elastic), returning a score of +∞ if not. Algorithm 3 presents an early abandoned squared Euclidean distance. Our C++ implementation is based on it. As mentioned before, this distance is equivalent to CDTW (section 2.3.2) with a window of 0. Algorithm 3: Squared Euclidean Distance. Input: the time series S and T , a cut-off point cutoff Result: sqed(S, T ) 1 if LS 6= LT then return +∞ 2 acc ← 0 3 for i ← 1 to LS do 4 acc ← acc + (Si − Ti )2 5 if acc > cutoff then return +∞ 6 return acc 9
2.5.2 Longest Common Sub-Sequence The Longest Common Sub-Sequence distance (LCSS, see [7]) is another elastic distance, more commonly used for string matching. Time series aren’t usually made of discreet characters than can be compared for equality, so LCSS considers two values equal if they are within a threshold . LCSS is describes by Equations 9. MLCSS (0, 0) = 0 (9a) MLCSS (i, 0) = 0 (9b) MLCSS (0, j) = 0 (9c) 1 + MLCSS (i − 1, j − 1) If |si−1 − tj−1 | ≤ MLCSS (i − 1, j − 1) MLCSS (i, j) = (9d) max MLCSS (i − 1, j) otherwise MLCSS (i, j − 1) Contrary to other distances, LCSS maximizes a score, a higher score indicating greater similarity. To be usable with nearest neighbor classifiers, its result is transformed using Equation 10. The MLCSS (LS , LT ) score is first normalized by dividing it by the length of the shortest series, i.e. the maximum number of possible matches. Then, it is subtracted from 1, returning a score between 0 and 1 with the desired property (lower means greater similarity). MLCSS(L,T ) (LS , LT ) LCSS(T, S) = 1 − (10) min(LS , LT ) Algorithm 4: LCSS with warping window and early abandoning. Input: the time series S and T , a threshold , a warping window w, a cut-off point cutoff Result: Cost LCSS(S, T ) 1 co ← shortest series between S and T 2 li ← longest series between S and T 3 if w < Lli − Lco then return +∞ 4 (prev, curr) ← arrays of length lco + 1 filled with 0 5 toreach ← d(1 − cutoff) ∗ Lco e 6 maxv ← 0 7 for i ← 1 to Lli do 8 if maxv + (Lli − i) < toreach then return +∞ 9 swap(prev, curr) 10 jStart ← max(1, i − w) 11 jStop ← min(i + w, Lco ) 12 curr[jStart − 1] ← 0 13 for j ← jStart to jStop do 14 if |lii − coj | ≤ then 15 v ← prev[j-1 ] + 1 16 curr[j ] ← v 17 maxv ← max(maxv, v) 18 else 19 v ← max(prev[j-1 ], prev[j ], curr[j-1 ]) 20 return curr[lco ] 10
Algorithm 4 presents an early abandoned version of LCSS (and how our C++ implementation works). To early abandon, the cut-off value (which must be between 0 and 1) is turned into a number of matches that must be reached (line 5). We early abandon if the highest value of a row added to the number of remaining potential matches is below the required value (line 8). Note that number of remaining matches is given by min(Lco , Lli − i). However, because the score is adapted with respect to Lco , we can only early abandoned when we fewer than Lco lines remain (Lli − i < Lco ). Hence, the minimum operation is not required. 3 Pruning and Early Abandoning We saw section 2.4 the idea behind early abandoning. We now introduce the idea of pruning. Given a cost matrix like the one presented Figure 1b, pruning refers to avoiding the computation of some of the cells. As we still proceed row by row (such as in Algorithm 1), pruning all the cells in a row leads to early abandoning. In this section, we present how to prune and early abandon the distances based on the common structure described section 2.4. We present our strategy, “EAPruned”, in a generic manner: all the distances based on the common structure can benefit from it. We present and illustrate the idea behind EAPruned using DTW. However, we present the algorithm with a warping window and computed borders. Pruning was first developed for DTW in 2016 with PrunedDTW [25]. It aimed at speeding up all pairwise calculation of DTW. In this context, no cut-off value is provided. Instead, PrunedDTW computes its own cut-off value based on the diagonal of the cost matrix. In the case of DTW with series of equal length, this amounts to the squared Euclidean distance. With disparate length, the bottom left corner can be reach by completing the diagonal with the last column. Regardless of the distance, those cut-off computations are all similar to Algorithm 3, i.e. are in O(L) time complexity, and a O(1) space complexity. The cut-off is an upper bound ub on the actual DTW cost, i.e. the actual DTW cost is smaller (or equal) than ub. PrunedDTW uses ub not to early abandon, but to skip the computation of some alignments (i.e. some cells from the cost matrix) that we know will be above it. PrunedDTW was later used for similarity search [26] with early abandoning. However, its core was not revisited, and early abandoning was implemented in the classical way (like in Algorithm 2). We revisited the idea behind PrunedDTW and developed a strategy that tightly integrates early aban- doning and pruning. In addition, it delivers more pruning than PrunedDTW and is carefully designed for computational efficiency. It is organized in successive stages in which we avoid computing unnecessary de- pendencies. EAPruned itself is presented Algorithm 5; the strategy is presented in two parts in “Pruning from the left” and “Pruning on the right”. 3.1 Pruning From the Left As stated earlier, we compute the cost matrix row by row, from left to right. While computing a row, we look at discarding as many as possible of the leftmost cells of the matrix. We define a “discard zone” as being the columns topped by a “discard point”. In a row, discard points are all the cells preceding the first cell with a cost below the cut-off2 , and that are not already part of the discard zone. Discard points and all the cells in the discard zone have a cost strictly greater than the cut-off. In the Figures 3a and 3b, the white cells with red border are discard points, and the discard zone is made of the grey columns they top. Notice how the cell (1, 5) is greater than the cut-off, but not a discard point. This is because it is on the right of (1, 1) which has a cost below the cut-off. Beside the discard points, no cell in the discard zone is ever evaluated: they are pruned. For DTW, the discard zone starts with the left border at (1, 0). The cell at (2, 0) is not a discard point, as it already is in the discard zone. This is due to DTW borders initialization to +∞, excepts for (0, 0) = 0. When the border is computed, the discard zone starts at the first line i such that Md(S,T ) (i, 0) > cutoff. If a window w is used, the discard zone starts at the line w + 1, unless started earlier (see Algorithm 5 line 14). 2 I.e. that can be part of an optimal warping path. 11
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 0 0 0 1 4 4 5 9 10 11 1 4 4 5 9 10 11 2 4 8 5 5 6 7 2 4 8 5 5 6 7 3 13 > 9 5 9 14 9 10 3 13 > 6 5 9 14 9 10 4 22 6 9 18 13 13 4 22 6 9 18 13 13 5 22 10 > 9 7 7 8 9 5 22 10 > 6 7 > 6 7 > 6 8 > 6 9 > 6 6 22 14 8 7 8 9 6 22 14 8 7 8 9 (a) MDTW(S,T ) with cutoff = 9. The arrows represent the (b) MDTW(S,T ) with cutoff = 6. Early abandon occurs at the dependencies of the cell (4, 1). end of the fifth line. Figure 3: Illustration of MDTW(S,T ) “from the left” pruning with two different cut-off value. To explain this pruning scheme, we examine a cell’s dependencies, as illustrated in Figure 3a for the cell (4, 1). As before (see section 2.4), we give them the names of “left”, “top” and “diag”. All dependencies of (4, 1) are discarded (either in the discard zone or a discard point). A cell’s cost can only be greater than (or equal to) its smallest dependency. It follows that if all of a cell’s dependencies are discarded, the cell can be discarded. Starting on the left border, a cell only has a top dependency. Hence, as soon as a border cell is above the cut-off (i.e. (0, 1)), the remainder of the column can be discarded. In turn, this creates the same conditions for the next column, starting next row. We only need to check the top dependency for cells (i > 1, 1), stopping as soon as possible. The cell (3, 1) is above the cut-off, allowing to discard the remainder of the column. When proceeding row by row, pruning from the left is implemented by starting the next row after the discarded columns. Discarding all the columns in a row, such as in Figure 3b, leads to early abandoning. Note that in this case, the discard points from (5, 3) up to (5, 6) need to check both their top and diag dependencies. 3.2 Pruning on the Right In the previous section, we extended a discard zone from the left to the right, leading to early abandoning when it reaches the end of a row. This strategy relying on the row by row evaluation order, we cannot do the same starting from the top border as this would require a column by column order. We have to keep pruning inside a row. Looking back at Figure 3b with a cut-off value of 6, we can find such cells. The cell at (1, 4) is above the cut-off. In conjunction with the top border, the following cells can only be above the cut-off. Dealing with this kind scenario allows us to “prune on the right”. Pruning on the right is not as straightforward as pruning from the left. The main reason is that pruning form the left allows us to discard a column once and for all. But, the idea is similar. In the previous section we mentioned that a discard point creates the “same conditions for the next column, starting next row”, allowing to continue the pruning process. When pruning on the right, we have to identify a condition akin to the top border. This condition is a continuous block of cell above the cut-off value, reaching the end of 12
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 0 0 0 1 4 4 5 9 10 > 9 11 1 4 4 5 9>6 10 11 2 4 8 5 5 6 7 2 4 8 5 5 6 7>6 3 13 > 9 5 9 14 9 10 > 9 3 13 > 6 5 9>6 14 9 10 4 22 6 9 18 > 9 13 13 4 22 6 9>6 18 13 13 5 22 10 > 9 7 7 8 9 5 22 10 > 6 7 > 6 7 8 9 6 22 14 8 7 8 9 6 22 14 8 7 8 9 (a) MDTW(S,T ) with cutoff = 9 (b) MDTW(S,T ) with cutoff = 6 Figure 4: Illustration of MDTW(S,T ) “from the left” and “on the right” pruning with two different cut-off value. The blue cell represents the point of early abandoning, white cells represent discard points, and dark gray cells represent pruning points. their row. In the case of the top border, this block starts at (0, 1). In Figure 3b, the second one starts at (1, 4). The start of those block are called “pruning point”. See Figure 4 where they are represented in dark grey with red borders. Pruning points give an information for the next row3 . In the next row, as soon as we find a cell above the cut-off and after the pruning point, the remainder of the row can be dropped. Note that the cell just under the pruning point is a special case, as its diag dependency is not discarded by the pruning point (e.g the cell (1, 1) Figure 4b). The tricky part is that pruning points move back and forth across rows. To determine their position, a cell (i, j) with a cost below the cut-off will always assume that the next cell is the row’s pruning point. If (i, j + 1) > cutoff, and so on up to (i, Lco ), then (i, j + 1) indeed is a pruning point. Else, we assume that (i, j + 2) is the pruning point, repeating the same logic. Note that all rows (up to an early abandoning row) have one and only one pruning point. It may be out of bounds, e.g. at Lco + 1, in which case it does not appears in the figures. Let us look at some examples, Figure 4b. • The cell (2, 2) > cutoff is not a pruning point, because it does not start a continuous block of cells above cutoff reaching the end of the row. • The cell (3, 3) and all its following cells are above cutoff. The row must be fully evaluated to ensure that (3, 3) indeed is the row’s pruning point. It contributes to enabling pruning on the fourth row, starting at (4, 4). Finally, we have to address what happen when both pruning strategies “collide”. The Figure 4b illustrates such a collision in blue. Without pruning on the right, the cell (5, 3) would be a discard point. Because it is below a pruning point, we know that the rest of the row is above the cut-off. EAPruned allows to prune a total of 16 cells over 36. The usual early abandoning strategy presented section 2.4 would only prune the 3 Similar to discard points, instructing the next row to skip their column. 13
last line, i.e. 6 cells. Also, if the cell at (1, 1) is above cutoff, EAPruned immediately early abandons, when the classical way either evaluates the full row or requires a special case for this cell. As expected, Figure 4a does not show any early abandoning. In this case, EAPruned still prunes 5 cells. 3.3 The EAPruned Algorithm We now present the EAPruned algorithm (Algorithm 5) which is compatible with any distance matching the common structure described section 2.4. We present the algorithm with a window and computed borders, covering the most complex case. Our implementations are specialized for each distance’s needs. Our algorithm exploits resulting properties of pruning to further reduce the computational effort. For example, cells after the pruning point from the previous row can ignore their top and diag dependencies, along with the associated cost computation, as they are known to be above cutoff. Hence, only the left dependency needs to be computed. To do so, we split the computation of a row in several stages. For each stage, we indicate the dependencies that must be checked. 1. Compute the value of the left border (computed or outside the window). A computed border may require the top dependency. 2. Compute discard points until a non-discard point or the pruning point. Check the diag and top dependencies. 3. Compute non-discard point until the pruning point Check all dependencies. 4. Deal with the cell at the pruning point Check the diag (always) and left (unless was a discard point) dependencies. 5. Compute the cells after the pruning point Check the left dependency. In Algorithm 5, the discard points are represented by the next start variable. The position of the pruning point from the previous row, use to prune in the current row, is represented by the pruning point variable. Finally, the next pruning point variable represents the pruning point being computed in the current row, and used when computing the next row. To do so, its value is assigned to pruning point after a row is done (line 39). Beginning a new row, we first determine the index of the first and last columns. Then, the first stage updates the left border, computing its value or applying the window (line 14). The second stage (line 16) computes discard points, which require the row to be bordered on the left by a value above cutoff (tested line 17). The condition in the for loop ensures that all discard points form a continuous block. As soon as a value below cutoff is found, we jump to the second stage as we cannot have any more discard points. As explained in the previous section, next pruning point is set (line 20) to the next column index. Note that next start can only be updated in the second stages while next pruning point must be checked for update after every cost computation. The third stage (line 21) compute the cost taking all dependencies into account until we reach prun- ing point. If pruning point is out of bounds, the third stage completes the row and the following stages are skipped over. We enter the fourth stage (line 25). If we are still within bounds, two cases are possible. Either the left cell is a discard point, or it is not. If it is a discard point (line 27), then we only need to check the diag dependency, early abandoning if the cost is above cutoff. It not (line 30), both the left and diag dependencies are checked, and no early abandoning is possible. Finally, we early abandon if we are out of bounds and the discard points reach the end of the row (line 34). At the fifth and final stage (line 35), only the left dependency is checked as, both the diag and top dependencies are known to be above cutoff. The loop stops as soon as we find a cost above cutoff, effectively pruning the rest of the row. 14
Algorithm 5: Generic EAPruned Algorithm. Input: the time series S and T , a warping window w, a cut-off point cutoff Result: Cost d(S, T ) 1 co ← shortest series between S and T 2 li ← longest series between S and T 3 if w < Lli − Lco then return +∞ 4 (prev, curr) ← arrays of length lco + 1 filled with +∞ 5 curr[0 ] ← 0 6 curr[1 — w+1 ] ← InitHBorder 7 next start ← 1 8 pruning point ← 1 9 for i ← 1 to Lli do 10 swap(prev, curr) 11 jStart ← max(i − w, next start) 12 jStop ← min(i + w, Lco ) 13 j ← jStart 14 /* Stage 1: init the vertical border */ 15 curr[jStart − 1] ← if jStart == 1 then InitVBorder else +∞ 16 /* Stage 2: discard points up to, excluding, the pruning point */ 17 if curr[jStart-1 ] > cutoff then 18 ( − 1 while j = next start do for j to pruning point prev[j-1 ] + Canonical 19 curr[j ] ← min prev[j ] + AlternateLine 20 if curr[j ] ≤ cutoff then next pruning point ← j + 1 else next start++ 21 /* Stage 3: continue up to, excluding, the pruning point */ 22 − 1 do for j to pruning point prev[j-1 ] + Canonical 23 curr[j ] ← min prev[j ] + AlternateLine curr[j-1 ] + AlternateColumn 24 if curr[j ] ≤ cutoff then next pruning point ← j + 1 25 /* Stage 4: at the pruning point */ 26 if j ≤ jStop then 27 if j = next start then 28 curr[j ] ← prev[j-1 ] + Canonical 29 if curr[j ] ≤ cutoff then next pruning point ← j + 1 else return +∞ 30 else ( prev[j-1 ] + Canonical 31 curr[j ] ← min curr[j-1 ] + AlternateColumn 32 if curr[j ] ≤ cutoff then next pruning point ← j + 1 33 j++ 34 else if j = next start then return +∞ 35 /* Stage 5: after the pruning point */ 36 for j to jStop while j = next pruning point do 37 curr[j ] ← curr[j-1 ] + AlternateColumn 38 if curr[j ] ≤ cutoff then next pruning point ← j + 1 39 pruning point ← next pruning point 40 return curr[Lco ] 15
4 Experiments EAPruned distances produce the exact same results as their classic counterparts. Hence, our experiments are not about accuracy, but execution speed. We implemented the DTW, CDTW, WDTW, ERP, MSM and TWE distances following Algorithm 5, simplifying (e.g. removing handling of the warping window) where possible. All the experiments have been run in the same conditions, on a computer equipped with 64GB of RAM (enough memory to fit any dataset from the UCR Archive) and an AMD Opteron 6338P at 2.4Ghz. We tested the performance of four versions. The first one, “Base”, is the classical double buffered implementation without early abandoning. The second one, “Pruned”, is based on Algorithm 5. However, it uses its own computed cutoff based on the diagonal of the cost matrix, as suggested by [25]. Such cut- off only allow to prune, and may be quite loose. However, it allows to show how our implementations perform when early abandoning is not possible. The third one, “EABase”, is the classical double buffered implementation with early abandoning as presented Algorithm 2. Finally, the fourth one, “EAPruned”, is based on Algorithm 5. The Pruned and EAPruned version are not available for SQED and LCSS. These distances are implemented following Algorithms 3 and 4. The first experiment compares the various versions against each other. Every version of every distance is used within a NN1 classifier. Every classifier is run over every default train/test split from the UCR Archive in its 85 datasets version [4], for a total of 2380 runs. We use the parameters found by EE [14] after training. Hence, our benchmarking conditions are similar to EE at test time. The second experimentation assesses the impact of lower bounding of CDTW and DTW. The C++ source code for all the versions and the EE parameters are available on github [5]. 4.1 Evaluation of Pruning and Early Abandoning We measured the time taken by each distance, in each available version, to perform NN1 classification on the default train/test splits from the UCR Archive with 85 datasets. The Figure 5 presents the results in hours per distance, except for Figure 5a which excludes LCSS and SQED, as they are not prunable. LCSS and SQED (Figures 5h and 5i) are only early abandoned, not pruned. Both are ≈ 3 times faster when early abandoned. SQED is so fast that its time is negligible compared to other distances, early abandoned or not. This is expected as its time complexity is O(L). The time complexity of LCSS is O(Lw), for a warping window of size w, leading to a worst case complexity of O(L2 ). When early abandoning, the best case complexity is in O(L), due to buffers initialization. Now looking at the prunable distances Figure 5a. Compared to the Base case, pruning seems to help. However, this is not always the case when looking at individual distances: CDTW and TWE (Figures 5c and 5g) are actually slower. This may be explained by the small windows used in the case of CDTW4 . Small windows mean that there are less pruning opportunities. It follows than the time saved by pruning is less than its overhead cost (coming from the pruning technique itself, plus the computation of a cutoff). In the case of TWE, which does not have a window, the computed cutoff may not allow to prune enough. On the other hand, WDTW (Figure 5d), which does not have a window, is the only distance which benefits more from pruning than early abandoning in its base version. We explain this by WDTW’s non linear weights: close to 0 around the diagonal, they rapidly increase as we step away. Hence, the computed cutoff based on the diagonal is relatively tight, allowing a lot of pruning. Finally, EAPruned is ≈ 7.61 times faster than Base, and ≈ 2.88 times faster than EABase. Looking at individual distances, EAPruned is always the fastest, with a speedup ranging from ≈ 3.93 (TWE) up to ≈ 39.2 (WDTW) compared to Base, and from ≈ 1.85 (TWE) up to ≈ 8.44 (WDTW) compared to EABase. Note that CDTW (speedup ≈ 5.8) and TWE (speedup ≈ 3.93) remain the distances appearing to benefit less from pruning. However, if TWE already benefits from EABase, only EAPruned achieves to speed up CDTW. 4 Note that CDTW is ≈ 12 times faster than DTW in the base version, giving an idea of the effect of small windows on the computation time. 16
Total runtime in hours per algorithm family Total runtime in hours per algorithm family 85 UCR Datasets, parameters from EE 85 UCR Datasets, parameters from EE DTW, CDTW, WDTW, ERP, MSM, TWE DTW 200 25 175 150 20 125 runtime in hours runtime in hours 15 100 192.37 24.83 20.85 75 135.02 10 50 72.71 5 25 5.51 25.24 1.01 0 0 Base Pruned EABase EAPruned Base Pruned EABase EAPruned algorithm families Total runtimealgorithm in hours families Total (a)runtime in hours Distances All Prunable per algorithm family (b) DTWper algorithm family 85 UCR Datasets, parameters from EE 85 UCR Datasets, parameters from EE CDTW WDTW 25 2.0 20 runtime in hours runtime in hours 1.5 15 2.35 24.72 1.0 1.98 1.99 10 0.5 5 4.76 5.32 0.34 0.63 0.0 0 Base Pruned EABase EAPruned Base Pruned EABase EAPruned algorithm families algorithm Total runtime(d) hours families in WDTW per algorithm family Total runtime (c)inCDTW hours per algorithm family 85 UCR Datasets, parameters from EE 85 UCR Datasets, parameters from EE ERP MSM 17.5 70 15.0 60 12.5 50 runtime in hours runtime in hours 10.0 18.44 40 74.75 7.5 30 12.17 46.29 5.0 20 32.37 2.5 5.09 17 10 9.75 1.39 0.0 0 Base Pruned EABase EAPruned Base Pruned EABase EAPruned algorithm families algorithm families (e) ERP (f) MSM Figure 5: Accumulated timings in hours of NN1 classification over 85 datasets from the UCR archive, using parameters discovered by EE.
Total runtime in hours per algorithm family Total runtime in hours per algorithm family 85 UCR Datasets, parameters from EE 85 UCR Datasets, parameters from EE TWE LCSS 50 6 40 5 4 runtime in hours runtime in hours 30 47.64 48.61 3 5.92 20 2 22.43 10 1.98 12.12 1 0 0 Base Pruned EABase EAPruned Base EABase algorithm families Total runtime in hours per algorithm family algorithm families (g) TWE (h) LCSS 85 UCR Datasets, parameters from EE SQED 0.030 0.025 0.020 runtime in hours 0.015 0.03 0.010 0.01 0.005 0.000 Base EABase algorithm families (i) SQED Figure 5: (continued) Accumulated timings in hours of NN1 classification over 85 datasets from the UCR archive, using parameters discovered by EE. 18
Dataset Base Pruned EABase EAPruned UWaveGestureLibraryX 2.77 2.37 1.33 0.57 Phoneme 5.92 4.21 3.54 2.26 ElectricDevices 6.09 5.11 2.84 1.16 FordB 8.76 6.11 4.91 2.36 FordA 14.20 8.43 7.84 3.79 NonInvasiveFetalECGThorax2 16.40 9.49 3.80 0.38 NonInvasiveFetalECGThorax1 18.99 7.27 3.89 0.36 HandOutlines 19.55 9.91 6.90 1.33 UWaveGestureLibraryAll 21.16 17.30 8.47 3.47 StarLightCurves 63.71 48.40 19.58 7.61 total 177.54 118.61 63.10 23.28 Table 1: The 10 slowest datasets, timings in hours. Dataset Base Pruned EABase EAPruned ItalyPowerDemand 0.030 0.024 0.017 0.012 Coffee 0.035 0.029 0.032 0.017 SonyAIBORobotSurface1 0.044 0.028 0.038 0.016 BirdChicken 0.060 0.065 0.057 0.030 BeetleFly 0.062 0.050 0.047 0.032 ECG200 0.075 0.039 0.026 0.017 Wine 0.085 0.030 0.035 0.008 GunPoint 0.095 0.064 0.037 0.016 Beef 0.098 0.080 0.074 0.040 SonyAIBORobotSurface2 0.108 0.076 0.064 0.047 total 0.693 0.484 0.427 0.236 Table 2: The 10 fastet datasets, timings in minutes. The results presented Figure 5 are the timings over 85 datasets. We have to stress that the majority of the computation time is due to only some of the slowest datasets. The 10 slowest datasets reported Table 1 (in hours) make up for ≈ 92.29% of the Base time. The 15 fastest datasets reported Table 2 (in minutes) confirm that EAPruned is also beneficial (it makes up for its overhead) even in these cases. 4.1.1 On the Time Complexity of EAPruned In Tables 1 and 2, datasets are ordered according to the Base column. It is interesting to see that this order does not carry over to other columns. For example, the second slowest dataset “UWaveGestureLibraryAll” is more than twice slower than “FordA” with Base. However, it becomes faster with EAPruned. This is due to the unpredictable nature of early abandoning and pruning. The time complexity can be as low as O(1) when an implementation does not require any initialization (such as DTW’s). EAPruned’s time complexity can also be as high as O(L2 ), which is a “worst” quadratic complexity than the one from Base, due to EAPruned’s overhead. Time complexity under early abandoning is a moving target, sitting between the best and worst case scenarios, depending on the order of evaluation. When performing a NN1 classification, if our first candidate is the actual nearest neighbor of a query, the average complexity will be close to O(1) (or O(L) with buffer initialization) for all following distance computations. On the other hand, if the candidates are ordered from the furthest to the nearest, the computation will never be early abandoned, and probably not pruned much. The average complexity will then be around O(L2 ). 19
CDTW runtime in minutes per mode and lower bound 85 UCR Datasets, window parameter from EE 2.0 lb-none lb-keogh 1.5 lb-keogh2 runtime in hours 1.0 1.98 1.99 0.5 0.37 0.28 0.38 0.27 0.34 0.17 0.15 0.0 Base EABase EAPruned lower bounds / mode Figure 6: CDTW Runtime (in hours) comparison over 85 datasets from the UCR Archive between Base, DTW EABase and EAPRuned, under runtime lb-none, in minutes lb-keogh per mode and and lb-keogh2. lowerparameters Window bound obtained from EE. 85 UCR Datasets 25 lb-none 20 lb-keogh lb-keogh2 runtime in hours 15 24.83 10 14.13 10.54 5 5.51 4.52 4.14 1.01 0.94 0.92 0 Base EABase EAPruned lower bounds / mode Figure 7: DTW Runtime (in hours) comparison over 85 datasets from the UCR Archive between Base, EABase and EAPRuned, under lb-none, lb-keogh and lb-keogh2. No parameter required. 4.2 Comparing with Lower Bounds The results obtained by EAPruned lead to ask how it behaves in the presence of lower bounding. Lower bounds early abandon even before starting a distance’s computation. Hence EAPruned only sees the cases that could not be early abandoned, meaning it now sees a higher ratio of cases that cannot be early abandoned at all. Under these circumstances, is EAPruned still worth it? To study the question, we compare the results of CDTW and DTW for the Base, EABase and EAPruned versions, with no lower bound (“lb-none”), the LB-Keogh lower bound (“lb-keogh”), and cascading two applications of LB-Keogh (“lb-keogh2”, reversing their arguments as Keogh(a, b) 6= Keogh(b, a), see [18]). Note that LB-Keogh requires to compute the envelopes of the series. For a given run, envelopes are only computed once, keeping this overhead to a minimum. Moreover, envelopes are computed using the O(L) Lemire’s algorithm [13], the fastest known technique. Figures 6 and 7 present the results respectively for CDTW and DTW. The blue “lb-none” bars correspond to the Base cases presented Figure 5 (i.e. without lower bound). In the CDTW case, EAPruned without lower bound is on par against Base and EABase with lower bounds. Lower bounding EAPruned then provides a ≈ 2 times speed up. In the DTW case, EAPruned if more than ≈ 4 times faster than the EABase with lb-keogh2, one of the fastest configurations known until now. Lower bounding EAPruned still offers a ≈ 10% speedup, which is good news. Indeed, an envelope computed over a window as wide as the series 20
does not contain much information (see how the CDTW benefits way more from lower bounding in the Base and EABase cases than DTW). Our results indicate that lower bounding – at least with lb-Keogh – and EAPruned complement each other. We explain this by the difference in the modus operandi of the two techniques. Lower bounds have the opportunity to pre-compute some data over the series, given them an overall but approximate view. On the other hand, EAPruned works locally, but with the exact costs. 5 Conclusion In this paper, we presented how to efficiently prune and early abandon elastic distances. We implemented our algorithm EAPruned for major elastic distances used by state-of-the-art ensemble classifiers, and compared the timings with existing techniques. We show over 85 datasets from the UCR archive that in scenarios akin to NN1 classification, our algorithm is always the fastest, even in already fast cases. We also show that pruning alone can be beneficial for some distances, although caution is advised as it may be counter productive in some cases. Finally, we show that non only is EAPruned competitive against lower bounding alone, the two techniques actually are complementary. In the lights of this results, we encourage the researcher to keep developing lower bounds, and the practitioner to use our algorithm. We make the latter easy by releasing our C++ implementations with python/numpy bindings (see [6]) under the permissive MIT licence. Our next steps will be to implement our algorithm for the multivariate case, and implement more lower bounds in Tempo. We also plan to fit some ensemble classifiers such as Proximity Forest or TSChief with our distances, expecting significant speed up. References [1] Anthony Bagnall, Michael Flynn, James Large, Jason Lines, and Matthew Middlehurst. A tale of two toolkits, report the third: On the usage and performance of HIVE-COTE v1.0. arXiv:2004.06069 [cs, stat], April 2020. [2] Seif-Eddine Benkabou, Khalid Benabdeslem, and Bruno Canitia. Unsupervised outlier detection for time series by entropy and dynamic time warping. Knowledge and Information Systems, 54(2):463–486, 2018. [3] Lei Chen and Raymond Ng. On the marriage of lp-norms and edit distance. In Proceedings 2004 VLDB Conference, pages 792 – 803, 2004. [4] Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn Keogh. The UCR Time Series Archive. arXiv:1810.07758 [cs, stat], September 2019. [5] Matthieu Herrmann. Experimentation source code and ressources. https://github.com/HerrmannM/ paper-2021-EAPElasticDist, 2021. [6] Matthieu Herrmann. Tempo, the monash time series classification library. https://github.com/ MonashTS/tempo, 2021. [7] Daniel S. Hirschberg. Algorithms for the Longest Common Subsequence Problem. Journal of the ACM (JACM), 24(4):664–675, October 1977. [8] F. Itakura. Minimum prediction residual principle applied to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 23(1):67–72, 1975. 21
You can also read