Early Abandoning and Pruning for Elastic Distances

Page created by Tiffany Richards

World Around

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Early Abandoning and Pruning for Elastic Distances
                                                                               Matthieu Herrmann, Geoffrey I. Webb
                                                                                        Monash University, Australia
                                                                                  {matthieu.herrmann,geoff.webb}@monash.edu

                                                                                                   Abstract
                                                     Elastic distances are key tools for time series analysis. Straightforward implementations require O(n2 )
                                                 space and time complexities, preventing many applications from scaling to long series. Much work has
arXiv:2102.05221v1 [cs.LG] 10 Feb 2021

                                                 been devoted in speeding up these applications, mostly with the development of lower bounds, allowing
                                                 to avoid costly distance computations when a given threshold is exceeded. This threshold also allows to
                                                 early abandon the computation of the distance itself. Another approach, developed for DTW, is to prune
                                                 parts of the computation. All these techniques are orthogonal to each other. In this work, we develop a
                                                 new generic strategy, “EAPruned”, that tightly integrates pruning with early abandoning. We apply it
                                                 to DTW, CDTW, WDTW, ERP, MSM and TWE, showing substantial speedup in NN1-like scenarios.
                                                 Pruning also shows substantial speedup for some distances, benefiting applications such as clustering
                                                 where all pairwise distances are required and hence early abandoning is not applicable. We release
                                                 our implementation as part of a new C++ library for time series classification, along with easy to use
                                                 Python/Numpy bindings.

                                         1      Introduction
                                         Elastic distances are important tools for time series classification [30], clustering [12], outlier detection [2]
                                         and similarity search [26]. An elastic distance computes a similarity score, lower scores indicating greater
                                         similarity. This property makes them fit for nearest neighbor classifiers, which have historically been a
                                         preferred approach to time series classification. Dynamic Time Warping (DTW) is a widely used elastic
                                         distance, introduced in 1971 by [21, 20] along with its constrained variant CDTW. While state of the art
                                         classifiers are now significantly more accurate than simple nearest neighbor classifiers, some utilize elastic
                                         distances in more sophisticated ways [14, 15, 16, 24].
                                             Two key strategies that have been developed for speeding up elastic distance computations are pruning
                                         and early abandoning. Pruning identifies and saves computation of parts of the cost matrix that cannot be
                                         on the optimal path (see Section 2). Early abandoning comes in two kinds. The first one, “computation
                                         abandoning”, terminates computation when some internal variable exceeds a threshold. The second one,
                                         “lower bounding”, first computes a lower bound (an approximated minimal score), and then only proceeds to
                                         compute the actual score if the lower bound is below a threshold. Early abandoning is useful in applications
                                         such as nearest neighbor search for which it is possible to derive thresholds above which the precise score is
                                         not relevant.
                                             In this paper we develop a new strategy, “EAPruned”, that tightly integrates pruning and early abandon-
                                         ing. We investigate its effectiveness with six key elastic distances. To enable fair comparison, we implemented
                                         several versions of the distances in C++ and compared their run times using NN1 classification over the
                                         UCR archive (in its 85 univariate datasets version). We show that EAPruned offers significant speedups:
                                         around 7.62 times faster then a simple implementation, and around 2.88 times faster than an implementation
                                             ∗ This   research has been supported by Australian Research Council grant DP210100072.

                                                                                                        1

with the usual computation abandoning scheme. We also show, in the cases of DTW and CDTW, that lower
bounding is complementary to EAPruned.
The rest of this paper is organised as follows. The section 2 presents the background on ensemble classifiers
that lead us to consider a given set of elastic distances. It then presents these elastic distances and highlights
their common structure. The section 3 presents EAPruned itself. We then present our experimentation
results in the section 4, and conclude section 5.

2 Background and Related Work
In this paper, we only consider univariate time series, although our technique is applicable to multivariate
series. Any mention of “series” or “times series” should be understood as “univariate time series”. We
usually denote series with the letters Q (for a query), C (for a candidate), S, T and U , and their length
with the letter L (using subscript such as LS to disambiguate the lengths of different series). Subscripts
Ci are used to distinguish multiple candidates. The elements s1 , s2 , . . . sLS are the elements of the series
S = (s1 , s2 , . . . sLS ). The element si is understood to be the i-th element of S, with 1 ≤ i ≤ LS .

2.1 The Recent History of Ensemble Classifiers
Ensemble classifiers have recently revolutionized time series classification. The “Elastic Ensemble” (“EE”,
see [14]), introduced in 2015, was one of the first classifiers to be consistently more accurate than NN1-DTW
over a wide variety of tasks. EE combines eleven NN1 classifiers based on eleven elastic distances (presented
sections 2.3 and 2.5). Most elastic distances have parameters. At train time, EE fine tunes them using cross
validation. At test time, the query’s class is determined through a majority vote between the component
classifiers. “Proximity Forest” (“PF”, see [16]) uses the same set of distance measures as EE, but deploys
them for nearest neighbor tests between a single exemplar of each class within an ensemble of random trees.
Both EE and PF solely work in the time domain, leading to poor accuracies when discriminant features are
in other domains. This was addressed by their respective evolution into “HIVE-COTE” (“HC”, see [15])
and “TSCHIEF” (see [24]), combining more classifiers working in different domains (interval, shapelets,
dictionary and spectral, see [15] for an overview).
While EE provided a qualitative jump in accuracy, it did so at the cost of a quantitative jump in computa-
tion time. For example, [29] reports 17 days to learn and classify the UCR archive’s ElectricDevice benchmark
([4]). Indeed, in their simplest forms and given series of length L, elastic distances have O(L2 ) space and
time complexities. This is only compounded by an extensive search of the best possible parametrization
through leave-one-out cross validation. For a training set of N times series, searching among M parameters
(M = 100 for EE), EE has a training time complexity in O(M.N 2 .L2 ). HIVE-COTE not only expands on
EE, it actually embeds it as a component – or did. Recently, EE was dropped from HIVE-COTE (see [1])
because of its computational cost. Doing so caused an average drop of 0.6% in accuracy, and beyond a 5%
drop on some datasets (tested on the UCR archive). The authors considered that it was a fair price to pay
for the resulting speedup (the authors only report that the new version is “significantly faster”).

2.2 Early Abandoning and Pruning
A double buffer technique [18] overcomes the space complexity, but the time complexity remains challenging.
Luckily, nearest neighbor search opens the door to early abandoning strategies. When searching for the
nearest neighbor of a query Q under a distance D, we sequentially check Q against each candidate Ci from a
database. Let S be the current nearest neighbor of Q. Early abandoning encompasses a set of techniques that
abort before actually completing the computation D(Q, Ci ) if they can determine that D(Q, Ci ) > D(Q, S)
(i.e. Ci cannot be the nearest neighbor). The two main kinds of early abandoning techniques are “lower
bounding” and “computation abandoning”.

Lower bounding employs lower bounds lb such that lb(Q, Ci ) ≤ D(Q, Ci ). It follows that if lb(Q, Ci ) >
D(Q, S) then D(Q, Ci ) > D(Q, S), i.e. we do not need to compute D(Q, Ci ) if the lower bound is already
above D(Q, S). This allows to discard Ci even before starting to compute D(Q, Ci ). To be effective, lower
bounds must be computationally cheap while remaining as “tight” as possible, i.e. as close as possible to the
actual result of the distance. Lower bounding is shown to significantly speedup classification [10, 29] and
similarity search [19]. Lower bounds have mainly been developed for DTW and CDTW, the most common
being LB Kim and LB Keogh [23, 10]. They also exist for other elastic distances [29], and remain an active
field of research.
    The second way to early abandon is to stop the computation of D(Q, Ci ) itself. We can give an intuition
of how this works without diving into the details yet. While computing D(Q, Ci ), a minimum value v ≤
D(Q, Ci ) is maintained. The value v usually starts at 0 and increases toward D(Q, Ci ). If (and as soon as)
v > D(Q, S), then D(Q, Ci ) > D(Q, S), allowing to abandon the computation.
    Finally, recent progress has been made on the DTW distance. PrunedDTW [25] computes DTW while
avoiding computation of some alignments that cannot be on an optimal path [19], and was subsequently
extended to incorporate early abandoning [26].

2.3     Elastic Distances with a Common Structure
EE uses eleven elastic distances. This section examines DTW, CDTW, WDTW, MSM, and TWE, the
computations of which follow a similar pattern and can be sped up by a common strategy. DDTW, DCTW
and DWDTW are respectively DTW, CDTW and WDTW applied to the derivative of the series [11]. We
do not examine these variants in this paper, but we note that they automatically benefit from our strategy.
The two remaining distances, LCSS and SQED, are examined in section 2.5.
    All the distances compute an alignment cost between two series. For most distances, this alignment cost
depends on a cost “cost” function used between elements of the series. The squared Euclidean distance is
often used, and this is the cost we use in all our implementations. However, other cost functions are possible,
such as the L1 norm.

2.3.1   Dynamic Time Warping
The Dynamic Time Warping (DTW) distance was first introduced in 1971 by Sakoe and Chiba as a speech
recognition tool [21, 20]. Compared to the classic Euclidean distance, DTW handles distortion and disparate
lengths. Intuitively, DTW “aligns” two series by aligning their points. The end points must be aligned, and
no point-alignment shall cross each other. Figure 1a illustrates how DTW aligns two series S and T where
the cost function is the square Euclidean distance. DTW compute the minimal (optimal) alignment cost.
Note that several optimal alignments with the same cost may exist.
   To compute the optimal alignment cost, DTW computes an optimal alignment which is represented by
an optimal “warping path” in the (L+ 1×L+ 1) DTW cost matrix MDTW . Its cells represent the cumulative
optimal cost of aligning the series according to the Equations 1d. Equations 1a to 1c describe the boundary
conditions. Figure 1b shows the cost matrix for two series S and T along with the optimal warping path.
DTW(S, T ) = MDTW (LS , LT ).

                           MDTW (0, 0) = 0                                                                (1a)
                           MDTW (i, 0) = +∞                                                               (1b)
                           MDTW (0, j) = +∞                                                               (1c)
                                                             
                                                             MDTW (i − 1, j − 1)
                                                             
                           MDTW (i, j) = cost(si , tj ) + min MDTW (i − 1, j)                             (1d)
                                                             
                                                               MDTW (i, j − 1)
                                                             

                                                      3

T=   1    3    2   1    2    2
                                                                            5

                                                                            0
                                                                5       0       0    1    2    3   4    5    6

                                                           S=               0   0
                                                            3               1        4    4    5   9    10   11
                                                            1               2        4    8    5   5    6    7
        Series S
4       Series T
        Alignments       1   1                              4               3        13   5    9   14   9    10
3
                                                            4               4        22   6    9   18   13   13
2   4
                                   1        1    1
                                                            1               5        22   10   7   7    8    9
1          0                           0
    1                2       3     4        5    6
                                                            1               6        22   14   8   7    8    9
               (a) DTW alignements with costs.                      (b) DTW cost matrix with warping path.

Figure 1: MDTW(S,T ) with warping path and alignments for S = (3, 1, 4, 4, 1, 1) and T = (1, 3, 2, 1, 2, 2). We
have DTW(S, T ) = MDTW(S,T ) (6, 6) = 9.

    Taking the diagonal means a new alignment between two new points. Going down or right means that
one point is reused. For example, in Figure 1b, the fifth point of S is aligned with the third, fourth and fifth
point of T . This can also be visualise on Figure 1a, with three dotted lines reaching the fifth point of S. An
alignment requires the end points to be aligned. This means that a valid warping path starts from the top
left cell and reaches the bottom right cell.
    One approach to speeding up elastic distances is through approximation, e.g. FastDTW (see [22], and [31]
for a counterpoint). Our approach computes exact DTW (and other distances), which is useful when ap-
proximation is not desirable. Comparing approximated approaches with exact approaches is left for further
research.
    In the following, we will use notions of direction (left, right, previous, up, top, diagonal, etc...). They
should be understood in the context of a matrix similar to the one depicted Figure 1b. A diagonal always
mean a new alignment between two new points. We call such an alignment a “canonical alignment”. Going
down or right (or with a top or left dependency) represents an “alternate alignment”. Alternate alignments
happen either because the canonical alignment is too expensive, or because the series have differing lengths.
Note that the alternate cases are always symmetric. This is reflected in the commutative nature of the
distances regarding their series arguments (e.g. DTW(S, T ) = DTW(T, S)).

2.3.2     Constrained DTW
In CDTW, the “constrained” variant of DTW, the warping path is restricted to a subarea of the matrix.
Different constraints exist [20, 8]. We focus on the popular Sakoe-Chiba band [20], also known as “Warping
Window” (or “window” for short). This constraint can also be applied to other distances such as ERP and
LCSS (section 2.3.4 and 2.5.2). The window is a parameter w controlling how far the warping path can
deviate from the diagonal. Given a matrix M , a line index 1 ≤ l ≤ L and a column index 1 ≤ c ≤ L, we
have |l − c| ≤ w. Figure 2a illustrates the state of MCDTW (S, T ) for a window of 1. Notice how the path can
only step away one cell from each side of the diagonal. The resulting cost is 11, which is greater than the
cost of 9 obtained earlier by unconstrained DTW (see Figure 1). Due to the window constraint, the cost of

                                                       4

T=   1   3    2    1     2    2                                  U=   1   3   2   1    2   2   4   4
                 5                                                                5

                 0                                                                0
     5       0       0    1   2    3    4    5     6                   5      0       0    1   2   3   4    5   6   7   8

S=               0   0                                            S=              0   0

 3               1        4   4                                    3              1        4   4

 1               2        4   8    5                               1              2        4   8   5

 4               3            5    9   14                          4              3            5   9   14

 4               4                 9   18    18                    4              4                9   18 18

 1               5                      9    10   11               1              5                    9    10 11

 1               6                           10   11               1              6                         10 11 20

(a) CDTW cost matrix with w = 1 for S and T of equal length       (b) CDTW cost matrix with w = 1 for S and U of disparate
                                                                  lengths.

Figure 2: Example of CDTW cost matrices with a window of 1. In the second case, the window it too
small to allow an alignment between the last two points at (6, 8).

CDTW is always greater than or equal to DTW’s. Figure 2b shows what happens with a window too short
for series of disparate lengths. The last two points can’t be aligned. Given two series S and T of respective
lengths LS and LT , an alignment is possible with a window w ≥ |LS − LT |.
    A window of 0 is akin to the squared Euclidean distance, while a window of L is equivalent to DTW
(no constraint). With a correctly set window, NN1-CDTW can achieve better accuracy than NN1-DTW
by preventing spurious alignments. It is also faster to compute than DTW, as cells beyond the window
are ignored. However, the window parameter must be set. A technique exploiting several properties of
CDTW (e.g. increasing the window can only lead to a smaller or equal cost) has been developed to learn
this parameter faster [28].

2.3.3    Weighted DTW
The Weighted Dynamic Time Warping (WDTW, see [9]) distance is another variant of DTW. If CDTW can
be seen as a hard constraint on the warping paths, WDTW can be seen as a soft one. The cells (l, c) of the
MW DT W cost matrix are weighted according to their distance to the diagonal d = |l − c|. The larger the
weight, the more the cell is penalized and unlikely to be part of an optimal path. The weight is computed
according to the Equation 2, where g controls the penalization. Its optimal range is 0.01 – 0.6 [9].
                                                               1
                                              w(d) =                                                                        (2)
                                                       1 + exp−g×(d−L/2)

2.3.4    Edit Distance with Real Penalty
The Edit distance with Real Penalty (ERP, see [3]) is actually a metric fulfilling the triangular inequality.
It takes as parameters a warping window (see CDTW section 2.3.2) and a “gap value” g. Intuitively, ERP
aligns the points in the case of canonical alignments, and inserts a “gap” at a cost based on g in the case of

                                                              5

alternate alignments. The authors suggest a gap value of 0, but EE still fine tunes it, along with the window
size.
    ERP is describe by the Equations 3. Again, the cost function usually is the squared Euclidean distance,
but other norms are allowed. Compared to DTW, the borders are computed (equations 3b and 3c) rather
than being set to +∞. The cost function is now specialized per argument of the min { function.

                                 MERP (0, 0) = 0                                                                              (3a)
                                 MERP (i, 0) = MERP (i − 1, 0) + cost(si , g)                                                (3b)
                                 MERP (0, j) = MERP (0, j − 1) + cost(g, tj )                                                 (3c)
                                                  
                                                  MERP (i − 1, j − 1) + cost(si , tj )
                                                  
                                 MERP (i, j) = min MERP (i − 1, j) + cost(si , g)                                            (3d)
                                                  
                                                    MERP (i, j − 1) + cost(ti , g)
                                                  

2.3.5    Move-Split-Merge
Move-Split-Merge (MSM, see [27]) is another metric developed to overcome shortcomings in other elastic
distances. Compared to ERP, it is robust to translation1 . The authors showed that MSM is competitive
against DTW, CDTW and ERP for NN1 classification.
    MSM (Equations 5) defined its own cost function Cc (Equation 4) used to compute the cost of the
alternate alignments. It takes as argument the new point (np) of the alternate alignment, and the two
previously considered points (x and y) of each series. C also takes a real number c as a constant penalty.
This penalty is the only parameter of MSM (and is fine tuned by EE).

                                                      c If x ≤ np ≤ y or x ≥ np ≥ y
                                                     
                                                     
                                                            (
                                    Cc (np, x, y) =            |np − x|                                                        (4)
                                                    c + min
                                                                        otherwise
                                                               |np − y|

                                 MMSM (0, 0) = 0                                                                              (5a)
                                 MMSM (i, 0) = +∞                                                                            (5b)
                                 MMSM (0, j) = +∞                                                                             (5c)
                                                  
                                                  MMSM (i − 1, j − 1) + |si − tj |
                                                  
                                 MMSM (i, j) = min MMSM (i − 1, j) + C(si , si−1 , tj )                                      (5d)
                                                  
                                                    MMSM (i, j − 1) + C(tj , si , tj−1 )
                                                  

2.3.6    Time Warp Edit Distance
The Time Warp Edit distance (TWE, see [17]) is motivated by the idea that timestamps (i.e. when a value
is recorded) should be taken into account. This is important for series with non-uniform sampling rates.
Note that our current implementation does not use timestamps, also assuming a constant sampling rate.
The i-th timestamp of a series S is denoted by τs,i .
    TWE (Equations 7) defines its cost functions (Equations 6) with two parameters. The first one, ν, is
a “stiffness” parameter weighting the timestamp contribution. A parameter of ν = 0 is similar to DTW.
The second one, λ, is a constant penalty added to the cost of the alternate alignments (“delete” in TWE
   1 In ERP, the gap cost is given by cost(s , g). If S is translated, the gap cost also changes while the canonical alignment cost
                                            i
remains unchanged, making ERP translation-sensitive.

                                                                6

terminology — deleteA for the lines, deleteB for the columns). The cost of the alternate case is the cost
between the two current points, plus their timestamp difference, plus the λ penalty. The canonical alignment
(“match”) cost is the sum of the cost between the two current and the two previous points, plus a weighted
contribution of their respective timestamps differences.

                match:      γM = cost(si , tj ) + cost(si−1 , tj−1 ) + ν(|τs,i − τt,j | + |τs,i−1 − τt,j−1 |)
               deleteA:     γA = cost(si , si−1 ) + ν|τs,i − τs,i−1 | + λ                                        (6)
               deleteB:     γB = cost(tj , tj−1 ) + ν|τt,j − τt,j−1 | + λ

                                  MTWE (0, 0) = 0                                                               (7a)
                                   MTWE (i, 0) = +∞                                                             (7b)
                                  MTWE (0, j) = +∞                                                              (7c)
                                                    
                                                    MTWE (i − 1, j − 1) + γM
                                                    
                                   MTWE (i, j) = min MTWE (i − 1, j) + γA                                       (7d)
                                                    
                                                      MTWE (i, j − 1) + γB
                                                    

2.4    A Common Structure
These six distances share a common structure, described by Equations 8. For a distance D, these equations
assign a value to every cell of a cost matrix MD . Given a series S “along the rows” and a series T “along
the columns”, the 0-indexed matrix MD has dimension (LS + 1) × (LT + 1). MD (i, j) is the cumulative cost
of aligning the first i points of S with the first j points of T . It follows that the last cell at (LS + 1, LT + 1)
holds the actual cost computed by D.
    The first three equations describe the border initialization of the matrix. Equation 8a initialises the
top left corner with 0. Equation 8b initialises the left vertical border, and Equation 8c initialises the top
horizontal border. The borders must be monotonously increasing, e.g. MD (i, 0) ≤ MD (i + 1, 0). Among the
presented distances, only ERP has computed borders. The borders of all the others are initialised to +∞.
    The last equation (8d) is the main one. A cell is computed by taking the minimum of three different align-
ment costs plus the cost of their respective dependency. The canonical alignment (“Canonical”) cost depends
on the top left diagonal cell (“diag” dependency). The vertical alternate alignment (“AlternateLine”) cost
depends on the above cell (“top” dependency). The horizontal alternate alignment (“AlternateColumn”)
cost depends on the left cell (“left” dependency). This process amounts to computing an optimal alignment
(with minimum cost) represented by an optimal warping path in the cost matrix. For example, in Figure 1b,
the cell (4, 2) is part of the optimal path, meaning that the fourth value of S is aligned with the second value
of T .

                          MD (0, 0) = 0                                                                         (8a)
                          MD (i, 0) = InitVBorder       such that MD (i, 0) ≤ MD (i + 1, 0)                     (8b)
                          MD (0, j) = InitHBorder such that MD (0, i) ≤ MD (0, i + 1)                           (8c)
                                          
                                          MD (i − 1, j − 1) + Canonical
                                          
                          MD (i, j) = min MD (i − 1, j) + AlternateLine                                         (8d)
                                          
                                            MD (i, j − 1) + AlternateColumn
                                          

   This generic structure guarantees the following properties:

   • An optimal path (with minimum cost) is computed.

                                                             7

• Series extremities are aligned with each other ((1, 1) and (LS , LT )).
     • The path is continuous and monotonous, i.e. for two of its consecutive cells (i1, j1) 6= (i2, j2), we have
       i1 ≤ i2 ≤ i1 + 1 and j1 ≤ j2 ≤ j1 + 1.
    This structure hints at a linear space complexity. All dependencies of a cell are satisfied either by
the previous row (diag and top dependencies), or by the previously computed cell on the same row (left
dependency) case. It follows that if we proceed row by row, only two rows of the matrix are required at any
time, achieving linear space complexity. We can further contain the memory footprint by using the shortest
series as the “column series”, minimizing row length.
    Algorithm 1 implements Equations 8 with the two rows technique. We have two arrays representing the
current row (curr) and the previous row (prev). The arrays are swapped (line 6) at each iteration of the
outer loop. The horizontal border is stored in the curr array (line 4), the swap operation putting it in prev
so it can act as the previous row when computing the first one. The vertical border is gradually computed
for each new row (line 7). Finally, the inner loop (line 8) computes from left to right the value of the cells
of the current row.

 Algorithm 1: Generic linear space complexity for a distance D.
     Input: the time series S and T
     Result: Cost D(S, T )
 1   co ← shortest series between S and T
 2   li ← longest series between S and T
 3   (prev, curr) ← arrays of length lco + 1; curr[0 ] ← 0
 4   curr[1 — L] ← InitHBorder
 5   for i ← 1 to Lli do
 6       swap(prev, curr)
 7       curr[0 ] ← InitVBorder
 8       for j ← 1 to Lco do
                           prev[j-1 ] + Canonical
                           
 9           curr[j ] ← min    prev[j ]     + AlternateLine
                              
                               curr[j-1 ]   + AlternateColumn
                              
10   return curr[lco ]

    Algorithm 2 builds upon Algorithm 1, adding support for a warping window w and the usual early
abandoning technique with a cut-off value. Hence, it is a suitable base for early abandoned CDTW and
ERP. These two changes being orthogonal to each other, removing the window handling code yields the base
for early abandoning all other distances presented until now.
    When dealing with a warping window w, we first check whether it permits an alignment or not (line 3) –
see Figure 2b for a graphical representation of a window too small. The horizontal border is only initialized
up to w + 1 (line 6), and the inner loop is caped within the window around the diagonal (from jStart to
jStop, lines 9, 10 and 13). The vertical border is only computed while the window covers the first column
(line 11). More interestingly, the arrays are now initialized to +∞. To understand why, let us consider the
cell (2, 3) for MCDTW Figure 2a. The cell (1, 3) is accessed to compute the AlternateLine case. However,
this cell is outside the window and should not contribute to the minimum. By initializing the arrays to +∞,
we implicitly set this cell, and all the upper right triangle outside of the window, to +∞. The lower triangle
is implicitly set to +∞ line 11, after the window stops covering the first column. This explain why we assign
to curr[jStart − 1] and not to curr[0 ]. Only the diagonal edge of the triangle is set to +∞, which is enough
for the next cell’s AlternateColumn case to be properly ignored (e.g. in Figure 2a the cell (3, 2) ignoring the
cell (3, 1)).

                                                              8

Algorithm 2: Generic distance with window and early abandoning.
     Input: the time series S and T , a warping window w, a cut-off point cutoff
     Result: Cost D(S, T )
 1   co ← shortest series between S and T
 2   li ← longest series between S and T
 3   if w < Lli − Lco then return +∞
 4   (prev, curr) ← arrays of length lco + 1 filled with +∞
 5   curr[0 ] ← 0
 6   curr[1 — w+1 ] ← InitHBorder
 7   for i ← 1 to Lli do
 8       swap(prev, curr)
 9       jStart ← max(1, i − w)
10       jStop ← min(i + w, Lco )
11       curr[jStart − 1] ← if jStart == 1 then InitVBorder else +∞
12       minv ← +∞
13       for j ← jStartto jStop do
                       prev[j-1 ] + Canonical
                       
14          v ← min      prev[j ]   + AlternateLine
                       
                         curr[j-1 ] + AlternateColumn
                       
15          minv ← min(minv, v)
16          curr[j ] ← v
17      if minv > cutoff then return +∞
18   return curr[lco ]

2.5     Other Distances
The two missing distances from EE are the squared Euclidean distance and the longest common sub-sequence
distance. They do not fit the common structure described section 2.4, nor will they fit the common structure
described section 3. Hence, we fully treat them here.

2.5.1     Squared Euclidean Distance
The squared Euclidean distance is exactly what you would expect. It requires series of equal length (i.e. it
is not elastic), returning a score of +∞ if not. Algorithm 3 presents an early abandoned squared Euclidean
distance. Our C++ implementation is based on it. As mentioned before, this distance is equivalent to
CDTW (section 2.3.2) with a window of 0.

 Algorithm 3: Squared Euclidean Distance.
     Input: the time series S and T , a cut-off point cutoff
     Result: sqed(S, T )
 1   if LS 6= LT then return +∞
 2   acc ← 0
 3   for i ← 1 to LS do
 4       acc ← acc + (Si − Ti )2
 5       if acc > cutoff then return +∞
 6   return acc

                                                           9

2.5.2     Longest Common Sub-Sequence
The Longest Common Sub-Sequence distance (LCSS, see [7]) is another elastic distance, more commonly
used for string matching. Time series aren’t usually made of discreet characters than can be compared
for equality, so LCSS considers two values equal if they are within a threshold . LCSS is describes by
Equations 9.

                       MLCSS (0, 0) = 0                                                                  (9a)
                       MLCSS (i, 0) = 0                                                                  (9b)
                       MLCSS (0, j) = 0                                                                  (9c)
                                        1 + MLCSS (i − 1, j − 1)              If |si−1 − tj−1 | ≤ 
                                      
                                      
                                           
                                      
                                            MLCSS (i − 1, j − 1)
                                      
                       MLCSS (i, j) =                                                                    (9d)
                                            
                                      
                                      
                                       max   MLCSS (i − 1, j)                otherwise
                                           
                                              MLCSS (i, j − 1)
                                            

    Contrary to other distances, LCSS maximizes a score, a higher score indicating greater similarity. To
be usable with nearest neighbor classifiers, its result is transformed using Equation 10. The MLCSS (LS , LT )
score is first normalized by dividing it by the length of the shortest series, i.e. the maximum number of
possible matches. Then, it is subtracted from 1, returning a score between 0 and 1 with the desired property
(lower means greater similarity).

                                                              MLCSS(L,T ) (LS , LT )
                                       LCSS(T, S) = 1 −                                                  (10)
                                                                min(LS , LT )

 Algorithm 4: LCSS with warping window and early abandoning.
     Input: the time series S and T , a threshold ,
     a warping window w, a cut-off point cutoff
     Result: Cost LCSS(S, T )
 1   co ← shortest series between S and T
 2   li ← longest series between S and T
 3   if w < Lli − Lco then return +∞
 4   (prev, curr) ← arrays of length lco + 1 filled with 0
 5   toreach ← d(1 − cutoff) ∗ Lco e
 6   maxv ← 0
 7   for i ← 1 to Lli do
 8       if maxv + (Lli − i) < toreach then return +∞
 9       swap(prev, curr)
10       jStart ← max(1, i − w)
11       jStop ← min(i + w, Lco )
12       curr[jStart − 1] ← 0
13       for j ← jStart to jStop do
14           if |lii − coj | ≤  then
15                v ← prev[j-1 ] + 1
16                curr[j ] ← v
17                maxv ← max(maxv, v)
18           else
19                v ← max(prev[j-1 ], prev[j ], curr[j-1 ])
20   return curr[lco ]

                                                              10

Algorithm 4 presents an early abandoned version of LCSS (and how our C++ implementation works).
To early abandon, the cut-off value (which must be between 0 and 1) is turned into a number of matches that
must be reached (line 5). We early abandon if the highest value of a row added to the number of remaining
potential matches is below the required value (line 8). Note that number of remaining matches is given by
min(Lco , Lli − i). However, because the score is adapted with respect to Lco , we can only early abandoned
when we fewer than Lco lines remain (Lli − i < Lco ). Hence, the minimum operation is not required.

3 Pruning and Early Abandoning
We saw section 2.4 the idea behind early abandoning. We now introduce the idea of pruning. Given a cost
matrix like the one presented Figure 1b, pruning refers to avoiding the computation of some of the cells. As
we still proceed row by row (such as in Algorithm 1), pruning all the cells in a row leads to early abandoning.
In this section, we present how to prune and early abandon the distances based on the common structure
described section 2.4. We present our strategy, “EAPruned”, in a generic manner: all the distances based
on the common structure can benefit from it. We present and illustrate the idea behind EAPruned using
DTW. However, we present the algorithm with a warping window and computed borders.
Pruning was first developed for DTW in 2016 with PrunedDTW [25]. It aimed at speeding up all pairwise
calculation of DTW. In this context, no cut-off value is provided. Instead, PrunedDTW computes its own
cut-off value based on the diagonal of the cost matrix. In the case of DTW with series of equal length, this
amounts to the squared Euclidean distance. With disparate length, the bottom left corner can be reach by
completing the diagonal with the last column. Regardless of the distance, those cut-off computations are all
similar to Algorithm 3, i.e. are in O(L) time complexity, and a O(1) space complexity.
The cut-off is an upper bound ub on the actual DTW cost, i.e. the actual DTW cost is smaller (or
equal) than ub. PrunedDTW uses ub not to early abandon, but to skip the computation of some alignments
(i.e. some cells from the cost matrix) that we know will be above it. PrunedDTW was later used for
similarity search [26] with early abandoning. However, its core was not revisited, and early abandoning was
implemented in the classical way (like in Algorithm 2).
We revisited the idea behind PrunedDTW and developed a strategy that tightly integrates early aban-
doning and pruning. In addition, it delivers more pruning than PrunedDTW and is carefully designed for
computational efficiency. It is organized in successive stages in which we avoid computing unnecessary de-
pendencies. EAPruned itself is presented Algorithm 5; the strategy is presented in two parts in “Pruning
from the left” and “Pruning on the right”.

3.1 Pruning From the Left
As stated earlier, we compute the cost matrix row by row, from left to right. While computing a row, we
look at discarding as many as possible of the leftmost cells of the matrix. We define a “discard zone” as
being the columns topped by a “discard point”. In a row, discard points are all the cells preceding the first
cell with a cost below the cut-off2 , and that are not already part of the discard zone. Discard points and
all the cells in the discard zone have a cost strictly greater than the cut-off. In the Figures 3a and 3b, the
white cells with red border are discard points, and the discard zone is made of the grey columns they top.
Notice how the cell (1, 5) is greater than the cut-off, but not a discard point. This is because it is on the
right of (1, 1) which has a cost below the cut-off. Beside the discard points, no cell in the discard zone is
ever evaluated: they are pruned.
For DTW, the discard zone starts with the left border at (1, 0). The cell at (2, 0) is not a discard point,
as it already is in the discard zone. This is due to DTW borders initialization to +∞, excepts for (0, 0) = 0.
When the border is computed, the discard zone starts at the first line i such that Md(S,T ) (i, 0) > cutoff. If a
window w is used, the discard zone starts at the line w + 1, unless started earlier (see Algorithm 5 line 14).
2 I.e. that can be part of an optimal warping path.

0       1     2      3     4      5     6                      0      1      2       3      4      5       6

0     0                                                         0    0

1             4     4      5     9     10     11                1           4      4       5      9      10     11

2             4     8      5     5      6     7                 2           4      8       5      5      6      7

3          13 > 9   5      9     14     9     10                3        13 > 6    5       9     14      9      10

4            22     6      9     18    13     13                4          22      6       9     18      13     13

5            22 10 > 9     7     7      8     9                 5          22 10 > 6 7 > 6 7 > 6 8 > 6 9 > 6

6            22     14     8     7      8     9                 6          22      14      8      7      8      9

(a) MDTW(S,T ) with cutoff = 9. The arrows represent the        (b) MDTW(S,T ) with cutoff = 6. Early abandon occurs at the
dependencies of the cell (4, 1).                                end of the fifth line.
          Figure 3: Illustration of MDTW(S,T ) “from the left” pruning with two different cut-off value.

    To explain this pruning scheme, we examine a cell’s dependencies, as illustrated in Figure 3a for the cell
(4, 1). As before (see section 2.4), we give them the names of “left”, “top” and “diag”. All dependencies of
(4, 1) are discarded (either in the discard zone or a discard point). A cell’s cost can only be greater than
(or equal to) its smallest dependency. It follows that if all of a cell’s dependencies are discarded, the cell
can be discarded. Starting on the left border, a cell only has a top dependency. Hence, as soon as a border
cell is above the cut-off (i.e. (0, 1)), the remainder of the column can be discarded. In turn, this creates the
same conditions for the next column, starting next row. We only need to check the top dependency for cells
(i > 1, 1), stopping as soon as possible. The cell (3, 1) is above the cut-off, allowing to discard the remainder
of the column.
    When proceeding row by row, pruning from the left is implemented by starting the next row after the
discarded columns. Discarding all the columns in a row, such as in Figure 3b, leads to early abandoning.
Note that in this case, the discard points from (5, 3) up to (5, 6) need to check both their top and diag
dependencies.

3.2       Pruning on the Right
In the previous section, we extended a discard zone from the left to the right, leading to early abandoning
when it reaches the end of a row. This strategy relying on the row by row evaluation order, we cannot do
the same starting from the top border as this would require a column by column order. We have to keep
pruning inside a row. Looking back at Figure 3b with a cut-off value of 6, we can find such cells. The cell
at (1, 4) is above the cut-off. In conjunction with the top border, the following cells can only be above the
cut-off. Dealing with this kind scenario allows us to “prune on the right”.
    Pruning on the right is not as straightforward as pruning from the left. The main reason is that pruning
form the left allows us to discard a column once and for all. But, the idea is similar. In the previous section
we mentioned that a discard point creates the “same conditions for the next column, starting next row”,
allowing to continue the pruning process. When pruning on the right, we have to identify a condition akin
to the top border. This condition is a continuous block of cell above the cut-off value, reaching the end of

                                                           12

0          1        2      3      4       5       6                  0        1        2      3      4       5      6

0     0                                                                0   0

1               4         4      5      9    10 > 9 11                 1            4        4      5    9>6      10     11

2               4         8      5      5      6        7              2            4        8      5      5      6      7>6

3          13 > 9         5      9      14     9      10 > 9           3        13 > 6       5    9>6      14     9      10

4               22        6      9    18 > 9 13        13              4            22       6    9>6      18     13     13

5               22 10 > 9        7      7      8        9              5            22 10 > 6 7 > 6        7      8       9

6               22       14      8      7      8        9              6            22      14      8      7      8       9

                     (a) MDTW(S,T ) with cutoff = 9                                     (b) MDTW(S,T ) with cutoff = 6

Figure 4: Illustration of MDTW(S,T ) “from the left” and “on the right” pruning with two different cut-off
value. The blue cell represents the point of early abandoning, white cells represent discard points, and dark
gray cells represent pruning points.

their row. In the case of the top border, this block starts at (0, 1). In Figure 3b, the second one starts at
(1, 4). The start of those block are called “pruning point”. See Figure 4 where they are represented in dark
grey with red borders.
    Pruning points give an information for the next row3 . In the next row, as soon as we find a cell above the
cut-off and after the pruning point, the remainder of the row can be dropped. Note that the cell just under
the pruning point is a special case, as its diag dependency is not discarded by the pruning point (e.g the cell
(1, 1) Figure 4b). The tricky part is that pruning points move back and forth across rows. To determine
their position, a cell (i, j) with a cost below the cut-off will always assume that the next cell is the row’s
pruning point. If (i, j + 1) > cutoff, and so on up to (i, Lco ), then (i, j + 1) indeed is a pruning point. Else,
we assume that (i, j + 2) is the pruning point, repeating the same logic. Note that all rows (up to an early
abandoning row) have one and only one pruning point. It may be out of bounds, e.g. at Lco + 1, in which
case it does not appears in the figures.
    Let us look at some examples, Figure 4b.

     • The cell (2, 2) > cutoff is not a pruning point, because it does not start a continuous block of cells
       above cutoff reaching the end of the row.
     • The cell (3, 3) and all its following cells are above cutoff. The row must be fully evaluated to ensure
       that (3, 3) indeed is the row’s pruning point. It contributes to enabling pruning on the fourth row,
       starting at (4, 4).

    Finally, we have to address what happen when both pruning strategies “collide”. The Figure 4b illustrates
such a collision in blue. Without pruning on the right, the cell (5, 3) would be a discard point. Because it
is below a pruning point, we know that the rest of the row is above the cut-off. EAPruned allows to prune
a total of 16 cells over 36. The usual early abandoning strategy presented section 2.4 would only prune the
    3 Similar   to discard points, instructing the next row to skip their column.

                                                                  13

last line, i.e. 6 cells. Also, if the cell at (1, 1) is above cutoff, EAPruned immediately early abandons, when
the classical way either evaluates the full row or requires a special case for this cell. As expected, Figure 4a
does not show any early abandoning. In this case, EAPruned still prunes 5 cells.

3.3 The EAPruned Algorithm
We now present the EAPruned algorithm (Algorithm 5) which is compatible with any distance matching the
common structure described section 2.4. We present the algorithm with a window and computed borders,
covering the most complex case. Our implementations are specialized for each distance’s needs.
Our algorithm exploits resulting properties of pruning to further reduce the computational effort. For
example, cells after the pruning point from the previous row can ignore their top and diag dependencies,
along with the associated cost computation, as they are known to be above cutoff. Hence, only the left
dependency needs to be computed. To do so, we split the computation of a row in several stages. For each
stage, we indicate the dependencies that must be checked.
1. Compute the value of the left border (computed or outside the window).
A computed border may require the top dependency.
2. Compute discard points until a non-discard point or the pruning point.
Check the diag and top dependencies.
3. Compute non-discard point until the pruning point
Check all dependencies.
4. Deal with the cell at the pruning point
Check the diag (always) and left (unless was a discard point) dependencies.
5. Compute the cells after the pruning point
Check the left dependency.
In Algorithm 5, the discard points are represented by the next start variable. The position of the pruning
point from the previous row, use to prune in the current row, is represented by the pruning point variable.
Finally, the next pruning point variable represents the pruning point being computed in the current row, and
used when computing the next row. To do so, its value is assigned to pruning point after a row is done
(line 39).
Beginning a new row, we first determine the index of the first and last columns. Then, the first stage
updates the left border, computing its value or applying the window (line 14). The second stage (line 16)
computes discard points, which require the row to be bordered on the left by a value above cutoff (tested
line 17). The condition in the for loop ensures that all discard points form a continuous block. As soon as
a value below cutoff is found, we jump to the second stage as we cannot have any more discard points. As
explained in the previous section, next pruning point is set (line 20) to the next column index. Note that
next start can only be updated in the second stages while next pruning point must be checked for update after
every cost computation.
The third stage (line 21) compute the cost taking all dependencies into account until we reach prun-
ing point. If pruning point is out of bounds, the third stage completes the row and the following stages are
skipped over. We enter the fourth stage (line 25). If we are still within bounds, two cases are possible.
Either the left cell is a discard point, or it is not. If it is a discard point (line 27), then we only need to
check the diag dependency, early abandoning if the cost is above cutoff. It not (line 30), both the left and
diag dependencies are checked, and no early abandoning is possible. Finally, we early abandon if we are out
of bounds and the discard points reach the end of the row (line 34).
At the fifth and final stage (line 35), only the left dependency is checked as, both the diag and top
dependencies are known to be above cutoff. The loop stops as soon as we find a cost above cutoff, effectively
pruning the rest of the row.

Algorithm 5: Generic EAPruned Algorithm.
     Input: the time series S and T , a warping window w, a cut-off point cutoff
     Result: Cost d(S, T )
 1   co ← shortest series between S and T
 2   li ← longest series between S and T
 3   if w < Lli − Lco then return +∞
 4   (prev, curr) ← arrays of length lco + 1 filled with +∞
 5   curr[0 ] ← 0
 6   curr[1 — w+1 ] ← InitHBorder
 7   next start ← 1
 8   pruning point ← 1
 9   for i ← 1 to Lli do
10       swap(prev, curr)
11       jStart ← max(i − w, next start)
12       jStop ← min(i + w, Lco )
13       j ← jStart
14       /* Stage 1: init the vertical border                                              */
15       curr[jStart − 1] ← if jStart == 1 then InitVBorder else +∞
16       /* Stage 2: discard points up to, excluding, the pruning point                    */
17       if curr[jStart-1 ] > cutoff then
18
                                   ( − 1 while j = next start do
              for j to pruning point
                                     prev[j-1 ] + Canonical
19                curr[j ] ← min
                                     prev[j ]    + AlternateLine
20                if curr[j ] ≤ cutoff then next pruning point ← j + 1 else next start++
21       /* Stage 3: continue up to, excluding, the pruning point                          */
22
                               − 1 do
         for j to pruning point
                              prev[j-1 ] + Canonical
                              
23            curr[j ] ← min prev[j ]         + AlternateLine
                              
                                curr[j-1 ] + AlternateColumn
                              
24            if curr[j ] ≤ cutoff then next pruning point ← j + 1
25       /* Stage 4: at the pruning point                                                  */
26       if j ≤ jStop then
27            if j = next start then
28                curr[j ] ← prev[j-1 ] + Canonical
29                if curr[j ] ≤ cutoff then next pruning point ← j + 1 else return +∞
30            else                 (
                                     prev[j-1 ] + Canonical
31                curr[j ] ← min
                                     curr[j-1 ] + AlternateColumn
32                if curr[j ] ≤ cutoff then next pruning point ← j + 1
33            j++
34       else if j = next start then return +∞
35       /* Stage 5: after the pruning point                                               */
36       for j to jStop while j = next pruning point do
37            curr[j ] ← curr[j-1 ] + AlternateColumn
38            if curr[j ] ≤ cutoff then next pruning point ← j + 1
39       pruning point ← next pruning point
40   return curr[Lco ]

                                                           15

4 Experiments
EAPruned distances produce the exact same results as their classic counterparts. Hence, our experiments
are not about accuracy, but execution speed. We implemented the DTW, CDTW, WDTW, ERP, MSM and
TWE distances following Algorithm 5, simplifying (e.g. removing handling of the warping window) where
possible. All the experiments have been run in the same conditions, on a computer equipped with 64GB of
RAM (enough memory to fit any dataset from the UCR Archive) and an AMD Opteron 6338P at 2.4Ghz.
We tested the performance of four versions. The first one, “Base”, is the classical double buffered
implementation without early abandoning. The second one, “Pruned”, is based on Algorithm 5. However,
it uses its own computed cutoff based on the diagonal of the cost matrix, as suggested by [25]. Such cut-
off only allow to prune, and may be quite loose. However, it allows to show how our implementations
perform when early abandoning is not possible. The third one, “EABase”, is the classical double buffered
implementation with early abandoning as presented Algorithm 2. Finally, the fourth one, “EAPruned”, is
based on Algorithm 5. The Pruned and EAPruned version are not available for SQED and LCSS. These
distances are implemented following Algorithms 3 and 4.
The first experiment compares the various versions against each other. Every version of every distance is
used within a NN1 classifier. Every classifier is run over every default train/test split from the UCR Archive
in its 85 datasets version [4], for a total of 2380 runs. We use the parameters found by EE [14] after training.
Hence, our benchmarking conditions are similar to EE at test time. The second experimentation assesses
the impact of lower bounding of CDTW and DTW. The C++ source code for all the versions and the EE
parameters are available on github [5].

4.1 Evaluation of Pruning and Early Abandoning
We measured the time taken by each distance, in each available version, to perform NN1 classification on
the default train/test splits from the UCR Archive with 85 datasets. The Figure 5 presents the results in
hours per distance, except for Figure 5a which excludes LCSS and SQED, as they are not prunable.
LCSS and SQED (Figures 5h and 5i) are only early abandoned, not pruned. Both are ≈ 3 times faster
when early abandoned. SQED is so fast that its time is negligible compared to other distances, early
abandoned or not. This is expected as its time complexity is O(L). The time complexity of LCSS is O(Lw),
for a warping window of size w, leading to a worst case complexity of O(L2 ). When early abandoning, the
best case complexity is in O(L), due to buffers initialization.
Now looking at the prunable distances Figure 5a. Compared to the Base case, pruning seems to help.
However, this is not always the case when looking at individual distances: CDTW and TWE (Figures 5c
and 5g) are actually slower. This may be explained by the small windows used in the case of CDTW4 . Small
windows mean that there are less pruning opportunities. It follows than the time saved by pruning is less
than its overhead cost (coming from the pruning technique itself, plus the computation of a cutoff). In the
case of TWE, which does not have a window, the computed cutoff may not allow to prune enough. On the
other hand, WDTW (Figure 5d), which does not have a window, is the only distance which benefits more
from pruning than early abandoning in its base version. We explain this by WDTW’s non linear weights:
close to 0 around the diagonal, they rapidly increase as we step away. Hence, the computed cutoff based on
the diagonal is relatively tight, allowing a lot of pruning.
Finally, EAPruned is ≈ 7.61 times faster than Base, and ≈ 2.88 times faster than EABase. Looking at
individual distances, EAPruned is always the fastest, with a speedup ranging from ≈ 3.93 (TWE) up to
≈ 39.2 (WDTW) compared to Base, and from ≈ 1.85 (TWE) up to ≈ 8.44 (WDTW) compared to EABase.
Note that CDTW (speedup ≈ 5.8) and TWE (speedup ≈ 3.93) remain the distances appearing to benefit
less from pruning. However, if TWE already benefits from EABase, only EAPruned achieves to speed up
CDTW.
4 Note that CDTW is ≈ 12 times faster than DTW in the base version, giving an idea of the effect of small windows on the

computation time.

Total runtime in hours per algorithm family                                  Total runtime in hours per algorithm family
                               85 UCR Datasets, parameters from EE                                          85 UCR Datasets, parameters from EE
                                DTW, CDTW, WDTW, ERP, MSM, TWE                                                               DTW
                   200
                                                                                                  25
                   175

                   150                                                                            20

                   125
runtime in hours

                                                                               runtime in hours
                                                                                                  15
                   100    192.37                                                                       24.83
                                                                                                                     20.85
                    75                 135.02                                                     10

                    50
                                                      72.71                                        5
                    25                                                                                                              5.51
                                                                     25.24
                                                                                                                                                  1.01
                     0                                                                             0
                          Base          Pruned        EABase        EAPruned                           Base          Pruned         EABase      EAPruned
                                          algorithm families                                              Total runtimealgorithm
                                                                                                                         in hours families
                             Total
                                (a)runtime in hours Distances
                                     All Prunable   per algorithm family                                                (b)  DTWper algorithm family
                               85 UCR Datasets, parameters from EE                                          85 UCR Datasets, parameters from EE
                                                CDTW                                                                         WDTW
                                                                                                  25

                   2.0
                                                                                                  20
                                                                               runtime in hours
runtime in hours

                   1.5
                                                                                                  15
                                        2.35                                                           24.72
                   1.0    1.98                         1.99
                                                                                                  10

                   0.5                                                                             5
                                                                                                                     4.76           5.32
                                                                     0.34
                                                                                                                                                    0.63
                   0.0                                                                             0
                          Base         Pruned         EABase        EAPruned                           Base          Pruned         EABase        EAPruned
                                         algorithm families                                                            algorithm
                                                                                                          Total runtime(d)  hours families
                                                                                                                         in WDTW  per algorithm family
                             Total runtime
                                         (c)inCDTW
                                               hours per algorithm family
                               85 UCR Datasets, parameters from EE                                          85 UCR Datasets, parameters from EE
                                                 ERP                                                                          MSM

                   17.5                                                                           70
                   15.0                                                                           60

                   12.5                                                                           50
                                                                               runtime in hours
runtime in hours

                   10.0    18.44                                                                  40   74.75
                    7.5                                                                           30
                                        12.17                                                                        46.29
                    5.0                                                                           20                               32.37
                    2.5                                5.09
                                                                               17 10                                                               9.75
                                                                     1.39
                    0.0                                                                            0
                           Base        Pruned        EABase        EAPruned                            Base         Pruned        EABase        EAPruned
                                         algorithm families                                                           algorithm families
                                          (e) ERP                                                                      (f) MSM

      Figure 5: Accumulated timings in hours of NN1 classification over 85 datasets from the UCR archive, using
      parameters discovered by EE.

Total runtime in hours per algorithm family                                             Total runtime in hours per algorithm family
                             85 UCR Datasets, parameters from EE                                                     85 UCR Datasets, parameters from EE
                                              TWE                                                                                     LCSS
                   50                                                                                         6

                   40                                                                                         5

                                                                                                              4
runtime in hours

                                                                                           runtime in hours
                   30
                        47.64         48.61                                                                   3       5.92
                   20
                                                                                                              2
                                                                   22.43
                   10                                                                                                                                 1.98
                                                                              12.12                           1

                    0                                                                                         0
                        Base         Pruned        EABase                   EAPruned                                 Base                           EABase
                                       algorithm families                  Total runtime in hours per algorithm family algorithm families
                                        (g) TWE                                                                          (h) LCSS
                                                                             85 UCR Datasets, parameters from EE
                                                                                              SQED
                                                                 0.030

                                                                 0.025

                                                                 0.020
                                              runtime in hours

                                                                 0.015         0.03

                                                                 0.010
                                                                                                                   0.01
                                                                 0.005

                                                                 0.000
                                                                              Base                                EABase
                                                                                       algorithm families
                                                                                       (i) SQED

      Figure 5: (continued) Accumulated timings in hours of NN1 classification over 85 datasets from the UCR
      archive, using parameters discovered by EE.

                                                                                          18

Dataset                  Base     Pruned    EABase      EAPruned
                   UWaveGestureLibraryX              2.77      2.37      1.33         0.57
                           Phoneme                   5.92      4.21      3.54         2.26
                       ElectricDevices               6.09      5.11      2.84         1.16
                            FordB                    8.76      6.11      4.91         2.36
                            FordA                   14.20      8.43      7.84         3.79
                 NonInvasiveFetalECGThorax2         16.40      9.49      3.80         0.38
                 NonInvasiveFetalECGThorax1         18.99      7.27      3.89         0.36
                        HandOutlines                19.55      9.91      6.90         1.33
                   UWaveGestureLibraryAll           21.16     17.30      8.47         3.47
                       StarLightCurves              63.71     48.40     19.58         7.61
                             total                 177.54    118.61     63.10        23.28

                             Table 1: The 10 slowest datasets, timings in hours.

                             Dataset             Base      Pruned   EABase     EAPruned
                       ItalyPowerDemand          0.030      0.024    0.017       0.012
                              Coffee             0.035      0.029    0.032       0.017
                    SonyAIBORobotSurface1        0.044      0.028    0.038       0.016
                           BirdChicken           0.060      0.065    0.057       0.030
                            BeetleFly            0.062      0.050    0.047       0.032
                             ECG200              0.075      0.039    0.026       0.017
                              Wine               0.085      0.030    0.035       0.008
                            GunPoint             0.095      0.064    0.037       0.016
                               Beef              0.098      0.080    0.074       0.040
                    SonyAIBORobotSurface2        0.108      0.076    0.064       0.047
                               total             0.693      0.484    0.427       0.236

                            Table 2: The 10 fastet datasets, timings in minutes.

    The results presented Figure 5 are the timings over 85 datasets. We have to stress that the majority of
the computation time is due to only some of the slowest datasets. The 10 slowest datasets reported Table 1
(in hours) make up for ≈ 92.29% of the Base time. The 15 fastest datasets reported Table 2 (in minutes)
confirm that EAPruned is also beneficial (it makes up for its overhead) even in these cases.

4.1.1   On the Time Complexity of EAPruned
In Tables 1 and 2, datasets are ordered according to the Base column. It is interesting to see that this order
does not carry over to other columns. For example, the second slowest dataset “UWaveGestureLibraryAll”
is more than twice slower than “FordA” with Base. However, it becomes faster with EAPruned. This is due
to the unpredictable nature of early abandoning and pruning. The time complexity can be as low as O(1)
when an implementation does not require any initialization (such as DTW’s). EAPruned’s time complexity
can also be as high as O(L2 ), which is a “worst” quadratic complexity than the one from Base, due to
EAPruned’s overhead. Time complexity under early abandoning is a moving target, sitting between the best
and worst case scenarios, depending on the order of evaluation. When performing a NN1 classification, if
our first candidate is the actual nearest neighbor of a query, the average complexity will be close to O(1) (or
O(L) with buffer initialization) for all following distance computations. On the other hand, if the candidates
are ordered from the furthest to the nearest, the computation will never be early abandoned, and probably
not pruned much. The average complexity will then be around O(L2 ).

                                                      19

CDTW runtime in minutes per mode and lower bound
                                            85 UCR Datasets, window parameter from EE
                   2.0                                                                               lb-none
                                                                                                     lb-keogh
                   1.5                                                                               lb-keogh2
runtime in hours

                   1.0    1.98                          1.99

                   0.5
                                  0.37    0.28                  0.38   0.27          0.34     0.17     0.15
                   0.0           Base                         EABase                        EAPruned
                                                       lower bounds / mode
Figure 6: CDTW Runtime (in hours) comparison over 85 datasets from the UCR Archive between Base,
                         DTW
EABase and EAPRuned, under    runtime
                           lb-none,    in minutes
                                    lb-keogh      per mode and
                                             and lb-keogh2.    lowerparameters
                                                            Window   bound obtained from EE.
                                                         85 UCR Datasets
                   25                                                                                lb-none
                   20                                                                                lb-keogh
                                                                                                     lb-keogh2
runtime in hours

                   15
                         24.83
                   10
                                 14.13
                                         10.54
                    5                                   5.51    4.52   4.14
                                                                                     1.01     0.94     0.92
                    0            Base                         EABase                        EAPruned
                                                       lower bounds / mode
Figure 7: DTW Runtime (in hours) comparison over 85 datasets from the UCR Archive between Base,
EABase and EAPRuned, under lb-none, lb-keogh and lb-keogh2. No parameter required.

4.2                 Comparing with Lower Bounds
The results obtained by EAPruned lead to ask how it behaves in the presence of lower bounding. Lower
bounds early abandon even before starting a distance’s computation. Hence EAPruned only sees the cases
that could not be early abandoned, meaning it now sees a higher ratio of cases that cannot be early abandoned
at all. Under these circumstances, is EAPruned still worth it?
    To study the question, we compare the results of CDTW and DTW for the Base, EABase and EAPruned
versions, with no lower bound (“lb-none”), the LB-Keogh lower bound (“lb-keogh”), and cascading two
applications of LB-Keogh (“lb-keogh2”, reversing their arguments as Keogh(a, b) 6= Keogh(b, a), see [18]).
Note that LB-Keogh requires to compute the envelopes of the series. For a given run, envelopes are only
computed once, keeping this overhead to a minimum. Moreover, envelopes are computed using the O(L)
Lemire’s algorithm [13], the fastest known technique.
    Figures 6 and 7 present the results respectively for CDTW and DTW. The blue “lb-none” bars correspond
to the Base cases presented Figure 5 (i.e. without lower bound). In the CDTW case, EAPruned without
lower bound is on par against Base and EABase with lower bounds. Lower bounding EAPruned then
provides a ≈ 2 times speed up. In the DTW case, EAPruned if more than ≈ 4 times faster than the EABase
with lb-keogh2, one of the fastest configurations known until now. Lower bounding EAPruned still offers
a ≈ 10% speedup, which is good news. Indeed, an envelope computed over a window as wide as the series

                                                               20

does not contain much information (see how the CDTW benefits way more from lower bounding in the Base
and EABase cases than DTW).
Our results indicate that lower bounding – at least with lb-Keogh – and EAPruned complement each
other. We explain this by the difference in the modus operandi of the two techniques. Lower bounds have
the opportunity to pre-compute some data over the series, given them an overall but approximate view. On
the other hand, EAPruned works locally, but with the exact costs.

5 Conclusion
In this paper, we presented how to efficiently prune and early abandon elastic distances. We implemented our
algorithm EAPruned for major elastic distances used by state-of-the-art ensemble classifiers, and compared
the timings with existing techniques. We show over 85 datasets from the UCR archive that in scenarios
akin to NN1 classification, our algorithm is always the fastest, even in already fast cases. We also show
that pruning alone can be beneficial for some distances, although caution is advised as it may be counter
productive in some cases. Finally, we show that non only is EAPruned competitive against lower bounding
alone, the two techniques actually are complementary.
In the lights of this results, we encourage the researcher to keep developing lower bounds, and the
practitioner to use our algorithm. We make the latter easy by releasing our C++ implementations with
python/numpy bindings (see [6]) under the permissive MIT licence.
Our next steps will be to implement our algorithm for the multivariate case, and implement more lower
bounds in Tempo. We also plan to fit some ensemble classifiers such as Proximity Forest or TSChief with
our distances, expecting significant speed up.

References
[1] Anthony Bagnall, Michael Flynn, James Large, Jason Lines, and Matthew Middlehurst. A tale of two
toolkits, report the third: On the usage and performance of HIVE-COTE v1.0. arXiv:2004.06069 [cs,
stat], April 2020.
[2] Seif-Eddine Benkabou, Khalid Benabdeslem, and Bruno Canitia. Unsupervised outlier detection for
time series by entropy and dynamic time warping. Knowledge and Information Systems, 54(2):463–486,
2018.
[3] Lei Chen and Raymond Ng. On the marriage of lp-norms and edit distance. In Proceedings 2004 VLDB
Conference, pages 792 – 803, 2004.
[4] Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh
Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn Keogh. The UCR Time Series Archive.
arXiv:1810.07758 [cs, stat], September 2019.
[5] Matthieu Herrmann. Experimentation source code and ressources. https://github.com/HerrmannM/
paper-2021-EAPElasticDist, 2021.

[6] Matthieu Herrmann. Tempo, the monash time series classification library. https://github.com/
MonashTS/tempo, 2021.
[7] Daniel S. Hirschberg. Algorithms for the Longest Common Subsequence Problem. Journal of the ACM
(JACM), 24(4):664–675, October 1977.
[8] F. Itakura. Minimum prediction residual principle applied to speech recognition. IEEE Transactions
on Acoustics, Speech, and Signal Processing, 23(1):67–72, 1975.

You can also read