Partial Wasserstein Covering - Keisuke Kawano, Satoshi Koide, Keisuke Otaki - Association for the ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
PRELIMINARY PREPRINT VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference. Partial Wasserstein Covering Keisuke Kawano, Satoshi Koide, Keisuke Otaki Toyota Central R&D Labs., Inc. {kawano, koide, otaki}@mosk.tytlabs.co.jp Abstract Original data distribution Partial Wasserstein covering We consider a general task called partial Wasserstein cover- ing with the goal of providing information on what patterns are not being taken into account in a dataset (e.g., dataset used during development) compared with another dataset(e.g., dataset obtained from actual applications). We model this task Anomaly detection (LOF) Active learning (coreset) as a discrete optimization problem with partial Wasserstein divergence as an objective function. Although this problem is NP-hard, we prove that it satisfies the submodular prop- erty, allowing us to use a greedy algorithm with a 0.63 ap- proximation. However, the greedy algorithm is still inefficient Application dataset Development dataset Selected data because it requires solving linear programming for each ob- jective function evaluation. To overcome this inefficiency, we Figure 1: Concept of PWC. PWC extracts some data from propose quasi-greedy algorithms that consist of a series of an unlabeled application dataset by minimizing the partial acceleration techniques, such as sensitivity analysis based on Wasserstein divergence between the application dataset and strong duality and the so-called C-transform in the optimal the union of the selected data and a development dataset. transport field. Experimentally, we demonstrate that we can PWC focuses on regions in which development data are efficiently fill in the gaps between the two datasets and find lacking compared with the application dataset, whereas missing scene in real driving scenes datasets. anomaly detection extracts irregular data, and active learn- ing selects to improve accuracy. 1 Introduction A major challenge in real-world machine learning applica- tions is coping with mismatches between the data distribu- the discrepancy between the data distributions and select tion obtained in real-world applications and those used for data to minimize this discrepancy. The Wasserstein distance development. Regions in the real-world data distribution that has attracted significant attention as a metric for data dis- are not well supported in the development data distribution tributions (Arjovsky, Chintala, and Bottou 2017; Alvarez- (i.e., regions with low relative densities) result in potential Melis and Fusi 2020). However, the Wasserstein distance risks such as a lack of evaluation or high generalization er- is not capable of representing the asymmetric relationship ror, which in turn leads to low product quality. Our motiva- between application datasets and development datasets, i.e., tion is to provide developers with information on what pat- the parts that are over-included during development increase terns are not being taken into account when developing prod- the Wasserstein distance. ucts by selecting some of the (usually unlabeled) real-world In this paper, we propose partial Wasserstein covering data distribution, also referred to as application dataset,1 to (PWC) that selects a limited amount of data from the ap- fill in the gaps in the development dataset. Note that the term plication dataset by minimizing the partial Wasserstein di- “development” includes choosing models to use, designing vergence (Bonneel and Coeurjolly 2019) between the appli- subroutines for fail safe, and training/testing models. cation dataset and the union of the development dataset and Our research question is formulated as follows. To re- the selected data. PWC, as illustrated in Fig. 1, selects data solve the lack of data density in development datasets, how from areas with fewer development data than application and using which metric can we select data from applica- data in the data distributions (lower-right area in the blue tion datasets with a limited amount of data for develop- distribution in Fig. 1) while ignoring areas with sufficient ers to understand? One reasonable approach is to define development data (upper-middle of the orange distribution). Copyright © 2022, Association for the Advancement of Artificial We also highlight the data selected through an anomaly de- Intelligence (www.aaai.org). All rights reserved. tection method, LOF (Breunig et al. 2000), where irregular 1 data (upper-right points) were selected, but the major density More precisely, we call a finite set of data sampled from the real-world data distribution application dataset in this study. difference was ignored. Furthermore, we show the selection
obtained using an active learning method, coreset with a k- datasets) X and Y with probability masses a ∈ Rm + and Pm center (Sener and Savarese 2018), where the data are chosen n b ∈ R+ , which are denoted as X = (i) i=1 ai δ(x ) and to improve the accuracy rather than fill the gap in terms of Y = Pn b δ(y (j) ). Without loss of generality, the to- j=1 j the distribution mismatch. tal mass of Y is greater than or equal to that of X (i.e., Our main contributions are summarized as follows. Pm Pn i=1 ai = 1 and j=1 b j ≥ 1). • We propose PWC that extracts data that are lacking in a Based on the definitions above, we define the partial development dataset from an application dataset by mini- Wasserstein divergence as follows:2 mizing the partial Wasserstein divergence between an ap- plication dataset and the development dataset. PW 2 (X, Y ) := min hP, Ci , where P∈U (a,b) • We prove that PWC is a maximization problem involv- (1) ing a submodular function whose inputs are the set of U (a, b) = {P ∈ [0, 1]m×n | P1n = a, P>1m ≤ b}, selected data. This allows us to use a greedy algorithm with a guaranteed approximation ratio. where Cij := kx(i) − y(j) k2 is the transport cost between • Additionally, we propose fast heuristics based on sensi- x(i) and y(j) , and Pij is the amount of mass flowing from tivity analysis and the Sinkhorn derivative for an acceler- x(i) to y(j) (to be optimized). Unlike the standard Wasser- ated computation. stein distance, the second constraint in Eq. (1) is not de- fined with “=”, but with “≤”. This modification allows us to • Experimentally, we demonstrate that compared with treat distributions with different total masses. The condition baselines, PWC extracts data lacking in the development P> 1m ≤ b indicates that the mass in Y does not need to be data distribution from the application distribution more transported, whereas the condition P1n = a indicates that efficiently. all of the mass in X should be transported without excess The remainder of this paper is organized as follows. or deficiency (just as in the original Wasserstein distance). Section 2 briefly introduces the partial Wasserstein diver- This property is useful for the problem defined below, which gence, linear programming (LP), and submodular functions. treats datasets with vastly different sizes. In Sect. 3, we detail the proposed PWC, fast algorithms for To compute the partial Wasserstein divergence, we must approximating the optimization problem, and some theoret- solve the minimization problem in Eq.(1). In this paper, ical results. Section 4 presents a literature review of related we consider the following two methods. (i) LP using sim- work. We demonstrated the PWC in some numerical exper- plex method. (ii) Generalized Sinkhorn iteration with en- iments in Sect. 5. In Sect. 6, we present our conclusions. tropy regularization (with a small regularization parameter ε > 0) (Benamou et al. 2015; Peyré and Cuturi 2019). 2 Preliminaries As will be detailed later, an element in mass b varies when In this section, we introduce the notations used throughout adding data to the development datasets. A key to our algo- this paper. We then describe partial Wasserstein divergences, rithm is to quickly estimate the extent to which the partial sensitivity analysis for LP, and submodular functions. Wasserstein divergence will change when an element in b varies. If we compute PW 2 using LP, we can employ a sen- Notations Vectors and matrices are denoted in bold (e.g., sitivity analysis, which will be described in the following x and A), where xi and Aij denote the ith and (i, j)th el- paragraph. If we use generalized Sinkhorn iterations, we can ements, respectively. To clarify the elements, we use nota- use automatic differentiation techniques to obtain a partial tions such as x = (f (i))i and A = (g(i, j))ij , where f derivative with respect to bj (see Sect. 3.3). and g specify the element values depending on their sub- scripts. h·, ·i denotes the inner product of matrices or vec- LP and sensitivity analysis The sensitivity analysis of tors. 1n ∈ Rn is a vector with all elements equal to one. LP plays an important role in our algorithm. Given a vari- For a natural number n, we define [[n]] := {1, · · · , n}. For a able x ∈ Rm and parameters c ∈ Rm , d ∈ Rn , and finite set V , we denote its power set as 2V and its cardinality A ∈ Rn×m , the standard form of LP can be written as fol- as |V |. The L2 -norm is denoted as k · k. The delta function lows: min c> x, s.t. Ax ≤ d, x ≥ 0. Sensitivity anal- is denoted as δ(·). R+ is a set of real positive numbers, and ysis is a framework for estimating changes in a solution [a, b] ⊂ R denotes a closed interval. I[·] denotes an indicator when the parameters c, A, and d of the problem vary. We function (zero for false and one for true). consider a sensitivity analysis for the right-hand side of the constraint (i.e., d). When dj changes as dj + ∆dj , the op- Partial Wasserstein divergence In this paper, we con- timal value changes by yj∗ ∆dj if a change ∆dj in dj lies sider the partial Wasserstein divergence (Figalli 2010; Bon- within (dj , dj ), where y∗ is the optimal solution of the fol- neel and Coeurjolly 2019; Chapel, Alaya, and Gasso 2020) lowing dual problem corresponding to the primal problem: as an objective function. Partial Wasserstein divergence is max d> y s.t. A> y ≥ c, y ≥ 0. We refer readers to (Van- designed to measure the discrepancy between two distri- butions with different total masses by considering varia- 2 The partial optimal transport problems in the literature contain tions in optimal transport. Throughout this paper, datasets a wider problem definition than Eq.(1) as summarized in Table 1 (b) are modeled as empirical distributions that are represented of (Bonneel and Coeurjolly 2019), but this paper employs this one- by mixtures of delta functions without necessarily having side relaxed Wasserstein divergence corresponding to Table 1 (c) the same total mass. Suppose two empirical distributions (or in (Bonneel and Coeurjolly 2019) without loss of generality.
derbei 2015) for the details of calculating the upper bound The PWC problem in Eq.(2) can be written as a mixed dj and the lower bound di . integer linear program (MILP) as follows. Instead of using the divergence between Dapp and S ∪ Ddev , we consider the Submodular function Our covering problem is modeled divergence between Dapp and Dcand ∪ Ddev . Hence, the mass as a discrete optimization problem involving submodular b is an (Ncand + Ndev )-dimensional vector, where the first functions, which are a subclass of set functions that play Ncand elements correspond to the data in Dcand and the re- an important role in discrete optimization. A set function maining elements correspond to the data in Ddev . In this case, φ : 2V → R is called a submodular iff φ(S ∪ T ) + φ(S ∩ we never transport to data points in Dcand \ S, meaning we T ) ≤ φ(S) + φ(T ) (∀S, T ⊆ V ). A submodular function use the following mass that depends on S: is monotone iff φ(S) ≤ φ(T ) for S ⊆ T . An important ( (j) I[s ∈S] property of monotone submodular functions is that a greedy Ndev , if 1 ≤ j ≤ Ncand algorithm provides a (1 − 1/e) ≈ 0.63-approximate solu- bj (S) = 1 (3) Ndev , if Ncand + 1 ≤ j. tion to the maximization problem under a budget constraint |S| ≤ K (Nemhauser, Wolsey, and Fisher 1978). The con- As a result, the problem is an MILP problem with an objec- tive function hP, Ci w.r.t. S ⊆ Dcand and P ∈ U (a, b(S)) traction φ̃ : 2V → R of a (monotone) submodular function with |S| ≤ K and ai = 1/Napp . φ, which is defined as φ̃(S) := φ(S ∪ T ) − φ(T ), where One may wonder why we use the partial Wasserstein di- T ⊆ V , is also a (monotone) submodular function. vergence in Eq.(2) instead of the standard Wasserstein dis- tance. This is because the asymmetry in the conditions of 3 Partial Wasserstein covering problem the partial Wasserstein divergence enables us to extract only 3.1 Formulation the parts with a density lower than that of the application As discussed in Sect. 1, our goal is to fill in the gap be- dataset, whereas the standard Wasserstein distance becomes tween the application dataset Dapp by adding some data from large for parts with sufficient density in the development a candidate dataset Dcand to a small dataset Ddev . We con- dataset. Furthermore, the guaranteed approximation algo- sider Dcand = Dapp (i.e., we copy some data in Dapp and rithms that we describe later can be utilized only for the add them into Ddev ) in the above-mentioned scenarios, but partial Wasserstein divergence version. we herein consider the most general formulation. We model this task as a discrete optimization problem called the partial 3.2 Submodularity of the PWC problem Wasserstein covering problem. In this section, we prove the following theorem to guarantee Given a dataset for development Ddev = {y(j) }N dev the approximation ratio of the proposed algorithms. j=1 , a dataset obtained from an application Dapp = {x(i) }i=1 app , N Theorem 1. Given the datasets X = {x(i) }N i=1 and Y = x (j) Ny (j) Ns and a dataset containing candidates for selection Dcand = {y }j=1 , and a subset of a dataset S ⊆ {s }j=1 , φ(S) = {s(j) }N j=1 , where Napp ≥ Ndev , the PWC problem is de- cand −PW 2 (X, S∪Y )+PW 2 (X, Y ) is a monotone submodular fined as the following optimization:3 function. To prove Theorem 1, we reduce our problem to the partial max −PW 2 (Dapp , S ∪ Ddev ) + PW 2 (Dapp , Ddev ) (2) maximum weight matching problem (Bar-Noy and Rabanca S⊆Dcand 2016), which is known to be submodular. First, we present s.t. |S|≤K the following lemmas. The proofs of lemmas are provided We select a subset S from the candidate dataset Dcand un- in the supplementary materials. der the budget constraint |S| ≤ K ( Ncand ), and then add Lemma 1. Let Q be the set of all rational numbers, and that subset to the development data Ddev to minimize the par- m and n be positive integers with n ≤ m. Consider A ∈ tial Wasserstein divergence between the two datasets Dapp Qm×n and b ∈ Qm . Then, the extreme points x∗ of a convex and S ∪ Ddev . The second term is a constant with respect to polytope defined by the linear inequalities Ax ≤ b are also S, which is included to make the objective non-negative. rational, meaning that x∗ ∈ Qn . In Eq. (2), we must specify the probability mass (i.e., a Lemma 2. Let l, m, and n be positive integers, and Z = and b in Eq. (1)) for each data point. Here, we employ a uni- [[n]]. Given a positive-valued m-by-n matrix R > 0, the fol- form mass, which is a natural choice because we do not have lowing set function ψ : 2Z → R is a submodular function: prior information regarding each data point. Specifically, we set the weights to ai = 1/Napp for Dapp and bj = 1/Ndev for ψ(S) = max hR, Pi , S ∪ Ddev . With this choice of masses, we can easily show P∈U ≤ (1m /m,b(S)) (4) that PW 2 (Dapp , S ∪ Ddev ) = W 2 (Dapp , Ddev ) when S = ∅, I[j∈S] where bj (S) = ∀j ∈ [[n]]. where W 2 (Dapp , Ddev ) is the Wasserstein distance. There- l fore, based on the monotone property (Sect. 3.2), the objec- Here, U ≤ (a, b(S)) is a set defined by replacing the con- tive value is non-negative for any S. straint P1n = a in Eq. (1) with P1n ≤ a. 3 Lemma 3. If |S| ≥ l, there exists an optimal so- We herein consider the partial Wasserstein divergence be- lution P∗ in the maximization problem of ψ satisfying P∗ 1n = 1mm (i.e., P∗ := arg maxP∈U ≤ ( 1m ,b(S)) hR, Pi = tween the unlabled datasets, because the application and candidate datasets are usually unlabeled. If they are labeled, we can use the m labels information as in (Alvarez-Melis and Fusi 2020). arg maxP∈U ( 1m ,b(S)) hR, Pi). m
Proof of Theorem 1. Given Cmax = maxi,j Cij + γ, up the greedy algorithms. We describe the concrete methods where γ > 0, the following is satisfied: for each of the approaches, simplex method, and Sinkhorn iterations in the following paragraphs. φ(S) = (−Cmax + max hR, Pi) P∈U (a,b(S)) (5) Quasi-greedy algorithm for the simplex method When − (−Cmax + max hR, Pi), we use the simplex method for the partial Wasserstein diver- P∈U (a,b(∅)) gence computation, we employ a sensitivity analysis for LP. where R = (Cmax − Cij )ij > 0, m = Nx , n = Ns + Ny , The dual problem of partial Wasserstein divergence compu- and l = Ny . Here, |S ∪ Y | ≥ Ny = l and Lemma 3 tation is defined as follows. yield φ(S) = ψ(S ∪ Y ) − ψ(Y ). Since φ(S) is a contrac- max hf , ai + hg, b(S)i , tion of the submodular function ψ(S) (Lemma 2), φ(S) = f ∈RNapp ,g∈RNcand +Ndev (6) −PW 2 (X, S ∪ Y ) + PW 2 (X, Y ) is a submodular function. s.t. gj ≤ 0, fi + gj ≤ Cij , ∀i, j, We now prove the monotonicity of φ. For S ⊆ T , we have U (a, b(S)) ⊆ U (a, b(T )) because b(S) ≤ b(T ). This im- where f and g are dual variables. Using sensitivity analysis, plies that φ(S) ≤ φ(T ). the changes in the partial Wasserstein divergences can be Finally, we prove Proposition 1 to clarify the computa- estimated as gj∗ ∆bj , where gj∗ is an optimal dual solution tional intractability of the PWC problem. of Eq. (6), and ∆bj is the change in bj . It is worth noting Proposition 1. The PWC problem with marginals a and that ∆bj now corresponds to I[s(j) ∈ S] in Eq. (3), and a b(S) is NP-hard. smaller Eq. (6) results in a larger Eq. (2). Thus, we propose ∗ a heuristic algorithm that iteratively selects s(j ) satisfying The proofs of Proposition 1 are provided in the supple- ∗(t) ∗(t) j ∗ = arg minj∈[[Ncand ]],s(j) ∈S / gj at iteration t, where gj mentary materials. is the optimal dual solution at t. We emphasize that we have 3.3 Algorithms − N1dev gj∗ = φ({s(j) } ∪ S) − φ(S) as long as bj − bj ≥ MILP and greedy algorithms The PWC problem in 1/Ndev holds, where bj is the upper bound obtained from the Eq. (2) can be solved directly as an MILP problem. We re- sensitivity analysis, leading to the same selection as that of fer to this method as PW-MILP. However, this is extremely the greedy algorithm. The computational complexity of the inefficient when Napp , Ndev , and Ncand are large. heuristic algorithm is O(K·CP W ) because we can obtain g∗ As a consequence of Theorem 1, a simple greedy al- by solving the dual simplex. It should be noted that we add gorithm provides a 0.63 approximation because the prob- a small value to bj when bj = 0 to avoid undefined values lem is a type of monotone submodular maximization in g. We refer to this algorithm as PW-sensitivity-LP with a budget constraint |S| ≤ K (Nemhauser, Wolsey, and summarize the overall algorithm in Algorithm 1. and Fisher 1978). Specifically, this greedy algorithm se- lects data from the candidate dataset Dcand sequentially in Algorithm 1: Data selection for the PWC problem with sen- each iteration to maximize φ. We consider two baseline sitivity analysis (PW-sensitivity-LP) methods, called PW-greedy-LP and PW-greedy-ent. 1: Input: Dapp , Ddev , Dcand PW-greedy-LP solves the LP problem using the simplex 2: Output: S method. PW-greedy-ent computes the optimal P? using 3: S ← {}, t ← 0 Sinkhorn iterations (Benamou et al. 2015). 4: while |S| < K do Even for the greedy algorithms mentioned above, the ∗(t) computational cost is still high because we need to calcu- 5: Calculate PW 2 (Dapp , S ∪ Ddev ) , and obtain gj late the partial Wasserstein divergences for all candidates on from the sensitivity analysis. ∗(t) Dapp in each iteration, yielding a computational time com- 6: j ∗ = arg minj∈[[Ncand ]],s(j) ∈S / gj plexity of O(K·Ncand·CP W ). Here, CP W is the complexity of 7: S ← S ∪ {s(j ) } ∗ the partial Wasserstein divergence with the simplex method 8: t←t+1 for the LP or Sinkhorn iteration. 9: end while To reduce the computational cost, we propose heuristic approaches called quasi-greedy algorithms. A high-level de- scription of the quasi-greedy algorithms is provided below. For computational efficiency, we use the solution matrix In each step of greedy selection, we use a type of sensitiv- P from the previous step for the initial value in the sim- ity value that allows us to estimate how much the partial plex algorithm in the current step. Any feasible solution Wasserstein divergence varies when adding candidate data, P ∈ U (a, b(S)) is also feasible because P ∈ U (a, b(T )) instead of performing the actual computation of divergence. for all S ⊆ T . The previous solutions of the dual forms f and Specifically, if we add a data point s(j) to S, denoted as g are also utilized for the initialization of the dual simplex T = {s(j) } ∪ S, we have b(T ) = b(S) + Ndev ej , where ej is a and Sinkhorn iterations. one-hot vector with the corresponding jth element activated. Hence, we can estimate the change in the divergence for data Faster heuristic using C-transform The heuristics addition by computing the sensitivity with respect to b(S). above still require solving a large LP in each step, where If efficient sensitivity computation is possible, we can speed the number of dual constraints is Napp × (Ncand + Ndev ).
Here, we aim to reduce the size of the constraints in the LP Unlike data summarization, PWC focuses on filling in gaps to Napp × (|S| + Ndev ) Napp × (Ncand + Ndev ) using C- between datasets. transform (Sect.3.1 in (Peyré and Cuturi 2019)). Anomaly detection Our covering problem can be consid- To derive the algorithm, we consider the dual Eq.(6). Con- ered as the task of selecting a subset of a dataset that is sidering the fact that bj (S) = 0 for s(j) ∈/ S, we first solve not included in the development dataset. In this regard, our the smaller LP problem by ignoring j such that s(j) ∈ / S, method is related to anomaly detectors e.g., LOF (Breunig whose system size is Napp × (|S| + Ndev ). Let the opti- et al. 2000), one-class SVM (Schölkopf et al. 1999), and mal dual solution of this smaller LP be (f ∗ , g∗ ), where deep learning-based methods (Kim et al. 2019; An and Cho f ∗ ∈ RNapp , g∗ ∈ R|S|+Ndev . For each s(j) ∈ / S, we con- 2015). However, our goal is to extract patterns that are less sider an LP in which s(j) is added to S. Instead of solv- included in the development dataset, but have a certain vol- ing each LP such as PW-greedy-LP, we approximate the ume in the application, rather than selecting rare data that optimal solution using a technique called the C-transform. would be considered as observation noise or infrequent pat- More specifically, for each j such that s(j) ∈ / S, we only terns that are not as important as frequent patterns. optimize gj and fix the other variables to be the dual optimal Active learning The problem of selecting data from an un- above. This is done by gjC := min{0, mini∈[[Ndev ]] Cij − fi∗ }. labeled dataset is also considered in active learning. A typ- Note that this is the largest value of gj , satisfying the dual ical approach usees the output of the target model e.g., the constraints. This gjC gives the estimated increase in the ob- entropy-based method (Holub, Perona, and Burl 2008; Co- jective function when s(j) ∈ / S is added to S. Finally, letta et al. 2019), whereas some methods are independent we select the instance j ∗ = arg minj∈[[Ncand ]],s(j) ∈S C / gj . As of the output of the model e.g., the k-center-based (Sener shown later, this heuristic experimentally works efficiently and Savarese 2018), and Wasserstein distance-based ap- in terms of computational times. We refer to this algorithm proach (Shui et al. 2020). In contrast, our goal is finding as PW-sensitivity-Ctrans. unconsidered patterns during development, even if they do not contribute to improving the prediction performance by Quasi-greedy algorithm for Sinkhorn iteration When directly adding them to the training data; developers may we use the generalized Sinkhorn iterations, we can easily ob- use these unconsidered data to test the generalization perfor- tain the derivative of the entropy-regularized partial Wasser- mance, design subroutines to process the patterns, redevelop stein divergence ∇j := ∂PW 2 (Dapp , S ∪ Ddev )/∂bj , where a new model, or alert users to not use the products in certain PW 2 denotes the partial Wasserstein divergence, using au- special situations. Furthermore, we emphasize that the sub- tomatic differentiation techniques such as those provided modularity and guaranteed approximation algorithms can be by PyTorch (Paszke et al. 2019). Based on this derivative, used only with the partial Wasserstein divergence, but not we can derive a quasi-greedy algorithm for the Sinkhorn with the vanilla Wasserstein distance. iteration as j ∗ = arg minj∈[[Ncand ]],s(j) ∈S / ∇j (i.e., we re- Facility location problem (FLP) and k-median The FLP place Lines 6 and 7 of Algorithm 1 with this formula). Be- and related k-median (ReVelle and Swain 1970) are similar cause the derivative can be obtained with the same com- to our problem in the sense that we select data S ⊆ Dcand putational complexity as the Sinkhorn iterations, the over- to minimize an objective function. FLP is a mathematical all complexity is O(K · CP W ). We refer this algorithm model for finding a desirable location for facilities (e.g., PW-sensitivity-ent. stations, schools, and hospitals). While FLP models facility opening costs, our data covering problem does not consider 4 Related work data selection costs, making it more similar to k-median problems. However, unlike the k-median, our covering prob- Instance selection and data summarization Tasks for lem allows relaxed transportation for distribution, which is selecting an important portion of a large dataset are common known as Kantorovich relaxation in the optimal transport in machine learning. Instance selection involves extracting field (Peyré and Cuturi 2019). Hence, our approach using a subset of a training dataset while maintaining the target partial Wasserstein divergence enables us to model more task performance. Accoring to (Olvera-López et al. 2010), flexible assignments among distributions beyond naı̈ve as- two types of instance selection methods have been studied: signments among data. model-dependent methods (Hart 1968; Ritter 1975; Chou, Kuo, and Chang 2006; Wilson 1972; Vázquez, Sánchez, and Pla 2005) and label-dependent methods (Wilson and Mar- 5 Experiments tinez 2000; Brighton and Mellish 2002; Riquelme, Aguilar- Ruiz, and Toro 2003; Bezdek and Kuncheva 2001; Liu and In our experiments, we considered a scenario in which we Motoda 2002; Spillmann et al. 2006). Unlike instance se- wished to select some data in the application dataset Dapp lection methods, PWCs do not depend on either model or for the PWC problem (i.e., Dapp = Dcand ). For PW-*-LP ground-truth labels. Data summarization involves finding and PW-sensitivity-Ctrans, we used IBM ILOG representative elements in a large dataset (Ahmed 2019). CPLEX 20.01 (IBM ILOG CPLEX Optimization Studio Here, the submodularity plays an important role (Mitrovic 2013) for the dual simplex algorithm, where the sensitivity et al. 2018; Lin and Bilmes 2011; Mirzasoleiman, Zadi- analysis for LP is also available. For PW-*-ent, we com- moghaddam, and Karbasi 2016; Tschiatschek et al. 2014). puted the entropic regularized version of the partial Wasser-
Partial Wasserstein Partial Wasserstein divergence Computational Time PW-MIP 103 PW-greedy-LP Computational 0.02 divergence 102 PW-greedy-ent time [s] 0.01 PW-sensitivity-LP 101 PW-sensitivity-Ctrans 0.00 PW-sensitivity-ent 0 10 20 30 100 200 300 400 500 PW-sensitivity-ent(CPU) # selected data # data (Napp = Ndev = Ncand) random Figure 2: Partial Wasserstein divergence PW 2 (Dapp , S ∪ Ddev ) when data are sequentially selected and added to S (left). Computational time, where colored areas indicate standard deviations (right). 1.0 Approximation ratio tween Dapp and S ∪ Ddev while varying the number of the selected data when Napp = Ndev = 30 and Fig. 2 (right) 0.9 presents the computational times when K = 30 and Napp = Ndev = Ncand varied. As shown in Fig. 2 (left), 0.8 PW-greedy-LP, and PW-sensitivity-LP select ex- actly the same data as the global optimal solution obtained 0.7 by PW-MILP. As shown in Fig. 2 (right), considering the t s t re e dy-LPreedy-en sitivity-LP ity-Ctraintivityr-aenndom logarithmic scale, PW-sensitivity-* significantly re- g PW- PW-g PW-sen -sensitiPvW-sens duces the computational time compared with PW-MILP, PW while the naı̈ve greedy algorithms do not scale. In particular, PW-sensitivity-ent can be calculated quickly as long Figure 3: Approximation ratios evaluated empirically for as the GPU memory is sufficient, whereas its CPU version K = 15 and 50 random instances. (PW-sensitivity-ent(CPU)) is significantly slower. PW-sensitivity-Ctrans is the fastest among the PW-MILP methods without GPUs. It is 8.67 and 3.07 times faster than 0.8 PW-sensitivity-LP PW-MILP and PW-sensitivity-LP, respectively. For PW-sensitivity-Ctrans quantitative evaluation, we define the empirical approxima- Relative frequency 0.6 PW-sensitivity-ent tion ratio as φ(SK )/φ(SK ∗ ), where SK∗ is the optimal solu- k-center k-center-greedy tion obtained by PW-MILP when |S| = K. Figure 3 shows 0.4 active-ent the empirical approximation ratio (Napp = Ndev = 30, K = LOF 15) for 50 different random seeds. Both PW-*-LP achieve 0.2 random approximation ratios close to one, whereas PW-*-ent and 0.0 PW-sensitivity-Ctrans have slightly lower approx- Methods imation ratios4 . Figure 4: Relative frequency of missing category in selected data. The black bars denote the standard deviations. 5.2 Finding a missing category For the quantitative evaluation of the missing pattern find- ings, we herein consider a scenario in which a category (i.e., stein divergence using PyTorch (Paszke et al. 2019) on a label) is less included in the development dataset than in GPU. We computed the sensitivity using automatic differ- the application dataset5 . We employ subsets of the MNIST entiation. We set the coefficient for entropy regularization dataset (LeCun and Cortes 2010), where the development = 0.01 and terminated the Sinkhorn iterations when the dataset contains 0 labels at a rate of 0.5%, whereas all la- divergence did not change by at least maxi,j Cij × 10−12 bels are included in the application data in equal ratios (i.e., compared with the previous step, or the number of itera- 10%). We randomly sampled 500 images from the valida- tions reached 50,000. All experiments were conducted with tion split for each Dapp and Ddev . We employ the squared an Intel® Xeon® Gold 6142 CPU and an NVIDIA® TITAN L2 distance on the pixel space as the ground metric for our RTX™ GPU. method. For baseline methods, we employed LOF (Breunig et al. 2000) novelty detection provided by scikit-learn (Pe- 5.1 Comparison of algorithms dregosa et al. 2011) with default parameters. For the LOF, First, we compare our algorithms, namely, the greedy-based we selected top-K data according to the anomaly scores. PW-greedy-*, sensitivity-based PW-sensitivity-*, and random selection for the baseline. For the evaluation, 4 An important feature of PW-*-ent is that they can be exe- we generated two-dimensional random values following a cuted without using any commercial solvers. normal distribution for Dapp and Ddev , respectively. Fig- 5 Note that in real scenarios, missing patterns do not always cor- ure 2 (left) presents the partial Wasserstein divergences be- respond to labels.
We also employed active learning approaches based on en- tropies of the prediction (Holub, Perona, and Burl 2008; Coletta et al. 2019) and coreset with the k-center and k- center greedy (Sener and Savarese 2018; Bachem, Lucic, and Krause 2017). For the entropy-based method, we train a (a) Partial Wassersein covering LeNet-like neural network using the training split, and then select K data with high entropies of the predictions from the application dataset. LOF and coreset algorithms are con- ducted on the pixel space. We conducted 10 trials using different random seeds for the proposed algorithms and baselines, except for the (b) LOF PW-greedy-* algorithms, because they took over 24h. Figure 4 presents a histogram of the selected labels when K = 30, where x-axis corresponds to labels of selected data (i.e., 0 or not 0) and y-axis shows relative frequencies of selected data. The proposed methods (i.e., PW-*) extract more data corresponding to the label 0 (0.71 for PW-MILP) (c) Coreset (k-center greedy) than the baselines. We can conclude that the proposed meth- ods successfully selected data from the candidate dataset Figure 5: Covering results when the application dataset Dcand (= Dapp ) to fill in the missing areas in the distribution. (BDD100k) contains nighttime scenes, but the development dataset (KITTI) does not. PWC (PW-sensitivity-LP) 5.3 Missing scene extraction in driving datasets extracts the major difference between the two datasets Finally, we demonstrate PWC in a realistic scenario, (i.e., nighttime images), whereas LOF and coreset (k-center using driving scene images. We adopted two datasets, greedy) mainly extract some rare cases. BDD100K (Yu et al. 2020) and KITTI (Object Detection Evaluation 2012) (Geiger et al. 2013) as the application and development datasets, respectively. The major differ- ence between these two datasets is that KITTI (Ddev ) con- mentally evaluate and update the ML-based systems (e.g., tains only daytime images, whereas BDD100k (Dapp ) con- test the generalization performance, redevelop a new model, tains both daytime and nighttime images. To reduce compu- and designing a special subroutine for the pattern). Identi- tational time, we randomly selected 1,500 data points for the fying patterns that are not taken into account during devel- development dataset from the test split of the KITTI dataset opment and addressing them individually can improve the and 3,000 data points for the application dataset from the reliability and generalization ability of ML-based systems, test split of BDD100k. To compute the transport costs be- such as image recognition in autonomous vehicles. tween images, we calculated the squared L2 norms between the feature vectors extracted by a pretrained ResNet with 50 layers (He et al. 2016) obtained from Torchvision (Marcel 6 Conclusion and Rodriguez 2010). Before inputting the images into the ResNet, each image was resized to a height of 224 pixels and then center-cropped to a width of 224, followed by normal- In this paper, we proposed the PWC, which fills in the ization. As baseline methods, we adopted the LOF (Breunig gaps between development datasets and application datasets et al. 2000) and coreset with the k-center greedy (Sener and based on partial Wasserstein divergence. We also proved the Savarese 2018). Figure 5 presents the obtained top-3 images. submodularity of the PWC, leading to a greedy algorithm One can see that PWC (PW-sensitivity-LP) selects with the guaranteed approximation ratio. In addition, we the major pattern (i.e., nighttime scenes) that is not included proposed quasi-greedy algorithms based on sensitivity anal- in the development data, whereas LOF and coreset mainly ysis and derivatives of Sinkhorn iterations. Experimentally, extracts specific rare scenes (e.g., a truck crossing a street we demonstrated that the proposed method covers the ma- or out-of-focus scene). The coreset selects a single image jor areas of development datasets with densities lower than of nighttime scenes; however, we emphasize that providing those of the corresponding areas in application datasets. The multiple images is essential for the developers to understand main limitation of the PWC is scalability; the space com- what patterns in the image (e.g., nighttime, type of cars, or plexity of the dual simplex or Sinkhorn iteration is at least roadside) are less included in the image. O(Napp · Ndev ), which might require approximation, e.g., The above results indicate that the PWC problem enables stochastic optimizations (Aude et al. 2016), slicing (Bon- us to accelerate updating ML-based systems when the dis- neel and Coeurjolly 2019) and neural networks (Xie et al. tributions of application and development data are different 2019)). As a potential risk, if infrequent patterns in applica- because our method does not focus on isolated anomalies tion datasets are as important as frequent patterns, our pro- but major missing patterns in the application data, as shown posed method may ignore some important cases, and it may in Fig. 1 and Fig. 5. The ML workflow using our method can be desirable to use our covering with methods focusing on efficiently find such patterns and allow developers to incre- fairness.
References Figalli, A. 2010. The optimal partial transport problem. Ahmed, M. 2019. Data summarization: a survey. Knowledge Archive for rational mechanics and analysis, 195(2): 533– and Information Systems, 58(2): 249–273. 560. Alvarez-Melis, D.; and Fusi, N. 2020. Geometric Dataset Geiger, A.; Lenz, P.; Stiller, C.; and Urtasun, R. 2013. Vision Distances via Optimal Transport. In Larochelle, H.; Ran- meets Robotics: The KITTI Dataset. International Journal zato, M.; Hadsell, R.; Balcan, M. F.; and Lin, H., eds., Ad- of Robotics Research. vances in Neural Information Processing Systems 33, vol- Hart, P. 1968. The condensed nearest neighbor rule (cor- ume 33, 21428–21439. Curran Associates, Inc. resp.). IEEE transactions on information theory, 14(3): 515– An, J.; and Cho, S. 2015. Variational autoencoder based 516. anomaly detection using reconstruction probability. Special He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual Lecture on IE, 2(1): 1–18. learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasser- 770–778. stein Generative Adversarial Networks. In Precup, D.; and Teh, Y. W., eds., Proceedings of the 34th International Con- Holub, A.; Perona, P.; and Burl, M. C. 2008. Entropy-based ference on Machine Learning, volume 70 of Proceedings of active learning for object recognition. In 2008 IEEE Com- Machine Learning Research, 214–223. PMLR. puter Society Conference on Computer Vision and Pattern Recognition Workshops, 1–8. IEEE. Aude, G.; Cuturi, M.; Peyré, G.; and Bach, F. 2016. Stochas- tic optimization for large-scale optimal transport. arXiv IBM ILOG CPLEX Optimization Studio. 2013. CPLEX preprint arXiv:1605.08527. User’s manual. Vers, 12: 207–258. Bachem, O.; Lucic, M.; and Krause, A. 2017. Practical Kim, K. H.; Shim, S.; Lim, Y.; Jeon, J.; Choi, J.; Kim, B.; coreset constructions for machine learning. arXiv preprint and Yoon, A. S. 2019. Rapp: Novelty detection with recon- arXiv:1703.06476. struction along projection pathway. In International Confer- ence on Learning Representations. Bar-Noy, A.; and Rabanca, G. 2016. Tight approximation LeCun, Y.; and Cortes, C. 2010. MNIST handwritten digit bounds for the seminar assignment problem. In Interna- database. tional Workshop on Approximation and Online Algorithms, 170–182. Springer. Lin, H.; and Bilmes, J. 2011. A class of submodular func- tions for document summarization. In Proceedings of the Benamou, J.-D.; Carlier, G.; Cuturi, M.; Nenna, L.; and 49th Annual Meeting of the Association for Computational Peyré, G. 2015. Iterative Bregman projections for regular- Linguistics: Human Language Technologies, 510–520. ized transportation problems. SIAM Journal on Scientific Computing, 37(2): A1111–A1138. Liu, H.; and Motoda, H. 2002. On issues of instance se- lection. Data Mining and Knowledge Discovery, 6(2): 115– Bezdek, J. C.; and Kuncheva, L. I. 2001. Nearest proto- 130. type classifier designs: An experimental study. International Journal of Intelligent Systems, 16(12): 1445–1473. Marcel, S.; and Rodriguez, Y. 2010. Torchvision the machine-vision package of torch. In Proceedings of the 18th Bonneel, N.; and Coeurjolly, D. 2019. SPOT: sliced partial ACM International Conference on Multimedia, 1485–1488. optimal transport. ACM Transactions on Graphics (TOG), Mirzasoleiman, B.; Zadimoghaddam, M.; and Karbasi, A. 38(4): 1–13. 2016. Fast Distributed Submodular Cover: Public-Private Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; and Sander, J. Data Summarization. In Advances in Neural Information 2000. LOF: identifying density-based local outliers. In Pro- Processing Systems 29, 3594–3602. ceedings of the 2000 ACM SIGMOD International Confer- Mitrovic, M.; Kazemi, E.; Zadimoghaddam, M.; and Kar- ence on Management of Data, 93–104. basi, A. 2018. Data summarization at scale: A two-stage Brighton, H.; and Mellish, C. 2002. Advances in instance submodular approach. In Proceedings of the 35th In- selection for instance-based learning algorithms. Data Min- ternational Conference on Machine Learning, 3596–3605. ing and Knowledge Discovery, 6(2): 153–172. PMLR. Chapel, L.; Alaya, M.; and Gasso, G. 2020. Partial optimal Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978. An transport with applications on positive-unlabeled learning. analysis of approximations for maximizing submodular set In Advances in Neural Information Processing Systems 33, functions—I. Mathematical programming, 14(1): 265–294. 2903–2913. Olvera-López, J. A.; Carrasco-Ochoa, J. A.; Martı́nez- Chou, C.-H.; Kuo, B.-H.; and Chang, F. 2006. The gener- Trinidad, J. F.; and Kittler, J. 2010. A review of instance se- alized condensed nearest neighbor rule as a data reduction lection methods. Artificial Intelligence Review, 34(2): 133– method. In Proceedings of the 18th International Confer- 143. ence on Pattern Recognition, volume 2, 556–559. IEEE. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Coletta, L. F.; Ponti, M.; Hruschka, E. R.; Acharya, A.; and Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, Ghosh, J. 2019. Combining clustering and active learning L.; Desmaison, A.; Köpf, A.; Yang, E.; DeVito, Z.; Rai- for the detection and learning of new image classes. Neuro- son, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; computing, 358: 150–165. Bai, J.; and Chintala, S. 2019. PyTorch: An Imperative
Style, High-Performance Deep Learning Library. In Wal- transport. In International Conference on Machine Learn- lach, H. M.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, ing, 6882–6892. PMLR. F.; Fox, E. B.; and Garnett, R., eds., Advances in Neural In- Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; formation Processing Systems 32, 8024–8035. Madhavan, V.; and Darrell, T. 2020. BDD100K: A Di- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; verse Driving Dataset for Heterogeneous Multitask Learn- Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, ing. In IEEE/CVF Conference on Computer Vision and Pat- R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; tern Recognition. Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit- learn: Machine Learning in Python. Journal of Machine Learning Research, 12: 2825–2830. Peyré, G.; and Cuturi, M. 2019. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6): 355–607. ReVelle, C. S.; and Swain, R. W. 1970. Central facilities location. Geographical analysis, 2(1): 30–42. Riquelme, J. C.; Aguilar-Ruiz, J. S.; and Toro, M. 2003. Finding representative patterns with ordered projections. Pattern Recognition, 36(4): 1009–1018. Ritter, G. 1975. An Algorithm for a selective nearest neigh- bor decision rule. IEEE Trans. Information Theory, 21(11): 665–669. Schölkopf, B.; Williamson, R. C.; Smola, A. J.; Shawe- Taylor, J.; Platt, J. C.; et al. 1999. Support vector method for novelty detection. In Advances in Neural Information Processing Systems 12, volume 12, 582–588. Citeseer. Sener, O.; and Savarese, S. 2018. Active Learning for Con- volutional Neural Networks: A Core-Set Approach. In In- ternational Conference on Learning Representations. Shui, C.; Zhou, F.; Gagné, C.; and Wang, B. 2020. Deep ac- tive learning: Unified and principled method for query and training. In International Conference on Artificial Intelli- gence and Statistics, 1308–1318. PMLR. Spillmann, B.; Neuhaus, M.; Bunke, H.; Pekalska, E.; and Duin, R. P. 2006. Transforming strings to vector spaces us- ing prototype selection. In Joint IAPR international work- shops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), 287–296. Springer. Tschiatschek, S.; Iyer, R. K.; Wei, H.; and Bilmes, J. A. 2014. Learning mixtures of submodular functions for image collection summarization. In Advances in Neural Informa- tion Processing Systems 27, 1413–1421. Vanderbei, R. J. 2015. Linear programming, volume 3. Springer. Vázquez, F.; Sánchez, J. S.; and Pla, F. 2005. A stochas- tic approach to Wilson’s editing algorithm. In Iberian con- ference on pattern recognition and image analysis, 35–42. Springer. Wilson, D. L. 1972. Asymptotic properties of nearest neigh- bor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3): 408–421. Wilson, D. R.; and Martinez, T. R. 2000. Reduction tech- niques for instance-based learning algorithms. Machine Learning, 38(3): 257–286. Xie, Y.; Chen, M.; Jiang, H.; Zhao, T.; and Zha, H. 2019. On scalable and efficient computation of large scale optimal
You can also read