Black Box Submodular Maximization: Discrete and Continuous Settings - Proceedings of ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Black Box Submodular Maximization: Discrete and Continuous Settings Lin Chen1† Mingrui Zhang1‡ Hamed Hassani] Amin Karbasi† † Department of Electrical Engineering, Yale University, ‡ Department of Statistics and Data Science, Yale University, ] Department of Electrical and Systems Engineering, University of Pennsylvania Abstract 2011, Rios and Sahinidis, 2013, Shahriari et al., 2016]. In this setting, we assume that the objective function In this paper, we consider the problem of is unknown and we can only obtain zeroth-order infor- black box continuous submodular maximiza- mation such as (stochastic) function evaluations. tion where we only have access to the function Fueled by a growing number of machine learning ap- values and no information about the deriva- plications, black-box optimization methods are usually tives is provided. For a monotone and continu- considered in scenarios where gradients (i.e., first-order ous DR-submodular function, and subject to a information) are 1) difficult or slow to compute, e.g., bounded convex body constraint, we propose graphical model inference [Wainwright et al., 2008], Black-box Continuous Greedy, a derivative-free structure predictions [Taskar et al., 2005, Sokolov et al., algorithm that provably achieves the tight 2016], or 2) inaccessible, e.g., hyper-parameter turning [(1 − 1/e)OP T − ] approximation guarantee for natural language processing or image classifications with O(d/3 ) function evaluations. We then Snoek et al. [2012], Thornton et al. [2013], black-box extend our result to the stochastic setting attacks for finding adversarial examples Chen et al. where function values are subject to stochas- [2017c], Ilyas et al. [2018]. Even though heuristics such tic zero-mean noise. It is through this stochas- as random or grid search, with undesirable dependen- tic generalization that we revisit the discrete cies on the dimension, are still used in some applica- submodular maximization problem and use tions (e.g., parameter tuning for deep networks), there the multi-linear extension as a bridge between has been a growing number of rigorous methods to discrete and continuous settings. Finally, we address the convergence rate of black-box optimiza- extensively evaluate the performance of our tion in convex and non-convex settings [Wang et al., algorithm on continuous and discrete submod- 2017, Balasubramanian and Ghadimi, 2018, Sahu et al., ular objective functions using both synthetic 2018]. and real data. The focus of this paper is the constrained continuous DR-submodular maximization over a bounded convex 1 Introduction body. We aim to design an algorithm that uses only zeroth-order information while avoiding expensive pro- Black-box optimization, also known as zeroth-order or jection operations. Note that one way the optimization derivative-free optimization2 , has been extensively stud- methods can deal with constraints is to apply the pro- ied in the literature [Conn et al., 2009, Bergstra et al., jection oracle once the proposed iterates land outside 1 the feasibility region. However, computing the projec- These authors contributed equally to this work. 2 tion in many constrained settings is computationally We note that black-box optimization (BBO) and derivative-free optimization (DFO) are not identical terms. prohibitive (e.g., projection over bounded trace norm Audet and Hare [2017] defined DFO as “the mathematical matrices, flow polytope, matroid polytope, rotation ma- study of optimization algorithms that do not use deriva- trices). In such scenarios, projection-free algorithms, tives” and BBO as “the study of design and analysis of a.k.a., Frank-Wolfe [Frank and Wolfe, 1956], replace algorithms that assume the objective and/or constraint Proceedings of the 23rd International Conference on Artificial functions are given by blackboxes”. However, as the differ- Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. ences are nuanced in most scenarios, this paper uses them PMLR: Volume 108. Copyright 2020 by the author(s). interchangeably.
Black Box Submodular Maximization: Discrete and Continuous Settings Table 1: Number of function queries in different settings, where D1 is the diameter of K. Function Additional Assumptions Function Queries continuous DR-submodular monotone, G-Lipschitz, L-smooth O(max{G, LD1 }3 · d3 ) [Theorem 1] 3 stoch. continuous DR-submodular monotone, G-Lipschitz, L-smooth O(max{G, LD1 }3 · d5 ) [Theorem 2] 5 discrete submodular monotone O( d5 ) [Theorem 3] the projection with a linear program. Indeed, our pro- [2008], a canonical example of DR-submodular func- posed algorithm combines efficiently the zeroth-order tions, is only defined on a unit cube. Moreover, they information with solving a series of linear programs to can only guarantee to reach a first-order stationary ensure convergence to a near-optimal solution. point. However, Hassani et al. [2017] showed that for a monotone DR-submodular function, the stationary Continuous DR-submodular functions are an important points can only guarantee 1/2 approximation to the subset of non-convex functions that can be minimized optimum. Therefore, if a state-of-the-art zeroth-order exactly Bach [2016], Staib and Jegelka [2017] and maxi- non-convex algorithm is used for maximizing a mono- mized approximately Bian et al. [2017a,b], Hassani et al. tone DR-submodular function, it is likely to terminate [2017], Mokhtari et al. [2018a], Hassani et al. [2019], at a suboptimal stationary point whose approximation Zhang et al. [2019b] This class of functions generalize ratio is only 1/2. the notion of diminishing returns, usually defined over discrete set functions, to the continuous domains. They Our contributions: In this paper, we propose a have found numerous applications in machine learn- derivative-free and projection-free algorithm Black-box ing including MAP inference in determinantal point Continuous Greedy (BCG), that maximizes a monotone processes (DPPs) Kulesza et al. [2012], experimental continuous DR-submodular function over a bounded design Chen et al. [2018c], resource allocation Eghbali convex body K ⊆ Rd . We consider three scenarios: and Fazel [2016], mean-field inference in probabilistic (1) In the deterministic setting, where function eval- models Bian et al. [2018], among many others. uations can be obtained exactly, BCG achieves the Motivation: Computing the gradient of a continu- tight [(1 − 1/e)OP T − ] approximation guarantee with ous DR-submodular function has been shown to be O(d/3 ) function evaluations. computationally prohibitive (or even intractable) in (2) In the stochastic setting, where function evaluations many applications. For example, the objective func- are noisy, BCG achieves the tight [(1 − 1/e)OP T − tion of influence maximization is defined via specific ] approximation guarantee with O(d3 /5 ) function stochastic processes [Kempe et al., 2003, Rodriguez evaluations. and Schölkopf, 2012] and computing/estimating the gradient of the mutliliear extension would require a (3) In the discrete setting, Discrete Black-box Greedy relatively high computational complexity. In the prob- (DBG) achieves the tight [(1 − 1/e)OP T − ] approxi- lem of D-optimal experimental design , the gradient mation guarantee with O(d5 /5 ) function evaluations. of the objective function involves inversion of a po- All the theoretical results are summarized in Table 1. tentially large matrix [Chen et al., 2018c]. Moreover, when one attacks a submodular recommender model, We would like to note that in discrete setting, due to only black-box information is available and the service the conservative upper bounds for the Lipschitz and provider is unlikely to provide additional first-order smooth parameters of general multilinear extensions, information (this is known as the black-box adversarial and the variance of the gradient estimators subject to attack model) [Lei et al., 2019]. noisy function evaluations, the required number of func- tion queries in theory is larger than the best known re- There has been very recent progress on developing sult, O(d5/2 /3 ) in Mokhtari et al. [2018a,b]. However, zeroth-order methods for constrained optimization our experiments show that empirically, our proposed problems in convex and non-convex settings Ghadimi algorithm often requires significantly fewer function and Lan [2013], Sahu et al. [2018]. Such methods typi- evaluations and less running time, while achieving a cally assume the objective function is defined on the practically similar utility. whole Rd so that they can sample points from a proper distribution defined on Rd . For DR-submodular func- Novelty of our work: All the previous results in tions, this assumption might be unrealistic, since many constrained DR-submodular maximization assume ac- DR-submodular functions might be only defined on a cess to (stochastic) gradients. In this work, we address subset of Rd , e.g., the multi-linear extension Vondrák a harder problem, i.e., we provide the first rigorous
Lin Chen1† , Mingrui Zhang1‡ , Hamed Hassani] , Amin Karbasi† analysis when only (stochastic) function values can be maximization of monotone continuous DR-submodular obtained. More specifically, with the smoothing trick functions Zhang et al. [2019a] is a closely related set- [Flaxman et al., 2005], one can construct an unbiased ting to ours. However, to the best of our knowledge, gradient estimator via function queries. However, this none of the existing work has developed a zeroth-order estimator has a large O(d2 /δ 2 ) variance which may algorithm for maximizing a monotone continuous DR- cause FW-type methods to diverge. To overcome this submodular function. For a detailed review of DFO issue, we build on the momentum method proposed by and BBO, interested readers refer to book [Audet and Mokhtari et al. [2018a] in which they assumed access Hare, 2017]. to the first-order information. Given a point x, the smoothed version of F at x is de- 2 Preliminaries fined as Ev∼B d [F (x + δv)]. If x is close to the boundary of the domain D, (x + δv) may fall outside of D, leaving Submodular Functions We say a set function f : the smoothed function undefined for many instances 2Ω → R is submodular, if it satisfies the diminishing of DR-submodular functions (e.g., the multilinear ex- returns property: for any A ⊆ B ⊆ Ω and x ∈ Ω \ B, tension is only defined over the unit cube). Thus the we have vanilla smoothing trick will not work. To this end, f (A ∪ {x}) − f (A) ≥ f (B ∪ {x}) − f (B). (1) we transform the domain D and constraint set K in a proper way and run our zeroth-order method on In words, the marginal gain of adding an element x the transformed constraint set K0 . Importantly, we to a subset A is no less than that of adding x to its retrieve the same convergence rate of O(T −1/3 ) as in superset B. Mokhtari et al. [2018a] with a minimum number of func- tion queries in different settings (continuous, stochastic For the continuous analogue, consider a function F : continuous, discrete). X → R+ , where X = Πni=1 Xi , and each Xi is a compact subset of R+ . We define F to be continuous submodular We further note that by using more recent variance if F is continuous and for all x, y ∈ X , we have reduction techniques [Zhang et al., 2019b], one might be able to reduce the required number of function F (x) + F (y) ≥ F (x ∨ y) + F (x ∧ y), (2) evaluations. where ∨ and ∧ are the component-wise maximizing and minimizing operators, respectively. 1.1 Further Related Work The continuous function F is called DR-submodular Submodular functions Nemhauser et al. [1978], that Bian et al. [2017b] if F is differentiable and ∀ x ≤ y : capture the intuitive notion of diminishing returns, ∇F (x) ≥ ∇F (y). An important implication of DR- have become increasingly important in various machine submodularity is that the function F is concave in any learning applications. Examples include graph cuts non-negative directions, i.e., for x ≤ y, we have in computer vision Jegelka and Bilmes [2011a,b], data summarization Lin and Bilmes [2011b,a], Tschiatschek F (y) ≤ F (x) + h∇F (x), y − xi. (3) et al. [2014], Chen et al. [2018a, 2017b], influence maxi- The function F is called monotone if for x ≤ y, we mization Kempe et al. [2003], Rodriguez and Schölkopf have F (x) ≤ F (y). [2012], Zhang et al. [2016], feature compression Bateni et al. [2019], network inference Chen et al. [2017a], ac- Smoothing Trick For a function F defined on Rd , tive and semi-supervised learning Guillory and Bilmes its δ-smoothed version is given as [2010], Golovin and Krause [2011], Wei et al. [2015], crowd teaching Singla et al. [2014], dictionary learn- F̃δ (x) , Ev∼B d [F (x + δv)], (4) ing Das and Kempe [2011], fMRI parcellation Salehi where v is chosen uniformly at random from the d- et al. [2017], compressed sensing and structured spar- dimensional unit ball B d . In words, the function F̃δ at sity Bach [2010], Bach et al. [2012], fairness in machine any point x is obtained by “averaging” F over a ball of learning Balkanski and Singer [2015], Celis et al. [2016], radius δ around x. In the sequel, we omit the subscript and learning causal structures Steudel et al. [2010], δ for the sake of simplicity and use F̃ instead of F̃δ . Zhou and Spanos [2016], to name a few. Continuous DR-submodular functions naturally extend the notion Lemma 1 below shows that under the Lipschitz as- of diminishing returns to the continuous domains Bian sumption for F , the smoothed version F̃ is a good ap- et al. [2017b]. Monotone continuous DR-submodular proximation of F , and also inherits the key structural functions can be (approximately) maximized over con- properties of F (such as monotonicity and submodu- vex bodies using first-order methods Bian et al. [2017b], larity). Thus one can (approximately) optimize F via Hassani et al. [2017], Mokhtari et al. [2018a]. Bandit optimizing F̃ .
Black Box Submodular Maximization: Discrete and Continuous Settings Lemma 1 (Proof in Appendix A). If F is monotone setting and using recently proposed variance reduction continuous DR-submodular and G-Lipschitz continuous techniques, we can then optimize F̃ near-optimally. Fi- on Rd , then so is F̃ and nally, by Lemma 1 we show that the obtained optimizer also provides a good solution for F . |F̃ (x) − F (x)|≤ δG. (5) Recall that continuous DR-submodular functions are defined on a box X = Πni=1 Xi . To simplify the expo- An important property of F̃ is that one can obtain an sition, we can assume, without loss of generality, Qd that unbiased estimation for its gradient ∇F̃ by a single the objective function F is defined on D , i=1 [0, ai ] query of F . This property plays a key role in our Bian et al. [2017a]. Moreover, we note that since proposed derivative-free algorithms. F̃ = Ev∼B d [F (x+δv)], for x close to ∂D (the boundary Lemma 2 (Lemma 6.5 in [Hazan, 2016]). Given a of D), the point x + δv may fall outside of D, leaving function F on Rd , if we choose u uniformly at random the function F̃ undefined. from the (d − 1)-dimensional unit sphere S d−1 , then we have To circumvent this issue, we shrink the domain D by δ. Precisely, the shrunk domain is defined as d ∇F̃ (x) = Eu∼S d−1 F (x + δu)u . (6) Dδ0 = {x ∈ D|d(x, ∂D) ≥ δ}. (10) δ Qd Since we assume D = i=1 [0, ai ], the shrunk domain 3 DR-Submodular Maximization d is Dδ0 = i=1 [δ, ai − δ]. Then for all x ∈ Dδ0 , we Q have x + δv ∈ D. So F̃ is well-defined on Dδ0 . By In this paper, we mainly focus on the constrained optimization problem: Lemma 1, the optimum of F̃ on the shrunk domain Dδ0 will be close to that on the original domain D, if max F (x), (7) δ is small enough. Therefore, we can first optimize x∈K F̃ on Dδ0 , then approximately optimize F̃ (and thus where F is a monotone continuous DR-submodular F ) on D. For simplicity of analysis, we also translate function on Rd , and the constraint set K ⊆ X ⊆ Rd is the shrunk domain Dδ0 by −δ, and denote it as Dδ = Qd convex and compact. i=1 [0, ai − 2δ]. For first-order monotone DR-submodular maximiza- Besides the domain D, we also need to consider the tion, one can use Continuous Greedy Calinescu et al. transformation on constraint set K. Intuitively, if there [2011], Bian et al. [2017b], a variant of Frank-Wolfe is no translation, we should consider the intersection of Algorithm [Frank and Wolfe, 1956, Jaggi, 2013, Lacoste- K and the shrunk domain Dδ0 . But since we translate Dδ0 Julien and Jaggi, 2015], to achieve the [(1−1/e)OP T − by −δ, the same transformation should be performed ] approximation guarantee. At iteration t, the FW on K. Thus, we define the transformed constraint set variant first maximizes the linearization of the objective as the translated intersection (by −δ) of Dδ0 and K: function F : K0 , (Dδ0 ∩ K) − δ1 = Dδ ∩ (K − δ1). (11) vt = arg maxhv, ∇F (xt )i. (8) v∈K It is well known that the FW Algorithm is sensitive to Then the current point xt moves in the direction of vt the accuracy of gradient, and may have arbitrarily poor with a step size γt ∈ (0, 1]: performance with stochastic gradients Hazan and Luo [2016], Mokhtari et al. [2018b]. Thus we incorporate xt+1 = xt + γt vt . (9) two methods of variance reduction into our proposed al- gorithm Black-box Continuous Greedy which correspond Hence, by solving linear optimization problems, the to Step 7 and Step 8 in Algorithm 1, respectively. iterates are updated without resorting to the projection First, instead of the one-point gradient estimation in oracle. Lemma 2, we adopt the two-point estimator of ∇F̃ (x) Here we introduce our main algorithm Black-box Con- [Agarwal et al., 2010, Shamir, 2017]: tinuous Greedy which assumes access only to function d values (i.e., zeroth-order information). This algorithm (F (x + δu) − F (x − δu))u, (12) 2δ is partially based on the idea of Continuous Greedy. The basic idea is to utilize the function evaluations of where u is chosen uniformly at random from the unit F at carefully selected points to obtain unbiased esti- sphere S d−1 .We note that (12) is an unbiased gradi- mations of the gradient of the smoothed version, ∇F̃ . ent estimator with less variance w.r.t. the one-point By extending Continuous Greedy to the derivative-free estimator. We also average over a mini-batch of Bt
Lin Chen1† , Mingrui Zhang1‡ , Hamed Hassani] , Amin Karbasi† Algorithm 1 Black-box Continuous Greedy ξ is stochastic zero-mean noise. In particular, in Step 1: Input: constraint set K, iteration number T , ra- 6 of Algorithm 1, we obtain independent stochastic + − dius δ, step size ρt , batch size Bt function evaluations F̂ (yt,i ) and F̂ (yt,i ), instead of the + − 2: Output: xT +1 + δ1 exact function values F (yt,i ) and F (yt,i ). For unbiased 3: x1 ← 0, ḡ0 ← 0 function evaluation oracles with uniformly bounded 4: for t = 1 to T do variance, we have the following theorem. 5: Sample ut,1 , . . . , ut,Bt i.i.d. from S d−1 Theorem 2 (Proof in Appendix C). Under the + − 6: For i = 1 to Bt , let yt,i ← δ1 + xt + δut,i , yt,i ← condition of Theorem 1, if we further assume that + − δ1 + xt − δut,i and evaluate F (yt,i ), F (yt,i ) for all x, E[F̂ (x)] = F (x) and E[|F̂ (x)−F (x)|2 ] ≤ PBt d + − 7: gt ← B1t i=1 2δ [F (yt,i ) − F (yt,i )]ut,i σ02 , then we have 8: ḡt ← (1 − ρt )ḡt−1 + ρt gt 9: vt ← arg maxv∈K0 hv, ḡt i (1 − 1/e)F (x∗ ) − E[F (xT +1 + δ1)] 10: xt+1 ← xt + vTt 3D1 Q1/2 LD12 √ 11: end for ≤ + + δG(1 + ( d + 1)(1 − 1/e)), T 1/3 2T 12: Output xT +1 + δ1 where D1 = diam(K0 ), Q = max{42/3 G2 , 6L2 D12 + (4cdG2 + 2d2 σ02 /δ 2 )/Bt }, c is a constant, and x∗ independently sampled two-point estimators for fur- is the global maximizer of F on K. ther variance reduction. The second variance-reduction 3 3 2 technique is the momentum method used in [Mokhtari Remark √ 2. By setting T = O(1/ ), Bt = d / , and et al., 2018a] to estimate the gradient by a vector ḡt δ = / d, the error term (RHS) is at most O(). The which is updated at each iteration as follows: total number of evaluations is at most O(d3 /5 ). ḡt = (1 − ρt )ḡt−1 + ρt gt . (13) 4 Discrete Submodular Maximization Here ρt is a given step size, ḡ0 is initialized as an all zero In this section, we describe how Black-box Continuous vector 0, and gt is an unbiased estimate of the gradient Greedy can be used to solve a discrete submodular max- at iterate xt . As ḡt is a weighted average of previous imization problem with a general matroid constraint, gradient approximation ḡt−1 and the newly updated i.e., maxS∈I f (S), where f is a monotone submodular stochastic gradient gt , it has a lower variance compared set function and I is a matroid. with gt . Although ḡt is not an unbiased estimation of the true gradient, the error of it will approach zero as For any monotone submodular set function f : 2Ω → time proceeds. The detailed description of Black-box R≥0 , its multilinear extension F : [0, 1]d → R≥0 , de- Continuous Greedy is provided in Algorithm 1. fined as Theorem 1 (Proof in Appendix B). For a mono- X Y Y F (x) = f (S) xi (1 − xj ), (14) tone continuous DR-submodular function F , which S⊆Ω i∈S j ∈S / is also G-Lipschitz continuous and L-smooth on a convex and compact constraint set K, if we set is monotone and DR-submodular [Calinescu et al., ρt = 2/(t + 3)2/3 in Algorithm 1, then we have 2011]. Here, d = |Ω| is the size of the ground set Ω. Equivalently, we have F (x) = ES∼x [f (S)], where (1 − 1/e)F (x∗ ) − E[F (xT +1 + δ1)] S ∼ x means that the each element i ∈ Ω is included 3D1 Q1/2 LD12 √ in S with probability xi independently. ≤ 1/3 + + δG(1 + ( d + 1)(1 − 1/e)). T 2T It can be shown that in lieu of solving the discrete optimization problem one can solve the continuous op- where Q = max{42/3 G2 , 4cdG2 /Bt + 6L2 D12 }, c is timization problem maxx∈K F (x), where K = conv{1I : a constant, D1 = diam(K0 ), and x∗ is the global I ∈ I} is the matroid polytope [Calinescu et al., 2011]. maximizer of F on K. This equivalence is obtained by showing that (i) the Remark 1. By setting T = O(1/3 ), Bt = d, and δ = optimal values of the two problems are the same, and √ / d, the error term (RHS) is guaranteed to be at most (ii) for any fractional vector x ∈ K we can deploy effi- O(). Also, the total number of function evaluations is cient, lossless rounding procedures that produce a set at most O(d/3 ). S ∈ I such that E[f (S)] ≥ F (x) (e.g., pipage rounding [Ageev and Sviridenko, 2004, Calinescu et al., 2011] We can also extend Algorithm 1 to the stochastic case and contention resolution [Chekuri et al., 2014]). So we in which we obtain information about F only through can view F̃ as the underlying function that we intend to its noisy function evaluations F̂ (x) = F (x) + ξ, where optimize, and invoke Black-box Continuous Greedy. As
Black Box Submodular Maximization: Discrete and Continuous Settings Algorithm 2 Discrete Black-box Greedy 3)2/3 , St,i = l in Algorithm 2, then we have 1: Input: matroid constraint I, transformed con- straint set K0 = Dδ ∩ (K − δ1) where K = conv{1I : (1 − 1/e)f (X ∗ ) − E[f (XT +1 )] I ∈ I}, number of iterations T , radius δ, step size p 3D1 Q1/2 2M d(d − 1)D12 ρt , batch size Bt , sample size St,i ≤ + T 1/3 √ √T 2: Output: XT +1 + 2M δ d(1 + ( d + 1)(1 − 1/e)). 3: x1 ← 0, ḡ0 ← 0, 4: for t = 1 to T do 2d2 M 2 ( 1 +8c) 5: Sample ut,1 , . . . , ut,Bt i.i.d. from S d−1 where D1 = diam(K0 ), Q = max{ lδ 2 Bt + + 2 2 5/3 2 ∗ 6: For i = 1 to Bt , let yt,i ← δ1 + xt + 96d(d − 1)M D1 , 4 dM }, c is a constant, X is − the global maximizer of f under matroid constraint δut,i , yt,i ← δ1 + xt − δut,i , independently sample subsets Yt,i + and Yt,i − for St,i times I. + − according to yt,i , yt,i , get sampled subsets Remark 3. By setting T = O(d3 /3 ), Bt = 1, l = + Yt,i,j − , Yt,i,j , ∀j ∈ [St,i ], evaluate the function d2 /2 , and δ = /d, the error term (RHS) is at most values f (Yt,i,j + − ), f (Yt,i,j ), ∀j ∈ [St,i ], and calcu- O(). The total number of evaluations is at most PSt,i + f (Yt,i,j ) O(d5 /5 ). late the averages f¯t,i + ← j=1 St,i , f¯t,i − ← PSt,i − We note that in Algorithm 2, f¯t,i + is the unbiased estima- j=1 f (Yt,i,j ) St,i PBt d ¯+ tion of F (yt,i ), and the same holds for f¯t,i + − and F (yt,i− ). gt ← B1t i=1 ¯− 2δ (ft,i − ft,i )ut,i 7: As a result, we can analyze the algorithm under the 8: ḡt ← (1 − ρt )ḡt−1 + ρt gt framework of stochastic continuous submodular maxi- 9: vt ← arg maxv∈K0 hv, ḡt i mization. By applying Theorem 2, Lemma 3, and the 10: xt+1 ← xt + vTt facts E[|f¯t,i + + 2 −F (yt,i )| ] ≤ M 2 /St,i , E[|f¯t,i − − 2 −F (yt,i )| ] ≤ 11: end for M 2 /St,i directly, we can also attain Theorem 3. 12: Output XT +1 = round(xT +1 + δ1) 5 Experiments a result, we want that F is G-Lipschitz and L-smooth as in Theorem 1. The following lemma shows these In this section, we will compare Black-box Continuous properties are satisfied automatically if f is bounded. Greedy (BCG) and Discrete Black-box Greedy (DBG) with the following baselines: Lemma 3. For a submodular set function f defined on Ω with supX⊆Ω |f (X)|≤ M , its multilinear extension (1) Zeroth-Order Gradient Ascent (ZGA) is the projected √ gradient ascent algorithm equipped with the same two- p F is 2M d-Lipschitz and 4M d(d − 1)-smooth. point gradient estimator as BCG uses. Therefore, it is We note that the bounds for Lipschitz and smooth- a zeroth-order projected algorithm. ness parameters actually depend on the norms that we (2) Stochastic Continuous Greedy (SCG) is the state-of- consider. However, different norms are equivalent up the-art first-order algorithm for maximizing continuous to a factor that may depend on the dimension. If we DR-submodular functions Mokhtari et al. [2018a,b]. consider another norm, some dimension factors may be Note that it is a projection-free algorithm. absorbed into the norm. Therefore, we only study the Euclidean norm in Lemma 3. (3) Gradient Ascent (GA) is the first-order projected gradient ascent algorithm Hassani et al. [2017]. We further note that computing the exact value of F is difficult as it requires evaluating f over all the subsets The stopping criterion for the algorithms is whenever a S ∈ Ω. However, one can construct an unbiased esti- given number of iterations is achieved. Moreover, the mate for the value F (x) by simply sampling a random batch sizes St,i in Algorithm 1 and Bt in Algorithm 2 set S ∼ x and returning f (S) as the estimate. We are both 1. Therefore, in the experiments, DBG uses 1 present our algorithm in detail in Algorithm 2, where query per iteration while SCG uses O(d) queries. we have D = [0, 1]d , since F is defined on [0, 1]d , and We perform four sets of experiments which are de- thus Dδ = [0, 1 − 2δ]d . We state the theoretical result scribed in detail in the following. The first two sets formally in Theorem 3. of experiments are maximization of continuous DR- Theorem 3 (Proof in Appendix E). For submodular functions, which Black-box Continuous a monotone submodular set function f with Greedy is designed to solve. The last two are sub- supX⊆Ω |f (X)|≤ M , if we set ρt = 2/(t + modular set maximization problems. We will apply Discrete Black-box Greedy to solve these problems. The
Lin Chen1† , Mingrui Zhang1‡ , Hamed Hassani] , Amin Karbasi† function values at different rounds and the execution problem is defined by f (S) = log det(I + ΣS,S ), where times are presented in Fig. 1 and Section 5. The first- S ⊆ [d] and ΣS,S is the principal submatrix indexed order algorithms (SCG and GA) are marked in orange, by S. The total number of 22 attributes are parti- and the zeroth-order algorithms are marked in blue. tioned into 5 disjoint subsets with sizes 4, 4, 4, 5 and 5, respectively. The problem is subject to a partition Non-convex/non-concave Quadratic Program- matroid requiring that at most one attribute should be ming (NQP): In this set of experiments, we apply our active within each subset. Since this is a submodular proposed algorithm and the baselines to the problem of set maximization problem, in order to evaluate the non-convex/non-concave quadratic programming. The gradient (i.e., obtain an unbiased estimate of gradi- objective function is of the form F (x) = 12 x> Hx + b> x, ent) required by first-order algorithms SCG and GA, where x is a 100-dimensional vector, H is a 100-by-100 it needs 2d function value queries. To be precise, the matrix, and every component of H is an i.i.d. ran- i-th component of gradient is ES∼x [f (S ∪ {i}) − f (S)] dom variable whose distribution is equal to that of the and requires two function value queries. It can be ob- negated absolute value ofPa standard normal P60 distribu- 30 served from Fig. 1c that DBG outperforms the other tion. The constraints are i=1 xi ≤ 30, i=31 xi ≤ 20, P100 zeroth-order algorithm ZGA. Although its performance and i=61 xi ≤ 20. To guarantee that the gradient is slightly worse than the two first-order algorithms is non-negative, we set bt = −H > 1. One can observe SCG and GA, it require significantly less number of from Fig. 1a that the function value that BCG attains function value queries than the other two first-order is only slightly lower than that of the first-order algo- methods (as discussed above). rithm SCG. The final function value that BCG attains is similar to that of ZGA. Influence Maximization In the influence maximiza- tion problem, we assume that every node in the network Topic Summarization: Next, we consider the topic is able to influence all of its one-hop neighbors. The ob- summarization problem [El-Arini et al., 2009, Yue and jective of influence maximization is to select a subset of Guestrin, 2011], which is to maximize the probabilistic nodes in the network, called the seed set (and denoted coverage of selected articles on news topics. Each news by S), so that the total number of influenced nodes, article is characterized by its topic distribution, which including the seed nodes, is maximized. We choose is obtained by applying latent Dirichlet allocation to the social network of Zachary’s karate club Zachary the corpus of Reuters-21578, Distribution 1.0. The [1977] in this study. The subjects in this social net- number of topics is set to 10. We will choose from 120 work are partitioned into three disjoint groups, whose news articles. The probabilistic coverage of a subset sizes are 10, 14, and 10 respectively. The chosen seed of news articles (denoted by X) is defined by f (X) = nodes should be subject to a partition matroid; i.e., We 1 P10 Q 10 j=1 [1 − a∈X (1 − pa (j))], where pa (·) is the topic will select at most two subjects from each of the three distribution of article a. PThe multilinear extension groups. Note that this problem is also a submodular 1 10 Q function of f is F (x) = 10 j=1 [1− a∈Ω (1−p a (j)xa )], set maximization problem. Similar to the situation in 120 where x ∈ [0, 1] Iyer et al. [2014]. The constraint the active set selection problem, first-order algorithms P40 P80 P120 is i=1 xi ≤ 25, i=41 xi ≤ 30, i=81 xi ≤ 35. It need function value queries to obtain an unbiased es- can be observed from Fig. 1b that the proposed BCG timate of gradient. We can observe from Fig. 1d that algorithm achieves the same function value as the first- DBG attains a better influence coverage than the other ordered algorithm SCG and outperforms the other two. zeroth-order algorithm ZGA. Again, even though SCG As shown in Fig. 2a, BCG is the most efficient method. and GA achieve a slightly better coverage, due to their The two projection-free algorithms BCG and SCG run first-order nature, they require a significantly larger faster than the projected methods ZGA and GA. We number of function value queries. will elaborate on the running time later in this section. Active Set Selection We study the active set selec- Running Time The running times of the our pro- tion problem that arises in Gaussian process regres- posed algorithms and the baselines are presented in sion Mirzasoleiman et al. [2013]. We use the Parkin- Section 5 for the above-mentioned experimental set- sons Telemonitoring dataset, which is composed of ups. There are two main conclusions. First, the two biomedical voice measurements from people with early- projection-based algorithms (ZGA and GA) require stage Parkinson’s disease [Tsanas et al., 2010]. Let significantly higher time complexity compared to the X ∈ Rn×d denote the data matrix. Each row X[i, :] projection-free algorithms (BCG, DBG, and SCG), is a voice recording while each column X[:, j] denotes as the projection-based algorithms require solving an attribute. The covariance matrix Σ is defined by quadratic optimization problems whereas projection- Σij = exp(−kX[:, i] − X[:, j]k2 )/h2 , where h is set to free ones require solving linear optimization problems 0.75. The objective function of the active set selection which can be solved more efficiently. Second, when we compare first-order and zeroth-order algorithms, we
Black Box Submodular Maximization: Discrete and Continuous Settings Number of first-order oracle queries Number of queries to compute gradients Number of first-order oracle queries Number of queries to compute gradients ×104 0 10 20 30 0 10 20 30 0 500 1000 0 1000 2000 1.00 6 3 1.5 Function values Function values Function values Function values 0.75 4 2 BCG BCG 1.0 DBG 0.50 DBG ZGA ZGA ZGA ZGA 1 2 SCG 0.25 SCG 0.5 SCG SCG GA GA GA GA 0 0.00 0.0 0 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 Number of zeroth-order oracle queries Number of zeroth-order oracle queries Number of zeroth-order oracle queries Number of zeroth-order oracle queries (a) NQP (b) Topic summarization (c) Active set selection (d) Influence maximization Figure 1: Function value vs. number of oracle queries. Note that every chart has dual horizontal axes. Orange lines use the orange horizontal axes above while blue lines use the blue ones below. Rel. running time (DBG=1) Rel. running time (DBG=1) Rel. running time (BCG=1) Rel. running time (BCG=1) 6 4 10 8 4 3 6 2 5 4 2 1 2 0 0 0 0 BCG SCG ZGA GA BCG SCG ZGA GA DBG SCG ZGA GA DBG SCG ZGA GA (a) NQP (b) Topic summarization (c) Active set selection (d) Influence maximization Figure 2: Relative running time normalized with respect to BCG (for continuous DR-submodular maximization in the first two sets of experiments) and DBG (for submodular set maximization in the last two sets of experiments). can observe that zeroth-order algorithms (BCG, DBG, is particularly preferable in a large-scale machine learn- and ZGA) run faster than their first-order counterparts ing task and an application where the total number of (SCG and GA). function evaluations or the running time is subject to a budget. Summary The above experiment results show the following major advantages of our method over the 6 Conclusion baselines including SCG and ZGA. In this paper, we presented Black-box Continuous 1. BCG/DBG is at least twice faster than SCG and Greedy, a derivative-free and projection-free algo- ZGA in all tasks in terms of running time (Figs. 2a rithm for maximizing a monotone and continuous to 2d) DR-submodular function subject to a general convex 2. DBG requires remarkably fewer function evalua- body constraint. We showed that Black-box Continuous tions in the discrete setting (Figs. 1c and 1d) Greedy achieves the tight [(1−1/e)OP T −] approxima- tion guarantee with O(d/3 ) function evaluations. We 3. In addition to saving function evaluations, then extended the algorithm to the stochastic continu- BCG/DBG achieves an objective function value ous setting and the discrete submodular maximization comparable to that of the first-order baselines SCG problem. Our experiments on both synthetic and real and GA. data validated the performance of our proposed al- gorithms. In particular, we observed that Black-box Furthermore, we note that the number of first-order Continuous Greedy practically achieves the same utility queries required by SCG is only half the number re- as Continuous Greedy while being way more efficient in quired by BCG. However, as is shown in Figs. 2a and 2b, terms of number of function evaluations. BCG runs significantly faster than SCG since a zeroth- order evaluation is faster than a first-order one. Acknowledgements In the topic summarization task (Fig. 1b), BCG ex- hibits a similar performance to that of the first-order LC is supported by the Google PhD Fellowship. HH baselines SCG and GA, in terms of the attained is supported by AFOSR Award 19RT0726, NSF HDR objective function value. In the other three tasks, TRIPODS award 1934876, NSF award CPS-1837253, BCG/DBG runs notably faster while achieving an only NSF award CIF-1910056, and NSF CAREER award slightly inferior function value. Therefore, BCG/DBG CIF-1943064. AK is partially supported by NSF
Lin Chen1† , Mingrui Zhang1‡ , Hamed Hassani] , Amin Karbasi† (IIS-1845032), ONR (N00014-19-1-2406), and AFOSR Andrew An Bian, Baharan Mirzasoleiman, Joachim (FA9550-18-1-0160). Buhmann, and Andreas Krause. Guaranteed non- convex optimization: Submodular maximization over References continuous domains. In Artificial Intelligence and Statistics, pages 111–120, 2017b. Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal Gruia Calinescu, Chandra Chekuri, Martin Pál, and algorithms for online convex optimization with multi- Jan Vondrák. Maximizing a monotone submodular point bandit feedback. In COLT, pages 28–40. Cite- function subject to a matroid constraint. SIAM seer, 2010. Journal on Computing, 40(6):1740–1766, 2011. Alexander A Ageev and Maxim I Sviridenko. Pipage L Elisa Celis, Amit Deshpande, Tarun Kathuria, and rounding: A new method of constructing algorithms Nisheeth K Vishnoi. How to be fair and diverse? with proven performance guarantee. Journal of Com- arXiv preprint arXiv:1610.07183, 2016. binatorial Optimization, 8(3):307–328, 2004. Chandra Chekuri, Jan Vondrák, and Rico Zenklusen. Charles Audet and Warren Hare. Derivative-free and Submodular function maximization via the multi- blackbox optimization. Springer, 2017. linear relaxation and contention resolution schemes. Francis Bach. Submodular functions: from discrete SIAM Journal on Computing, 43(6):1831–1879, 2014. to continuous domains. Mathematical Programming, pages 1–41, 2016. Lin Chen, Forrest W Crawford, and Amin Karbasi. Submodular variational inference for network recon- Francis Bach, Rodolphe Jenatton, Julien Mairal, Guil- struction. In UAI, 2017a. laume Obozinski, et al. Optimization with sparsity- inducing penalties. Foundations and Trends R in Lin Chen, Andreas Krause, and Amin Karbasi. Interac- Machine Learning, 4(1):1–106, 2012. tive submodular bandit. In NeurIPS, pages 141–152, 2017b. Francis R Bach. Structured sparsity-inducing norms through submodular functions. In Advances in Neu- Lin Chen, Moran Feldman, and Amin Karbasi. Weakly ral Information Processing Systems, pages 118–126, submodular maximization beyond cardinality con- 2010. straints: Does randomization help greedy? In ICML, pages 804–813, 2018a. Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order (non)-convex stochastic optimization Lin Chen, Christopher Harshaw, Hamed Hassani, and via conditional gradient and gradient updates. In Amin Karbasi. Projection-free online optimization Advances in Neural Information Processing Systems, with stochastic gradient: From convexity to submod- pages 3459–3468, 2018. ularity. In ICML, pages 813–822, 10–15 Jul 2018b. Eric Balkanski and Yaron Singer. Mechanisms for fair Lin Chen, Hamed Hassani, and Amin Karbasi. Online attribution. In Proceedings of the Sixteenth ACM continuous submodular maximization. In AISTATS, Conference on Economics and Computation, pages pages 1896–1905, 2018c. 529–546. ACM, 2015. Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, Mohammadhossein Bateni, Lin Chen, Hossein Esfan- and Cho-Jui Hsieh. Zoo: Zeroth order optimization diari, Thomas Fu, Vahab Mirrokni, and Afshin Ros- based black-box attacks to deep neural networks tamizadeh. Categorical feature compression via sub- without training substitute models. In Proceedings modular optimization. In International Conference of the 10th ACM Workshop on Artificial Intelligence on Machine Learning, pages 515–523, 2019. and Security, pages 15–26. ACM, 2017c. James S Bergstra, Rémi Bardenet, Yoshua Bengio, Andrew R Conn, Katya Scheinberg, and Luis N Vicente. and Balázs Kégl. Algorithms for hyper-parameter Introduction to derivative-free optimization, volume 8. optimization. In Advances in neural information Siam, 2009. processing systems, pages 2546–2554, 2011. Abhimanyu Das and David Kempe. Submodular meets An Bian, Kfir Levy, Andreas Krause, and Joachim M spectral: Greedy algorithms for subset selection, Buhmann. Continuous dr-submodular maximization: sparse approximation and dictionary selection. arXiv Structure and algorithms. In Advances in Neural In- preprint arXiv:1102.3975, 2011. formation Processing Systems, pages 486–496, 2017a. Reza Eghbali and Maryam Fazel. Designing smooth- An Bian, Joachim M Buhmann, and Andreas Krause. ing functions for improved worst-case competitive Optimal dr-submodular maximization and applica- ratio in online optimization. In Advances in Neural tions to provable mean field inference. arXiv preprint Information Processing Systems, pages 3287–3295, arXiv:1805.07482, 2018. 2016.
Black Box Submodular Maximization: Discrete and Continuous Settings Khalid El-Arini, Gaurav Veda, Dafna Shahaf, and Car- Martin Jaggi. Revisiting frank-wolfe: Projection-free los Guestrin. Turning down the noise in the blogo- sparse convex optimization. In ICML (1), pages sphere. In Proceedings of the 15th ACM SIGKDD 427–435, 2013. international conference on Knowledge discovery and Stefanie Jegelka and Jeff Bilmes. Submodularity be- data mining, pages 289–298. ACM, 2009. yond submodular energies: coupling edges in graph Abraham D Flaxman, Adam Tauman Kalai, and cuts. 2011a. H Brendan McMahan. Online convex optimization Stefanie Jegelka and Jeff A Bilmes. Approximation in the bandit setting: gradient descent without a gra- bounds for inference using cooperative cuts. In Pro- dient. In Proceedings of the sixteenth annual ACM- ceedings of the 28th International Conference on Ma- SIAM symposium on Discrete algorithms, pages 385– chine Learning (ICML-11), pages 577–584, 2011b. 394. Society for Industrial and Applied Mathematics, 2005. David Kempe, Jon Kleinberg, and Éva Tardos. Maxi- mizing the spread of influence through a social net- Marguerite Frank and Philip Wolfe. An algorithm work. In Proceedings of the ninth ACM SIGKDD for quadratic programming. Naval research logistics international conference on Knowledge discovery and quarterly, 3(1-2):95–110, 1956. data mining, pages 137–146. ACM, 2003. Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic Alex Kulesza, Ben Taskar, et al. Determinantal point programming. SIAM Journal on Optimization, 23 processes for machine learning. Foundations and (4):2341–2368, 2013. Trends R in Machine Learning, 5(2–3):123–286, 2012. Daniel Golovin and Andreas Krause. Adaptive submod- Simon Lacoste-Julien and Martin Jaggi. On the global ularity: Theory and applications in active learning linear convergence of frank-wolfe optimization vari- and stochastic optimization. Journal of Artificial ants. In Advances in Neural Information Processing Intelligence Research, 42:427–486, 2011. Systems, pages 496–504, 2015. Andrew Guillory and Jeff Bilmes. Interactive submodu- Qi Lei, Lingfei Wu, Pin-Yu Chen, Alexandros Dimakis, lar set cover. arXiv preprint arXiv:1002.3345, 2010. Inderjit Dhillon, and Michael Witbrock. Discrete adversarial attacks and submodular optimization Hamed Hassani, Mahdi Soltanolkotabi, and Amin Kar- with applications to text classification. Systems and basi. Gradient methods for submodular maximiza- Machine Learning (SysML), 2019. tion. In Advances in Neural Information Processing Systems, pages 5841–5851, 2017. Hui Lin and Jeff Bilmes. A class of submodular Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and functions for document summarization. In Proceed- Zebang Shen. Stochastic continuous greedy++: ings of the 49th Annual Meeting of the Association When upper and lower bounds match. In Advances for Computational Linguistics: Human Language in Neural Information Processing Systems, pages Technologies-Volume 1, pages 510–520. Association 13066–13076, 2019. for Computational Linguistics, 2011a. Elad Hazan. Introduction to online convex optimization. Hui Lin and Jeff Bilmes. Word alignment via submodu- Foundations and Trends R in Optimization, 2(3-4): lar maximization over matroids. In Proceedings of the 157–325, 2016. 49th Annual Meeting of the Association for Compu- tational Linguistics: Human Language Technologies: Elad Hazan and Haipeng Luo. Variance-reduced and short papers-Volume 2, pages 170–175. Association projection-free stochastic optimization. In Inter- for Computational Linguistics, 2011b. national Conference on Machine Learning, pages 1263–1271, 2016. Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maxi- Andrew Ilyas, Logan Engstrom, Anish Athalye, Jessy mization: Identifying representative elements in mas- Lin, Anish Athalye, Logan Engstrom, Andrew Ilyas, sive data. In Advances in Neural Information Pro- and Kevin Kwok. Black-box adversarial attacks cessing Systems, pages 2049–2057, 2013. with limited queries and information. In Proceedings of the 35th International Conference on Machine Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Learning,{ICML} 2018, 2018. Conditional gradient method for stochastic submod- Rishabh Iyer, Stefanie Jegelka, and Jeff Bilmes. Mono- ular maximization: Closing the gap. In AISTATS, tone closure of relaxed constraints in submodular pages 1886–1895, 2018a. optimization: Connections between minimization Aryan Mokhtari, Hamed Hassani, and Amin Kar- and maximization. In Uncertainty in Artificial In- basi. Stochastic conditional gradient methods: From telligence (UAI), Quebic City, Quebec Canada, July convex minimization to submodular maximization. 2014. AUAI. arXiv preprint arXiv:1804.09554, 2018b.
Lin Chen1† , Mingrui Zhang1‡ , Hamed Hassani] , Amin Karbasi† George L Nemhauser, Laurence A Wolsey, and Mar- Chris Thornton, Frank Hutter, Holger H Hoos, and shall L Fisher. An analysis of approximations for Kevin Leyton-Brown. Auto-weka: Combined selec- maximizing submodular set functions—i. Mathemat- tion and hyperparameter optimization of classifica- ical programming, 14(1):265–294, 1978. tion algorithms. In Proceedings of the 19th ACM Luis Miguel Rios and Nikolaos V Sahinidis. Derivative- SIGKDD international conference on Knowledge dis- free optimization: a review of algorithms and compar- covery and data mining, pages 847–855. ACM, 2013. ison of software implementations. Journal of Global Athanasios Tsanas, Max A Little, Patrick E McSharry, Optimization, 56(3):1247–1293, 2013. and Lorraine O Ramig. Enhanced classical dysphonia Manuel Gomez Rodriguez and Bernhard Schölkopf. measures and sparse regression for telemonitoring of Influence maximization in continuous time diffusion parkinson’s disease progression. In Acoustics Speech networks. arXiv preprint arXiv:1205.1682, 2012. and Signal Processing (ICASSP), 2010 IEEE Inter- national Conference on, pages 594–597. IEEE, 2010. Anit Kumar Sahu, Manzil Zaheer, and Soummya Kar. Towards gradient free and projection free stochastic Sebastian Tschiatschek, Rishabh K Iyer, Haochen Wei, optimization. arXiv preprint arXiv:1810.03233, 2018. and Jeff A Bilmes. Learning mixtures of submodular functions for image collection summarization. In Mehraveh Salehi, Amin Karbasi, Dustin Scheinost, and Advances in neural information processing systems, R Todd Constable. A submodular approach to cre- pages 1413–1421, 2014. ate individualized parcellations of the human brain. In International Conference on Medical Image Com- Jan Vondrák. Optimal approximation for the submod- puting and Computer-Assisted Intervention, pages ular welfare problem in the value oracle model. In 478–485. Springer, 2017. Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 67–74. ACM, 2008. Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human Martin J Wainwright, Michael I Jordan, et al. Graph- out of the loop: A review of bayesian optimization. ical models, exponential families, and variational Proceedings of the IEEE, 104(1):148–175, 2016. inference. Foundations and Trends R in Machine Learning, 1(1–2):1–305, 2008. Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feed- Yining Wang, Simon Du, Sivaraman Balakrishnan, and back. Journal of Machine Learning Research, 18(52): Aarti Singh. Stochastic zeroth-order optimization in 1–11, 2017. high dimensions. arXiv preprint arXiv:1710.10551, 2017. Adish Singla, Ilija Bogunovic, Gábor Bartók, Amin Karbasi, and Andreas Krause. Near-optimally teach- Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity ing the crowd to classify. In ICML, pages 154–162, in data subset selection and active learning. In In- 2014. ternational Conference on Machine Learning, pages 1954–1963, 2015. Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learn- Yisong Yue and Carlos Guestrin. Linear submodular ing algorithms. In Advances in neural information bandits and their application to diversified retrieval. processing systems, pages 2951–2959, 2012. In Advances in Neural Information Processing Sys- tems, pages 2483–2491, 2011. Artem Sokolov, Julia Kreutzer, Stefan Riezler, and Christopher Lo. Stochastic structured prediction Wayne W Zachary. An information flow model for under bandit feedback. In Advances in Neural Infor- conflict and fission in small groups. Journal of an- mation Processing Systems, pages 1489–1497, 2016. thropological research, 33(4):452–473, 1977. Matthew Staib and Stefanie Jegelka. Robust budget al- Mingrui Zhang, Lin Chen, Hamed Hassani, and Amin location via continuous submodular functions. arXiv Karbasi. Online continuous submodular maximiza- preprint arXiv:1702.08791, 2017. tion: From full-information to bandit feedback. In Bastian Steudel, Dominik Janzing, and Bernhard Advances in Neural Information Processing Systems, Schölkopf. Causal markov condition for sub- pages 9206–9217, 2019a. modular information measures. arXiv preprint Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed arXiv:1002.4020, 2010. Hassani, and Amin Karbasi. One sample stochastic Ben Taskar, Vassil Chatalbashev, Daphne Koller, and frank-wolfe. arXiv preprint arXiv:1910.04322, 2019b. Carlos Guestrin. Learning structured prediction mod- Yuanxing Zhang, Yichong Bai, Lin Chen, Kaigui els: A large margin approach. In Proceedings of the Bian, and Xiaoming Li. Influence maximization in 22nd international conference on Machine learning, messenger-based social networks. In GLOBECOM, pages 896–903. ACM, 2005. pages 1–6. IEEE, 2016.
Black Box Submodular Maximization: Discrete and Continuous Settings Yuxun Zhou and Costas J Spanos. Causal meets sub- modular: Subset selection with directed information. In Advances in Neural Information Processing Sys- tems, pages 2649–2657, 2016.
You can also read