Ranked Prioritization of Groups in Combinatorial Bandit Allocation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Ranked Prioritization of Groups in Combinatorial Bandit Allocation Lily Xu1 , Arpita Biswas1 , Fei Fang2 and Milind Tambe1 1 Center for Research on Computation and Society, Harvard University 2 Institute for Software Research, Carnegie Mellon University lily xu@g.harvard.edu, arpitabiswas@seas.harvard.edu, feif@cs.cmu.edu, milind tambe@harvard.edu arXiv:2205.05659v1 [cs.AI] 11 May 2022 Abstract Preventing poaching through ranger patrols pro- tects endangered wildlife, directly contributing to the UN Sustainable Development Goal 15 of life on land. Combinatorial bandits have been used to allocate limited patrol resources, but existing ap- Figure 1: Animal density distributions and poaching risk (expected proaches overlook the fact that each location is number of snares) across a 10 × 10 km region in Srepok Wildlife home to multiple species in varying proportions, Sanctuary, Cambodia. The distribution of the species of greatest so a patrol benefits each species to differing de- conservation concern, the elephant, differs from that of other ani- grees. When some species are more vulnerable, mals and overall poaching risk. we ought to offer more protection to these ani- mals; unfortunately, existing combinatorial bandit approaches do not offer a way to prioritize im- been widely used for a variety of tasks [Bastani and Bayati, portant species. To bridge this gap, (1) We pro- 2020; Segal et al., 2018; Misra et al., 2019] including ranger pose a novel combinatorial bandit objective that patrols to prevent poaching [Xu et al., 2021a]. In this poach- trades off between reward maximization and also ing prevention setting, the patrol planner is tasked with re- accounts for prioritization over species, which we peatedly and efficiently allocating a limited number of patrol call ranked prioritization. We show this objec- resources across different locations within the park [Plumptre tive can be expressed as a weighted linear sum of et al., 2014; Fang et al., 2016; Xu et al., 2021b]. Lipschitz-continuous reward functions. (2) We pro- In past work, we worked with the World Wide Fund for Na- vide RankedCUCB,1 an algorithm to select com- ture (WWF) to deploy machine learning approaches for patrol binatorial actions that optimize our prioritization- planning in Srepok Wildlife Sanctuary in Cambodia [Xu et based objective, and prove that it achieves asymp- al., 2020]. Subsequent conversations with park managers and totic no-regret. (3) We demonstrate empirically that conservation biologists raised the importance of focusing on RankedCUCB leads to up to 38% improvement in conservation of vulnerable species during patrols, revealing outcomes for endangered species using real-world a key oversight in our past work. In this paper, we address wildlife conservation data. Along with adapting to these shortcomings to better align our algorithmic objectives other challenges such as preventing illegal logging with on-the-ground conservation priorities. and overfishing, our no-regret algorithm addresses Looking at real-world animal distribution and snare loca- the general combinatorial bandit problem with a tions in Srepok visualized in Figure 1, we observe that the weighted linear objective. locations that maximize expected reward, defined as finding the most snares (darkest red in the risk map), are not neces- sarily the regions with high density of endangered animals 1 Introduction (elephants). To effectively improve conservation outcomes, it More than a quarter of mammals assessed by the IUCN Red is essential to account for these disparate impacts, as the rela- List are threatened with extinction [Gilbert, 2008]. As part tively abundant muntjac and wildpig would benefit most if we of the UN Sustainable Development Goals, Target 15.7 fo- simply patrol the regions with most snares, neglecting the en- cuses on ending poaching, and Target 15.5 directs us to halt dangered elephant. Prioritization of species is well-grounded the loss of biodiversity and prevent the extinction of threat- in existing conservation best practices [Regan et al., 2008; ened species. To efficiently allocate limited resources, multi- Arponen, 2012; Dilkina et al., 2017]. The IUCN Red List of armed bandits (MABs), and in particular combinatorial ban- Threatened Species, which classifies species into nine groups dits [Chen et al., 2016; Cesa-Bianchi and Lugosi, 2012], have from critically endangered to least concerned, is regarded as an important tool to focus conservation efforts on species with 1 Code is available at https://github.com/lily-x/rankedCUCB the greatest risk of extinction [Rodrigues et al., 2006]. We
term the goal of preferentially allocating resources to maxi- At each timestep t until the horizon T , the planner mize benefit to priority groups as ranked prioritization. determines an action, which is an effort vector β~ (t) = Some existing multi-armed bandit approaches have con- (β1 , . . . , βN ) specifying the amount of effort (e.g., number sidered priority-based fairness [Joseph et al., 2016; Kearns of patrol hours) to allocate to each location. The total ef- et al., 2017; Schumann et al., 2022]. However, unlike ours, fort at each timestep is constrained by a budget B such that these prior works only consider stochastic, not combinatorial, PN i=1 βi ≤ B. To enable implementation in practice, we as- bandits. In our combinatorial bandit problem setup, we must sume the effort is discretized into J levels, thus βi ∈ Ψ = determine some hours of ranger patrol to allocate across N {ψ1 , . . . , ψJ } for all i. locations at each timestep, subject to a budget constraint, The ~ the decision maker receives some After taking action β, rewards obtained from these actions are unknown a priori so reward µ(β). ~ We assume the reward is decomposable [Chen must be learned in an online manner. Existing combinato- rial bandit approaches [Chen et al., 2016; Xu et al., 2021a] et al., 2016], defined as the sum of the reward at each location: achieve no-regret guarantees with respect to the objective of N X maximizing the direct sum of rewards obtained from all the ~ = expected reward = µ(β) µi (βi ) . (1) arms. However, we wish to directly consider ranked prior- i=1 itization of groups in our objective; learning combinatorial We assume µi ∈ [0, 1] for all i. In the poaching context, rewards in an online fashion while ensuring we justly priori- the reward µi corresponds to the true probability of detecting tize vulnerable groups as well as reward makes the problem snares at a location i. The reward function is initially un- more challenging. How to trade off reward and prioritization known, leading to a learning problem. Following the model in the combinatorial bandit setting has so far remained an of combinatorial bandit allocation from Xu et al. [2021a], we open question. We show experimentally that straightforward assume that the reward function µi (·) is (1) Lipschitz continu- solutions fail to make the appropriate trade-off. ous, i.e., |µi (ψj )−µi (ψk )| ≤ L·|ψj −ψk | for some Lipschitz To improve wildlife conservation outcomes by addressing constant L and all pairs ψj , ψk , and (2) monotonically non- the need for species prioritization, we contribute to the litera- decreasing such that exerting greater effort never results in a ture on online learning for resource allocation across groups decrease in reward, i.e., βi0 > βi implies µi (βi0 ) ≥ µi (βi ). with the following: Finally, we assume that an ordinal ranking is known to in- 1. We introduce a novel problem formulation that considers dicate which group is more vulnerable. The ranking informs resource allocation across groups with ranked prioritiza- the planner which groups are of relatively greater concern, as tion, which has significant implications for wildlife con- we often lack sufficient data to quantify the extent to which servation, as well as illegal logging and overfishing. We one group should be prioritized. Without loss of generality, show that an objective that considers both prioritization assume the groups are numerically ordered with g = 1 being and reward can be expressed as a weighted linear sum, the most vulnerable group and g = G being the least, so the providing the useful result that the optimal action in hind- true rank is rank = h1, . . . , Gi. sight can be found efficiently via a linear program. 2.1 Measuring ranked prioritization 2. We provide a no-regret learning algorithm, RankedCUCB, To evaluate prioritization, we must measure how closely the for the novel problem of selecting a combinatorial ac- tion (amount of patrol effort to allocate to each loca- outcome of a combinatorial action β~ aligns with the prior- tion) at each timestep to simultaneously attain high reward ity ranking over groups. Later in Section 6, we discuss how and group prioritization. We prove that RankedCUCB our approach can generalize to other prioritization definitions, achieves sub-linear regret of O( ln T NJ such as the case where have greater specificity over the rela- N T + T ) for a setting tive prioritization of each group. with N locations, O(J N ) combinatorial actions, and time To formalize prioritization, we define benefit(β; ~ g) horizon T . which quantifies the benefit group g receives from action β. ~ 3. Using real species density and poaching data from Sre- As mentioned earlier, we assume that reward µi (βi ) obtained pok Wildlife Sanctuary in Cambodia along with synthetic by taking an action βi impacts all individuals at location i. data, we experimentally demonstrate that RankedCUCB Let ηgi be the number of individuals from group g at loca- achieves up to 38% improvement over existing approaches tion i and ηg be the total number of individuals in group g. in ranked prioritization for endangered wildlife. We define N 2 Problem formulation P ηgi µi (βi ) Consider a protected area with N locations and G groups of benefit(β; ~ g) := i=1 interest, with each group representing a species or a set of ηg species in the context of poaching prevention. Let dgi denote N X the (known) fraction of animals of a group g ∈ [G] present in , dgi µi (βi ) . (2) PN i=1 location i ∈ [N ]. Note that i=1 dgi = 1 for all g ∈ [G]. We assume the groups are disjoint, i.e., each animal is a member An action β~ perfectly follows ranked prioritization when of exactly one group, and the impact of an action on a location benefit(β; ~ 1) ≥ benefit(β; ~ 2) ≥ · · · ≥ benefit(β;~ G) . equally benefits all animals that are present there.
However, when β ~ does not follow perfect rank prioritization, 2.2 Objective with prioritization and reward we need a metric to measure the extent to which groups are We wish to take actions that maximize our expected reward appropriately prioritized. This metric would quantify similar- (Eq. 1) while also distributing our effort across the various ity between the ranking induced by the benefits of β~ and the groups as effectively as possible (Eq. 5). Recognizing that true rank. One common approach to evaluate the similarity targeting group prioritization requires us to sacrifice reward, between rankings is the Kendall tau coefficient: we set up the objective to balance reward and prioritization (# concordant pairs) − (# discordant pairs) with a parameter λ ∈ (0, 1] that enables us to tune the degree G . (3) to which we emphasize reward vs. prioritization: 2 ~ = λµ(β) obj(β) ~ + (1 − λ) · P(β) ~ . (6) The number of concordant pairs is the number of pairwise rankings that agree, across all G 2 pairwise rankings; the dis- cordant pairs are those that disagree. This metric yields a 3 Approach value [−1, 1] which signals amount of agreement, where +1 We show that the objective (Eq. 6) can be reformulated as a is perfect agreement and −1 is complete disagreement (i.e., weighted linear combination of the reward of each individual reverse ordering). However, a critical weakness of Kendall arm, which we then solve using our RankedCUCB algorithm, tau is that it is discontinuous, abruptly jumping in value when producing a general-form solution for combinatorial bandits. the effort βi on an arm is perturbed, rendering optimization Using the prioritization metric from Equation (5), we can difficult. re-express the objective (6) as Fortunately, we have at our disposal not just each pairwise G−1 G ranking, but also the magnitude by which each pair is in con- N P P (dgi −dhi ) cordance or discordance. Leveraging this information, we ~ = obj(β) P µi (βi ) λ + (1 − λ) g=1 h=g+1 (7) ~ that quantifies group pri- i=1 (G2 ) construct a convex function P(β) oritization: ~ ~ ) Observe that the large parenthetical is comprised only of 1(g
reward rewardt (i, ψj ) over nt (i, ψj ) arm pulls by timestep t. Algorithm 1 RankedCUCB The confidence radius rt of an arm (i, j) is then a function of Input: time horizon T , budget B, discretization levels the number of times we have pulled that arm: Ψ = {ψ1 , . . . , ψJ }, arms i ∈ [N ] with unknown reward µi Parameters: tuning parameter λ s 3Γ2i log T rt (i, j) = . (9) 1: Precompute Γi for each arm i ∈ [N ] 2nt (i, ψj ) 2: n(i, ψj ) = 0, reward(i, ψj ) = 0 ∀i ∈ [N ], j ∈ [J] We distinguish between UCB and a term we call 3: for timestep t = 1, 2, . . . , T do SELF UCB. The SELF UCB of an arm (i, j) representing lo- 4: Let εt = t−1/3 cation i with effort ψj at time t is the UCB of an arm based 5: Compute UCBt for all arms using Eq. (11) only on its own observations, given by 6: Solve LP(UCBt , {Γi }, B) to select super arm β ~ SELF UCB t (i, j) = Γi µ̂t (i, j) + rt (i, j) . (10) 7: // Execute action (t) (t) (t) This definition of SELF UCB corresponds with the standard 8: Act on β~ 0 to observe rewards X1 , X2 , . . . , XN interpretation of confidence bounds from the standard UCB1 9: for arm i = 1, . . . , N do (t) algorithm [Auer et al., 2002]. The UCB of an arm is 10: reward(i, βi0 ) = reward(i, βi0 ) + Xi then computed by taking the minimum of the bounds of all 11: n(i, βi ) = n(i, βi ) + 1 SELF UCBs as applied to the arm. These bounds are deter- mined by adding the distance between arm (i, j) and all other arms (u, v) to the SELF UCB: 4 Regret analysis UCBt (i, j) = min {SELF UCBt (i, k) + L · dist} (11) We now prove that our iterative algorithm RankedCUCB k∈[J] (Alg. 1) guarantees no regret with respect to the optimal so- dist = max{0, ψj − ψk } (12) lution for the objective (8) that jointly considers reward and which leverages the assumptions described in Section 2 that prioritization. More formally, we show that RegretT → 0 as the expected rewards are L-Lipschitz continuous and mono- T → ∞ where tonicially nondecreasing. T Given these UCB estimates for each arm (i, j), we must 1 X ~ (t) RegretT := µ(β~ ? ) − µ(β ) . (14) now select a combinatorial action β ~ to take at each timestep. T t=1 As the prioritization metric can be expressed as a linear com- bination of the reward (8), we can directly optimize the over- Here, β~ ? is an optimal action and expected reward µ(β) ~ := PN ~ all objective using an integer linear program LP (in Ap- i=1 µi (βi )Γi for an effort vector β = {β1 , . . . , βN }. Note pendix A), which selects an optimal action that respects the that if Γi < 0, any solution to the maximization problem (8) budget constraint. would allocate βi = 0, and so would RankedCUCB. Hence, for the analysis we assume that we consider only those loca- 3.2 Trading off prioritization and learning tions whose Γi > 0. A key challenge with our prioritization metric is that it is de- fined with respect to the true expected reward functions µi 4.1 Convergence of estimates which are initially unknown. Instead, we estimate per-round To prove the no-regret guarantee of RankedCUCB, we first prioritization based on our current belief of the true reward, establish Lemma 1 which states that, with high probability, µ̂i , related to subjective fairness from the algorithmic fairness the UCB estimate µ̂t (i, j)Γi p +rt (i, j) converges within a con- literature [Dimitrakakis et al., 2017]. However, this belief is fidence bound of rt (i, j) = (3Γ2i log T )/(2nt ). nonsensical in early rounds when our reward estimates are extremely coarse, so we discount the weight of prioritization Lemma 1. Using RankedCUCB, after t timesteps, each in our objective until our learning improves. µ̂t (i, j)Γi estimate p converges to a value within a radius Inspired by decayed epsilon-greedy, we incorporate an ε rt (i, j) = 2 3Γi ln t/2nt (i, j) of the corresponding true coefficient to tune how much we want to prioritize rank order µt (i, j)Γi values with probability 1 − 2N J t2 for all i, j. at each step (versus learning to maximize reward): ~ = λµ(β) ~ + (1 − λ)(1 − ε) · P(β) ~ . Proof sketch. Using the Chernoff–Hoeffding bound, we obj(β) (13) show that, at any timestep t, the probability that the difference For example, epsilon-greedy methods often use exploration between µt (i, j)Γi and µ̂t (i, j)Γi is greater than rt (i, j), is at probability εt ∼ t−1/3 , which gradually attenuates at a de- most 2/t3 . We then use union bound to show that µ̂t (i, j)Γi creasing rate with each increased timestep. Our definition converges to a value within radius rt (i, j) of µt (i, j)Γi with gives nice properties that εt = 1 at t = 1, so we do not care probability 1 − 2N J t2 . The complete proof is given in Ap- about ranked prioritization at all (we have no estimate of the pendix B.2. reward values so we cannot reasonably estimate ranking) but at t = ∞ then εt = 0 and we fully care about ranking w.r.t. λ (since we have full knowledge of the reward thus can pre- 4.2 Achieving no regret cisely estimate rank order). Theorem 1. The cumulative regret of RankedCUCB is All together, we call the approach described here Ranked- O J ln N T + N J with N arms, J discrete effort values, and CUCB and provide pseudocode in Algorithm 1. time horizon T .
Obj (λ = 0.3) Obj (λ = 0.8) Reward Prioritization Pareto Frontier 6 0.1 0.2 0.3 0.50.5 Prioritization 4 5 6 2 4 0.10.3 S REPOK 0.7 0.7 0.8 0.8 4 5 0.9 2 0 0.9 2 3 4 −2 0 1.0 1.0 2 3 −4 0 0 200 400 0 200 400 0 200 400 0 200 400 3 4 5 6 7 0.1 0.3 1 0.5 S YNTHETIC Prioritization 3 2 0.10.3 5 0 0.5 0.7 2 6 0.8 −1 1 1 0.7 0.9 4 1.0 5 −2 1.0 0 0 0 200 400 0 200 400 0 200 400 0 200 400 5.5 6 6.5 7 Timestep t Timestep t Timestep t Timestep t Reward Optimal RankedCUCB LIZARD Random NaiveRank Optimal LIZARD RankedCUCB Figure 2: The performance of each approach. LEFT evaluates the objective with tuning parameter λ = 0.3 and λ = 0.8. Our approach, RankedCUCB, performs significantly better than baselines. CENTER evaluates reward and prioritization (at λ = 0.8), the two components of the combined objective. The reward-maximizing LIZARD algorithm rapidly maximizes reward but performs worse than random in terms of rank order. RIGHT visualizes the Pareto frontier trading off the two components of our objective. Labels represent different values of λ ∈ {0.1, 0.2, . . . , 1.0}. Each point plots the reward and ranked prioritization as the average of the final 10 timesteps. All results are averaged over 30 random seeds and smoothed with an exponential moving average. Proof sketch. We first show the expected regret (Eq. 14) of categories of critically endangered, endangered, least concern RankedCUCB can be redefined as: but decreasing, and least concern respectively. Thus our set- N J ting has G = 4 groups distributed across N = 25 locations 1 XX within the park. We further evaluate our algorithm on a syn- E[Li,j,T ] ζi,j T i=1 j=1 thetic dataset with G = 5 groups randomly distributed across the park. where Li,j,T specifies the number of times the effort ψj se- lected for location i results in a suboptimal solution, and Baselines We compare the performance of our Ranked- ζi,j denotes the minimum loss incurred due to the sub- CUCB approach with naive priority-aware, priority-blind, optimal selection. Then, by contradiction, we show that when random, and optimal benchmarks. NaiveRank takes the all µ̂t (i, j)Γi estimates are within their confidence radius, straightforward approach of directly solving for the objective RankedCUCB selects an optimal effort ψj (and not a sub- that weighs each target by its individual ranked prioritarian optimal one) at time t. Using this fact and Lemma 1, we show metric, which accounts for prioritization induced by each tar- that the E[Li,j,T ] can be upper bounded by a finite term, and get independently but ignores the coupled effect across tar- hence the expected regret is O( J ln T + N J). The complete gets. LIZARD [Xu et al., 2021a] maximizes the combinato- N proof is given in Appendix B.3. rial reward objective; this algorithm enhances standard UCB approaches by accounting for smoothness and monotonicity Our analysis improves upon the regret analysis for standard in the reward function, but ignores prioritization. Optimal (non-weighted) combinatorial bandits [Chen et al., 2016] by solves for the action β~ ? = arg maxβ~ obj(β)~ that maximizes not having to rely on the “bounded smoothness” assumption the objective, using the ground truth values of µi to directly of reward functions. The technical tools we employ for this compute the best action. Random selects an arbitrary subset regret analysis can be of independent interest to the general of actions that satisfies the budget constraint. multi-armed bandits community. See Figure 2 for a comparison of the approaches. Ranked- CUCB performs consistently the best on our overall objec- 5 Experiments tive, and the breakdown of the reward and prioritization com- We evaluate the performance of our RankedCUCB approach ponents reveals that this gain is a result of the tradeoff be- in real-world conservation data from Srepok Wildlife Sanctu- tween the two components: although LIZARD is able to learn ary to demonstrate significant improvement over approaches high-reward actions, these actions lead to prioritization out- that ignore prioritization. comes that are worse than random. LIZARD even achieves Of the key species in Srepok on which we have species reward that exceeds that of the optimal action (which also density data, we prioritize elephants, followed by banteng, considers rank priority), but that comes at the cost of poor muntjac, and wild pig, corresponding to their IUCN Red List prioritization. Notably, NaiveRank performs worse than ran-
dom, even measured on fairness: focusing on the individual et al. [2012] provide a knapsack-based policy using UCB targets with the best prioritization neglects group-wide pat- to improve to O(ln B) regret. Chen et al. [2016] extend terns throughout the park. UCB1 to a combinatorial setting, matching the regret bound The reward–prioritization tradeoff becomes more apparent of O(log T ). Slivkins [2013] address ad allocation where when we analyze the Pareto frontier in Figure 2(right), which each stochastic arm is limited by a budget constraint on the plots the reward and prioritization of each approach as we number of pulls. Our objective (8) is similar to theirs but they change the value of λ. The price of prioritization is clear: do not handle combinatorial actions, which makes our regret the more we enforce prioritization (smaller values of λ) the analysis even harder. lower our possible reward. However, this tradeoff is not one- Prioritization notions in MABs Addressing prioritization for-one; the steepness of the Pareto curve allows us to strate- requires reward trade-offs, which is related to objectives from gically select the value of λ. For example, we achieve a nearly algorithmic fairness [Dwork et al., 2012; Kleinberg et al., two-fold increase in prioritization by going from λ = 0.9 to 2018; Corbett-Davies et al., 2017]. However, most MAB lit- 0.7 while sacrificing less than 10% of the reward, indicating erature favor traditional reward-maximizing approaches [Lat- a worthwhile tradeoff. timore et al., 2015; Agrawal and Devanur, 2014; Verma et al., 2019]. The closest metric to our prioritization objective is 6 Generalizability to other settings meritocratic fairness [Kearns et al., 2017]. For MAB, merito- Beyond wildlife conservation, our problem formulation ap- cratic fairness was introduced by Joseph et al. [2016], to en- plies to domains with the following characteristics: (1) re- sure that lower-reward arms are never favored above a high- peated resource allocation, (2) multiple groups with some pri- reward one. Liu et al. [2017] apply calibrated fairness by ority ordering, (3) an a priori unknown reward function that enforcing a smoothness constraint on our selection of simi- must be learned over time, and (4) actions that impact some lar arms, using Thompson sampling to achieve an Õ(T 2/3 ) subset of the groups to differing degrees. Our approach also fairness regret bound. For contextual bandits, Chohlas-Wood adapts to the following related objectives. et al. [2021] take a consequentialist (outcome-oriented) ap- Weighted rank Gatmiry and others [2021] suggest some proach and evaluate possible outcomes on a Pareto frontier, specific metrics for how rangers should prioritize different which we also investigate. Wang et al. [2021] ensure that species, noting ranger enforcement of poaching urial (a wild arms are proportionally represented according to their merit, sheep) should be 2.5 stricter than red deer. Our approach a metric they call fairness of exposure. Others have consid- could easily adapt to these settings where domain experts ered fairness in multi-agent bandits [Hossain et al., 2021], have more precise specification of desired outcomes. In this restless bandits [Herlihy et al., 2021], sleeping bandits [Li case, we introduce a parameter αg for each group g that spec- et al., 2019], contextual bandits with biased feedback [Schu- ifies the relative importance of group g, leading to the follow- mann et al., 2022], and infinite bandits [Joseph et al., 2018]. ing prioritization metric: However, none of these papers consider correlated groups, which is our focus. G−1 P G P N αg dgi − αh dhi ~ = X g=1 h=g+1 8 Conclusion PCF (β) µi (βi ) G (15) i=1 2 We address the challenge of allocating limited resources to prioritize endangered species within protected areas where where we set αg > αh when g < h to raise our threshold of true snare distributions are unknown, closing a key gap iden- how strongly group g should be favored. From our example, tified by our conservation partners at Srepok Wildlife Sanc- we would set αurial = 2.5αdeer . This metric aligns with the tuary. Our novel problem formulation introduces the metric definition of calibrated fairness [Liu et al., 2017]. of ranked prioritization to measure impact across disparate Weighted reward We can also accommodate the setting groups. Our RankedCUCB algorithm offers a principled way where the reward is a weighted combination, i.e., if reward of balancing the tradeoff between reward and ranked prioriti- PN zation in an online learning setting. Notably, our theoretical is i=1 ci µi (βi ) for some coefficients ci ∈ R, as the coeffi- cients ci would be absorbed into the Γi term of the objective. guarantee bounding the regret of RankedCUCB applies to a broad class of combinatorial bandit problems with a weighted 7 Related work linear objective. Multi-armed bandits MABs [Lattimore and Szepesvári, 2020] have been applied to resource allocation for healthcare Acknowledgments [Bastani and Bayati, 2020], education [Segal et al., 2018], This work was supported in part by NSF grant IIS-2046640 and dynamic pricing [Misra et al., 2019]. These papers solve (CAREER) and the Army Research Office (MURI W911NF- various versions of the stochastic MAB problem [Auer et al., 18-1-0208). Biswas supported by the Center for Research 2002]. on Computation and Society (CRCS). Thank you to Andrew Several prior works consider resource allocation settings Plumptre for a helpful discussion on species prioritization in where each arm pull is costly, limiting the total number of wildlife conservation; Jessie Finocchiaro for comments on an pulls by a budget. Tran-Thanh et al. [2010] use an ε–based earlier draft; and all the rangers on the front lines of biodiver- approach to achieve regret linear in the budget B; Tran-Thanh sity efforts in Srepok and around the world.
References Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Shipra Agrawal and Nikhil R Devanur. Bandits with concave Ashesh Rambachan. Algorithmic fairness. In AEA, 2018. rewards and convex knapsacks. In EC, 2014. Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. Anni Arponen. Prioritizing species for conservation planning. Biodiversity and Conservation, 21(4):875–893, 2012. Tor Lattimore, Koby Crammer, and Csaba Szepesvári. Linear multi-resource allocation with semi-bandit feedback. In Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite- NeurIPS, 2015. time analysis of the multiarmed bandit problem. ML, 2002. Fengjiao Li, Jia Liu, and Bo Ji. Combinatorial sleeping ban- Hamsa Bastani and Mohsen Bayati. Online decision making dits with fairness constraints. IEEE Trans. Netw. Sci. Eng., with high-dimensional covariates. OR, 68(1), 2020. 7(3), 2019. Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial ban- Yang Liu, Goran Radanovic, Christos Dimitrakakis, Deb- dits. J. Comput. Syst. Sci., 78(5), 2012. malya Mandal, and David C Parkes. Calibrated fairness Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Com- in bandits. FAT-ML Workshop, 2017. binatorial multi-armed bandit and its extension to proba- Kanishka Misra, Eric M Schwartz, and Jacob Abernethy. Dy- bilistically triggered arms. JMLR, 17(1), 2016. namic online pricing with incomplete information using Alex Chohlas-Wood, Madison Coots, Emma Brunskill, and multiarmed bandit experiments. Marketing Science, 2019. Sharad Goel. Learning to be fair: A consequentialist ap- Andrew J Plumptre, Richard A Fuller, Aggrey Rwetsiba, et al. proach to equitable decision-making. arXiv, 2021. Efficiently targeting resources to deter illegal activities in Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, protected areas. J. of Appl. Ecology, 51(3), 2014. and Aziz Huq. Algorithmic decision making and the cost Helen Regan, Lauren Hierl, et al. Species prioritization for of fairness. In KDD, 2017. monitoring and management in regional multiple species Bistra Dilkina, Rachel Houtman, Carla P Gomes, et al. Trade- conservation plans. Diversity and Distributions, 2008. offs and efficiencies in optimal budget-constrained multi- Ana SL Rodrigues, John D Pilgrim, John F Lamoreux, species corridor networks. Conservation Biology, 2017. Michael Hoffmann, et al. The value of the IUCN Red List for conservation. Trends Ecol. Evol., 21(2), 2006. Christos Dimitrakakis, Yang Liu, David Parkes, and Goran Radanovic. Subjective fairness: Fairness is in the eye of Candice Schumann, Zhi Lang, Nicholas Mattei, and John P the beholder. Technical report, 2017. Dickerson. Group fairness in bandit arm selection. AA- MAS, 2022. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Rein- gold, and Richard Zemel. Fairness through awareness. In Avi Segal, Yossi Ben David, Joseph Jay Williams, Kobi Gal, ITCS, 2012. et al. Combining difficulty ranking with multi-armed ban- dits to sequence educational content. In AIED, 2018. Fei Fang, Thanh H Nguyen, Rob Pickles, Wai Y Lam, et al. Deploying PAWS: Field optimization of the Protection As- Aleksandrs Slivkins. Dynamic ad allocation: Bandits with sistant for Wildlife Security. In IAAI, 2016. budgets. arXiv, 2013. Long Tran-Thanh, Archie Chapman, Enrique Munoz Zohreh Gatmiry et al. A security game approach for strategic De Cote, Alex Rogers, et al. Epsilon–first policies for conservation against poaching considering food web com- budget–limited multi-armed bandits. In AAAI, 2010. plexities. Ecological Complexity, 2021. Long Tran-Thanh, Archie Chapman, Alex Rogers, and Natasha Gilbert. A quarter of mammals face extinction. Na- Nicholas Jennings. Knapsack based optimal policies for ture, 455(7214):717–718, 2008. budget–limited multi–armed bandits. In AAAI, 2012. Christine Herlihy, Aviva Prins, Aravind Srinivasan, and John Arun Verma, Manjesh K. Hanawal, Arun Rajkumar, et al. Dickerson. Planning to fairly allocate: Probabilistic fair- Censored semi-bandits: A framework for resource alloca- ness in the restless bandit setting. arXiv, 2021. tion with censored feedback. In NeurIPS, 2019. Safwan Hossain, Evi Micha, and Nisarg Shah. Fair algo- Lequn Wang, Yiwei Bai, Wen Sun, and Thorsten Joachims. rithms for multi-agent multi-armed bandits. In NeurIPS, Fairness of exposure in stochastic bandits. In ICML, 2021. 2021. Lily Xu, Shahrzad Gholami, Sara Mc Carthy, Bistra Dilkina, Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Andrew Plumptre, et al. Stay ahead of poachers: Illegal Aaron Roth. Fairness in learning: Classic and contextual wildlife poaching prediction and patrol planning under un- bandits. In NeurIPS, 2016. certainty with field test evaluations. In ICDE, 2020. Matthew Joseph, Michael Kearns, Jamie Morgenstern, Seth Lily Xu, Elizabeth Bondi, Fei Fang, Andrew Perrault, Kai Neel, and Aaron Roth. Meritocratic fairness for infinite Wang, and Milind Tambe. Dual-mandate patrols: Multi- and contextual bandits. In AIES, 2018. armed bandits for green security. In AAAI, 2021. Michael Kearns, Aaron Roth, and Zhiwei Steven Wu. Mer- Lily Xu, Andrew Perrault, Fei Fang, Haipeng Chen, and itocratic fairness for cross-population selection. In ICML, Milind Tambe. Robust reinforcement learning under mini- 2017. max regret for green security. In UAI, 2021.
A Linear program By substituting Eq. (17) in Eq. (16), we obtain The following linear program LP takes the current UCBt (i, j) estimates to select a combinatorial action that t X J N X q (s) X maximizes our optimistic reward. 1− P ρi,j − ρ̂i,j > 3Γ2i ln t/2s Each zi,j is an indicator variable specifying whether we s=1 i=1 j=1 choose effort level ψj for location i. t X N X J X 2 ≥1− 3 N X X J s=1 i=1 j=1 t max zi,j UCBt (i, j) (LP) z 2N J i=1 j=1 =1− . (18) t2 s.t. zi,j ∈ {0, 1} ∀i ∈ [N ], j ∈ [J] J X This completes the proof. zi,j = 1 ∀i ∈ [N ] j=1 N X X J B.3 Regret bound zi,j ψj ≤ B . i=1 j=1 Theorem 1. The cumulative regret of RankedCUCB is O J ln N T + N J with N arms, J discrete effort values, and B Full Proofs time horizon T . B.1 Notation (t) For ease of representation, we use ρi,j := µt (i, j)Γi to de- note the expected reward of visiting location i with effort ψj (t) Proof. For a finite time horizon T , the average cumulative and ρ̂i,j := µ̂t (i, j)Γi as the average empirical reward (the in- dex j denotes one that corresponds to the discretization level regret of an algorithm that takes action β~ (t) at time t is given of βi , i.e., βi = ψj ). We use ρ̂ to denote the empirical aver- by Eq. (14). We use B to denote the set of all sub-optimal age reward and ρ to denote the upper confidence bound. Note actions: that ρ = ρ̂ + rt (i, j). Similarly for µ̂ and µ. B.2 Convergence of estimates B := {β~ | µ(β) ~ < µ(β~ ? )} . Lemma 1. Using RankedCUCB, after t timesteps, each µ̂t (i, j)Γi estimate converges to a value within a radius We now define regret as the expected loss incurred from choosing β~ from the set B. Thus, Eq. (14) can be written p rt (i, j) = 3Γ2i ln t/2nt (i, j) of the corresponding true µt (i, j)Γi values with probability 1 − 2N J as t2 for all i, j. (t) N T "N # N J Proof. In other words, we wish to show that each ρ̂i,j esti- X 1X X (t) 1 XX µi (βi? )Γi − E µi (βi ) Γi ≤ E[Li,j,T ] ζi,j mate converges to a value within radius rt (i, j) of the corre- T t=1 T i=1 j=1 (t) i=1 i=1 sponding true ρi,j values. (19) We provide a lower bound on the probability that estimated µ̂t (i, j) values are within a bounded radius for all i ∈ [N ] and all j ∈ [J]. At time t, where Li,j,T denotes the number of times the effort for arm i (t) is set as βi = ψj and the corresponding β~ ∈ B. That is, Li,j,T P ρi,j − ρ̂i,j ≤ rt (i, j), ∀i ∈ [N ] ∀j ∈ [J] specifies the number of times the pair (i, j) is chosen in a N X J suboptimal way. Note that Li,j,T ≤ nT (i, j). Let ζi,j denote (t) the minimum loss incurred due to a sub-optimal selection (of X =1− P ρi,j − ρ̂i,j > rt (i, j) (16) i=1 j=1 effort ψj ) on arm i. In other words, as the events are all independent. Assume that the number of samples used for computing the estimate ρ̂i,j is nt (i, j). ζi,j = µ(β~ ? ) − max{µi (ψj ) | β~ ∈ B and βi = ψj } . i,j Using the fact that ρi,j ∈ [0, Γi ] and the Chernoff–Hoeffding bound, we find that (t) 2nt (i, j) 6Γ2 ln t P ρi,j − ρ̂i,j > rt (i, j) ≤ 2 exp − rt (i, j)2 Let ζmin = mini,j ζi,j and τt = N 2iζ 2 . Let us as- Γ2i min sume that, at time step t, all arms have been visited at least = 2 exp (−3 ln t) τt times, so nt (i, j) ≥ τt for all i ∈ [N ] and j ∈ [J]. (obtained by substituting value of rt (i, j)) Using contradiction, we show that the algorithm will not 2 choose any sub-optimal vector β~ ∈ B at time t when the = 3 . (17) (t) t ρ̂i,j estimates converge to a value within a radius rt (i, j) =
p 3Γ2i ln t/2nt (i, j) of ρi,j : s N ~ = X (t) 3Γ2i ln t Rµ (β) ρ̂i,j + i=1 2nt (i, j) s N X (t) 3Γ2i ln t ≤ ρi,j + 2 i=1 2nt (i, j) (since estimates are within radius) s N X (t) 6Γ2i ln t ≤ ρi,j + (since nt (i, j) ≥ τt ) i=1 τt N (t) X ≤ ζmin + ρi,j . (substituting τt ) i=1 (20) ~ However, the assumption that β was chosen by the algorithm at time t implies that ~ > µ(β µ(β) ~?) s N ~?) + X 3Γ2i ln t ≥ µ̂(β i=1 2nt (i, j ? ) ~?) (j ? is the effort level corresponding to β i ≥ µ(β~ ? ) . (since estimates are within radius) (21) By combining Eqs. (20) and (21) we obtain, N (t) X ζmin + ρi,j ≥ µ(β~ ? ) i=1 ζmin ≥ µ(β~ ? ) − µ(β~ (t) ) . (22) Inequality (22) contradicts the definition of ζmin . Thus, the algorithm selects an optimal effort (and not the sub-optimal (t) ones) at time t when all the ρi,j estimates are within the radius p rt (i, j) = 3Γ2i ln t/2nt (i, j) of the true ρi,j . Using this fact along with Lemma 1, we obtain an upper bound on E[Li,j,T ]: T X 2N J E[Li,j,T ] ≤ (τT + 1)N J + t=1 t2 π2 ≤ (τT + 1)N J + NJ . (23) 3 We use ζmax = maxi,j ζi,j to provide a bound on Eq. (19), which is O(N ). This produces the following regret for RankedCUCB by time T : π2 1 RegretT ≤ τT + 1 + N Jζmax T 3 2 π 2 N Jζmax 6Γi ln T = 2 +1+ . (24) N 2 ζmin 3 T This result shows that the worst case regret increases linearly with the increase in the number of discrete effort levels. How- ever, the regret grows only sub-linearly with time T . This proves that RankedCUCB asymptotically converges as a no- regret algorithm, that is, as T → ∞, the regret of Ranked- CUCB tends to 0.
You can also read