Ranked Prioritization of Groups in Combinatorial Bandit Allocation

Page created by Brittany Reese
 
CONTINUE READING
Ranked Prioritization of Groups in Combinatorial Bandit Allocation

                                                                Lily Xu1 , Arpita Biswas1 , Fei Fang2 and Milind Tambe1
                                                            1
                                                              Center for Research on Computation and Society, Harvard University
                                                                 2
                                                                   Institute for Software Research, Carnegie Mellon University
                                             lily xu@g.harvard.edu, arpitabiswas@seas.harvard.edu, feif@cs.cmu.edu, milind tambe@harvard.edu
arXiv:2205.05659v1 [cs.AI] 11 May 2022

                                                                        Abstract
                                                   Preventing poaching through ranger patrols pro-
                                                   tects endangered wildlife, directly contributing to
                                                   the UN Sustainable Development Goal 15 of life
                                                   on land. Combinatorial bandits have been used to
                                                   allocate limited patrol resources, but existing ap-        Figure 1: Animal density distributions and poaching risk (expected
                                                   proaches overlook the fact that each location is           number of snares) across a 10 × 10 km region in Srepok Wildlife
                                                   home to multiple species in varying proportions,           Sanctuary, Cambodia. The distribution of the species of greatest
                                                   so a patrol benefits each species to differing de-         conservation concern, the elephant, differs from that of other ani-
                                                   grees. When some species are more vulnerable,              mals and overall poaching risk.
                                                   we ought to offer more protection to these ani-
                                                   mals; unfortunately, existing combinatorial bandit
                                                   approaches do not offer a way to prioritize im-            been widely used for a variety of tasks [Bastani and Bayati,
                                                   portant species. To bridge this gap, (1) We pro-           2020; Segal et al., 2018; Misra et al., 2019] including ranger
                                                   pose a novel combinatorial bandit objective that           patrols to prevent poaching [Xu et al., 2021a]. In this poach-
                                                   trades off between reward maximization and also            ing prevention setting, the patrol planner is tasked with re-
                                                   accounts for prioritization over species, which we         peatedly and efficiently allocating a limited number of patrol
                                                   call ranked prioritization. We show this objec-            resources across different locations within the park [Plumptre
                                                   tive can be expressed as a weighted linear sum of          et al., 2014; Fang et al., 2016; Xu et al., 2021b].
                                                   Lipschitz-continuous reward functions. (2) We pro-            In past work, we worked with the World Wide Fund for Na-
                                                   vide RankedCUCB,1 an algorithm to select com-              ture (WWF) to deploy machine learning approaches for patrol
                                                   binatorial actions that optimize our prioritization-       planning in Srepok Wildlife Sanctuary in Cambodia [Xu et
                                                   based objective, and prove that it achieves asymp-         al., 2020]. Subsequent conversations with park managers and
                                                   totic no-regret. (3) We demonstrate empirically that       conservation biologists raised the importance of focusing on
                                                   RankedCUCB leads to up to 38% improvement in               conservation of vulnerable species during patrols, revealing
                                                   outcomes for endangered species using real-world           a key oversight in our past work. In this paper, we address
                                                   wildlife conservation data. Along with adapting to         these shortcomings to better align our algorithmic objectives
                                                   other challenges such as preventing illegal logging        with on-the-ground conservation priorities.
                                                   and overfishing, our no-regret algorithm addresses            Looking at real-world animal distribution and snare loca-
                                                   the general combinatorial bandit problem with a            tions in Srepok visualized in Figure 1, we observe that the
                                                   weighted linear objective.                                 locations that maximize expected reward, defined as finding
                                                                                                              the most snares (darkest red in the risk map), are not neces-
                                                                                                              sarily the regions with high density of endangered animals
                                         1        Introduction                                                (elephants). To effectively improve conservation outcomes, it
                                         More than a quarter of mammals assessed by the IUCN Red              is essential to account for these disparate impacts, as the rela-
                                         List are threatened with extinction [Gilbert, 2008]. As part         tively abundant muntjac and wildpig would benefit most if we
                                         of the UN Sustainable Development Goals, Target 15.7 fo-             simply patrol the regions with most snares, neglecting the en-
                                         cuses on ending poaching, and Target 15.5 directs us to halt         dangered elephant. Prioritization of species is well-grounded
                                         the loss of biodiversity and prevent the extinction of threat-       in existing conservation best practices [Regan et al., 2008;
                                         ened species. To efficiently allocate limited resources, multi-      Arponen, 2012; Dilkina et al., 2017]. The IUCN Red List of
                                         armed bandits (MABs), and in particular combinatorial ban-           Threatened Species, which classifies species into nine groups
                                         dits [Chen et al., 2016; Cesa-Bianchi and Lugosi, 2012], have        from critically endangered to least concerned, is regarded as
                                                                                                              an important tool to focus conservation efforts on species with
                                              1
                                                  Code is available at https://github.com/lily-x/rankedCUCB   the greatest risk of extinction [Rodrigues et al., 2006]. We
term the goal of preferentially allocating resources to maxi-          At each timestep t until the horizon T , the planner
mize benefit to priority groups as ranked prioritization.           determines an action, which is an effort vector β~ (t) =
   Some existing multi-armed bandit approaches have con-            (β1 , . . . , βN ) specifying the amount of effort (e.g., number
sidered priority-based fairness [Joseph et al., 2016; Kearns        of patrol hours) to allocate to each location. The total ef-
et al., 2017; Schumann et al., 2022]. However, unlike ours,         fort at each timestep is constrained by a budget B such that
these prior works only consider stochastic, not combinatorial,      PN
                                                                       i=1 βi ≤ B. To enable implementation in practice, we as-
bandits. In our combinatorial bandit problem setup, we must         sume the effort is discretized into J levels, thus βi ∈ Ψ =
determine some hours of ranger patrol to allocate across N          {ψ1 , . . . , ψJ } for all i.
locations at each timestep, subject to a budget constraint, The                                   ~ the decision maker receives some
                                                                       After taking action β,
rewards obtained from these actions are unknown a priori so
                                                                    reward µ(β).   ~ We assume the reward is decomposable [Chen
must be learned in an online manner. Existing combinato-
rial bandit approaches [Chen et al., 2016; Xu et al., 2021a]        et al., 2016], defined as the sum of the reward at each location:
achieve no-regret guarantees with respect to the objective of                                                  N
                                                                                                               X
maximizing the direct sum of rewards obtained from all the                                         ~ =
                                                                               expected reward = µ(β)                 µi (βi ) .   (1)
arms. However, we wish to directly consider ranked prior-                                                       i=1
itization of groups in our objective; learning combinatorial        We assume µi ∈ [0, 1] for all i. In the poaching context,
rewards in an online fashion while ensuring we justly priori-       the reward µi corresponds to the true probability of detecting
tize vulnerable groups as well as reward makes the problem          snares at a location i. The reward function is initially un-
more challenging. How to trade off reward and prioritization        known, leading to a learning problem. Following the model
in the combinatorial bandit setting has so far remained an          of combinatorial bandit allocation from Xu et al. [2021a], we
open question. We show experimentally that straightforward          assume that the reward function µi (·) is (1) Lipschitz continu-
solutions fail to make the appropriate trade-off.                   ous, i.e., |µi (ψj )−µi (ψk )| ≤ L·|ψj −ψk | for some Lipschitz
   To improve wildlife conservation outcomes by addressing          constant L and all pairs ψj , ψk , and (2) monotonically non-
the need for species prioritization, we contribute to the litera-   decreasing such that exerting greater effort never results in a
ture on online learning for resource allocation across groups       decrease in reward, i.e., βi0 > βi implies µi (βi0 ) ≥ µi (βi ).
with the following:                                                    Finally, we assume that an ordinal ranking is known to in-
1. We introduce a novel problem formulation that considers          dicate which group is more vulnerable. The ranking informs
    resource allocation across groups with ranked prioritiza-       the planner which groups are of relatively greater concern, as
    tion, which has significant implications for wildlife con-      we often lack sufficient data to quantify the extent to which
    servation, as well as illegal logging and overfishing. We       one group should be prioritized. Without loss of generality,
    show that an objective that considers both prioritization       assume the groups are numerically ordered with g = 1 being
    and reward can be expressed as a weighted linear sum,           the most vulnerable group and g = G being the least, so the
    providing the useful result that the optimal action in hind-    true rank is rank = h1, . . . , Gi.
    sight can be found efficiently via a linear program.
                                                                    2.1   Measuring ranked prioritization
2. We provide a no-regret learning algorithm, RankedCUCB,
                                                                    To evaluate prioritization, we must measure how closely the
   for the novel problem of selecting a combinatorial ac-
   tion (amount of patrol effort to allocate to each loca-          outcome of a combinatorial action β~ aligns with the prior-
   tion) at each timestep to simultaneously attain high reward      ity ranking over groups. Later in Section 6, we discuss how
   and group prioritization. We prove that RankedCUCB               our approach can generalize to other prioritization definitions,
   achieves sub-linear regret of O( ln T    NJ                      such as the case where have greater specificity over the rela-
                                     N T + T ) for a setting
                                                                    tive prioritization of each group.
   with N locations, O(J N ) combinatorial actions, and time
                                                                       To formalize prioritization, we define benefit(β;       ~ g)
   horizon T .
                                                                    which quantifies the benefit group g receives from action β.  ~
3. Using real species density and poaching data from Sre-
                                                                    As mentioned earlier, we assume that reward µi (βi ) obtained
   pok Wildlife Sanctuary in Cambodia along with synthetic
                                                                    by taking an action βi impacts all individuals at location i.
   data, we experimentally demonstrate that RankedCUCB
                                                                    Let ηgi be the number of individuals from group g at loca-
   achieves up to 38% improvement over existing approaches
                                                                    tion i and ηg be the total number of individuals in group g.
   in ranked prioritization for endangered wildlife.
                                                                    We define
                                                                                                       N
2   Problem formulation
                                                                                                      P
                                                                                                          ηgi µi (βi )
Consider a protected area with N locations and G groups of                         benefit(β; ~ g) := i=1
interest, with each group representing a species or a set of                                                ηg
species in the context of poaching prevention. Let dgi denote                                          N
                                                                                                       X
the (known) fraction of animals of a group g ∈ [G] present in                                      ,         dgi µi (βi ) .        (2)
                            PN                                                                         i=1
location i ∈ [N ]. Note that i=1 dgi = 1 for all g ∈ [G]. We
assume the groups are disjoint, i.e., each animal is a member       An action β~ perfectly follows ranked prioritization when
of exactly one group, and the impact of an action on a location
                                                                    benefit(β; ~ 1) ≥ benefit(β;  ~ 2) ≥ · · · ≥ benefit(β;~ G) .
equally benefits all animals that are present there.
However, when β  ~ does not follow perfect rank prioritization,                2.2    Objective with prioritization and reward
we need a metric to measure the extent to which groups are                     We wish to take actions that maximize our expected reward
appropriately prioritized. This metric would quantify similar-                 (Eq. 1) while also distributing our effort across the various
ity between the ranking induced by the benefits of β~ and the                  groups as effectively as possible (Eq. 5). Recognizing that
true rank. One common approach to evaluate the similarity                      targeting group prioritization requires us to sacrifice reward,
between rankings is the Kendall tau coefficient:                               we set up the objective to balance reward and prioritization
        (# concordant pairs) − (# discordant pairs)                            with a parameter λ ∈ (0, 1] that enables us to tune the degree
                             G
                                                   .                    (3)   to which we emphasize reward vs. prioritization:
                                  2
                                                                                                ~ = λµ(β)
                                                                                            obj(β)     ~ + (1 − λ) · P(β)
                                                                                                                       ~ .                    (6)
The number of concordant pairs is the number of pairwise
rankings that agree, across all G  2 pairwise rankings; the dis-
cordant pairs are those that disagree. This metric yields a                    3     Approach
value [−1, 1] which signals amount of agreement, where +1                      We show that the objective (Eq. 6) can be reformulated as a
is perfect agreement and −1 is complete disagreement (i.e.,                    weighted linear combination of the reward of each individual
reverse ordering). However, a critical weakness of Kendall                     arm, which we then solve using our RankedCUCB algorithm,
tau is that it is discontinuous, abruptly jumping in value when                producing a general-form solution for combinatorial bandits.
the effort βi on an arm is perturbed, rendering optimization                      Using the prioritization metric from Equation (5), we can
difficult.                                                                     re-express the objective (6) as
   Fortunately, we have at our disposal not just each pairwise                                                       G−1   G
                                                                                                                                             
ranking, but also the magnitude by which each pair is in con-                                N
                                                                                                                       P   P
                                                                                                                                 (dgi −dhi )
cordance or discordance. Leveraging this information, we                              ~ =
                                                                                 obj(β)
                                                                                            P
                                                                                                µi (βi ) λ + (1 − λ)
                                                                                                                     g=1 h=g+1
                                                                                                                                              (7)
                                                                                                                                             
                                    ~ that quantifies group pri-                            i=1                               (G2 )
construct a convex function P(β)
oritization:
                                       ~             ~ )                       Observe that the large parenthetical is comprised only of
                       1(g
reward rewardt (i, ψj ) over nt (i, ψj ) arm pulls by timestep t.    Algorithm 1 RankedCUCB
The confidence radius rt of an arm (i, j) is then a function of      Input: time horizon T , budget B, discretization levels
the number of times we have pulled that arm:                         Ψ = {ψ1 , . . . , ψJ }, arms i ∈ [N ] with unknown reward µi
                                                                     Parameters: tuning parameter λ
                             s
                                 3Γ2i log T
                 rt (i, j) =                  .              (9)      1: Precompute Γi for each arm i ∈ [N ]
                                 2nt (i, ψj )
                                                                      2: n(i, ψj ) = 0, reward(i, ψj ) = 0 ∀i ∈ [N ], j ∈ [J]
   We distinguish between UCB and a term we call                      3: for timestep t = 1, 2, . . . , T do
SELF UCB.    The SELF UCB of an arm (i, j) representing lo-           4:   Let εt = t−1/3
cation i with effort ψj at time t is the UCB of an arm based          5:   Compute UCBt for all arms using Eq. (11)
only on its own observations, given by                                6:   Solve LP(UCBt , {Γi }, B) to select super arm β   ~
          SELF UCB t (i, j) = Γi µ̂t (i, j) + rt (i, j) . (10)        7:   // Execute action
                                                                                                               (t)  (t)      (t)
This definition of SELF UCB corresponds with the standard             8:   Act on β~ 0 to observe rewards X1 , X2 , . . . , XN
interpretation of confidence bounds from the standard UCB1            9:   for arm i = 1, . . . , N do
                                                                                                                      (t)
algorithm [Auer et al., 2002]. The UCB of an arm is                  10:       reward(i, βi0 ) = reward(i, βi0 ) + Xi
then computed by taking the minimum of the bounds of all             11:       n(i, βi ) = n(i, βi ) + 1
SELF UCBs as applied to the arm. These bounds are deter-
mined by adding the distance between arm (i, j) and all other
arms (u, v) to the SELF UCB:                                         4     Regret analysis
    UCBt (i, j) = min {SELF UCBt (i, k) + L · dist}       (11)       We now prove that our iterative algorithm RankedCUCB
                    k∈[J]
                                                                     (Alg. 1) guarantees no regret with respect to the optimal so-
            dist = max{0, ψj − ψk }                        (12)      lution for the objective (8) that jointly considers reward and
which leverages the assumptions described in Section 2 that          prioritization. More formally, we show that RegretT → 0 as
the expected rewards are L-Lipschitz continuous and mono-            T → ∞ where
tonicially nondecreasing.                                                                                  T
   Given these UCB estimates for each arm (i, j), we must                                               1 X ~ (t)
                                                                                  RegretT := µ(β~ ? ) −       µ(β ) .                (14)
now select a combinatorial action β ~ to take at each timestep.                                         T t=1
As the prioritization metric can be expressed as a linear com-
bination of the reward (8), we can directly optimize the over-       Here, β~ ? is an optimal action and expected reward µ(β)      ~ :=
                                                                     PN                                     ~
all objective using an integer linear program LP (in Ap-                i=1 µi (βi )Γi for an effort vector β = {β1 , . . . , βN }. Note
pendix A), which selects an optimal action that respects the         that if Γi < 0, any solution to the maximization problem (8)
budget constraint.                                                   would allocate βi = 0, and so would RankedCUCB. Hence,
                                                                     for the analysis we assume that we consider only those loca-
3.2   Trading off prioritization and learning                        tions whose Γi > 0.
A key challenge with our prioritization metric is that it is de-
fined with respect to the true expected reward functions µi          4.1    Convergence of estimates
which are initially unknown. Instead, we estimate per-round          To prove the no-regret guarantee of RankedCUCB, we first
prioritization based on our current belief of the true reward,       establish Lemma 1 which states that, with high probability,
µ̂i , related to subjective fairness from the algorithmic fairness   the UCB estimate µ̂t (i, j)Γi p
                                                                                                   +rt (i, j) converges within a con-
literature [Dimitrakakis et al., 2017]. However, this belief is
                                                                     fidence bound of rt (i, j) = (3Γ2i log T )/(2nt ).
nonsensical in early rounds when our reward estimates are
extremely coarse, so we discount the weight of prioritization        Lemma 1. Using RankedCUCB, after t timesteps, each
in our objective until our learning improves.                        µ̂t (i, j)Γi estimate
                                                                                   p        converges to a value within a radius
    Inspired by decayed epsilon-greedy, we incorporate an ε          rt (i, j) =        2
                                                                                      3Γi ln t/2nt (i, j) of the corresponding true
coefficient to tune how much we want to prioritize rank order        µt (i, j)Γi values with probability 1 − 2N   J
                                                                                                                t2 for all i, j.
at each step (versus learning to maximize reward):
                 ~ = λµ(β) ~ + (1 − λ)(1 − ε) · P(β)  ~ .            Proof sketch. Using the Chernoff–Hoeffding bound, we
            obj(β)                                            (13)
                                                                     show that, at any timestep t, the probability that the difference
For example, epsilon-greedy methods often use exploration            between µt (i, j)Γi and µ̂t (i, j)Γi is greater than rt (i, j), is at
probability εt ∼ t−1/3 , which gradually attenuates at a de-         most 2/t3 . We then use union bound to show that µ̂t (i, j)Γi
creasing rate with each increased timestep. Our definition           converges to a value within radius rt (i, j) of µt (i, j)Γi with
gives nice properties that εt = 1 at t = 1, so we do not care        probability 1 − 2N    J
                                                                                         t2 . The complete proof is given in Ap-
about ranked prioritization at all (we have no estimate of the       pendix B.2.
reward values so we cannot reasonably estimate ranking) but
at t = ∞ then εt = 0 and we fully care about ranking w.r.t.
λ (since we have full knowledge of the reward thus can pre-          4.2    Achieving no regret
cisely estimate rank order).                                         Theorem 1. The    cumulative regret of RankedCUCB is
   All together, we call the approach described here Ranked-         O J ln
                                                                          N
                                                                            T
                                                                               + N J with N arms, J discrete effort values, and
CUCB and provide pseudocode in Algorithm 1.                          time horizon T .
Obj (λ = 0.3)             Obj (λ = 0.8)              Reward                    Prioritization                                  Pareto Frontier
                                       6                                                                                                                0.1    0.2 0.3
                                                                                                                                                                      0.50.5

                                                                                                                               Prioritization
             4                         5                        6
                                                                                            2                                                   4                0.10.3
S REPOK

                                                                                                                                                                         0.7 0.7
                                                                                                                                                                          0.8 0.8
                                       4                        5                                                                                                         0.9
             2                                                                              0                                                                                  0.9
                                                                                                                                                2
                                       3                        4                         −2
             0                                                                                                                                                            1.0 1.0
                                       2                        3                         −4                                                    0
                 0      200    400         0      200   400         0      200     400          0      200    400                                   3        4     5     6
                                                                7                                                                                           0.1 0.3
                                                                                            1                                                                        0.5
S YNTHETIC

                                                                                                                               Prioritization
             3                                                                                                                                  2       0.10.3
                                       5                                                    0                                                               0.5          0.7
             2                                                  6                                                                                                          0.8
                                                                                          −1                                                    1
             1                                                                                                                                               0.7            0.9
                                       4                                                                                                                      1.0
                                                                5                         −2                                                                                   1.0
             0                                                                                                                                  0
                 0      200    400         0      200   400         0      200     400          0      200    400                                       5.5         6   6.5   7
                      Timestep t            Timestep t                    Timestep t              Timestep t                                          Reward
                     Optimal         RankedCUCB        LIZARD           Random           NaiveRank                   Optimal                    LIZARD RankedCUCB

Figure 2: The performance of each approach. LEFT evaluates the objective with tuning parameter λ = 0.3 and λ = 0.8. Our approach,
RankedCUCB, performs significantly better than baselines. CENTER evaluates reward and prioritization (at λ = 0.8), the two components
of the combined objective. The reward-maximizing LIZARD algorithm rapidly maximizes reward but performs worse than random in terms
of rank order. RIGHT visualizes the Pareto frontier trading off the two components of our objective. Labels represent different values of
λ ∈ {0.1, 0.2, . . . , 1.0}. Each point plots the reward and ranked prioritization as the average of the final 10 timesteps. All results are averaged
over 30 random seeds and smoothed with an exponential moving average.

Proof sketch. We first show the expected regret (Eq. 14) of                      categories of critically endangered, endangered, least concern
RankedCUCB can be redefined as:                                                  but decreasing, and least concern respectively. Thus our set-
                                 N   J                                           ting has G = 4 groups distributed across N = 25 locations
                              1 XX                                               within the park. We further evaluate our algorithm on a syn-
                                        E[Li,j,T ] ζi,j
                              T i=1 j=1                                          thetic dataset with G = 5 groups randomly distributed across
                                                                                 the park.
where Li,j,T specifies the number of times the effort ψj se-
lected for location i results in a suboptimal solution, and                      Baselines We compare the performance of our Ranked-
ζi,j denotes the minimum loss incurred due to the sub-                           CUCB approach with naive priority-aware, priority-blind,
optimal selection. Then, by contradiction, we show that when                     random, and optimal benchmarks. NaiveRank takes the
all µ̂t (i, j)Γi estimates are within their confidence radius,                   straightforward approach of directly solving for the objective
RankedCUCB selects an optimal effort ψj (and not a sub-                          that weighs each target by its individual ranked prioritarian
optimal one) at time t. Using this fact and Lemma 1, we show                     metric, which accounts for prioritization induced by each tar-
that the E[Li,j,T ] can be upper bounded by a finite term, and                   get independently but ignores the coupled effect across tar-
hence the expected regret is O( J ln T
                                        + N J). The complete                     gets. LIZARD [Xu et al., 2021a] maximizes the combinato-
                                   N
proof is given in Appendix B.3.                                                  rial reward objective; this algorithm enhances standard UCB
                                                                                 approaches by accounting for smoothness and monotonicity
   Our analysis improves upon the regret analysis for standard                   in the reward function, but ignores prioritization. Optimal
(non-weighted) combinatorial bandits [Chen et al., 2016] by                      solves for the action β~ ? = arg maxβ~ obj(β)~ that maximizes
not having to rely on the “bounded smoothness” assumption                        the objective, using the ground truth values of µi to directly
of reward functions. The technical tools we employ for this                      compute the best action. Random selects an arbitrary subset
regret analysis can be of independent interest to the general                    of actions that satisfies the budget constraint.
multi-armed bandits community.
                                                                                    See Figure 2 for a comparison of the approaches. Ranked-
                                                                                 CUCB performs consistently the best on our overall objec-
5            Experiments                                                         tive, and the breakdown of the reward and prioritization com-
We evaluate the performance of our RankedCUCB approach                           ponents reveals that this gain is a result of the tradeoff be-
in real-world conservation data from Srepok Wildlife Sanctu-                     tween the two components: although LIZARD is able to learn
ary to demonstrate significant improvement over approaches                       high-reward actions, these actions lead to prioritization out-
that ignore prioritization.                                                      comes that are worse than random. LIZARD even achieves
   Of the key species in Srepok on which we have species                         reward that exceeds that of the optimal action (which also
density data, we prioritize elephants, followed by banteng,                      considers rank priority), but that comes at the cost of poor
muntjac, and wild pig, corresponding to their IUCN Red List                      prioritization. Notably, NaiveRank performs worse than ran-
dom, even measured on fairness: focusing on the individual           et al. [2012] provide a knapsack-based policy using UCB
targets with the best prioritization neglects group-wide pat-        to improve to O(ln B) regret. Chen et al. [2016] extend
terns throughout the park.                                           UCB1 to a combinatorial setting, matching the regret bound
   The reward–prioritization tradeoff becomes more apparent          of O(log T ). Slivkins [2013] address ad allocation where
when we analyze the Pareto frontier in Figure 2(right), which        each stochastic arm is limited by a budget constraint on the
plots the reward and prioritization of each approach as we           number of pulls. Our objective (8) is similar to theirs but they
change the value of λ. The price of prioritization is clear:         do not handle combinatorial actions, which makes our regret
the more we enforce prioritization (smaller values of λ) the         analysis even harder.
lower our possible reward. However, this tradeoff is not one-
                                                                     Prioritization notions in MABs Addressing prioritization
for-one; the steepness of the Pareto curve allows us to strate-
                                                                     requires reward trade-offs, which is related to objectives from
gically select the value of λ. For example, we achieve a nearly
                                                                     algorithmic fairness [Dwork et al., 2012; Kleinberg et al.,
two-fold increase in prioritization by going from λ = 0.9 to
                                                                     2018; Corbett-Davies et al., 2017]. However, most MAB lit-
0.7 while sacrificing less than 10% of the reward, indicating
                                                                     erature favor traditional reward-maximizing approaches [Lat-
a worthwhile tradeoff.
                                                                     timore et al., 2015; Agrawal and Devanur, 2014; Verma et
                                                                     al., 2019]. The closest metric to our prioritization objective is
6   Generalizability to other settings                               meritocratic fairness [Kearns et al., 2017]. For MAB, merito-
Beyond wildlife conservation, our problem formulation ap-            cratic fairness was introduced by Joseph et al. [2016], to en-
plies to domains with the following characteristics: (1) re-         sure that lower-reward arms are never favored above a high-
peated resource allocation, (2) multiple groups with some pri-       reward one. Liu et al. [2017] apply calibrated fairness by
ority ordering, (3) an a priori unknown reward function that         enforcing a smoothness constraint on our selection of simi-
must be learned over time, and (4) actions that impact some          lar arms, using Thompson sampling to achieve an Õ(T 2/3 )
subset of the groups to differing degrees. Our approach also         fairness regret bound. For contextual bandits, Chohlas-Wood
adapts to the following related objectives.                          et al. [2021] take a consequentialist (outcome-oriented) ap-
Weighted rank Gatmiry and others [2021] suggest some                 proach and evaluate possible outcomes on a Pareto frontier,
specific metrics for how rangers should prioritize different         which we also investigate. Wang et al. [2021] ensure that
species, noting ranger enforcement of poaching urial (a wild         arms are proportionally represented according to their merit,
sheep) should be 2.5 stricter than red deer. Our approach            a metric they call fairness of exposure. Others have consid-
could easily adapt to these settings where domain experts            ered fairness in multi-agent bandits [Hossain et al., 2021],
have more precise specification of desired outcomes. In this         restless bandits [Herlihy et al., 2021], sleeping bandits [Li
case, we introduce a parameter αg for each group g that spec-        et al., 2019], contextual bandits with biased feedback [Schu-
ifies the relative importance of group g, leading to the follow-     mann et al., 2022], and infinite bandits [Joseph et al., 2018].
ing prioritization metric:                                           However, none of these papers consider correlated groups,
                                                                     which is our focus.
                                G−1
                                P     G
                                      P
               N                            αg dgi − αh dhi
         ~ =
               X                g=1 h=g+1                            8   Conclusion
    PCF (β)          µi (βi )                G
                                                             (15)
               i=1                           2                       We address the challenge of allocating limited resources to
                                                                     prioritize endangered species within protected areas where
where we set αg > αh when g < h to raise our threshold of            true snare distributions are unknown, closing a key gap iden-
how strongly group g should be favored. From our example,            tified by our conservation partners at Srepok Wildlife Sanc-
we would set αurial = 2.5αdeer . This metric aligns with the         tuary. Our novel problem formulation introduces the metric
definition of calibrated fairness [Liu et al., 2017].                of ranked prioritization to measure impact across disparate
Weighted reward We can also accommodate the setting                  groups. Our RankedCUCB algorithm offers a principled way
where the reward is a weighted combination, i.e., if reward          of balancing the tradeoff between reward and ranked prioriti-
  PN                                                                 zation in an online learning setting. Notably, our theoretical
is i=1 ci µi (βi ) for some coefficients ci ∈ R, as the coeffi-
cients ci would be absorbed into the Γi term of the objective.       guarantee bounding the regret of RankedCUCB applies to a
                                                                     broad class of combinatorial bandit problems with a weighted
7   Related work                                                     linear objective.
Multi-armed bandits MABs [Lattimore and Szepesvári,
2020] have been applied to resource allocation for healthcare
                                                                     Acknowledgments
[Bastani and Bayati, 2020], education [Segal et al., 2018],          This work was supported in part by NSF grant IIS-2046640
and dynamic pricing [Misra et al., 2019]. These papers solve         (CAREER) and the Army Research Office (MURI W911NF-
various versions of the stochastic MAB problem [Auer et al.,         18-1-0208). Biswas supported by the Center for Research
2002].                                                               on Computation and Society (CRCS). Thank you to Andrew
  Several prior works consider resource allocation settings          Plumptre for a helpful discussion on species prioritization in
where each arm pull is costly, limiting the total number of          wildlife conservation; Jessie Finocchiaro for comments on an
pulls by a budget. Tran-Thanh et al. [2010] use an ε–based           earlier draft; and all the rangers on the front lines of biodiver-
approach to achieve regret linear in the budget B; Tran-Thanh        sity efforts in Srepok and around the world.
References                                                      Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and
Shipra Agrawal and Nikhil R Devanur. Bandits with concave         Ashesh Rambachan. Algorithmic fairness. In AEA, 2018.
  rewards and convex knapsacks. In EC, 2014.                    Tor Lattimore and Csaba Szepesvári. Bandit algorithms.
                                                                  Cambridge University Press, 2020.
Anni Arponen. Prioritizing species for conservation planning.
  Biodiversity and Conservation, 21(4):875–893, 2012.           Tor Lattimore, Koby Crammer, and Csaba Szepesvári. Linear
                                                                  multi-resource allocation with semi-bandit feedback. In
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-        NeurIPS, 2015.
  time analysis of the multiarmed bandit problem. ML, 2002.
                                                                Fengjiao Li, Jia Liu, and Bo Ji. Combinatorial sleeping ban-
Hamsa Bastani and Mohsen Bayati. Online decision making           dits with fairness constraints. IEEE Trans. Netw. Sci. Eng.,
  with high-dimensional covariates. OR, 68(1), 2020.              7(3), 2019.
Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial ban-       Yang Liu, Goran Radanovic, Christos Dimitrakakis, Deb-
  dits. J. Comput. Syst. Sci., 78(5), 2012.                       malya Mandal, and David C Parkes. Calibrated fairness
Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Com-            in bandits. FAT-ML Workshop, 2017.
 binatorial multi-armed bandit and its extension to proba-      Kanishka Misra, Eric M Schwartz, and Jacob Abernethy. Dy-
 bilistically triggered arms. JMLR, 17(1), 2016.                  namic online pricing with incomplete information using
Alex Chohlas-Wood, Madison Coots, Emma Brunskill, and             multiarmed bandit experiments. Marketing Science, 2019.
  Sharad Goel. Learning to be fair: A consequentialist ap-      Andrew J Plumptre, Richard A Fuller, Aggrey Rwetsiba, et al.
  proach to equitable decision-making. arXiv, 2021.               Efficiently targeting resources to deter illegal activities in
Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel,        protected areas. J. of Appl. Ecology, 51(3), 2014.
  and Aziz Huq. Algorithmic decision making and the cost        Helen Regan, Lauren Hierl, et al. Species prioritization for
  of fairness. In KDD, 2017.                                      monitoring and management in regional multiple species
Bistra Dilkina, Rachel Houtman, Carla P Gomes, et al. Trade-      conservation plans. Diversity and Distributions, 2008.
  offs and efficiencies in optimal budget-constrained multi-    Ana SL Rodrigues, John D Pilgrim, John F Lamoreux,
  species corridor networks. Conservation Biology, 2017.          Michael Hoffmann, et al. The value of the IUCN Red List
                                                                  for conservation. Trends Ecol. Evol., 21(2), 2006.
Christos Dimitrakakis, Yang Liu, David Parkes, and Goran
  Radanovic. Subjective fairness: Fairness is in the eye of     Candice Schumann, Zhi Lang, Nicholas Mattei, and John P
  the beholder. Technical report, 2017.                           Dickerson. Group fairness in bandit arm selection. AA-
                                                                  MAS, 2022.
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Rein-
  gold, and Richard Zemel. Fairness through awareness. In       Avi Segal, Yossi Ben David, Joseph Jay Williams, Kobi Gal,
  ITCS, 2012.                                                     et al. Combining difficulty ranking with multi-armed ban-
                                                                  dits to sequence educational content. In AIED, 2018.
Fei Fang, Thanh H Nguyen, Rob Pickles, Wai Y Lam, et al.
  Deploying PAWS: Field optimization of the Protection As-      Aleksandrs Slivkins. Dynamic ad allocation: Bandits with
  sistant for Wildlife Security. In IAAI, 2016.                   budgets. arXiv, 2013.
                                                                Long Tran-Thanh, Archie Chapman, Enrique Munoz
Zohreh Gatmiry et al. A security game approach for strategic
                                                                  De Cote, Alex Rogers, et al. Epsilon–first policies for
  conservation against poaching considering food web com-
                                                                  budget–limited multi-armed bandits. In AAAI, 2010.
  plexities. Ecological Complexity, 2021.
                                                                Long Tran-Thanh, Archie Chapman, Alex Rogers, and
Natasha Gilbert. A quarter of mammals face extinction. Na-        Nicholas Jennings. Knapsack based optimal policies for
  ture, 455(7214):717–718, 2008.                                  budget–limited multi–armed bandits. In AAAI, 2012.
Christine Herlihy, Aviva Prins, Aravind Srinivasan, and John    Arun Verma, Manjesh K. Hanawal, Arun Rajkumar, et al.
  Dickerson. Planning to fairly allocate: Probabilistic fair-     Censored semi-bandits: A framework for resource alloca-
  ness in the restless bandit setting. arXiv, 2021.               tion with censored feedback. In NeurIPS, 2019.
Safwan Hossain, Evi Micha, and Nisarg Shah. Fair algo-          Lequn Wang, Yiwei Bai, Wen Sun, and Thorsten Joachims.
  rithms for multi-agent multi-armed bandits. In NeurIPS,         Fairness of exposure in stochastic bandits. In ICML, 2021.
  2021.
                                                                Lily Xu, Shahrzad Gholami, Sara Mc Carthy, Bistra Dilkina,
Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and          Andrew Plumptre, et al. Stay ahead of poachers: Illegal
 Aaron Roth. Fairness in learning: Classic and contextual         wildlife poaching prediction and patrol planning under un-
 bandits. In NeurIPS, 2016.                                       certainty with field test evaluations. In ICDE, 2020.
Matthew Joseph, Michael Kearns, Jamie Morgenstern, Seth         Lily Xu, Elizabeth Bondi, Fei Fang, Andrew Perrault, Kai
 Neel, and Aaron Roth. Meritocratic fairness for infinite         Wang, and Milind Tambe. Dual-mandate patrols: Multi-
 and contextual bandits. In AIES, 2018.                           armed bandits for green security. In AAAI, 2021.
Michael Kearns, Aaron Roth, and Zhiwei Steven Wu. Mer-          Lily Xu, Andrew Perrault, Fei Fang, Haipeng Chen, and
 itocratic fairness for cross-population selection. In ICML,      Milind Tambe. Robust reinforcement learning under mini-
 2017.                                                            max regret for green security. In UAI, 2021.
A     Linear program                                                   By substituting Eq. (17) in Eq. (16), we obtain
The following linear program LP takes the current
UCBt (i, j) estimates to select a combinatorial action that                       t X J
                                                                                    N X                q            
                                                                                                   (s)
                                                                                  X
maximizes our optimistic reward.                                             1−         P ρi,j − ρ̂i,j > 3Γ2i ln t/2s
  Each zi,j is an indicator variable specifying whether we                         s=1 i=1 j=1
choose effort level ψj for location i.                                                 t X N X J
                                                                                      X           2
                                                                              ≥1−                   3
                 N X
                 X J
                                                                                      s=1 i=1 j=1
                                                                                                  t
           max             zi,j UCBt (i, j)                  (LP)
             z                                                                         2N J
                 i=1 j=1                                                      =1−           .                                       (18)
                                                                                        t2
            s.t. zi,j ∈ {0, 1}      ∀i ∈ [N ], j ∈ [J]
                 J
                 X                                                       This completes the proof.
                       zi,j = 1     ∀i ∈ [N ]
                 j=1
                 N X
                 X J                                                   B.3      Regret bound
                           zi,j ψj ≤ B .
                 i=1 j=1                                               Theorem 1. The    cumulative regret of RankedCUCB is
                                                                       O J ln
                                                                            N
                                                                              T
                                                                                 + N J   with N arms, J discrete effort values, and
B     Full Proofs                                                      time horizon T .
B.1    Notation
                                           (t)
For ease of representation, we use ρi,j := µt (i, j)Γi to de-
note the expected reward of visiting location i with effort ψj
       (t)                                                             Proof. For a finite time horizon T , the average cumulative
and ρ̂i,j := µ̂t (i, j)Γi as the average empirical reward (the in-
dex j denotes one that corresponds to the discretization level         regret of an algorithm that takes action β~ (t) at time t is given
of βi , i.e., βi = ψj ). We use ρ̂ to denote the empirical aver-       by Eq. (14). We use B to denote the set of all sub-optimal
age reward and ρ to denote the upper confidence bound. Note            actions:
that ρ = ρ̂ + rt (i, j). Similarly for µ̂ and µ.
B.2    Convergence of estimates                                                              B := {β~ | µ(β)
                                                                                                          ~ < µ(β~ ? )} .

Lemma 1. Using RankedCUCB, after t timesteps, each
µ̂t (i, j)Γi estimate   converges to a value within a radius           We now define regret as the expected loss incurred from
                                                                       choosing β~ from the set B. Thus, Eq. (14) can be written
              p
rt (i, j) =      3Γ2i ln t/2nt (i, j) of the corresponding true
µt (i, j)Γi values with probability 1 − 2N    J                        as
                                            t2 for all i, j.
                                                          (t)          N                        T
                                                                                                     "N            #       N   J
Proof. In other words, we wish to show that each ρ̂i,j esti-           X                     1X       X        (t)      1 XX
                                                                             µi (βi? )Γi   −       E      µi (βi ) Γi ≤           E[Li,j,T ] ζi,j
mate converges to a value within radius rt (i, j) of the corre-                              T t=1                      T i=1 j=1
                   (t)                                                 i=1                            i=1
sponding true ρi,j values.                                                                                                          (19)
    We provide a lower bound on the probability that estimated
µ̂t (i, j) values are within a bounded radius for all i ∈ [N ] and
all j ∈ [J]. At time t,                                                where Li,j,T denotes the number of times the effort for arm i
         
                    (t)
                                                                      is set as βi = ψj and the corresponding β~ ∈ B. That is, Li,j,T
      P ρi,j − ρ̂i,j ≤ rt (i, j), ∀i ∈ [N ] ∀j ∈ [J]                   specifies the number of times the pair (i, j) is chosen in a
                  N X
                    J                                                suboptimal way. Note that Li,j,T ≤ nT (i, j). Let ζi,j denote
                                 (t)                                   the minimum loss incurred due to a sub-optimal selection (of
                  X
          =1−         P ρi,j − ρ̂i,j > rt (i, j)                (16)
                  i=1 j=1
                                                                       effort ψj ) on arm i. In other words,
as the events are all independent. Assume that the number
of samples used for computing the estimate ρ̂i,j is nt (i, j).               ζi,j = µ(β~ ? ) − max{µi (ψj ) | β~ ∈ B and βi = ψj } .
                                                                                                 i,j
Using the fact that ρi,j ∈ [0, Γi ] and the Chernoff–Hoeffding
bound, we find that
                                                                 
  
             (t)
                                           2nt (i, j)                                                               6Γ2 ln t
 P ρi,j − ρ̂i,j > rt (i, j) ≤ 2 exp −                  rt (i, j)2          Let ζmin = mini,j ζi,j and τt = N 2iζ 2 . Let us as-
                                               Γ2i                                                                  min
                                                                       sume that, at time step t, all arms have been visited at least
                               = 2 exp (−3 ln t)                       τt times, so nt (i, j) ≥ τt for all i ∈ [N ] and j ∈ [J].
                    (obtained by substituting value of rt (i, j))      Using contradiction, we show that the algorithm will not
                                   2                                   choose any sub-optimal vector β~ ∈ B at time t when the
                               = 3 .                           (17)      (t)
                                   t                                   ρ̂i,j estimates converge to a value within a radius rt (i, j) =
p
 3Γ2i ln t/2nt (i, j) of ρi,j :
                        s
            N
     ~ =
           X      (t)      3Γ2i ln t
 Rµ (β)         ρ̂i,j +
           i=1
                           2nt (i, j)
                         s
            N
           X      (t)        3Γ2i ln t
        ≤       ρi,j + 2
           i=1
                             2nt (i, j)
                                    (since estimates are within radius)
                            s
             N
             X      (t)           6Γ2i ln t
         ≤         ρi,j +                         (since nt (i, j) ≥ τt )
             i=1
                                     τt
                        N
                                 (t)
                        X
         ≤ ζmin +               ρi,j .              (substituting τt )
                          i=1
                                                             (20)
                                  ~
However, the assumption that β was chosen by the algorithm
at time t implies that
      ~ > µ(β
    µ(β)       ~?)
                            s
                        N
               ~?) +
                      X         3Γ2i ln t
         ≥ µ̂(β
                       i=1
                               2nt (i, j ? )
                                                              ~?)
                    (j ? is the effort level corresponding to β       i

         ≥ µ(β~ ? ) . (since estimates are within radius)
                                                        (21)
By combining Eqs. (20) and (21) we obtain,
                   N
                       (t)
                  X
          ζmin +      ρi,j ≥ µ(β~ ? )
                          i=1

                        ζmin ≥ µ(β~ ? ) − µ(β~ (t) ) .        (22)
Inequality (22) contradicts the definition of ζmin . Thus, the
algorithm selects an optimal effort (and not the sub-optimal
                              (t)
ones) at time t when all the ρi,j estimates are within the radius
           p
rt (i, j) = 3Γ2i ln t/2nt (i, j) of the true ρi,j . Using this fact
along with Lemma 1, we obtain an upper bound on E[Li,j,T ]:
                                            T
                                           X    2N J
            E[Li,j,T ] ≤ (τT + 1)N J +
                                           t=1
                                                  t2
                                         π2
                           ≤ (τT + 1)N J +   NJ .           (23)
                                          3
  We use ζmax = maxi,j ζi,j to provide a bound on Eq. (19),
which is O(N ). This produces the following regret for
RankedCUCB by time T :
                                π2
                                   
                  1
      RegretT ≤        τT + 1 +       N Jζmax
                  T              3
                  2
                                     π 2 N Jζmax
                                        
                    6Γi ln T
               =          2   +1+                     .     (24)
                    N 2 ζmin         3         T
This result shows that the worst case regret increases linearly
with the increase in the number of discrete effort levels. How-
ever, the regret grows only sub-linearly with time T . This
proves that RankedCUCB asymptotically converges as a no-
regret algorithm, that is, as T → ∞, the regret of Ranked-
CUCB tends to 0.
You can also read