Improving Multi-agent Coordination by Learning to Estimate Contention
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Improving Multi-agent Coordination by Learning to Estimate Contention Panayiotis Danassis , Florian Wiedemair and Boi Faltings Artificial Intelligence Laboratory, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland {firstname.lastname}@epfl.ch arXiv:2105.04027v2 [cs.MA] 20 Jun 2021 Abstract coupled (agents are only aware of their own history), and re- quires no communication between the agents. Instead, agents We present a multi-agent learning algorithm, make decisions locally, based on the contest for resources ALMA-Learning, for efficient and fair allocations that they are interested in, and the agents that are interested in large-scale systems. We circumvent the tra- in the same resources. As a result, in the realistic case ditional pitfalls of multi-agent learning (e.g., the where each agent is interested in a subset (of fixed size) of moving target problem, the curse of dimension- the total resources, ALMA’s convergence time is constant in ality, or the need for mutually consistent actions) the total problem size. This condition holds by default in by relying on the ALMA heuristic as a coordi- many real-world applications (e.g., resource allocation in ur- nation mechanism for each stage game. ALMA- ban environments), since agents only have a local (partial) Learning is decentralized, observes only own ac- knowledge of the world, and there is typically a cost asso- tion/reward pairs, requires no inter-agent communi- ciated with acquiring a resource. This lightweight nature of cation, and achieves near-optimal (< 5% loss) and ALMA coupled with the lack of inter-agent communication, fair coordination in a variety of synthetic scenar- and the highly efficient allocations [Danassis et al., 2019b; ios and a real-world meeting scheduling problem. Danassis et al., 2019a; Danassis et al., 2020], make it ideal The lightweight nature and fast learning constitute for an on-device solution for large-scale intelligent systems ALMA-Learning ideal for on-device deployment. (e.g., IoT devices, smart cities and intelligent infrastructure, industry 4.0, autonomous vehicles, etc.). 1 Introduction Despite ALMA’s high performance in a variety of domains, One of the most relevant problems in multi-agent systems is it remains a heuristic; i.e., sub-optimal by nature. In this finding an optimal allocation between agents, i.e., comput- work, we introduce a learning element (ALMA-Learning) ing a maximum-weight matching, where edge weights cor- that allows to quickly close the gap in social welfare com- respond to the utility of each alternative. Many multi-agent pared to the optimal solution, while simultaneously increas- coordination problems can be formulated as such. Exam- ing the fairness of the allocation. Specifically, in ALMA, ple applications include role allocation (e.g., team formation while contesting for a resource, each agent will back-off with [Gunn and Anderson, 2013]), task assignment (e.g., smart probability that depends on their own utility loss of switching factories, or taxi-passenger matching [Danassis et al., 2019b; to some alternative. ALMA-Learning improves upon ALMA Varakantham et al., 2012]), resource allocation (e.g., park- by allowing agents to learn the chances that they will actually ing/charging spaces for autonomous vehicles [Geng and Cas- obtain the alternative option they consider when backing-off, sandras, 2013]), etc. What follows is applicable to any such which helps guide their search. scenario, but for concreteness we focus on the assignment ALMA-Learning is applicable in repeated allocation problem (bipartite matching), one of the most fundamental games (e.g., self organization of intelligent infrastructure, au- combinatorial optimization problems [Munkres, 1957]. tonomous mobility systems, etc.), but can be also applied as A significant challenge for any algorithm for the assign- a negotiation protocol in one-shot interactions, where agents ment problem emerges from the nature of real-world applica- can simulate the learning process offline, before making their tions, which are often distributed and information-restrictive. final decision. A motivating real-world application is pre- Sharing plans, utilities, or preferences creates high overhead, sented in Section 3.2, where ALMA-Learning is applied to and there is often a lack of responsiveness and/or communi- solve a large-scale meeting scheduling problem. cation between the participants [Stone et al., 2010]. Achiev- 1.1 Our Contributions ing fast convergence and high efficiency in such information- restrictive settings is extremely challenging. (1) We introduce ALMA-Learning, a distributed algorithm A recently proposed heuristic (ALMA [Danassis et al., for large-scale multi-agent coordination, focusing on scala- 2019a]) was specifically designed to address the aforemen- bility and on-device deployment in real-world applications. tioned challenges. ALMA is decentralized, completely un- (2) We prove that ALMA-Learning converges. Copyright International Joint Conferences on Artificial Intelligence (IJCAI), 2021. All rights reserved.
(3) We provide a thorough evaluation in a variety of syn- approximation procedure so as to maximize some notion of thetic benchmarks and a real-world meeting scheduling prob- expected cumulative reward. Both approaches have arguably lem. In all of them ALMA-Learning is able to quickly (as been designed to operate in a more challenging setting, thus little as 64 training steps) reach allocations of high social wel- making them susceptible to many pitfalls inherent in MAL. fare (less than 5% loss) and fairness (up to almost 10% lower For example, there is no stationary distribution, in fact, re- inequality compared to the best performing baseline). wards depend on the joint action of the agents and since all agents learn simultaneously, this results in a moving-target 1.2 Discussion and Related Work problem. Thus, there is an inherent need for coordination in Multi-agent coordination can usually be formulated as a MAL algorithms, stemming from the fact that the effect of an matching problem. Finding a maximum weight matching agent’s action depends on the actions of the other agents, i.e. is one of the best-studied combinatorial optimization prob- actions must be mutually consistent to achieve the desired re- lems (see [Su, 2015; Lovász and Plummer, 2009]). There is a sult. Moreover, the curse of dimensionality makes it difficult plethora of polynomial time algorithms, with the Hungarian to apply such algorithms to large scale problems. ALMA- algorithm [Kuhn, 1955] being the most prominent centralized Learning solves both of the above challenges by relying on one for the bipartite variant (i.e., the assignment problem). In ALMA as a coordination mechanism for each stage of the real-world problems, a centralized coordinator is not always repeated game. Another fundamental difference is that the available, and if so, it has to know the utilities of all the partic- aforementioned algorithms are designed to tackle the explo- ipants, which is often not feasible. Decentralized algorithms ration/exploitation dilemma. A bandit algorithm for example (e.g., [Giordani et al., 2010]) solve this problem, yet they re- will constantly explore, even if an agent has acquired his most quire polynomial computational time and polynomial number preferred alternative. In matching problems, though, agents of messages – such as cost matrices [Ismail and Sun, 2017], know (or have an estimate of) their own utilities. ALMA- pricing information [Zavlanos et al., 2008], or a basis of the Learning in particular, requires the knowledge of personal LP [Bürger et al., 2012], etc. (see also [Kuhn et al., 2016; preference ordering and pairwise differences of utility (which Elkin, 2004] for general results in distributed approximabil- are far easier to estimate than the exact utility table). The lat- ity under only local information/computation). ter gives a great advantage to ALMA-Learning, since agents While the problem has been ‘solved’ from an algorithmic do not need to continue exploring after successfully claiming perspective – having both centralized and decentralized poly- a resource, which stabilizes the learning process. nomial algorithms – it is not so from the perspective of multi- agent systems, for two key reasons: (1) complexity, and (2) 2 Proposed Approach: ALMA-Learning communication. The proliferation of intelligent systems will 2.1 The Assignment Problem give rise to large-scale, multi-agent based technologies. Al- The assignment problem refers to finding a maximum weight gorithms for maximum-weight matching, whether centralized matching in a weighted bipartite graph, G = {N ∪ R, V}. or distributed, have runtime that increases with the total prob- In the studied scenario, N = {1, . . . , N } agents compete to lem size, even in the realistic case where agents are interested acquire R = {1, . . . , R} resources. The weight of an edge in a small number of resources. Thus, they can only handle (n, r) ∈ V represents the utility (un (r) ∈ [0, 1]) agent n problems of some bounded size. Moreover, they require a sig- receives by acquiring resource r. Each agent can acquire at nificant amount of inter-agent communication. As the num- most one resource, and each resource can be assigned to at ber and diversity of autonomous agents continue to rise, dif- most one agent. The goal is to maximize the social welfare ferences in origin, communication protocols, or the existence (sum of utilities), i.e., maxx≥0 (n,r)∈V un (r)xn,r , where P of legacy agents will bring forth the need to collaborate with- x = (x1,1 , . . . , xN,R ), subject to r|(n,r)∈V xn,r = 1, ∀n ∈ P out any form of explicit communication [Stone et al., 2010]. Most importantly though, communication between partici- N , and n|(n,r)∈V xn,r = 1, ∀r ∈ R. P pants (sharing utility tables, plans, and preferences) creates high overhead. On the other hand, under reasonable assump- 2.2 Learning Rule tions about the preferences of the agents, ALMA’s runtime is We begin by describing (a slightly modified version of) the constant in the total problem size, while requiring no message ALMA heuristic of [Danassis et al., 2019a], which is used exchange (i.e., no communication network) between the par- as a subroutine by ALMA-Learning. The pseudo-codes for ticipating agents. The proposed approach, ALMA-Learning, ALMA and ALMA-Learning are presented in Algorithms 1 preserves the aforementioned two properties of ALMA. and 2, respectively. Both ALMA and ALMA-Learning are From the perspective of Multi-Agent Learning (MAL), the run independently and in parallel by all the agents (to im- problem at hand falls under the paradigm of multi-agent rein- prove readability, we have omitted the subscript n ). forcement learning, where for example it can be modeled as We make the following two assumptions: First, we as- a Multi-Armed Bandit (MAB) problem [Auer et al., 2002], sume (possibly noisy) knowledge of personal utilities by each or as a Markov Decision Process (MDP) and solved using a agent. Second, we assume that agents can observe feedback variant of Q-Learning [Busoniu et al., 2008]. In MAB prob- from their environment to inform collisions and detect free lems an agent is given a number of arms (resources) and at resources. It could be achieved by the use of sensors, or by a each time-step has to decide which arm to pull to get the single bit (0 / 1) feedback from the resource (note that these maximum expected reward. In Q-learning agents solve Bell- messages would be between the requesting agent and the re- man’s optimality equation [Bellman, 2013] using an iterative source, not between the participating agents themselves). Copyright International Joint Conferences on Artificial Intelligence (IJCAI), 2021. All rights reserved.
3 1: procedure ALMA-L EARNING Table 1: Motivating adversarial example: Inaccurate loss estimate. 2: for all r ∈ R do . Initializatio Agent n3 backs-off with high probability when contesting for re- 3: rewardHistory[r].add(u(r)) source r1 assuming a good alternative, only to find resource r2 oc- 4: reward[r] ← rewardHistory[r].getMean() cupied. 5: loss[r] ← u(r) − u(rnext ) Resources Resources Algorithm 2 ALMA-Learning 6: r ← arg max reward[r] loss(i) = un (ri7:) − un (ri+1 ) and ri+1r is the next best re- start r1 r2 r3 r1 r2 r3 Resources source according 8:toinagent for tn’s preferences ≺n . 1 Require: 0Sortrresources r2 (Rr3 ⊆ R) n n1 1 0 0.5 n1 0.9 decreasing∈ [1, order . . . , Tof] do utility . T : Time horizo Agents Agents r , . . . , r 1 The under first ≺ example9: is given in Table 1a. Agent 3 backs-off nloss[]) . Run ALM n2 0 1 0 n2 0 1 0 n0.9 1 n R −1 0 0.5 n r won ← ALM A(r start , n3 1 0.9 0 n3 Agents 1 Require: 1 0.9 n0rewardHistory[R][L],with high probability 10: (higherloss[R] reward[R], than agent n1 ) when contesting 1: procedure 2 0ALMA-L 1 0 for resource EARNING r 11:assuming arewardHistory[r good alternative, only to findwon ].add(u(r re-)) n3 all 1r ∈ R0.9 0 1 12: start able 1: Motivating adversarial (a) Table 2: Inaccurate example: Motivatingloss adversarial (b) 2: Inaccurate estimate.example: for source reward do r2 occupied. expec- Thus, reward[r n3 ends . Initialization up startmatched with resource start ].getMean ] ← rewardHistory[r gent n3 backs-off with high Table tation. probability Agents n1 and when contesting 1: Motivating n3 foralways adversarial start 3: re- example: by attempting Inaccurate . The tor3acquire loss re-social13: rewardHistory[r].add(u(r)) estimate. welfaren3ofifbacks-off Agent thestart u(r final −allocation ) with won ) >is0 2, u(rhigh thenwhich urce r1 assumingTable 1: Adversarial a good source alternative, examples: only , to (1a) reasoning find Inaccurate that resource it isloss oc-the estimate. most4: Agent preferred reward[r] one, isyet20% ← worse each 14: theonly rewardHistory[r].getMean() ofalternative, loss[r ← r n1 , n2 , n3 are start ]agents n3 backs-off withthem probability highonly r 1 when contesting probability r 2 for resource r1 assuming a good than optimal to (where find resource 2 pied. occupied. wins r1 whenhalf ofcontesting the times.for resource 5: r1loss[r] ← u(r) − u(rnext matched with resources 15: ) r3 , r2 , (1r1 ,−respectively, α)loss[rstart ]achieving + α (u(rstart a ) − u(rwon )) assuming a good alternative, Resources only to find resource r 2 occupied. 6: r ← arg max 16:2.5). ALMA-Learning solves this problem reward[r] (1b) Inaccurate reward start social welfare r of r1 rexpectation. 2 Agents n1 and n3 always r3 Resources 7: start Resources by learning an empirical 17: if rstart !of=the estimate rwon lossthen an agent will in- by attempting n1 to acquire 1 0.9 resource 0 r1 r 1 , reasoning that it 8: is the most for t ∈r [1, . .r. , T ] do 18: : Time .rTstart ← horizon arg maxr reward[r] r r cur if he r backs-off from a resource. Agents one, yet each of them only wins r1 half of the time. preferred . Run ALMA agent n3 will In this case, 2 3 1 2 3 n2 0 1 n0.9 1 0 0.5 9: n1 rwon ← 0.9 1 learn ALM A(r 0hisstart , loss[]) n3 Agents 1 0.9 n0 1 Agents 10: that loss is not 1 − 0.9 = 0.1, but actually 1 − 0 = 1, 2 0 1 0 n2 0 and1thus 0.9 willstart not].add(u(r back-offwon in ))subsequent stage games, result- 11: rewardHistory[r where rnext is the next most preferred resource to r, accord For both ALMA, and ALMA-Learning, n3 1 0.9 0each 12: agent sorts n3 reward[r1 ing0.9 in 0← rewardHistory[r an] optimal able 2: Motivating adversarial example: Inaccurate his available resources (possibly reward R n ⊆ expec- R) in decreasing util- start ingallocation. to agent n’s ].getMean()≺ (see line 5 of Alg. 2). Sub preferences start n tion. Agents n1 and e 1: Motivating n3 always adversarial Table start example:2:by attempting Inaccurate Motivating to acquire loss re- estimate. (a) adversarial example: Inaccurate13: Agent n if 3(b)backs-off u(r In ) another − withu(r reward expectation. start high wonexample ) > 0 then sequently, (Table Agents 1b, agents 1 and stage fornevery n3 always n 1 andagent game, n3 always n starts by selectin ity (r , . . . , r , . . . , r n −1 ) under his preference ordering ≺n . 14: start byit attempting topreferred acquire urce r1 ,when ability reasoning that it isfor contesting 0 the i most resource start Rpreferred r1 assuming by attempting one,to yet aeach goodof acquire alternative, resource only to find r1 , reasoning loss[r thatresource start ] ← is ther2mostresource rstartresource , andyet one, ends 1 ,up reach reasoning wining of that it is rwon . The los resource em only wins r1ALMA: half of the times. 15: (1 − the mostα)loss[r preferred of ]rone. + α (u(r − u(rwon )repeated )) pied. ALtruistic them Figure only 1:wins Adversarial MAtching examples: Heuristic r1 half of the times. 1a Inaccurate loss estimate. Agent startYet, is thenin aupdated game, according each of them to the following averagin start start n backs-off with high ALMA converges to a resource through repeated trials. Let 3 probability 16: when contesting for only resourcewins r 1 r1 half of process, the time where (for α is athesocial learningwelfare rate: 2, which Resources Resources 17: resourceifrr2start then than the optimal 2.8), thus, in expectation, A = {Y, Ar1 , assuming . . . , ArRna}good alternative, the setonly to find where roccupied. is !28.5% 1bworse = rwon denote of actions, 18: loss[r start ] ← (1 − α)loss[rstart ] + α (u(rstart ) − u(rwon )) r r r Inaccurate Algorithm reward r expectation. r r Agents n and n always start ← resource r1maxby arg reward[r] hasrutility 0.5. ALMA-Learning solves this by and2toA ALMA-Learning start Y 1 refers2 to yielding, 3 attempting r refersresource acquire 1 to2accessing 3 , resource reasoning 1 that 3 r,it is the most pre- n1 and1 let0g denote 0.5 the agent’s 1 0.9 n1strategy. r01 as an agent has learning an empirical estimate of the reward of each17-18 resource. gents 0 1 Agents Require: 0 ferred one, Sortn resources yet each0 of themAs long 1(Ronly n 0.9 R) inr1decreasing ⊆ wins half of the times. order of utility r0 , . . . , rRn −1theunder Finally, last condition ≺n (lines of Alg. 2) ensure n 2 not acquired Require: a resourcerewardHistory[R][L], yet,2 at every time-step, there arertwoloss[R] In this case, afterthat learning, agentseither who have agentacquired n1 or n3 (or both), will resources of high preferenc n3 1 0.9 0 n 1 0.9 0 where reward[R], next is the start nextfrom most preferred resource r . resource Agent n towillr, accord- back-off since he has process. a possible scenarios: 1: procedure If g = ALMA-L 3 Ar (strategy EARNINGpoints to ingresource to agent n’sgood preferences ≺ stop(see exploring, 2 line 5 of thus 2 Alg. stabilizing 2). Sub- the learning e 2: Motivating then (a) adversarial agent source attempts example: will to decrease, acquire thus that in the resource. future If the there agent is will alternative, switch n and the result will be the optimal allocation ample: Inaccurate r), loss estimate. n 2: forInaccurate Agent allnr3(b) backs-off ∈ R do reward with expectation. highsequently, Agents 1 and stage fornevery n3 always game, agent n starts.by selecting Initialization , where agents n12.3 nConvergence 3 are matched , n2 , resource withloss resources r2 , r3 , r1 by attempting source r1 assuminga tocollision, a goodthe acquire 3:to resource an alternative colliding alternative, only tostarting parties back-off r1 ,rewardHistory[r].add(u(r)) reasoning find that resource, it (set resourceis the ← andY preferred resource gr2most (2) ) with ifrstart an agent one,andyet backs- ends eachup wining of rwon . The (or r , r , r ), respectively. mgureonly1:wins r1 some Adversarial ofprobability. halfexamples:the times. 4:off 1a from reward[r] Otherwise, Inaccurate contesting if g ← loss estimate. = resource, theragent YAgent expecting of rchooses start is rewardHistory[r].getMean() low then loss,updatedonly according 1 3to 2 Convergence to the following of ALMA-Learning averaging does not translate to a fixe 3 backs-off withanother high probability resource when 5:findr forthatcontesting all his high monitoring. loss[r] for ← resource Ifutility u(r) alternatives the−resource r1 u(r process, )is are free, where already he α is the occupied, learning rate: allocation at each stage game. The system has converge r (loss[r]) willALMA-Learning next suming a good Resources alternative, sets g ← Aonly to find resource r2 occupied. r . then his expected loss of resource 1b increase, when agents no longer switch their starting resource, rstar 6: r ← arg max reward[r] loss[rstart ] ← ALMA-Learning (1 − α)loss[r ] + α (u(r start )as −au(r )) accurate2 reward orithm r1expectation. Therback-off ALMA-Learning 2 r3Agents making nstart probability and n 1 him more (P always 3 (·)) reluctantstart r by is computed to back-off individually in some future stage start Theuses final ALMA allocation sub-routine, ofwoneach stage specifically game is controlled b tempting to n1acquire 1 resource 0.9 r017: , reasoning game. In that more it is the detail, most pre- ALMA-Learning learns and as a coordination maintains mechanism for each stage of the repeated gents uire: Sort and locally based on R) inr1for each agent’s expected . . , T ]ofdoutility loss. If more Finally, than the last condition ALMA, (lines 17-18 which of means Alg. 2) that ensures even after convergence there ca n2resources 1(Ronly decreasing order r0 , . . . , rRn −1 under . T : Time horizon n rred one, yet each0 of them ⊆8:wins 0.9 halft of∈ [1,the .times. 1 game. ≺n Over time, ALMA-Learning learns which resource to than one agen one agent compete the following for resource information r (step 8: of Alg. that 1), each agents who have acquired be contest resources for of a high resource, preference i.e., having more uire: rewardHistory[R][L], n3 of1 them 9: reward[R], 0.9will 0back-off rwon loss[R] i ← ALM A(rstart , loss[]) select first (r ) when running . Run ALMA ALMA, and an accurate em- (i) rewardHistory[R][L]: with probability that A 2D depends array. on their For each r ∈ R stop exploring, thus stabilizing the learning process. startselecting the same starting resource. As we will demon procedure ALMA-L EARNING 10: pirical estimate on thelater, loss the agent will incur by backing-off ource mple: will expected decrease, thus utility in the it maintains loss. future The the the expected agent L willmost loss switchrecent array is reward computed values by received by strate this translates to fairer allocations, since agen forInaccurate all r ∈ R do reward expectation. 11: Agents rewardHistory[r n1 and n3 always start ].add(u(rwon )) (loss[]). . Initialization By learning these preferences two values can agents take more in- acquiring the an alternative ALMA-Learning starting urce r1 ,rewardHistory[r].add(u(r)) reasoning that it is12: resource, agent andand n, the most preferred (2)i.e., provided if the an reward[rone, as input L agent most to backs- recent ALMA. u2.3 The (r Convergence actual), where r ← with similar alternate between startyet ] ←each of n won won rewardHistory[r ].getMean() formed decisions, specifically: (1) If an agent often luckloses the we only pro start back-off probability ALMcan be computed Seewith line to 11 monotonically of Alg. 2. The array is initial- u(rany most preferred resource. Due to of space f from reward[r] contesting resource A(r, loss[]). ← 13:r expecting rewardHistory[r].getMean() if u(rlowstart loss, ) −only Convergence won ) > 0 then of ALMA-Learning contest of his starting does not translate resource, the to a fixed expected reward of that re-supplement. nd that all decreasing his high utility functionized to on the utility (see [ of each Danassis resource et al., (line 2019a 3] of ). InAlg. 2). vide a sketch of the proof. Please refer to the loss[r] ← u(r) − alternatives 14:next ) are already u(r loss loss[rstart occupied, ]← allocation at each stage game. The system has converged source will decrease, thus in the future the agent will switch en his expected thislosswork of we use P resource (ii) reward[R]: r (loss[r]) (loss) = f(1 will (loss) A , 1D β where increase, array. β controls ]when For each no the agents r− ∈u(rRswitch longer it their Theorem starting 1. resource, Thereand rstart exists . time-step r akingstart ← arg him more maxr reward[r] aggressiveness reluctant 15: maintainsinansome to (willingness back-off toempirical − α)loss[r future estimate back-off), andstarton stage +α (u(r thefinal start ) to expected reward won )) an alternative re- starting resource, (2) if an agenttconv backs-such that ∀t > 16:ceivedby starting at resource r and continue playing The allocation of each staget game : r n is controlled (t) = r n by (t ), where r n (t) denote ame. In more detail, ALMA-Learning learns and maintains off fromac-contesting resource r expecting low loss, only to conv start start conv start ALMA, which means that eventheafter convergence starting resource there r can of agent n at the stage game o ) in fordecreasing . . , T ]ofdoutility t ∈ [1, .order 17: r , . . . , rif r under ! = r ≺ 1.if loss then . T : find Time that horizon all his high utility alternatives are start already occupied, e following information1 : cording 1toR−Alg. , start Itn is≤computed won by averaging the re- i.e., time-step 0 n −1 be contest for a resource, having more than one agent ← ALM A(r rwon loss[R] 18:ward , loss[]) ← then. Run his ALMA t. expected loss of resource r (loss[r]) will increase, reward[R], (i) rewardHistory[R][L]: f start (loss)A 2D = history , rFor array. of start the each 1arg ifresource, r−∈loss maxi.e., reward[r] ∀r ∈ R(1)the R ≤r selecting : reward[r] same starting ← resource. As we will demon- G maintains the L most recent rewardHistory[r].getM reward1 − values loss, received otherwise ean(). by See linelater, 12 ofthis Alg.making 2. him moreProof. reluctant (Sketch) to back-off Theorem in 2.1 someoffuture stage et al., 2019a [Danassis strate translates to fairer allocations, since agents rewardHistory[rstart ].add(u(r (iii) loss[R]:won ))A 1D . Initialization array. For each r ∈ R it game. maintains In anmore detail, proves ALMA-Learning that ALMA (called learns at and line maintains 9 of Alg. 2) converge gent n, i.e., the Agents L mostthat recentdo un (r not have wongood ), where alternativesrwon ← with lesssimilar willinbeutility likely preferences can alternate between acquiring their (u(r)) reward[rstart168 ] ← rewardHistory[r Table empirical 2 presents estimate the on second start ].getMean() the loss example. Agents agent n and n n the incurs following always if he start information byin attempting polynomial 1 : totime acquire (in resource fact, under some assumptions, it con LM A(r, loss[]). See to back-off line 11 of Alg. and )backs-off vice 2. The array is initial- most preferred 1 resource. Due to luck of space we only pro- 3 if u(rstart ) − u(r > 0versa. thenfrom The ones that do back-off theis contest the mostofpreferredresource select r. an The loss of a(i) each game, A 2D array. For1 each History[r].getMean() wonr1 , reasoning that it one. Yet,of inthe rewardHistory[R][L]: repeated verges each in of them constant only time, wins i.e., reach r ∈game stage R converges i ed to the utility of each alternative 169resource (line 3 of Alg. 2). vide a sketch proof. Please refer to the supplement. ext ) loss[r start170 ] ←resource half resourceand examine of theFor r is (achieving times its availability. initialized to loss[r] social The ← welfare resource n (r) −worse 2,u28.5% itunmaintains (rnext than),thethe L most optimal constant 2.8), recent thus,Ininreward time). expectation, ALMA-Learning values received agents by switch their star (ii) reward[R]: selection (1 − A 1D α)loss[r is array. performed ] + α in (u(r each r) −∈order, sequential u(r R it starting )) from Theorem the 1. There exists time-step tan such that ∀t >expected d[r] aintains an empirical estimate 171 resource start 1 r has utility on the start 0.5. ALMA-Learning expected reward won re- solves agent this problemn, i.e., by the learning ing L most resourceconv recent empirical only uwhen n estimate (r wonthe ), where of rwon reward← for the curren We 1have omitted most preferred resource (see step 3 of Alg. 1). n tconv : rstart (t) the subscript from all the n variables and = A(r, n ar- rstart (tconv ),nwhere r11n (t) denotes eived by starting at resource 172 therays, reward r and but of each continue every resource. agent playing maintains Inac- this theircase, own after learning, estimates. ALM either agent 1 or See loss[]). nstarting 3 (orline both), resource ofwill start Alg.start drops 2.below The fromarray the bestis initial- alternative one, i.e if Sources of then Inefficiency resource r2by . AgentALMA : n2 willtheis Timea heuristic, horizon back-off the starting i.e., sub- resource ized to r the of utilityagentof eachn at the resource stage (linegame 3 ofofAlg. 2). ording to Alg. r ! = start 1. won r It173is computed . averaging T re- since he has a good alternative, and the result will be the optimal start ,ard history rof loss[]) optimal the← start arg by max resource, 174 nature. allocation It is∈ r reward[r] i.e., ∀r worth where R : agents understanding reward[r] .nRun ←, n3the 1 , n2ALMA aretime-step sources matchedof t.with resources (ii) reward[R]:r2 , r3 , r1 (or A r11D , r3 ,array. For each r ∈ R it r2 ), respectively. inefficiency, which ewardHistory[r].getM ean(). See line 12 of Alg. 2. in turn motivated ALMA-Learning. To maintains an Proof. (Sketch) Theorem 2.1 of [Danassis et al., 2019a empirical estimate on the expected ] reward re- .add(u(r won ))do (iii) loss[R]: A 1D so, we 175provide array. 2.1.2 For each a couple r ∈ of ALMA-Learning: R adversarial it maintains Aexamples. Multi-Agent an (Meta-)Learning ceived by starting Algorithm proves that ALMA (called at line 9 of Alg. 2) converges at resource r and continue playing ac- mpirical ardHistory[r e 2 presents estimate thestartInon second the original ].getMean()loss in ALMA theexample. utility Agentsagent algorithm, n1 and n nincurs all agents always if hestart start by in attempt- totime attempting polynomial cording acquire (in to resource fact, Alg. under 1. some It is computed assumptions, by it con-averaging the re- 3 acks-off > 0 then from ing the to claim contest 176 of ALMA-Learning their most resource preferred easoning that it is the most preferred one. Yet, in a repeated game,r. The uses loss ALMA resource, of each as and a sub-routine, back-off verges eachwith in of specifically ward them only constant as history time,wins a coordination of i.e.,reach the resource, mechanism i.e., stage game converges in ∀r for ∈ Reach : reward[r] ← 1 source r is initialized probability 177to thatstage loss[r] of depends the ← repeated u of the times (achieving social welfare 2, 28.5% worse than the optimal on (r) their − game. u loss(r Over of time, switching ), ALMA-Learning to the 2.8), learns rewardHistory[r].getM thus,IninALMA-Learning expectation,which resource to select ean(). Seefirst (r line 12 ) of Alg. 2. n n next constant time). agents switch their start- start t ]1 + rα1 (u(r urce immediate has omitted utility )− u(r178won when next)) best ALMA-Learning running resource. ALMA, solves this and Specifically, an accurate problem in the by empirical simplest learning an empiricalestimate on theof estimate loss it will incur by backing-off We have start 0.5. the subscript from all the variables and ar- ing resource only when the expected reward for the current case, the (loss[]). 179probability By learning to back-off learning,when these eithertwo values, contesting agents both), resource take more 1 informed Westarthave omitted decisions, the subscript specifically: n from (1)theIfvariables and ar- one,alli.e., n eward ys, but of each every resource. agent maintains In this their case, own after estimates. agent 1 or n3 (or nstarting resource will drops from below the best alternative r would 180 bean agent given by oftenP loses (loss(i)) the =contest 1 − of his loss(i), starting where resource, rays, the but expected every agent reward maintains of that their resource own estimates.will urce r2 . Agent n2 will back-off since he has a good alternative, and the result will be the optimal i cation where agents n1 , n2 , n3 are matched with resources r2 , r3 , r1 (or r1 , r3 , r2 ), respectively. reward[r] Copyright International Joint Conferences on Artificial Intelligence (IJCAI), 2021. All rights reserved. 2 ALMA-Learning: A Multi-Agent (Meta-)Learning Algorithm 5 . Agents n1 and n3 always start by attempting to acquire resource MA-Learning uses ferred one. Yet, inALMA as agame, a repeated sub-routine, each ofspecifically as a coordination them only wins r1 mechanism for each eelfare of the2,repeated game. Over time, ALMA-Learning learns which 28.5% worse than the optimal 2.8), thus, in expectation, resource to select first (rstart ) n running ALMA, and an accurate empirical estimate on the Learning solves this problem by learning an empirical estimate of loss it will incur by backing-off s[]). By learning these two values, agents take more informed decisions, specifically: (1) If
Algorithm 1 ALMA: Altruistic Matching Heuristic. Algorithm 2 ALMA-Learning Require: Sort resources (R ⊆ R) in decreasing order of utility n Require: Sort resources (Rn ⊆ R) in decreasing order of utility r0 , . . . , rRn −1 under ≺n r0 , . . . , rRn −1 under ≺n 1: procedure ALMA(rstart , loss[R]) Require: rewardHistory[R][L], reward[R], loss[R] 2: Initialize g ← Arstart 1: procedure ALMA-L EARNING 3: Initialize current ← −1 2: for all r ∈ R do . Initialization 4: Initialize converged ← F alse 3: rewardHistory[r].add(u(r)) 5: while !converged do 4: reward[r] ← rewardHistory[r].getMean() 6: if g = Ar then 5: loss[r] ← u(r) − u(rnext ) 7: Agent n attempts to acquire r 6: rstart ← arg maxr reward[r] 8: if Collision(r) then 7: 9: Back-off (set g ← Y ) with prob. P (loss[r]) 8: for t ∈ [1, . . . , T ] do . T : Time horizon 10: else 9: rwon ← ALM A(rstart , loss[]) . Run ALMA 11: converged ← T rue 10: 12: else (g = Y ) 11: rewardHistory[rstart ].add(u(rwon )) 13: current ← (current + 1) mod R 12: reward[rstart ] ← rewardHistory[rstart ].getMean() 14: Agent n monitors r ← rcurrent . 13: if u(rstart ) − u(rwon ) > 0 then 15: if Free(r) then set g ← Ar 14: loss[rstart ] ← 16: return r, such that g = Ar 15: (1 − α)loss[rstart ] + α (u(rstart ) − u(rwon )) 16: 17: if rstart 6= rwon then (iii) loss[R]: A 1D array. For each r ∈ R it maintains an 18: rstart ← arg maxr reward[r] empirical estimate on the loss in utility agent n incurs if he backs-off from the contest of resource r. The loss of each resource r is initialized to loss[r] ← un (r) − un (rnext ), that reward[rstart ] < reward[rstart0 ]. Given that utilities where rnext is the next most preferred resource to r, accord- are bounded in [0, 1], there is a maximum, finite number of ing to agent n’s preferences ≺n (see line 5 of Alg. 2). Sub- switches until rewardn [r] = 0, ∀r ∈ R, ∀n ∈ N . In that sequently, for every stage game, agent n starts by selecting resource rstart , and ends up winning resource rwon . The loss case, the problem is equivalent to having N balls thrown ran- of rstart is then updated according to the following averaging domly and independently into N bins (since R = N ). Since process, where α is the learning rate: both R, N are finite, the process will result in a distinct allo- cation in finite steps with probability 1. loss[rstart ] ← (1 − α)loss[rstart ] + α (u(rstart ) − u(rwon )) Finally, the last condition (lines 17-18 of Alg. 2) ensures that agents who have acquired resources of high preference 3 Evaluation stop exploring, thus stabilizing the learning process. We evaluate ALMA-Learning in a variety of synthetic bench- marks and a meeting scheduling problem based on real data 2.3 Convergence from [Romano and Nunamaker, 2001]. Error bars represent- Convergence of ALMA-Learning does not translate to a fixed ing one standard deviation (SD) of uncertainty. allocation at each stage game. The system has converged For brevity and to improve readability, we only present the when agents no longer switch their starting resource, rstart . most relevant results in the main text. We refer the interested The final allocation of each stage game is controlled by reader to the appendix for additional results for both Sections ALMA, which means that even after convergence there can 3.1, 3.2, implementation details and hyper-parameters, and a be contest for a resource, i.e., having more than one agent detailed model of the meeting scheduling problem. selecting the same starting resource. As we will demon- strate later, this translates to fairer allocations, since agents Fairness The usual predicament of efficient allocations is with similar preferences can alternate between acquiring their that they assign the resources only to a fixed subset of agents, most preferred resource. which leads to an unfair result. Consider the simple exam- ple of Table 2. Both ALMA (with higher probability) and Theorem 1. There exists time-step tconv such that ∀t > any optimal allocation algorithm will assign the coveted re- n tconv : rstart n (t) = rstart (tconv ), where rstart n (t) denotes source r1 to agent n1 , while n3 will receive utility 0. But, the starting resource rstart of agent n at the stage game of using ALMA-Learning, agents n1 and n3 will update their time-step t. expected loss for resource r1 to 1, and randomly acquire it Proof. (Sketch; see Appendix A) Theorem 2.1 of [Danassis between stage games, increasing fairness. Recall that con- et al., 2019a] proves that ALMA (called at line 9 of Alg. 2) vergence for ALMA-Learning does not translate to a fixed converges in polynomial time (in fact, under some assump- allocation at each stage game. To capture the fairness of this tions, it converges in constant time, i.e., each stage game con- ‘mixed’ allocation, we report the average fairness on 32 eval- verges in constant time). In ALMA-Learning agents switch uation time-steps that follow the training period. their starting resource only when the expected reward for the To measure fairness, we used the Gini coefficient [Gini, current starting resource drops below the best alternative one, 1912]. An allocation x = (x1 , . . . , xN )> is fair iff G(x) = 0, PN PN PN i.e., for an agent to switch from rstart to rstart 0 , it has to be where: G(x) = ( n=1 n0 =1 |xn − xn0 |)/(2N n=1 xn ) Copyright International Joint Conferences on Artificial Intelligence (IJCAI), 2021. All rights reserved.
Resources 0.5 Relative Difference in SW (%) Hungarian r1 r2 r3 Greedy n1 1 0.5 0 -5 0.4 ALMA Gini Coefficient Agents n2 0 1 0 ALMA-Learning -10 n3 1 0.75 → 0 0.3 -15 Table 2: Adversarial example: Unfair allocation. Both ALMA (with Greedy ALMA 0.2 higher probability) and any optimal allocation algorithm will assign -20 ALMA-Learning the coveted resource r1 to agent n1 , while n3 will receive utility 0. 2 4 8 16 32 64 256 1024 0.1 4 8 16 32 64 256 1024 #Resources #Resources 3.1 Test Case #1: Synthetic Benchmarks (a) Relative Difference in SW (b) Gini Index (lower is better) Setting We present results on three benchmarks: Figure 1: Map test-case. Results for increasing number of resources (a) Map: Consider a Cartesian map on which the agents and ([2, 1024], x-axis in log scale), and N = R. ALMA-Learning was resources are randomly distributed. The utility of agent n trained for 512 time-steps. for acquiring resource r is proportional to the inverse of their distance, i.e., un (r) = 1/dn,r . Let dn,r denote Table 3: Range of the average loss (%) in social welfare compared √ the Manhattan the (centralized) optimal for the three different benchmarks. distance. We assume a grid length of size 4 × N . (b) Noisy Common Utilities: This pertains to an anti- Greedy ALMA ALMA-Learning coordination scenario, i.e., competition between agents with (a) Map 1.51% − 18.71% 0.00% − 9.57% 0.00% − 0.89% similar preferences. We model the utilities as: ∀n, n0 ∈ (b) Noisy 8.13% − 12.86% 2.96% − 10.58% 1.34% − 2.26% N , |un (r)−un0 (r)| ≤ noise, where the noise is sampled from (c) Binary 0.10% − 14.70% 0.00% − 16.88% 0.00% − 0.39% a zero-mean Gaussian distribution, i.e., noise ∼ N (0, σ 2 ). (c) Binary Utilities: This corresponds to each agent being indifferent to acquiring any resource amongst his set of de- Zunino and Campo, 2009; Maheswaran et al., 2004; Ben- sired resources, i.e., un (r) is randomly assigned to 0 or 1. Hassine and Ho, 2007; Hassine et al., 2004; Crawford and Veloso, 2005; Franzin et al., 2002]. The advent of social Baselines We compare against: (i) the Hungarian algo- media brought forth the need to schedule large-scale events, rithm [Kuhn, 1955], which computes a maximum-weight while the era of globalization and the shift to working-from- matching in a bipartite graphs, (ii) ALMA [Danassis et al., home require business meetings to account for participants 2019a], and (iii) the Greedy algorithm, which goes through with diverse preferences (e.g., different timezones). the agents randomly, and assigns them their most preferred, Meeting scheduling is an inherently decentralized prob- unassigned resource. lem. Traditional approaches (e.g., distributed constraint opti- Results We begin with the loss in social welfare. Figure 1a mization [Ottens et al., 2017; Maheswaran et al., 2004]) can depicts the results for the Map test-case, while Table 3 ag- only handle a bounded, small number of meetings. Interde- gregates all three test-cases2 . ALMA-Leaning reaches near- pendences between meetings’ participants can drastically in- optimal allocations (less than 2.5% loss), in most cases in just crease the complexity. While there are many commercially 32−512 training time-steps. The exception is the Noisy Com- available electronic calendars (e.g., Doodle, Google calen- mon Utilities test-case, where the training time was slightly dar, Microsoft Outlook, Apple’s Calendar, etc.), none of these higher. Intuitively we believe that this is because ALMA products is capable of autonomously scheduling meetings, already starts with a near optimal allocation (especially for taking into consideration user preferences and availability. R > 256), and given the high similarity on the agent’s utility While the problem is inherently online, meetings can tables (especially for σ = 0.1), it requires a lot of fine-tuning be aggregated and scheduled in batches, similarly to the to improve the result. approach for tackling matchings in ridesharing platforms Moving on to fairness, ALMA-Leaning achieves the most [Danassis et al., 2019b]. In this test-case, we map meeting fair allocations in all of the test-cases. As an example, Figure scheduling to an allocation problem and solve it using ALMA 1b depicts the Gini coefficient for the Map test-case. ALMA- and ALMA-Learning. This showcases an application were Learning’s Gini coefficient is −18% to −90% lower on av- ALMA-Learning can be used as a negotiation protocol. erage (across problem sizes) than ALMA’s, −24% to −93% Modeling Let E = {E1 , . . . , En } denote the set of events lower than Greedy’s, and −0.2% to −7% lower than Hungar- and P = {P1 , . . . , Pm } the set of participants. To formulate ian’s. the participation, let part : E → 2P , where 2P denotes the 3.2 Test Case #2: Meeting Scheduling power set of P. We further define the variables days and slots to denote the number of days and time slots per day Motivation The problem of scheduling a large number of of our calendar (e.g., days = 7, slots = 24). In order to meetings between multiple participants is ubiquitous in ev- add length to each event, we define an additional function eryday life [Nigam and Srivastava, 2020; Ottens et al., 2017; len : E → N. Participants’ utilities are given by: 2 For the Noisy Common Utilities test-case, we report results pref : E ×part(E)×{1, . . . , days}×{1, . . . , slots} → [0, 1]. for σ = 0.1; which is the worst performing scenario for ALMA- Mapping the above to the assignment problem of Section Learning. Similar results were obtained for σ = 0.2 and σ = 0.4. 2.1, we would have the set of (day, slot) tuples to correspond Copyright International Joint Conferences on Artificial Intelligence (IJCAI), 2021. All rights reserved.
10 10 Relative Difference in SW (%) 0.16 Relative Difference in SW (%) 10 Relative Difference in SW (%) 5 0.14 0 Gini Coefficient CPLEX CPLEX 0 Upper bound 5 0.12 bound Upper CPLEX -10 Greedy Greedy GreedyCPLEX -5 MSRAC 0.1 MSRAC MSRAC ALMA 0 ALMA ALMAUpper bound -20 ALMA-Learning ALMA-Learning Greedy -10 0.08 ALMA-Learning -5 MSRAC -15 10 15 20 50 100 -30 70 140 210 280 0.06 70 ALMA 140 210 280 #Events #Events #Events ALMA-Learning -10 (a) Relative Difference in SW (b) Relative Difference in SW (c) Gini Coefficient (lower is better) -15 Figure 2: Meeting Scheduling. Results for 100 participants (P) and increasing10 15 of20events (x-axis50in log scale). number 100 ALMA-Learning was trained for 512 time-steps. #Events to R, while each event is represented by one event agent Table 4: Range of the average loss (%) in social welfare compared to (that aggregates the participant preferences), the set of which the IBM ILOG CP optimizer for increasing number of participants, P (|E| ∈ [10, 100]). The final line corresponds to the loss compared would correspond to N . to the upper bound for the optimal solution for the large test-case Baselines We compare against four baselines: (a) We used with |P| = 100, |E| = 280 (Figure 2b). the IBM ILOG CP optimizer [Laborie et al., 2018] to formu- late and solve the problem as a CSP3 . An additional benefit of Greedy MSRAC ALMA ALMA-Learning this solver is that it provides an upper bound for the optimal |P| = 20 |P| = 30 6.16% − 18.35% 1.72% − 14.92% 0.00% − 8.12% 1.47% − 10.81% 0.59% − 8.69% 0.50% − 8.40% 0.16% − 4.84% 0.47% − 1.94% solution (which is infeasible to compute). (b) A modified ver- |P| = 50 3.29% − 12.52% 0.00% − 15.74% 0.07% − 7.34% 0.05% − 1.68% sion of the MSRAC algorithm [BenHassine and Ho, 2007], |P| = 100 0.19% − 9.32% 0.00% − 8.52% 0.15% − 4.10% 0.14% − 1.43% and finally, (c) the greedy and (d) ALMA, as before. |E| = 280 0.00% − 15.31% 0.00% − 22.07% 0.00% − 10.81% 0.00% − 8.84% Designing Large Test-Cases As the problem size grows, CPLEX’s estimate on the upper bound of the optimal solu- ALMA-Learning loses less than 9% compared to the possible tion becomes too loose (see Figure 2a). To get a more accu- upper bound of the optimal solution. rate estimate on the loss in social welfare for larger test-cases, Moving on to fairness, Figure 2c depicts the Gini coeffi- we designed a large-instance by combining smaller problem cient for the large, hand-crafted instances (|P| = 100, |E| instances, making it easier for CPLEX to solve which in turn up to 280). ALMA-Learning exhibits low inequality, up to allowed for tighter upper bounds as well (see Figure 2b). −9.5% lower than ALMA in certain cases. It is worth noting, We begin by solving two smaller problem instances with though, that the fairness improvement is not as pronounced a low number of events. We then combine the two in a cal- as in Section 3.1. In the meeting scheduling problem, all of endar of twice the length by duplicating the preferences, re- the employed algorithms exhibit high fairness, due to the na- sulting in an instance of twice the number of events (agents) ture of the problem. Every participant has multiple meetings and calendar slots (resources). Specifically, in this case we to schedule (contrary to only being matched to a single re- generated seven one-day long sub-instances (with 10, 20, 30 source), all of which are drawn from the same distribution. and 40 events each), and combined then into a one-week long Thus, as you increase the number of meetings to be sched- instance with 70, 140, 210 and 280 events, respectively. The uled, the fairness naturally improves. fact that preferences repeat periodically, corresponds to par- ticipants being indifferent on the day (yet still have a prefer- 4 Conclusion ence on time). The next technological revolution will be interwoven to the These instances are depicted in Figure 2b and in the last proliferation of intelligent systems. To truly allow for scal- line of Table 4. able solutions, we need to shift from traditional approaches to Results Figures 2a and 2b depict the relative difference multi-agent solutions, ideally run on-device. In this paper, we in social welfare compared to CPLEX for 100 participants present a novel learning algorithm (ALMA-Learning), which (|P| = 100) and increasing number of events for the reg- exhibits such properties, to tackle a central challenge in multi- ular (|E| ∈ [10, 100]) and larger test-cases (|E| up to 280), agent systems: finding an optimal allocation between agents, respectively. Table 4 aggregates the results for various val- i.e., computing a maximum-weight matching. We prove that ues of P. ALMA-Learning is able to achieve less than 5% ALMA-Learning converges, and provide a thorough empiri- loss compared to CPLEX, and this difference diminishes cal evaluation in a variety of synthetic scenarios and a real- as the problem instance increases (less than 1.5% loss for world meeting scheduling problem. ALMA-Learning is able |P| = 100). Finally, for the largest hand-crafted instance to quickly (in as little as 64 training steps) reach allocations (|P| = 100, |E| = 280, last line of Table 4 and Figure 2b), of high social welfare (less than 5% loss) and fairness. 3 Computation time limit 20 minutes. Copyright International Joint Conferences on Artificial Intelligence (IJCAI), 2021. All rights reserved.
Appendix: Contents game that we select rstart as the starting resource, the re- In this appendix we include several details that have been ward of every other resource remains (1) unchanged and (2) omitted from the main text for the shake of brevity and to reward[r] < reward[rstart ], ∀r ∈ R {rstart } (except when improve readability. In particular: an agent switches starting resources). (iii) There is a finite number of times each agent can switch - In Section A, we prove Theorem 1. his starting resource rstart . This is because un (r) ∈ [0, 1] - In Section B, we describe in detail the modeling of and |un (r) − un (r0 )| > δ, ∀n ∈ N , r ∈ R, where δ is a the meeting scheduling problem, including the problem small, strictly positive minimum increment value. This means formulation, the data generation, the modeling of the that either the agents will perform the maximum number of events, the participants, and the utility functions, and fi- switches until rewardn [r] = 0, ∀r ∈ R∀n ∈ N (which will nally several implementation related details. happen in finite number of steps), or the process will have converged before that. - In Section C, we provide a thorough account of the sim- (iv) If rewardn [r] = 0, ∀r ∈ R, ∀n ∈ N , the question ulation results – including but not limited to omitted of convergence is equivalent to having N balls thrown ran- graphs and tables – both for the synthetic benchmarks domly and independently into R bins and asking whether you and the meeting scheduling problem. can have exactly one ball in each bin – or in our case, where N = R, have no empty bins. The probability of bin r being A Proof of Theorem 1 empty is R−1 N , i.e., being occupied is 1 − R−1 N . The R R Proof. Theorem 2.1 of [Danassis et al., 2019a] proves that N R probability of all the bins to be occupied is 1 − R−1 . ALMA (called at line 9 of Algorithm 2) converges in polyno- R mial time. The expected number of trials until this event occurs is In fact, under the assumption that each agent is interested N R 1/ 1 − R−1 , which is finite, for finite N, R. in a subset of the total resources (i.e., Rn ⊂ R) and thus at R each resource there is a bounded number of competing agents (N r ⊂ N ) Corollary 2.1.1 of [Danassis et al., 2019a] proves A.1 Complexity that the expected number of steps any individual agent re- ALMA-Learning is an anytime algorithm. At each training quires to converge is independent of the total problem size time-step, we run ALMA once. Thus, the computational (i.e., N and R). In other words, by bounding these two quan- complexity is bounded by T times the bound for ALMA, tities (i.e., we consider Rn and N r to be constant functions where T denotes the number of training time-steps (see Equa- of N , R), the convergence time of ALMA is constant in the tion 2, where N and R denote the number of agents and re- total problem size N , R. Thus, under the aforementioned as- sources, respectively, p∗ = f (loss∗ ), and loss∗ is given by sumptions: the Equation 3). 2 − p∗ Each stage game converges in constant time. 1 O TR log N + R (2) 2(1 − p∗ ) p∗ Now that we have established that the call to the ALMA procedure will return, the key observation to prove conver- gence for ALMA-Learning is that agents switch their start- loss∗ = arg min min (lossrn ), 1 − max (lossrn ) (3) lossr r∈R,n∈N r∈R,n∈N ing resource only when the expected reward for the current n starting resource drops below the best alternative one, i.e., for an agent to switch from rstart to rstart 0 , it has to be B Modeling of the Meeting Scheduling that reward[rstart ] < reward[rstart ]. Given that utilities 0 Problem are bounded in [0, 1], there is a maximum, finite number of B.1 Problem Formulation switches until rewardn [r] = 0, ∀r ∈ R, ∀n ∈ N . In that Let E = {E1 , . . . , En } denote the set of events we want case, the problem is equivalent to having N balls thrown ran- to schedule and P = {P1 , . . . , Pm } the set of participants. domly and independently into N bins (since R = N ). Since Additionally, we define a function mapping each event to both R, N are finite, the process will result in a distinct al- the set of its participants part : E → 2P , where 2P de- location in finite steps with probability 1. In more detail, we notes the power set of P. Let days and slots denote the can make the following arguments: number of days and time slots per day of our calendar (e.g., (i) Let rstart be the starting resource for agent n, and days = 7, slots = 24 would define a calendar for one week 0 rstart ← arg maxr∈R/{rstart } reward[r]. There are two where each slot is 1 hour long). In order to add length to each possibilities. Either reward[rstart ] > reward[rstart 0 ] for all event we define an additional function len : E → N, where time-steps t > tconverged – i.e., reward[rstart ] can oscillate N denotes the set of natural numbers (excluding 0). We do but always stays larger than reward[rstart0 ] – or there exists not limit the length; this allows for events to exceed a sin- time-step t when reward[rstart ] < reward[rstart 0 ], and then gle day and even the entire calendar if needed. Finally, we agent n switches to the starting resource rstart . 0 assume that each participant has a preference for attending (ii) Only the reward of the starting resource rstart changes certain events at a given starting time, given by: at each stage game. Thus, for the reward of a resource to in- crease, it has to be the rstart . In other words, at each stage pref : E ×part(E)×{1, . . . , days}×{1, . . . , slots} → [0, 1].
For example, pref(E1 , P1 , 2, 6) = 0.7 indicates that partici- 1.00 pant P1 has a preference of 0.7 to attend event E1 starting at day 2 and slot 6. The preference function allows participants 0.75 0.75 to differentiate between different kinds of meetings (personal, business, etc.), or assign priorities. For example, one could be 0.50 0.50 available in the evening for personal events while preferring to schedule business meetings in the morning. Finding a schedule consists of finding a function that as- 0.25 0.25 signs each event to a given starting time, i.e., 0.0 0.5 1.0 0.0 0.5 1.0 sched : E → ({1, . . . , days} × {1, . . . , slots}) ∪ ∅ (a) p = 10 (b) p = 50 where sched(E) = ∅ means that the event E is not scheduled. 1.0 1.0 For the schedule to be valid, the following hard constraints need to be met: 1. Scheduled events with common participants must not 0.5 0.5 overlap. 2. An event must not be scheduled at a (day, slot) tuple if any of the participants is not available. We represent an unavailable participant as one that has a preference of 0 (as 0.0 0.0 given by the function pref) for that event at the given (day, 0.0 0.5 1.0 0.0 0.5 1.0 slot) tuple. (c) p = 100 (d) p = 200 More formally the hard constraints are: Figure 3: Generated datapoints on the 1 × 1 plane for p number of ∀E1 ∈ E, ∀E2 ∈ E \ {E1 } : people. (sched(E1 ) 6= ∅ ∧ sched(E2 ) 6= ∅ ∧ part(E1 ) ∩ part(E2 ) 6= ∅) ⇒ (sched(E1 ) > end(E2 ) ∨ sched(E2 ) > end(E1 )) people were chosen for an event at a time4 (only 3% of the and meetings exceeds this number). Finally, for instances where the number of people in the entire system was below 90, that ∀E ∈ E : bound was reduced accordingly. (∃P ∈ P, ∃d ∈ [1, days], ∃s ∈ [1, slots] : pref(E, P, d, s) = 0 In order to simulate the naturally occurring clustering of ⇒ sched(E) 6= (d, s)) people (see also B.4) we assigned each participant to a point on a 1 × 1 plane in a way that enabled the emergence of clus- where end(E) returns the ending time (last slot) of the event ters. Participants that are closer on the plane, are more likely E as calculated by the starting time sched(E) and the length to attend the same meeting. In more detail, we generated the len(E). points in an iterative way. The first participant was assigned In addition to finding a valid schedule, we focus on maxi- a uniformly random point on the plane. For each subsequent mizing the social welfare, i.e., the sum of the preferences for participant, there is a 30% probability that they also get as- all scheduled meetings: signed a uniformly random point. With a 70% probability the X X person would be assigned a point based on a normal distri- pref(E, P, sched(E)) bution centered at one of the previously created points. The E∈E P ∈P selection of the latter point is based on the time of creation; a sched(E)6=∅ recently created point is exponentially more likely to be cho- B.2 Modeling Events, Participants, and Utilities sen as the center than an older one. This ensures the creation of clusters, while the randomness and the preference on re- Event Length To determine the length of each generated cently generated points prohibits a single cluster to grow ex- event, we used information on real meeting lengths in cor- cessively. Figure 3 displays an example of the aforedescribed porate America in the 80s [Romano and Nunamaker, 2001]. process. That data were then used to fit a logistic curve, which in turn was used to yield probabilities for an event having a given Utilities The utility function has two independent compo- length. The function was designed such that the maximum nents. The first was designed to roughly reflect availability number of hours was 11. According to the data less than 1% on an average workday. This function depends only on the of meetings exceeded that limit. time and is independent of the day. For example, a partici- pant might prefer to schedule meetings in the morning rather Participants To determine the number of participants in an than the afternoon (or during lunch time). A second function event, we used information on real meeting sizes [Romano decreases the utility for an event over time. This function and Nunamaker, 2001]. As before, we used the data to fit a is independent of the time slot and only depends on the day logistic curve, which we used to sample the number of par- 4 ticipants. The curve was designed such that no more than 90 The median is significantly lower.
You can also read