Computation and Communication Co-Design for Real-Time Monitoring and Control in Multi-Agent Systems
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Computation and Communication Co-Design for Real-Time Monitoring and Control in Multi-Agent Systems Vishrant Tripathi∗ , Luca Ballotta† , Luca Carlone∗ , and Eytan Modiano∗ ∗ Laboratory for Information and Decision Systems, MIT † Department of Information Engineering, University of Padova Abstract—We investigate the problem of co-designing compu- tation and communication in a multi-agent system (e.g., a sensor network or a multi-robot team). We consider the realistic setting arXiv:2108.03122v2 [cs.NI] 9 Aug 2021 where each agent acquires sensor data and is capable of local processing before sending updates to a base station, which is in charge of making decisions or monitoring phenomena of interest in real time. Longer processing at an agent leads to more informative updates but also larger delays, giving rise to a delay-accuracy trade-off in choosing the right amount of local processing at each agent. We assume that the available commu- nication resources are limited due to interference, bandwidth, and power constraints. Thus, a scheduling policy needs to be designed to suitably share the communication channel among the agents. To that end, we develop a general formulation to jointly optimize the local processing at the agents and the scheduling of transmissions. Our novel formulation leverages the notion of Age of Information to quantify the freshness of data and capture the Fig. 1. Example: four drones monitor different regions and send updates to delays caused by computation and communication. We develop a base station over a wireless channel. Each agent spends time τi processing efficient resource allocation algorithms using the Whittle index the collected measurements before sending. A scheduling algorithm prioritizes approach and demonstrate our proposed algorithms in two transmissions to the base station. This paper focuses on the co-design of the practical applications: multi-agent occupancy grid mapping in processing times τi and the scheduling policy. time-varying environments, and ride sharing in autonomous vehicle networks. Our experiments show that the proposed co- innovation involve a) pushing the computation to be distributed design approach leads to a substantial performance improvement across the network, such that all agents perform local processing (18 − 82% in our tests). of the collected measurements, and b) designing scheduling al- Index Terms—wireless networks; Age of Information; dis- gorithms that efficiently share limited communication resources tributed computing; robotics; networked control systems. across all devices and ensure timely delivery of information. I. I NTRODUCTION However, existing work on communication scheduling [1]–[5] disregards distributed processing, while related work on sensor Monitoring and control of dynamical systems are fundamen- fusion [6]–[8] focuses on designing distributed algorithms, tal and well-studied problems. Many emerging applications rather than allocating computational resources at each node. involve performing these tasks over communication networks. In this work, we explore the joint optimization of computa- Examples include: sensing for IoT applications, control of robot tion and communication resources for monitoring and control swarms, real-time surveillance, and environmental monitoring tasks. We consider a multi-agent system where each agent is in by sensor networks. Such systems typically involve multiple charge of monitoring a time-varying phenomenon and sending agents collecting and sending information to a central entity information to a central base station. For instance, this setup where data is stored, aggregated, analyzed, and then used to can model a team of robots mapping a dynamic environment send back control commands. Due to the dramatic improve- and sending map updates to a base station, which aggregates ments both in on-device and edge computing, and in wireless a global map for centralized decision-making (Fig. 1). communication over the past two decades, there has been a The agents are capable of local processing before transmit- rapid growth in the size and scale of such networked systems. ting the acquired information. This could involve operations This has motivated the design of scalable architectures, both such as refining, denoising, or compressing the data or simply for computation and communication. Two key directions of gathering more informative updates. We assume that the more time an agent spends in processing locally, the higher the The first two authors contributed equally to this paper. This work was funded by NSF Grant CNS-1713725, by Army Research quality of the generated update. However, longer processing Office (ARO) grant no. W911NF-17-1-0508, by the Office of Naval Research also induces a delay in between subsequent updates. This under the ONR RAIDER program (N00014-18-1-2828), by the CARIPARO yields a delay-accuracy trade-off : is it better to send outdated Foundation Visiting Programme HiPeR, and by the Italian Ministry of Education, University and Research through the PRIN project no. 2017NS9FEY but high-quality updates, or to reduce the overall latency by and through the initiative “Departments of Excellence” (Law 232/2016). communicating low-quality information?
agent acquires sample agent sends processed sample We consider the realistic scenario where the total commu- B.S. requests update new update delivered nication resources available are limited due to interference, limited bandwidth, and/or power constraints. Thus, in any given Age time-slot, only one of the agents is allowed to communicate with the base station. The communication constraints mean that, in addition to optimizing the local processing times, a scheduling policy needs to be designed to specify which agents Time (slotted) can communicate in every time-slot. Channel Therefore, the goal of this work is to develop a general framework to determine the optimal amount of local processing Agent at each agent in the network and design a scheduling policy to prioritize communication in order to maximize performance. Related Work. Over the past few years, there has been a Fig. 2. AoI evolution for agent i. The agent acquires and processes new samples every τi time-slots. When the base station (B.S.) requests a new rapidly growing body of work using Age of Information (AoI) update, the agent sends the most recent sample that has finished processing, as a metric for designing scheduling policies in communication (k) taking ri (τi ) time-slots for transmission. The variable δi represents the networks [3]–[5] and for control-driven tasks in networked con- waiting time in the buffer for update k. Upon a new update delivery, the AoI trol systems [9]–[12]. AoI captures the timeliness of received at the base station Ai (t) drops to the age of the delivered update. information at the destination (see [13], [14] for recent surveys). timization (Section V). Our simulations show that we can Our processing and scheduling co-design problem is motivated achieve performance improvements of 18−35% in the mapping by recent advances in embedded electronics, as well as the application and 75 − 82% in the ride-sharing application with development of efficient estimation and inference algorithms respect to baseline approaches. for real-time applications on low-powered devices [15]–[18]. The output accuracy of such algorithms increases with the II. P ROBLEM F ORMULATION runtime, in line with the delay-accuracy trade-off we consider We consider a discrete-time setting with N agents in a in this paper. Another application of such a trade-off involves networked system, where each agent is in charge of monitoring deciding on computation offloading in cloud robotics, which a time-varying phenomenon and sending information updates to has been the focus of recent works on real-time inference by a base station. Each agent processes the collected measurements resource-constrained robots [19], [20]. In this context, sending locally, before sending its updates. The i-th agent spends τi raw data can induce long transmission delays, but allow better time slots to process a new update. We refer to this quantity inference by shifting the computational burden the cloud. as the processing time associated with agent i. Contributions. We address the computation and communi- We assume that sensing and processing happen sequentially cation co-design problem and develop a) a scheduling policy at each agent. Thus, agent i acquires a new sample every τi that ensures timely delivery of updates, and b) an algorithm time slots. Further, each agent stores in a buffer the freshest to determine the optimal amount of local processing at each processed measurement. We will assume that the processing agent. To do so, we use AoI to measure the lag in obtaining time allocations τi , ∀i are constant during operation. information for monitoring and control of time-critical systems. To communicate the acquired and processed updates, the Our contribution is threefold. First, we develop a general agents use a wireless communication channel. We assume that, framework to jointly optimize computation and communication due to interference and bandwidth constraints, only one of the for real-time monitoring and decision-making (Section II). This agents can transmit to the base station in any given time-slot. framework extends existing work [21] by a) considering joint At every transmission opportunity, the base station polls one optimization of scheduling in addition to processing, and b) of the agents regarding the state of its system and receives the addressing a general model that goes beyond linear systems. most recent measurement that has been processed. Second, we develop low-complexity scheduling and pro- Scheduling decisions are modeled as indicator variables ui (t) cessing allocation schemes that perform well in practice where ui (t) = 1 if the i-th agent is scheduled at time t and zero (Sections III-IV). The co-design problem is a multi-period otherwise. We assume that a transmission from the i-th agent resource allocation problem and is hard to solve in general due takes ri (τi ) time slots, with ri (·) a monotone sequence. This to its combinatorial nature. We resolve this by considering a captures one aspect of the delay-accuracy trade-off, namely Lagrangian relaxation that decouples the problem into multiple that the size of the update depends on the amount of time single-agent problems, which can be solved effectively. To spent in processing it. When the agents spend local processing solve the scheduling problem, we generalize the Whittle index to collect more detailed information, e.g., in exploration tasks, framework proposed in [5] for sources that generate updates the measurements get larger overtime and ri (·) is increasing. at different rates and of different sizes. Conversely, when the agents compress the collected data, e.g., Finally, we demonstrate the benefits of using our methods extracting visual features from images, ri (·) is decreasing. in two practical applications from robotics and autonomous To measure the freshness of the information at the base systems: multi-agent occupancy grid mapping in time-varying station, we use a metric called Age of Information (AoI). The environments, and ride-sharing systems with local route op- AoI Ai (t) measures how old the information at the base station
is regarding agent i at time t. Upon receiving a new update, it drops to the age of the delivered update. Otherwise, it increases Problem 1 (Computation and Computation Co-design). linearly. The evolution is described below: Given the set of agents V = {1, . . . , N }, cost functions {Ji (·, ·)}i∈V , and AoI evolution (1), find the processing ( (k) τi + ri (τi ) + δi , if update k is delivered, Ai (t + 1) = times {τi }i∈V and the scheduling policy π that minimize Ai (t) + 1, otherwise. the infinite-horizon time-averaged cost: (1) (k) " T # Here δi is the waiting time spent by the k-th update from X 1 X π agent i in the buffer, i.e., the delay from the time the update min lim sup Eπ Ji (τi , Ai (t)) τi ∈Ti ∀i∈V T →+∞ T t=t was processed to the time it was actually transmitted. Since π∈Π i∈V 0 X a new processed update is generated every τi time-slots, the s.t. uπi (t) ≤ 1, ∀t (k) waiting time δi ranges from 0 to τi − 1 time-slots. Fig. 2 i∈V depicts the AoI process for agent i. Observe that the lowest (P1) value that the AoI can drop to is τi + ri (τi ), since every update where Π is the set of causal scheduling policies, uπi (t) = 1 spends time τi in processing and time ri (τi ) in communication. if policy π schedules agent i at time t and uπi (t) = 0 The AoI evolution in (1) is involved since it requires otherwise. Ti is the set of admissible processing times for analyzing waiting times that vary with each update. To simplify agent i, and Aπi (t) is the AoI of the i-th agent at time t the analysis, while still capturing the relevant features of the under policy π. (k) AoI dynamics, we assume that the sequences δi are constant (k) over time, i.e., δi ≡ δi ∀k, ∀i ∈ V. Each δi accounts for the average waiting time accumulated by a processed measurement Finding the optimal processing times requires iterating over before it is sent by the i-th agent. We are interested in the the combinatorial space Ti × ... × TN , while finding the optimal practical setting where processing times τi are small, and scheduling policy requires solving a dynamic program which the number of agents N is large. Thus, our assumption of suffers from the curse of dimensionality. constant waiting times is reasonable, since the waiting time’s III. A L AGRANGIAN R ELAXATION contribution to the overall AoI is negligible on average (being upper bounded by τi ) as compared to the time between We now discuss a relaxation of Problem 1 that enables us subsequent requests from the base station, which grows linearly to develop efficient algorithms. This approach is motivated with the number of agents N [3]. The smallest AoI for agent by the work of Whittle [22] and its applications to network i is defined as ∆i , τi + ri (τi ) + δi , which is the value that scheduling [5]. The relaxation will be useful not only for finding AoI resets to upon a new update delivery. a scheduling policy, but also in optimizing the processing times. We start by considering a relaxation of (P1) where the It has been shown in recent works [9]–[12] that real-time scheduling constraint is to be satisfied on average, rather than monitoring error for linear dynamical systems can be seen as an at each time slot. The relaxed problem is given by increasing function of the AoI. Intuitively, fresher updates lead " # to higher monitoring accuracy and better control performance. T X 1 X π Motivated by this, we assume that each agent has an associated min lim sup Eπ Ji (τi , Ai (t)) τi ∈Ti ∀i∈V T →+∞ T t=t cost function Ji (τi , Ai (t)) that maps the processing time and π∈Π i∈V 0 PT (2) the current AoI to a cost that reflects how useful the current π t=t0 ui (t) X information at the base station is for monitoring or control. s.t. lim sup ≤ 1. T →+∞ T i∈V Assumption 1 (Delay-Accuracy Trade-off). The cost functions To solve (2), we introduce a Lagrange multiplier C > 0 for Ji (τi , Ai (t)) are increasing with the AoI Ai (t) and decreasing the average scheduling constraint. The Lagrange optimization with the processing time τi . Thus, longer processing leads is given by the following equation: to more useful measurements (for a fixed age), while fresher X information induces a lower cost than outdated information. max min J¯i (τi , C) − C (3) C>0 τi ∈Ti ∀i∈V π∈Π i∈V Remark 1 (Task-related cost function). The functional form " T # of Ji (τi , Ai (t)) depends on the underlying dynamics of the 1 X J¯i (τi , C) , lim sup Eπ π π Ji (τi , Ai (t)) + Cui (t) system i and on the impact of agent processing on the quality of T →+∞ T t=t0 updates. These functions are typically estimated using domain Due to the Lagrangian relaxation, the inner minimization knowledge or learned from data offline. The approach in this can be decoupled as the sum of N independent problems. paper holds for any functions Ji (τi , Ai (t)) that satisfy the above assumption. We discuss numerical examples in Section V. Our goal is to design a causal scheduling policy π and Problem 2 (Decoupled Problem i). Given a constant cost find the processing times τ1 , ..., τN for every agent so as to C > 0, find a scheduling policy πi = {ui (t)}t≥t0 and a minimize the sum of the time-average costs.
where J˜iW (τi ) , JiW τi , Hi (τi ) . The optimal processing processing time τi ∈ Ti that minimize the infinite-horizon times τi∗ and policies πi∗ , with thresholds Hi (τi∗ ), computed time-averaged cost of agent i: for each decoupled problem provide an optimal solution to the " T # inner minimization of (3). 1 X πi min lim sup Eπi Ji (τi , Ai (t)) + Cui (t) τi ∈Ti T →+∞ T t=t πi ∈Π 0 (P2) B. Optimizing Processing Times in Problem 1 Leveraging the solution of the decoupled problems found in Section III-A, we now design a procedure to optimize the In Problem 2, the multiplier C can be interpreted as a processing times for the original multi-agent Problem 1. transmission cost: whenever ui (t) = 1, agent i has to pay a cost of C for using the channel. Further, transmitting an entire Given a cost C > 0, we can use (4) and (6) to compute update costs Cri (τi ), since i transmits for ri (τi ) time-slots. the optimal processing times τi∗ and the corresponding AoI In the next section, we look at the single-agent problem (P2) thresholds Hi (τi∗ ) for the N decoupled problems in (3). Further, in greater detail, and show how to solve it exactly. Since the observe that, for the i-th decoupled problem, the optimal problem involves a single agent, it is much easier to solve scheduling policy for agent i chooses to send a new update than the original combinatorial formulation. The solution also every time the AoI exceeds Hi (τi∗ ) and the AoI drops to ∆i provides key insights in choosing both the scheduling policy after each update delivery. Thus, the fraction of time that agent and the processing times for the original problem (P1). i occupies the channel (on average) is given by ri (τi∗ ) A. Solving the Decoupled Problem fi (τi∗ ) = . (7) Hi (τi∗ ) + ri (τi∗ ) − ∆i We now solve Problem 2 for each agent separately. First, we characterize the structure of the optimal scheduling policy The totalPchannel utilization given the Lagrange multiplier πi∗ given a fixed value of τi . Then, we optimize over the latter. C is f = i∈V fi (τi∗ ). From (2), f must lie in the interval [0, 1] to represent a feasible allocation of computation and Theorem 1. The solution to Problem 2, given a fixed value of communication resources. If not, then more than one agent τi , is a stationary threshold-based policy: let H e i , Hi + ri (τi ) is transmitting in every time-slot on average, which is not and suppose there exists an age Hi that satisfies possible given the (relaxed) interference constraint. e i − 1) ≤ JiW (τi , Hi ) ≤ Ji τi , H Ji (τi , H ei) (4) This suggests a natural way to optimize over both the Lagrange cost C and the processing times τi , which is presented where in Algorithm 1. In particular, we optimize the processing times PHe i −1 h=∆i Ji (τi , h) + Cri (τi ) τi by using (4) and (6) (line 4 in Algorithm 1), and update C JiW (τi , Hi ) , . (5) via a dual-ascent scheme (lines 6–7) using the average channel e i − ∆i H utilization fcurr . Then, an optimal scheduling policy πi∗ is to start sending an update whenever Ai (t) ≥ Hi and to not transmit otherwise. If Algorithm 1 Optimizing Processing Times no such Hi exists, the optimal policy is to never transmit. The quantity JiW (τ, Hi ) represents the time-average cost of using Input: Costs JiW (·), set of admissible processing times Ti for a threshold policy with the AoI threshold Hi . each agent i ∈ V, stepsize α > 0. Output: Locally optimal processing times {τi∗ }i∈V . Proof: See Appendix A. 1: C ← C0 ; The structure of the optimal scheduling policy πi∗ according 2: loop to Theorem 1 is intuitive, due to the monotonicity of the cost 3: for sensor i ∈ V do // optimization (6) functions Ji (τi , ·) in the AoI. If it is optimal to transmit and 4: τi∗ ← arg minτi ∈Ti J˜iW (τi ); pay the cost C for ri (τi ) time-slots at a particular AoI, it 5: end for P should be also be optimal to do so when the AoI is higher, 6: fcurr ← i∈V fi (τi∗ ); since the gain from AoI reduction would be even more. Given 7: C ← C + α(fcurr − 1); τi and C , a way to compute the optimal threshold is to start 8: end loop from Hi = ∆i and increase Hi until condition (4) is satisfied. 9: return {τi∗ }i∈V . Let the value that this procedure terminates at be denoted by Hi (τi ). Then, Hi (τi ) is an optimal threshold for agent i. Next, we look at how to compute the optimal processing Intuitively, the algorithm keeps increasing the virtual com- time τi∗ to solve Problem 2. To do so, given the admissible munication cost (quantified by the Lagrange multiplier C) until set Ti , we find the value of τi ∈ Ti that induces the lowest the processing times computed in line 4 become compatible time-averaged cost for agent i by enumerating over the set Ti : with the scheduling constraint. The decoupling reduces the complexity of finding the optimal processing P times from τi∗ = arg min J˜iW (τi ). (6) Q τi ∈Ti combinatorial O i∈V |T i | to linear search O i∈V |Ti | .
IV. W HITTLE - INDEX S CHEDULING index which solves the scheduling for the original Problem 1. In the previous section, we established a threshold structure for the optimal scheduling policy of the relaxed problem (2), Definition 1. For the i-th decoupled problem, the Whittle where each agent transmits when its AoI exceeds Hi (τi∗ ). index Wi (H) is defined as the minimum cost C that makes Next, we exploit this threshold structure to design an efficient both scheduling decisions (transmit, not transmit) equally scheduling policy for the original Problem 1. Given the preferable at AoI H. Let He , H + ri (τi ). The expression processing times τi∗ computed via Algorithm 1, we need to for Wi (H), given a processing time τi , is: solve: " T # H−1 e 1 X X ∗ π e − P J(τi , k) e − ∆i J τi , H H min lim sup Eπ Ji (τi , Ai (t)) π∈Π T →+∞ T t=t Wi (H) , k=∆i . (9) i∈V X 0 (8) ri (τi ) s.t. ui (t) ≤ 1, ∀t. i∈V Minimizing the time-average of increasing functions of AoI We derive the expression above in Appendix C. Using (9), was considered in [5]. There, the authors introduced a low- we can now design the Whittle index policy to solve (8). complexity near-optimal scheduling policy using the Whittle Whenever the channel is unoccupied, the agent with the most index approach. Unlike the setting in [5], our agents generate critical update should be asked for an update. This leads to the updates at different rates (every τi time-slots for agent i) and scheduling policy presented in Algorithm 2. The Whittle index induce different communication delays (ri (τi ) time-slots). We policy chooses the agent with the highest index (line 4), since now generalize the Whittle index approach for our setting. it represents the minimum cost each agent would be willing The Whittle index approach consists of four steps: 1) to pay to transmit at the current time-slot. When the channel converting the problem into an equivalent restless multi-armed is occupied, no other transmission is allowed (line 6). The bandit (RMAB) formulation, 2) decoupling the problem via a variable z keeps track of ongoing communication and drops Lagrange relaxation, 3) establishing a structural property called to zero when a new transmission can be scheduled. indexability for the decoupled problems, and 4) using this structure to formulate a Whittle index policy for the original Algorithm 2 Whittle Index Scheduling scheduling problem. We go through these steps below. Input: Processing time τi , communication delay ri (·), and Step 1. We first need to establish (8) can be equivalently cost Ji (·, ·) for each agent i ∈ V, time horizon T . formulated as a restless multi-armed bandit problem. We do 1: t = t0 , z = 0; so in Appendix B. 2: while t ≤ T do Step 2. As we observed in Section III, the original scheduling 3: if z = 0 then // schedule transmission at time t problem can be split into N decoupled problems of the form 4: π ← arg max Wi (Ai (t)); // trigger agent π (P2) via a Lagrange relaxation. Further, through Theorem 1, i∈V we know that the optimal scheduling policy for each decoupled 5: z ← rπ (τπ ) − 1; problem has a threshold structure, i.e., agent i should transmit 6: else // continue ongoing transmission only if its associated AoI Ai (t) exceeds the threshold Hi (τi∗ ). 7: z ← z − 1; Step 3. Whittle showed in [22] that when there is added 8: end if 9: end while structure in the form of a property called indexability for the decoupled problems, then the RMAB admits a low-complexity solution called the Whittle index, that is known to be near The Whittle index is known to be asymptotically optimal as optimal [23]. The indexability property for the i-th decoupled N → ∞, if a fluid limit condition is satisfied [23], [24]. These problem requires that, as the transmission cost C increases results, along with our simulations, suggest that the Whittle from 0 to ∞, the set of AoI values for which it is optimal for index is a very good low-complexity heuristic for scheduling agent i to transmit must decrease monotonically from the entire in real-time monitoring and control applications. set (all ages Ai (t) ≥ ∆i ) to the empty set (never transmit). In other words, the optimal threshold Hi (τi∗ ) should increase as V. A PPLICATIONS the transmission cost C increases. Next, we use Theorem 1 ∗ and the monotonicity of the cost functions Ji (τi , ·) to establish We demonstrate our co-design algorithms in two applications: that the decoupled problems are indeed indexable. multi-agent occupancy grid mapping in time-varying environ- ments (Section V-A), and ride sharing in autonomous vehicle Lemma 1. The indexability property holds for the decoupled networks (Section V-B). The results show that we can achieve problems (2), given an allocation of processing times τi . performance improvements of 18 − 35% for grid mapping and Proof: See Appendix C. 75 − 82% for ride-sharing compared to baseline approaches. Step 4. Having established indexability for the decou- We also provide a video briefly summarizing and visualizing pled Problem 2, we can derive a functional form for the Whittle our simulation results [25].
Fig. 4. Transition probabilities and optimal processing time allocations plotted for each region. The probabilities are plotted on a logarithmic scale while the processing times are plotted in number of time-slots. Fig. 3. Multi-agent mapping over 9 regions: each agent monitors and if more time was spent in processing, since the base station builds a local grid map of a region, and sends map updates to a base station. is more certain about the quality of the received update. Our The occupancy in the regions is time-varying. A scheduling policy specifies how to share the communication channel among the agents. Processing times goal is to minimize the time-average of the entropies summed specify how much time each agent spends in generating new map updates. across each region through the joint optimization of processing times and the scheduling policy. A. Multi-agent Mapping of Time-Varying Environments Results. Fig. 4 shows an example of transition probabilities Setup. We co-design computation and communication for a pi (for each of the 9 regions) and the corresponding optimal multi-agent mapping problem. We assume there are N separate processing times τi∗ found using Algorithm 1. We observe that regions each of which is being mapped by an agent. The agents for regions that change quickly (i.e., have large value of pi ), send updates —in the form of occupancy grid maps of their the corresponding processing time allocated is smaller. This is surroundings— to a base station over a single communication because there is not much benefit to spending large amounts of channel, where the local maps are aggregated into a global time generating high quality updates if they become outdated map for centralized monitoring (Fig. 3). very quickly. Conversely, for slowly changing regions (with In our tests, each region is 40m × 40m in size and is low values of pi ), Algorithm 1 assigns much longer processing represented by an occupancy grid map with 1m×1m cells. The times. In this case, high quality useful updates can be created state of each cell can be either occupied (1) or unoccupied (0). by taking longer time since the regions don’t change quickly. We consider a dynamic environment where the state of each Further, we compare the performance of various scheduling cell within region i evolves according to a Markov chain, with algorithms in Fig. 5. We consider the setting where the cells remaining in their original state with probability 1 − pi processing times τi are fixed to be the same parameter τ and switching from occupied to unoccupied and vice-versa for every region (uniform processing allocation). We then plot with probability pi . This is a common model for grid mapping the performance of three scheduling algorithms –a uniform in dynamic environments in the robotics community [26], [27]. stationary randomized policy, a round-robin policy, and the Each agent is equipped with a range-bearing sensor (e.g., proposed Whittle index-based policy– for different values of τ . lidar), with a fixed maximum scanning distance (25m) and We also plot the performance of the Whittle index policy and the angular range [−π/2, π/2]. The agents move around the regions stationary randomized policy under the optimized processing randomly, taking scans of the area round them. Scanning an times, computed using Algorithm 1, shown via dotted lines entire region takes an agent multiple time-slots. We use the in Fig. 5. We observe that Algorithm 1 can find processing Navigation toolbox in MATLAB to create sensors such that the times that perform well in practice. We also observe that the resolution of the readings θmin improves with the processing Whittle index policy outperforms the two “traditional” classes time. We set θmin = 0.5/τ . We also set the noise variance in of scheduling policies for every value of the parameter τ . angle and distance measurements to be inversely proportional Overall, choosing the processing times using Algorithm 1 to τ . These settings capture the delay-accuracy trade-off. We and using the Whittle schedule from Algorithm 2 together further set the update communication times to increase linearly leads to a performance improvement of 28 − 35% over with the amount of processing, i.e., r(τ ) = 5 + dτ /2e. the baseline versions of randomized policies. Similarly, our The base station maintains an estimate of the current map proposed approach leads to a performance improvement of for each region based on the most recent update it received 17 − 28% over the baseline versions of round-robin policies. and the Markov transition probabilities {pi }i∈V associated with each region. As is common in mapping literature [28], B. Smart Ride Sharing Control in Vehicle Networks [29], we measure uncertainty at the base station in terms of Setup. We consider the scenario in which a ride-sharing entropy of the current estimated occupancy grid map for each taxi fleet serves a city coordinated by a central scheduler, region and set the cost functions Ji (·, ·) to be the entropy which receives riding requests and assigns them to the drivers. of region i. In Appendix D, we show that the entropy cost Assigned requests are enqueued into a FIFO-like queue for increases monotonically with the AoI of a region and satisfies each driver. In particular, a rider is matched to the driver whose the assumptions of our framework. It drops to a lower value predicted route has the shortest distance to the pick-up location.
2D 1D 3D Request queue 6P 5P 4P 3P 2P 1P 3P 4D 5D 6D 5D 4D 3D 2D 1D 2P 1P TSP 4P 5P R=2 Fig. 6. Drivers calculate their route by processing the oldest R requests (green queue portion). The TSP solver starts from the current driver location and involves pick ups (P) and drop offs (D) of the processed requests. Fig. 5. Performance of different scheduling policies vs. processing times τ . Solid lines represent performance of different classes of scheduling policies as the processing time τ varies. The dotted lines represent the scheduling performance with processing times computed using Algorithm 1. Fig. 7. Left: long processing may cause large gaps between new routes calculated by the driver (solid gray) and the outdated ones stored at the scheduler (dashed gray), yielding bad matches (red dots). Right: with short In our setup, routes are calculated locally by drivers and processing, the matched requests are close to the actual routes (green dots). transmitted on demand to the scheduler, which uses this infor- mation to match future requests. Such distributed processing for fleet with five “myopic” drivers, which can only process the route optimization is different from current architectures, which oldest request (τm = 1), and five “smart” drivers whose are usually centralized. However, it allows for much greater processing can be designed: in particular, we assign the same scalability and is envisioned as a key component in increasing processing time τs to all such “smart” drivers. The cost of each efficiency and scale of future ride-sharing systems [30]–[32]. driver, given by its average service time (AST), is modeled as Given communication constraints, only one driver can Ji (τi , Ai (t)) = Pi (τi ) + Ai (t) (10) transmit at a time. Drivers update their route periodically to embed real-time road conditions and remove served requests where the estimated contribution of the local processing (TSPs) from the queue. Routes are calculated via the Travelling . Pi (τi ) = 2 + 2e−0.2τi q̂i (t) (11) Salesman Problem (TSP) involving the first R pick-up and drop-off locations in the request queue (Fig. 6). Processing was fitted from simulations with an initial queue and no many requests ensures more efficient paths for enqueued riders, assignments. Because the number of enqueued requests affects thus shortening their travel time from pick up to drop off. the AST but cannot be computed offline, we modeled Pi (τi ) as Conversely, the complexity of the TSP (i.e., its processing linear with the queue length. The scheduler approximates the time) increases with the amount of processed requests R. As queue length at time t with the latest received value q̂i (t). The a consequence, the information collected by the scheduler dependence on Ai (t) is hard to assess and we let it linear.1 is usually older, inducing larger gaps with the actual route 1 Other cost functions decreasing with τ and increasing with A (t) also followed by the driver (Fig. 7). This leads to worse driver- i i request matching and increases the waiting time experienced yield good performance, suggesting that our approach is indeed robust. by riders before they are actually picked up. Since the overall Quality of Service (QoS) is measured through the service time, 430 Stationary Randomized given by the sum of travel and waiting times of riders, the 390 Whittle index (proposed) drivers face a trade-off: processing many requests shortens the Average service time 350 Baselines travels, while processing few reduces the waiting time. Whittle index + optimized processing (proposed) 310 In our tests, we model the city as a 200-node graph where 270 55 each driver travels one edge per time slot. Requests are 230 50 randomly generated according to a Poisson process of unit 190 45 40 intensity and assigned immediately to the matching driver 150 4 5 6 by the scheduler. Each request contributes one time slot to 110 the processing time of the TSP (e.g., τ = 2 corresponds 70 to processing two requests) and we set r(τ ) = τ (longer 30 processing yields longer routes to transmit). To exploit the 1 2 3 4 5 6 7 advantage of the Whittle index, we simulate an heterogeneous Processing time of smart drivers τs Fig. 8. Average service time with varying processing time τs .
Results. We compute statistics over 1000 Monte Carlo [11] J. P. Champati, M. H. Mamduhi, K. H. Johansson, and J. Gross, runs. Fig. 8 shows the AST with 10000 requests assigned “Performance characterization using aoi in a single-loop networked control system,” in Proc. IEEE INFOCOM AoI Workshop, 2019, pp. 197–203. during the simulation for τs ∈ {1, ..., 7}. The circles refer [12] M. Klügel, M. H. Mamduhi, S. Hirche, and W. Kellerer, “Aoi-penalty to the performance obtained with the Whittle index policy, minimization for networked control systems with packet loss,” in Proc. while the squares to Stationary Randomized which is used IEEE INFOCOM AoI Workshop, 2019, pp. 189–196. as a benchmark. Combining Whittle index-based scheduling [13] A. Kosta, N. Pappas, V. Angelakis et al., “Age of information: A new with processing optimization (green circle) yields a striking concept, metric, and tool,” Foundations and Trends in Networking, vol. 12, no. 3, pp. 162–259, 2017. improvement of the QoS (AST = 41) compared to the [14] Y. Sun, I. Kadota, R. Talak, and E. Modiano, “Age of information: A new Stationary Randomized with standards policies (red squares) metric for information freshness,” Synthesis Lectures on Communication such as FIFO request service (τs = 1, AST = 225), or back- Networks, vol. 12, no. 2, pp. 1–224, 2019. to-back trips [33] (τs = 2, AST = 165). In particular, the [15] M. Amir and T. Givargis, “Priority neuron: A resource-aware neural network for cyber-physical systems,” IEEE Trans. Comp.-Aided Design minimum at τs∗ = 5 indicates that it is optimal to process of Integrated Circ. and Sys., vol. 37, no. 11, pp. 2732–2742, Nov 2018. the five oldest requests in the queue. Also, the Whittle index [16] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, outperforms Stationary Randomized for all values of the “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE processing time, with a decrease at the optimum of 25%. CVPR, 2018, pp. 4510–4520. [17] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” VI. C ONCLUSION arXiv e-prints, p. arXiv:1804.02767, Apr 2018. [18] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, In this work, we developed a novel framework for com- Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in putation and communication co-design for real-time multi- Proc. IEEE/CVF Int. Conf. Comp. Vision, 2019, pp. 1314–1324. agent monitoring and control. We designed efficient algorithms [19] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A low-latency online prediction serving system,” in that jointly allocate the processing time for each agent and 14th USENIX Symposium on Networked Systems Design and Implemen- schedule the available network communication resources. tation (NSDI 17), 2017, pp. 613–627. Through simulations, we further demonstrated that the proposed [20] S. Chinchali, A. Sharma, J. Harrison, A. Elhafsi, D. Kang, E. Pergament, approach works well for two different applications: multi-agent E. Cidon, S. Katti, and M. Pavone, “Network offloading policies for cloud robotics: A learning-based approach,” in Proceedings of Robotics: occupancy grid mapping in time-varying environments and Science and Systems, FreiburgimBreisgau, Germany, June 2019. distributed ride sharing in autonomous vehicle networks. [21] L. Ballotta, L. Schenato, and L. Carlone, “Computation-communication Possible directions of future work involve extending the trade-offs and sensor selection in real-time estimation for processing theoretical framework to consider more complex and realistic networks,” IEEE Trans. Net. Sci. Eng., vol. 7, no. 4, 2020. [22] P. Whittle, “Restless bandits: Activity allocation in a changing world,” cost functions that are coupled across multiple agents, time- Journal of applied probability, pp. 287–298, 1988. varying or unknown, requiring learning-based approaches. [23] R. R. Weber and G. Weiss, “On an index policy for restless bandits,” Journal of applied probability, pp. 637–648, 1990. R EFERENCES [24] A. Maatouk, S. Kriouile, M. Assaad, and A. Ephremides, “On the [1] S. Kaul, R. Yates, and M. Gruteser, “Real-time status: How often should optimality of the whittles index policy for minimizing the age of one update?” in Proc. IEEE INFOCOM, 2012, pp. 2731–2735. information,” IEEE Trans. Wireless Commun., 2020. [2] Y. Sun, E. Uysal-Biyikoglu, R. D. Yates, C. E. Koksal, and N. B. Shroff, [25] V. Tripathi, L. Ballotta, L. Carlone, and E. Modiano, “Computation “Update or wait: How to keep your data fresh,” IEEE Trans. Information and communication co-design for real-time monitoring and control Theory, vol. 63, no. 11, pp. 7492–7508, Nov. 2017. in multi-agent systems,” Video Attachment, 2021. [Online]. Available: [3] I. Kadota, A. Sinha, E. Uysal-Biyikoglu, R. Singh, and E. Modiano, https://www.dropbox.com/s/q7ijfsfc6eoko9d/video hq.mp4 “Scheduling policies for minimizing age of information in broadcast [26] D. Meyer-Delius, M. Beinhofer, and W. Burgard, “Occupancy grid models wireless networks,” IEEE/ACM Trans. Netw., vol. 26, no. 6, pp. 2637– for robot mapping in changing environments,” in Proceedings of the 2650, 2018. AAAI Conference on Artificial Intelligence, vol. 26, no. 1, 2012. [4] R. Talak, S. Karaman, and E. Modiano, “Optimizing information [27] J. Saarinen, H. Andreasson, and A. J. Lilienthal, “Independent markov freshness in wireless networks under general interference constraints,” in chain occupancy grid maps for representation of dynamic environment,” Proc. ACM Int. Symp. Mobile Ad Hoc Netw. Comput. (MobiHoc), 2018, in IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2012, pp. 3489–3495. pp. 61–70. [5] V. Tripathi and E. Modiano, “A whittle index approach to minimizing [28] F. Bourgault, A. A. Makarenko, S. B. Williams, B. Grocholsky, and functions of age of information,” in Proc. 57th Allerton Conf. Commun. H. F. Durrant-Whyte, “Information based adaptive robotic exploration,” Control Comput. IEEE, 2019, pp. 1160–1167. in IEEE/RSJ Int. Conf. Intell. Robot. Syst., vol. 1, 2002, pp. 540–545. [6] L. Xiao, S. Boyd, and S. Lall, “A scheme for robust distributed sensor [29] H. Carrillo, P. Dames, V. Kumar, and J. A. Castellanos, “Autonomous fusion based on average consensus,” in IPSN 2005. Fourth Int. Symp. robotic exploration using occupancy grid maps and graph slam based on Inf. Proc. Sensor Netw., 2005, 2005, pp. 63–70. shannon and rényi entropy,” in IEEE ICRA, 2015, pp. 487–494. [7] R. Carli, A. Chiuso, L. Schenato, and S. Zampieri, “Distributed kalman [30] A. Y. S. Lam, Y. Leung, and X. Chu, “Autonomous-vehicle public filtering based on consensus strategies,” IEEE Journal on Selected Areas transportation system: Scheduling and admission control,” IEEE Trans. in communications, vol. 26, no. 4, pp. 622–633, 2008. Intell. Transp. Syst., vol. 17, no. 5, pp. 1210–1226, 2016. [8] R. Olfati-Saber and J. S. Shamma, “Consensus filters for sensor networks [31] A. O. Al-Abbasi, A. Ghosh, and V. Aggarwal, “Deeppool: Distributed and distributed sensor fusion,” in Proc. 44th IEEE CDC, 2005, pp. 6698– model-free algorithm for ride-sharing using deep reinforcement learning,” 6703. IEEE Trans. Intell. Transp. Syst., vol. 20, no. 12, pp. 4714–4727, 2019. [9] Y. Sun, Y. Polyanskiy, and E. Uysal-Biyikoglu, “Remote estimation of the wiener process over a channel with random delay,” in Proc. IEEE [32] S. Muelas, A. LaTorre, and J.-M. Pena, “A distributed vns algorithm for Int. Symp. Information Theory (ISIT), 2017, pp. 321–325. optimizing dial-a-ride problems in large-scale scenarios,” Transportation [10] T. Z. Ornee and Y. Sun, “Sampling for remote estimation through queues: Research Part C: Emerging Technologies, vol. 54, pp. 110–130, 2015. Age of information and beyond,” IEEE Int. Symp. Model. Optim. Mobile, [33] Uber. (2018) How does uber pool expand access? [Online]. Available: Ad Hoc Wireless Netw. (WiOpt), 2019. www.uber.com/us/en/marketplace/matching/shared-rides/
A PPENDIX The case when r(τ ) = 1 is a direct application of Theorem 1 in [5], but with an adjusted minimum AoI value. For the A. Proof of Theorem 1 discussion that follows, we assume the more interesting case We first establish that the decoupled Problem 2 is equivalent of r(τ ) > 1. to a Markov decision process (MDP). We then solve the MDP We start from a state where the AoI A(t) = h and there is no using dynamic programming. Since the analysis looks similar ongoing transmission (z = 0), so a scheduling decision needs for each of the N decoupled problems, we drop the subscript to be made. We denote the differential cost-to-go function by i and solve the problem for a generic agent. S(h, z) and the time-average cost by λ. Then, the Bellman The state of the MDP describing Problem 2 consists of two equation is given by: non-negative integers A(t), z(t) . A(t) denotes the AoI of the agent at time t while z(t) denotes how much time is left S(h, 0) = J(τ, h) + min S(h + 1, 0), in the ongoing transmission from this agent. When the agent u∈{0,1} is not transmitting, z(t) is set to be 0. The variable u(t) is an indicator variable that denotes the C + S h + 1, r(τ ) − 1 − λ. (15) action of the agent: whether it is transmitting in time-slot t or Similarly, we write down the Bellman equation when there is not. Its value is chosen from the action set {0, 1}, 0 meaning an ongoing transmission. In this case, no scheduling decision the agent is at rest and 1 meaning an ongoing transmission. needs to be made. When z > 1, the AoI keeps increasing and When z(t) > 0, that means a transmission is ongoing and u(t) the Bellman equation is given by: can only be set to 1. This ensures that an entire update must be finished by the agent before making the next scheduling S(h, z) = J(τ, h) + C + S h + 1, z − 1 − λ. (16) decision. Whenever z(t) = 0, the scheduler can choose u(t) to be either 0 or 1, indicating the beginning of a new transmission. When z = 1, the AoI drops in the next time-slot and the The MDP evolution can be split into 2 cases. When the Bellman recursion is given by: agent is not transmitting (u(t) = 0), AoI increases by 1 and S(h, 1) = J(τ, h) + C + S ∆, 0 − λ. (17) z(t) remains at 0. Using (16), we expand the term S h + 1, r(τ ) − 1 : A(t + 1), z(t + 1) u=0 = (A(t) + 1, 0). (12) S h + 1, r(τ ) − 1 = J(τ, h + 1) + C + S h + 2, r(τ ) − 2 − λ. When the agent is transmitting (u(t) = 1), the AoI drops when (18) a new update completes delivery. Otherwise, it keeps increasing Applying (16) recursively to the right-hand side till we reach by 1. The variable z(t) is set to r(τ ) − 1 at the beginning of a S ∆, 0), we get: new transmission to indicate the time left in completing it. It decreases by 1 in every time-slot thereon, until the transmission r(τ )−1 X completes and z becomes 0. Thus, the state evolution is given S h + 1, r(τ ) − 1 = J(τ, h + k) + C(r(τ ) − 1) by: ( k=1 r(τ ) − 1, if z(t) = 0 + S ∆, 0 − λ r(τ ) − 1 . (19) z(t + 1) u=1 = (13) z(t) − 1, otherwise. Replacing S h + 1, r(τ ) − 1 in (15) with (19), we get: ( ∆, if z(t + 1) = 0. A(t + 1) u=1 = (14) S(h, 0) = J(τ, h) + min S(h + 1, 0), A(t) + 1, otherwise. u∈{0,1} r(τ )−1 Now that we have specified the state space, the action space X and the evolution equations; we also need to specify a cost Cr(τ ) + S ∆, 0 − λ r(τ ) − 1 + J(τ, h + k) − λ. k=1 function. We assume that in each time-slot the scheduler pays (20) a cost of the form Cu(t) + J(τ, A(t)). This maps the current state and action to a cost, where C acts like a transmission Note that now we can simplify the differential cost-to-go 0 charge and J(τ, A(t)) is an increasing function of the AoI, function to depend on the AoI only. Let S (h) , S(h, 0). Then, given a fixed value of τ . we get the simplified Bellman equation for our setting: Note that the decision process we have set up above is S 0 (h) = J(τ, h) + min S 0 (h + 1), Cr(τ ) + S 0 ∆ Markov since the state evolution depends only on the states u∈{0,1} and the actions taken in the previous time-slot. We wouldn’t r(τ )−1 have been able to make this conclusion without assuming a X fixed value of the waiting times δi , since that would have − λ r(τ ) − 1 + J(τ, h + k) − λ. (21) required us to maintain history per update. k=1 Next, we aim to minimize the infinite horizon time-average Without loss of generality, we can set S 0 (∆) = 0, since 0 cost for this MDP using dynamic programming. We follow the S (·) is a differential cost-to-go function. standard approach by first setting up the Bellman recursions. Part 1. We consider the case when there exists a threshold
H that satisfies the condition (4). Simplifying the above yields: We start by looking at a policy with an arbitrary transmission J(τ, H + r(τ ) − 1) ≤ λ (29) threshold H, i.e. transmit if and only if the AoI h ≥ H. We will show that if H satisfies (4) then this policy’s differential Similarly, for h = H − 2, we get: cost-to-go function satisfies the optimal Bellman recursion (21). r(τ )−1 X To do so, we first compute the differential cost-to-go function (C − λ)r(τ ) − λ + J(τ, H + k) ≤ for this policy. For all h ≥ H, we set u = 1 in (21) to get: k=−1 r(τ )−1 r(τ )−1 X X Cr(τ ) + J(τ, H − 2 + k) − λ . (30) S 0 (h) = J(τ, h) + Cr(τ ) + (J(h + k) − λ) − λ = k=1 k=1 r(τ )−1 Simplifying, we get: X = (C − λ)r(τ ) + J(τ, h + k) J(τ, H + r(τ ) − 1) + J(τ, H + r(τ ) − 2) ≤ 2λ. (31) k=0 (22) Repeating the above procedure for any h < H, we get: For h = H − 1 we again use (21) and set u = 0 to get: j X 0 S (H − 1) = J(τ, H − 1) + S (H) − λ = 0 J(τ, H + r(τ ) − k) ≤ jλ. (32) k=1 r(τ )−1 X (23) Observe that due to the monotonicity of the cost function = J(τ, H + k) − λ + (C − λ)r(τ ) k=−1 J(τ, ·), the most restrictive of these conditions is (29), since J(τ, H +r(τ )−1) ≤ λ implies J(τ, H +r(τ )−k) ≤ λ, ∀k > 1 where the second equality follows by expanding S 0 (H) using as well. Thus, for it to be optimal to not transmit at any AoI (22). Repeating this process j times gives us: values below the threshold H, it is sufficient for the following H+r(τ )−1 X to hold: 0 S (H − j) = J(τ, k) − jλ + (C − λ)r(τ ) (24) J(τ, H + r(τ ) − 1) ≤ λ (33) k=H−j For AoI h = H, we instead require that the optimal choice Setting H − j = ∆ in the equation above, we obtain the be to transmit, i.e. u = 1. Thus, the following must hold: following equality: r(τ )−1 H+r(τ )−1 X X Cr(τ ) + (J(τ, H + k) − λ) ≤ S 0 (H + 1) (34) J(τ, k) − (H − ∆)λ + Cr(τ ) − λr(τ ) = 0. (25) k=1 k=∆ Using (22) to expand S(H + 1), we get: Using this, we can compute λ: r(τ )−1 H+r(τ P )−1 X Cr(τ ) + (J(τ, H + k) − λ) ≤ J(τ, k) + Cr(τ ) k=∆ k=1 λ= . (26) r(τ )−1 H + r(τ ) − ∆ X (C − λ)r(τ ) + J(τ, H + 1 + k). (35) For this threshold policy to be optimal, it has to satisfy the k=0 Bellman equation (21) such that the minimization procedure over action u computed for each value of AoI h matches the Simplifying the above yields threshold structure. λ ≤ J(τ, H + r(τ )). (36) Thus, for h = H − 1, the optimal decision must be to not Similarly, for h = H + 1, we require the optimal decision to transmit, i.e. be transmit and get: r(τ )−1 r(τ )−1 X 0 S (H) ≤ Cr(τ ) + J(τ, H − 1 + k) − λ (27) X k=1 Cr(τ ) + (J(H + 1 + k) − λ) ≤ S 0 (H + 2) = k=1 Plugging in the expression of S 0 (H) using (22), we get: r(τ )−1 X r(τ )−1 (C − λ)r(τ ) + J(τ, H + 2 + k). (37) X (C − λ)r(τ ) + J(τ, H + k) ≤ k=0 k=0 Simplifying the above yields r(τ )−1 X λ ≤ J(τ, H + r(τ ) + 1). (38) Cr(τ ) + J(τ, H − 1 + k) − λ (28) k=1 Repeating the above procedure for any value of h ≥ H, we
obtain similar inequalities: λ, ∀h ≥ ∆. This, together with (44) implies r(τ )−1 λ ≤ J(τ, h + r(τ )), ∀h ≥ H. (39) X S 0 (h + 1) ≤ Cr(τ ) + (J(τ, h + k) − λ), ∀h ≥ ∆. (46) Clearly, the most restrictive of these upper bounds is λ ≤ k=1 J(H + r(τ )). Thus, for it to be optimal to transmit at all AoI The condition above implies that the minimization procedure values ≥ H, it is sufficient for the following to hold: to choose u ∈ {0, 1} will always select 0, i.e. never transmit. λ ≤ J(τ, H + r(τ )). (40) Thus, our choice of λ and S 0 (h) satisfies the Bellman equations and is optimal. This completes the proof of Theorem 1. The two conditions (33) and (40) together imply that if there exists a threshold H that satisfies (41), then an optimal policy B. Restless Multi-Armed Bandit Formulation is to transmit only when the AoI is ≥ H. We establish that the scheduling optimization described by (8) is equivalent to a restless multi-armed bandit problem J τ, H + r(τ ) − 1 ≤ λ ≤ J τ, H + r(τ ) . (41) (RMAB). A restless multi-armed bandit problem [22] consists Observe that this is identical to the optimal threshold condition of N “arms”. Each arm is a Markov decision process (MDP) (4) presented in Theorem 1. This completes one part of the with two actions (activate, rest). There are two transition proof. matrices per arm, one describing how the states evolve when Part 2. It still remains to be shown that in case no such the arm is active and one describing how the states evolve threshold can be found, then the optimal policy is to never when the arm is at rest. Each arm has a cost function mapping transmit. For ease of notation, we denote h+r(τ ) as eh. Consider states to costs. In the classic RMAB formulation, only one arm the function V : Z → R, for all AoI values h ≥ ∆ − r(τ ) + 1, can be activated in each time-slot, similar to our scheduling + given by: constraint and the goal is to find the schedule that minimizes the long-term time-average cost. h−1 To create a RMAB from (8), we first define the arms to e X V (h) , (eh − ∆)J τ, e h−1 − J(τ, k), ∀h. (42) represent each agent in the network. The state of every arm i k=∆ consists of two non-negative integers (Ai (t), zi (t)) ∈ Z2 . Here, Observe that for all values of h ≥ ∆ − r(τ ) + 1, we have Ai (t) is the AoI of the i-th agent while zi (t) is variable that V (h+1)−V (h) = (e h+1−∆)(J(τ, e h+1)−J(τ, e h)) ≥ 0. Thus, tracks the number of remaining time-slots to finish an ongoing V (·) is an increasing function. Further, V (∆ − r(τ ) + 1) = transmission from agent i. Thus, zi (t) is set to ri (τi ) − 1 at the J(τ, ∆) − J(τ, ∆) = 0. Thus, V (h) is a non-negative function start of a new transmission. It decreases by 1 in each time-slot for all values of AoI ≥ ∆ − r(τ ) + 1. as the transmission proceeds and is set to 0 when agent i is Using the function V (·) and the expression for λ (26), we not transmitting. can rewrite the condition (41) as follows: The state evolution of the arm (agent) depends on whether it is currently active (transmitting) or not. If agent i is transmitting V (H) ≤ Cr(τ ) ≤ V (H + 1). (43) in time-slot t, then it either initiates a new transmission; or the Suppose there exists some h such that Cr(τ ) ≤ V (h + 1). time remaining to finish sending the current update decreases Then, clearly (43) has a solution at H = h, since V (·) is a by 1. Under this condition, zi (t) evolves as follows: non-decreasing function. Since we are interested in the case ( ri (τi ) − 1, if zi (t) = 0 when (43) does not have a solution, we can safely assume (zi (t + 1))ui (t)=1 = (47) Cr(τ ) > V (h), ∀h. zi (t) − 1, otherwise. Since V (h) ∈ [0, Cr(τ )], ∀h ≥ ∆, so V (h) converges to a If the agent is transmitting in time-slot t and a new update finite value (bounded sequences always converge). The relation finished delivery at time-slot t + 1, i.e. zi (t + 1) = 0, then the V (h + 1) − V (h) = (e h + 1 − ∆)(J(τ, eh + 1) − J(τ, eh)) ≥ AoI drops to the age of the delivered packet. Otherwise, the 0 also ensures that the function J(τ, ·) is bounded. This is AoI increases by 1 in every time-slot. because J(τ, e h) is a non-decreasing sequence and has smaller ( increments than V (h) for each value of h. So, we can set ∆i , if zi (t + 1) = 0, (Ai (t + 1))ui (t)=1 = (48) λ = limh→∞ J(τ, h) and λ is well-defined. Ai (t) + 1, otherwise. We also set the differential cost-to-go function to be: If the agent is not transmitting in time-slot t, then there is ∞ X no update to be delivered and the state evolution is simply S 0 (h) = (J(τ, k) − λ) + Cr(τ ), ∀h ≥ ∆. (44) given by k=h (Ai (t + 1), zi (t + 1))ui (t)=0 = (Ai (t) + 1, 0) . (49) Clearly, S 0 (h) satisfies the following Bellman recurrence for never transmitting, i.e. For every arm i, there is a cost function Ji (τi , Ai (t)) which maps the state of the arm (Ai (t), zi (t)) to its associated costs, S 0 (h) = J(τ, h) + S 0 (h + 1) − λ, ∀h ≥ ∆. (45) given the processing time allocations τi . This completes the By the monotonicity of J(τ, ·), we know that J(τ, h) ≤ MDP specification for each arm.
You can also read