Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients arXiv:2201.01247v1 [cs.MA] 4 Jan 2022 Hanhan Zhou, Tian Lan,*and Vaneet Aggarwal † Abstract Value function factorization via centralized training and decentralized execu- tion is promising for solving cooperative multi-agent reinforcement tasks. One of the approaches in this area, QMIX, has become state-of-the-art and achieved the best performance on the StarCraft II micromanagement benchmark. However, the monotonic-mixing of per agent estimates in QMIX is known to restrict the joint action Q-values it can represent, as well as the insufficient global state informa- tion for single agent value function estimation, often resulting in suboptimality. To this end, we present LSF-SAC, a novel framework that features a variational inference-based information-sharing mechanism as extra state information to as- sist individual agents in the value function factorization. We demonstrate that such latent individual state information sharing can significantly expand the power of value function factorization, while fully decentralized execution can still be main- tained in LSF-SAC through a soft-actor-critic design. We evaluate LSF-SAC on the StarCraft II micromanagement challenge and demonstrate that it outperforms several state-of-the-art methods in challenging collaborative tasks. We further set extensive ablation studies for locating the key factors accounting for its perfor- mance improvements. We believe that this new insight can lead to new local value estimation methods and variational deep learning algorithms. A demo video and code of implementation can be found at https://sites.google.com/view/sacmm. 1 Introduction Reinforcement learning has been shown to match or surpass human performance in multiple domains, including Atari games [24], Go [19], and StarCraft II [42]. Many real-world problems, like autonomous vehicles coordination [14] and network packet delivery [47] often involve multiple agents’ decision making, which can be modeled as multi-agent reinforcement learning (MARL). Even though multi-agent cooperative * Hanhan Zhou and Tian Lan are with the Department of Electrical and Computer Engineering, the George Washington University, Washington, DC, 20052, e-mail: {hanhan, tlan}@gwu.edu. † Vaneet Aggarwal is with School of Industrial Engineering and the School of Electrical and Computer Engineering, Purdue University, West Lafayette IN, 47907, email: vaneet@purdue.edu 1
problems could be solved by single-agent algorithms, joint state and action space im- plies limited scalability. Further, partial observability and communication constraints give rise to additional challenges to MARL problems. One approach to deal with such issues is the paradigm of centralized training and decentralized execution (CTDE) [18]. The approaches for CTDE mainly include value function decomposition [37, 32] and multi-agent policy gradient [4]. Value decomposition based approaches like QMIX [32] represent the joint action values using a monotonic mixing function of per-agent estimates. The algorithms recorded the best performance on many StarCraft II micromanagement challenge maps [22]. Further, it is demonstrated [29] that multi-agent policy gradient is substantially outperformed by QMIX on both multi-agent particle world environment (MPE) [25] and StarCraft multi-agent challenge (SMAC) [33]. Despite recent attempts for com- bining policy gradient methods and value decomposition, e.g., VDAC [36], and mSAC [30], the achieved improvements over QMIX are limited. One of the fundamental chal- lenges is that the restricted function class permitted by QMIX limits the joint action Q-values it can represent, leading to suboptimal value approximations and inefficient explorations [22]. A number of proposals have been made to refine the value function factorization of QMIX, e.g., QTRAN [35] and weighted QMIX [31]. However, solving tasks that require significant coordination remains as a key challenge. To this end, we propose LSF-SAC a Latent State information sharing assisted value function factorization under multi-agent Soft-Actor-Critic paradigm. In particular, we introduce a novel peer-assisted information-sharing mechanism to enable effective value function factorization by sharing the latent individual states, which can be con- sidered extra state information for more accurate individual Q-value estimation by each agent. While global information sharing or communications in MARL - e.g., TarMAC [2] - typically prevents fully distributed decision making, we show that by leveraging the design of soft-actor-critic, LSF-SAC is able to retain fully decentralized execution while enjoying the benefits of latent individual states sharing. It also incorporates the entropy measure of the policy into the reward to encourage exploration. The key insight of LSF-SAC is that existing approaches of value function factor- ization mainly use the joint state information only in the mixing network, which yet is restricted by the function class it can represent. We believe an accurate indepen- dent value function estimation requires not only the state information of one specific agent, but also a proper represent of the all individual state information. We propose a way to extract and utilize the extra state information for individual, per-agent value function estimation through a variational inference method, serving as latent individ- ual state information. It is shown to significantly improve the power of value function factorization. Since we utilize such latent state information sharing only in centralized critic, the CTDE assumptions are preserved without affecting fully decentralized deci- sion making, unlike previous work introducing global communications [44]. Further, we note that combining actor-critic framework with value decomposition in LSF-SAC offers a way to decouple the decision making of individual agents (through separate policy networks) from value function networks, while also allowing the maximization of entropy to enhance its stability and exploration. Our key contributions are summarized as follows: 2
• We propose a novel method, LSF-SAC, the first framework for value function factor- ization by providing extra individual latent state information to facilitate individual, per-agent value function estimation. It is shown that latent state information can significantly improve the power of monotonic factorization operators. • LSF-SAC leverages a soft-actor-critic design to separate individual agents’ policy networks from value function networks and to maintain fully decentralized execu- tion, while enjoying the benefits of peer-assisted value function factorization. It also leads to an entropy maximization MARL for a more effective exploration. • We demonstrate the effectiveness of LSF-SAC and show that LSF-SAC significantly outperforms a number of state-of-the-art baselines on the StarCraft II micromanage- ment challenge in terms of better performance and faster convergence. 2 Background 2.1 Value Function Decomposition Value function decomposition methods [37, 32, 35, 45] learn a joint Q functions Qtot (τ, a) as a function of combined individual Q functions, conditioning individual local obser- vation history,then these local Q values are combined with a learnable mixing neural network to produce joint Q values. Qtot (τ, a) = q mix s, q i τ i , ai (1) Under the principle of guaranteed consistency between global optimal joint actions and local optimal actions, a global argmax performed on Qtot yields the same result as a set of individual argmax operations performed on each local q i , also known as Individual Global Maximum (IGM). QMIX proposed a more general case of VDN by approximating a broader class of monotonic functions to represent joint action-value functions rather than summation of the local action values. ∂Qtot (τ , u) > 0, ∀i ∈ N . (2) ∂Qi (τi , ui ) QPLEX [43] provides IGM consistency by taking the advantage of duplex dueling architecture, N X N X Qtot (τ , u) = Qi (τ , ui ) + (λi (τ , u) − 1) Ai (τ , ui ) (3) i=1 i=1 where Ai (τ , ui ) = wi (τ ) [Qi (τi , ui ) − Vi (τi )] , Vi (τi ) (4) = max Qi (τi , ui ) , ui wi (τ ) is a positive weight, yet its operator still limits it to only discrete action space [48]. 3
2.2 Maximum Entropy Deep Reinforcement Learning In a maximum entropy reinforcement learning framework, also known as soft-actor- critic [10], the objective is to maximize not only the cumulative expected total reward, but also the expected entropy of the policy: T X J(π) = E(st ,at )∼ρπ [r (st , at ) + αH (π (·|st ))] (5) t=0 where ρπ (st , at ) denotes the state-action marginal distribution of the trajectory in- duced by the policy π (at|st ). Soft actor-critic ultilized actor-critic architecture with independent policy and value networks and off-policy paradigm for efficient data col- lection and entropy maximization for effective exploration. It is considered as a state- of-the-art baseline for many RL problems with continuous actions due to its stability and capability. 2.3 Multi-agent Policy Gradient method Multi-agent policy gradient methods are the extensions to policy gradient algorithms, with a policy πθa (ua |oa ). Compared with policy gradient methods, MAPG usually faces the issues of high variance gradient estimates [21] and credit assignment [5]. A general multi-agent gradient can be written as: " # X a a ∇θ J = Eπ ∇θ log πθ (u |o ) Qπ (s, u) u Multi-agent policy gradients in the current literature often take advantage of CTDE by using a central critic to obtain extra state information s, and avoid using the vanilla multi-agent policy gradients (Equation 2) due to high variance. For instance, (Lowe et al. 2017) utilize a central critic to estimate Q (s, (a1 , · · · , an )) and optimize pa- rameters in actors by following a multi-agent DDPG gradient, which is derived from Equation 2 : h i ∇θα J = Eπ ∇θa π (ua |oa ) ∇u · Qua (s, u)|uα =π(oα ) Unlike most actor-critic frameworks, (Foerster et al. 2018) claim to solve the credit assignmentPissue by applying the following counterfactual policy gradients: where Aa (s, u) u− πθ (ua |τ a ) Qaπ (s, (u−a , ua )) is the counterfactual advantage for agent a. Note that (Foerster et al. 2018) argue that the COMA gradients provide agents with tailored gradients, thus achieving credit assignment. At the same time, they also prove that COMA is a variance reduction technique. 2.4 Variational Autoencoders For variables X ∈ X which are generated from unknown random variable z based on a generative distribution pu (x|z) with unknown parameter u and a prior distribution on 4
the latent variables, of which we assume is a Gaussian with 0 mean and unit variance p(z) = N (z; 0, I). To approximate the true posterior p(z|x) with a variational distri- bution qw (z|~x) = N (z; µ, Σ, w). [16] proposed Variational Autoencoders (VAE) to learn this distribution by using the Kullback-Leibler (KL) divergence from the approx- imate to the true posterior DKL (qw (z|x)kp(z|x)), the lower bound on the evidence log p(x) is derived as: log p(x) ≥ Ez∼qw (z|x) [log pu (x|z)] − DKL (qw (z|x)kp(z)). [12] proposed β-VAE, where a parameter β ≥ 0 is used to control the trade off between the reconstruction loss and the KL-divergence. 2.5 Information bottleneck Method Information bottleneck method [40] is a technique in information theory which intor- duced as the principle of extracting the relevant information with random input variable X ∈ X and output random variable Y ∈ Y, while finding the proper tradeoff between extraction accuracy and complexity. Given the joint distribution p(x, y), their relevant information is defined as their mutual information I(X; Y ). This problem can also be seen as a rate-distortion problem [41] with non-fixed distortion measure conditioning the optimal map, defined as dIB = DKL (p(y|x)kp(y|x̂)) where DKL is the Kullback-Leibler divergence. Then the expected IB distortion E [dIB (x, x̂)] = DIB = I(X; Y |X̂), with variational principle as L[p(x̂|x)] = I(X; X̂) − βI(X; Y |X̂) where β is a positive Lagrange multiplier operates as a tradeoff parameter between accuracy and complexity. [1] further proposed a variational approximation to the in- formation bottleneck using deep neural networks. 3 Related Works Cooperative multi-agent decision making often suffers from exponential joint state and action spaces. Multiple approaches including independent Q-learning and mean field games have been considered in the literature, while they do not perform well in chal- lenging tasks or require homogeneous agents [36]. Recently, a paradigm of centralized training and decentralized execution (CTDE) has been proposed for scalable decision making [18]. Some of the key CTDE approaches include value function decomposition and multi-agent policy gradient methods. Policy Gradient methods are considered to have more stable convergence compared to value-based methods [8] and can be extended to continuous actions problems easily. A representative multi-agent policy gradient method is COMA [4], which utilizes a centralized critic module for estimating the counterfactual advantage of an individual agent. However, as pointed out in [29], multi-agent policy gradient like MADDPG[21] is significantly outperformed by QMIX on both multi-agent particle world environment (MPE) [25] and StarCraft multi-agent challenge (SMAC) [33]. 5
Decomposed actor-critic methods, which combine value function decomposition and policy gradient methods with the use of the decomposed critics rather than cen- tralized critics, are introduced to guide policy gradients. VDAC [36] combined the structure of actor critic and QMIX for the joint state-value function estimation, while DOP [45] directly uses a network similar to Qatten [46] for policy gradients with off- policy tree backup and on-policy TD. The authors of [45] pointed out that decomposed critics are limited by restricted expressive capability and thus cannot guarantee the converge of global optima; even though the individual policies may converge to local optima [48]. Extensions of the monotonic mixing function has also been considered, e.g., QTRAN [35] and weighted QMIX [31]. But solving tasks that require significant coordination remains as a key challenge. Another related topic is representational learning in reinforcement learning. A VAE-based forward model is proposed in [9] to learn the state representations in the environment. A model to learn Gaussian embedding representations of different tasks during meta-testing is considered in [7]. The authors in [15] proposed a recurrent VAE model which encodes the observation and action history and learns a variational dis- tribution of the task. A method using inference model to represent decision making of the opponents is presented in [28]. The closest paper to our work is NDQ [44], which also utilizes latent variables to represent the information but as the communication messages during the decentralized agents execution. Although we both consider the information extraction as an infor- mation bottleneck problem, there are several key differences between our work and NDQ: (I) NDQ is a value-based method, while our work is a policy-based method un- der the soft-actor-critic framework. (II) NDQ requires communication between agents during decentralized execution, which limits its use cases, while we only utilize the latent extra state information during the central critics so that CTDE is maintained. (III) NDQ requires one-to-one communication during the execution stage, while in this work, we introduce a latent information-sharing mechanism which can be considered as an all-to-all message sharing method, which potentially requires less training time. The proposed LSF-SAC method leverages an actor-critic design with latent state information for value function factorization. We introduce a novel way to utilize the extra state information, as inspired from β-VAE [12], by using variational inference in decomposed critic as latent state information for better individual value estimation. Despite information sharing, CTDE is still maintained due to the use of actor-critic structure. We also utilize the entropy and expected return maximization for better exploration through soft actor-critic with separate actor and critic networks. 4 System Model Consider a fully cooperative multi-agent task as decentralized partially observable Markov decision process (DEC-POMDP) [26], given by a tuple G = hI, S, U, P, r, Z, O, n, γi, where I ≡ {1, 2, · · · , n} is the finite set of agents. The state is given as s ∈ S, from which each agent draws its own observation from the observation function oi ∈ O(s, i) : S × A → O. At each timestamp t, each agent i choose an action ui ∈ U , composing a joint action selection u. A shared reward is then given as r=R(s, a) : 6
S ×U→R, with the next state of each agent is s0 with transition probability function P (s0 |s, u) : S × U → [0, 1]. Each agent has an action-observation historyP∞ τi ∈ T ≡ (O×U )∗ . Then a joint action value function Qπtot (τ , u) = Es0:∞ , u0:∞ [ t=0 γ t rt |s0 = s, u0 = u, π] is proposed with policy π, and γ ∈ [0, 1) is the discount factor. Quantities in bold denote a joint quantities across all agents, and quantity with super script i denote a quantity specifically belong to agent i. ( , ) ( , , . ) ( , ) MLP Mixing Network ℎ −1 ℎ GRU 1 ( 1 , 1 ) ( , ) …… MLP MLP 1 |•| MLP 1 ( 1 , . ) ( 1 , . ) ( , −1 , ) 1 =N(μ, ) MLP Critic 1 Critic n |•| MLP MLP ( ) Latent Information MLP 1 Sharing Softmax MLP ℎ −1 ℎ 1 1 = ( 1 , −1 ) = ( , −1 ) ( , ) GRU MLP Agent 1 …… Agent n Execution ( , −1 ) Training Figure 1: Overview of LSF-SAC Approach. Best viewed in color. 5 Proposed Approach In this section, we first introduce the main structure of our proposed method, LSF-SAC, then we discuss the detailed implementation of the key designs, namely soft actor-critic framework for multi-agent reinforcement learning and value decomposition with latent information-sharing mechanism, and their corresponding optimizing strategies. 5.1 Framework Overview In our learning framework (Fig. 1), each individual actor (Green part) outputs πθ (ai |τ i ) only conditioned on its own local observation history. The centralized mixing network (Orange Part) approximates the joint action-value function from individual value func- tions (Blue part). A latent information-sharing mechanism (Purple part) is proposed to encode the extracted extra state information to assist individual agents in local action- value estimation. Function approximators (neural networks) are used for both actor and critic networks and optimized with stochastic gradient descent. The centralized critic network consists of (i) a local Q-network for each agent, (ii) a mixing network that takes all individual action-values with its weights and biases 7
generated by a separate hyper-network, and (iii) an extra state information encoder to generate latent state information for facilitating individual Q-value estimation. For each agent i, the local Q network represents its local Q value function qi (τi , ai , mi ) where mi is the extra state information for agent i drawn from the global informa- tion sharing pool. More precisely, the information for agent i is generated from the messages of all other agents following a multivariate Gaussian distribution, denoted as mi = with mi out ∼ N (fm (τi ; θm ), I)), where τi is the lo- cal observation history, θm is the parameters of encoder fm and I is an identity matrix. The mixing network is a feed-forward network, following the approach in QMIX, which mixes all local Q values to produce an estimate Qtot . The weights and biases of the mixing network are generated by a hypernetwork that takes joint state infor- mation s. To enforce monotonicity, the weights generated from the hyper-networks are followed by an absolute function to create non-negative values. The decentralized actor-network is similar to the individual Q network, except it only conditions on its own observation and action history, and a softmax layer is added to the end of the network to convert logits into categorical distribution. The overall goal is to minimize: L(θ) = LT D (θT D ) + λ1 Lm (θ m ) + λ2 Lπ (θ π ) (6) where LT D (θT D ) is the TD loss, of which we show it can also be used as the center critic loss, Lm (θ m ) is the message encoding loss, and Lπ (θ π ) is the joint actor (pol- icy) loss. λ1 and λ2 are the weighting terms. The details about latent state information generation and soft-actor-critic framework along with how to optimize them will be discussed in the following section. 5.2 Variational Approach Based Latent State Information One of the key advantages of multi-agent policy gradients under the CTDE assumption is the effective utilization of extra state information. In our design, not only is the extra state information accessible to the mixing network, but also to the individual agents’ value networks (through information sharing). Due to the partial observability and un- certainty of the multi-agent environments, the individual value estimation conditioned on its own observation and action history can be volatile and unreliable. Intuitively, introducing extra information from other agents helps remove the ambiguity and un- certainty of current observation to enable effective individual value estimation. However, it remains a crucial problem on how to efficiently and effectively en- code such extra state information. We consider this as an information bottleneck prob- lem [40], specifically, for agent i, we maximize the mutual information between other agents’ encoded information and their actions while minimizing the mutual informa- tion between its own encoded information and action selection, so that only the neces- sary information is chosen and then efficiently encoded. Formally, the objective for each agent i can be written as: n X Jm (θ m ) = [Iθm (Aj ; Mi |Tj , Mj ) − βIθm (Mi ; Ti )] (7) j=1 8
where Aj is agent j’s action selection, Mi is a random variable of mout i , Tj is a random variable of τj , and a parameter β ≥ 0 is used to control the trade-off between the mutual information of its own and other agents. Yet this does not lead to a learnable model, since the mutual information is intractable. With the help of variational approximation, specifically, deep variational information bottleneck [1], we are able to parameterize this model using a neural network. We then derive and optimize a variational lower bound of the first term of such objective as follows. Detailed derivations and proofs can be found in Appendix A.1. Lemma 1. A lower bound of mutual information Iθm (Aj ; Mi |Tj , Mj ) is ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , M )]] where qψ is a variational Gaussian distribution with parameters ψ to approximate the unknown posterior p(Aj |Tj , Mj ), T = {T1 , T2 , · · · , Tn }, M = {M1 , M2 , · · · , Mn }. Proof. We provide a proof outline as follows. Iθc (Aj ; Mi |Tj , Mj ) p (aj |τj , mj ) Z = daj dτj dmj p (aj , τj , mj ) log p aj |τj , mout j where p(aj |τj , mj ) is fully defined by our decoder fm and Markov Chain. Note this is intractable in our case, let qψ (aj |τj , mj ) be a variational approximation to p(aj |τj , mj ). Since the KL-divergence is always positive, hence Iθc (Aj ; Mi |Tj , Mj ) qψ (aj |τj , mj ) Z ≥ daj dτj dmj p (aj , τj , mj ) log p aj |τj , mout j = ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , M )]] + H(Aj |Tj , Mjout ) Consider H(Aj |Tj , Mjout ) is a positive term that is independent of our optimization procedure and can be ignored, then we have Iθm (Aj ; Mi |Tj , Mj ) (8) ≥ ET∼D,Mj ∼fm [−H [p (Aj |T) , qψ (Aj |Tj , M )]] Similarly, by introducing another variational approximator qφ , we have Iθm (Mi ; Ti ) = ETi ∼D,Mj ∼fm [DKL (p (Mi |Ti ) kp (Mi ))] (9) ≤ ETi ∼D,Mj ∼fm [DKL (p (Mi |Ti ) kqφ (Mi ))] where DKL denotes Kullback-Leibler divergence operator and qφ (Mi ) is a variational posterior estimator of p(Mi ) with parameters φ (see Appendix A.1 for details). Then 9
with the evidence lower bound derived above we optimize this bound for the message encoding objective which is to minimize Lm (θ m ) = ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , Mj )] (10) + βDKL (p(Mi |Ti )kqφ (Mi ))]. Algorithm 1 LSF-SAC 1: for k = 0 to max train steps do 2: Initiate environment 3: for t = 0 to max episode limits do 4: For each agent i, take action ai ∼ πi 5: Execute joint action a, observe reward r, and state-action history τ , next state st+1 0 6: Store (τ , a, r, τ ) in replay buffer D 7: end for 8: for t = 1 to T do 9: Sample minibatch B from D 10: Generate latent state information mout i ∼ N (fm (τi ; θm ), I)), for i = 0 to n 11: Update critic network θT D ← η ∇L ˆ T D (θT D ) w.r.t Eq(9) 12: Update policy network ˆ π ← η ∇L(π) w.r.t Eq(7) 13: Update encoding network θ m ← η ∇Lˆ m (θ m ) w.r.t Eq(5) 14: Update temperature parameter α ← η ∇αˆ w.r.t Eq(8) 15: if time to update target network then 16: θ− ← θ 17: end if 18: end for 19: end for 20: Return π 5.3 Factorizing Multi Agent Maximum Entropy RL In this section, we present one possible implementation of expanding soft actor-critic to multi-agent domain with latent state information assisted value function decompo- sition. Recent works have shown that Boltzmann exploration policy iteration is guar- anteed to improve the policy and converge to optimal [10]; its objective extended to multi-agent domain can be defined as X J(π) = E [r (st , at ) + αH (π (·|st ))] (11) t 10
where the temperature α is the hyper-parameter to control the trade-off between maxi- mizing the expected return and maximizing the entropy for better exploration. Following the previous research on value decomposition, to maximize both the expected return and the entropy, we find the soft policy loss of LSF-SAC as: L(π) = ED [α log π (at |τ t ) − Qπtot (st , τt , at )] (12) = q mix st , Eπi q i τti , ait − α log π i ait |τti where Qπtot is the soft value decomposition network with ai ∼ πi (oi ), and D is the replay buffer used to sample training data (state-action history and reward, etc.). Then, we can tune the temperature α as proposed in [10] by optimizing the follow- ing: J(α) = Eat ∼πt [−α log πi (at |st ) − αH0 ] (13) Unlike in VDAC that share the same network for actor networks and local Q value es- timations, we use a separate network for policy networks and train them independently from critic networks. Latent state information are used for individual critics for joint action value function factorization. We propose a latent state information assisted soft value decomposition design as Qtot (τ , a, m; θ) = q mix (st , Eπi [q i (τti , ait , mit ); θ]) We then use TD advantage with latent information sharing design as the critic loss, i.e., Qtot τ 0 , a0 , m0 ; θ − −Qπtot (τ , a, m; θ)]2 LT D (θ) = [r+γ max ∗ a = [r+γ max ∗ q mix (st , Eπi [q i (τt+1 i , ait+1 , mit+1 ); θ − ]) (14) a − q mix (st , Eπi [q i (τti , ait , mit ); θ])]2 where ai ∼ πi (oi ), θ − is the parameters of the target network that are periodically updated. Detailed derivations can be found in Appendix A.2. 6 Experiments In this section, we first empirically study the improvements of power in value func- tion factorization achieved by LSF-SAC through a non-monotonic matrix game. We compare the results with several existing value function factorization methods. Then in StarCraft II, we compare LSF-SAC with several state-of-the-art baselines. Finally, we perform several ablation studies to analyze the factors that contribute to the perfor- mance. 6.1 Single-state Matrix Game Proposed in QTRAN [35], the non-monotonic matrix game, as illustrated in Table 1(a), consists of two agents with three available actions and a shared reward. We show the value function factorization results of QTRAN, LSF-SAC, VDN, QMIX, and DOP [45]. 11
u2 Q2 A B C 4.2(A) 2.3(B) 2.3(C) u1 Q1 A 8.0 -12.0 -12.0 3.8(A) 8.0 6.13 6.1 B -12.0 0.0 0.0 -2.1(B) 2.1 0.2 0.2 C -12.0 0.0 0.0 -2.3(C) 1.9 0.0 0.0 (a) Payoff of matrix game (b) QTRAN Q2 Q2 1.7(A) -11.5(B) -12.7(C) 3.1(A) -2.3(B) -2.4(C) Q1 Q1 0.4(A) 8.1 -6.2 -6.0 -2.3(A) -5.4 -4.6 -4.7 -9.9(B) -6.0 -5.9 -6.1 -1.2(B) -4.4 -3.5 -3.6 -9.5(C) -5.9 -6.0 -6.0 -0.7(C) -3.9 -3.0 -3.1 (c) LSF-SAC (d) VDN Q2 Q2 -0.9(A) 0.0(B) 0.0(C) -2.5(A) -1.3(B) 0.0(C) Q1 Q1 -1.0(A) -8.1 -8.1 -8.1 -1.0(A) -7.8 -6.0 -4.2 0.1(B) -8.1 0.0 0.0 0.1(B) -6.1 -4.4 -2.6 0.1(C) -8.1 0.0 0.0 0.1(C) -4.2 -2.4 -0.7 (e) QMIX (f) DOP Table 1: Payoff Matrix of the one-step matrix game, Q1 , Q2 and reconstructed Qtot of selected algorithms. Boldface denotes optimal/greedy actions from state-action value. The use of variational information can significantly improve the power of the function factorization operators. Table 1b-1f shows the learning results of selected algorithms, QTRAN and LSF- SAC can learn a policy that each agent jointly takes the optimal action conditioning only on their local observations, meaning successful factorization. DOP falls into the sub-optimum caused by miscoordination penalties, similar to VDN and QMIX, which are limited by additivity and monotonicity constraints. Although QTRAN managed to address such limitations with more general value decomposition, as pointed out in later works [22] that it poses computationally intractable constraints that can lead to poor empirical performance on complex MARL domains. It is also worth noting that LSF-SAC can find the optimal joint actions under the constraints of monotonicity by providing variational information; this indicates that the utilization of latent state infor- mation can significantly improve the power of the monotonic factorization operators in a mixing network like QMIX. 6.2 Decentralised Starcraft II micromanagement benchmark In this section, we show our experimental results compared to several state-of-the- art algorithms, not limited to only multi-agent policy gradient methods, but also with decomposed value methods and combined methods on decentralized StarCraft II mi- cromanagement benchmark [33], namely COMA [4], MAVEN [22], QMIX [32], and VDAC-vmix from VDAC [36] as they report out of the two proposed methods it deliv- 12
Figure 2: Comparisons with baselines on the SMAC benchmark ers the best performance. We then perform several ablation studies to analyze the factors that contribute to the performance. It is worth noting that the StarCraft Multi-Agent Challenges (SMAC) are significantly affected by various code-level optimizations, i.e., hyper-parameter tuning, as also found by [13], some works are relying on heavy hyper-parameters tuning to achieve results that they otherwise cannot. Consistent with previous work, we carry out the test with the same hyper-parameters settings across all algorithms. More details about the algorithm implementation and settings can be found in the Appendix C. We choose six different maps from both symmetric and asymmetric scenarios for the general test, ranging from easy to hard ones. For each testing algorithm run, training will be paused every 5000 steps for an evaluation phase, where 32 independent episodes are generated, with each agent acting greedily according to their policy or value func- tion. The median percentage of winning rate is used for performance comparison from five independent training cycles. Specifically, we choose maps ranging from symmet- ric ones with the same units: 8m (easy), symmetric ones with different units: 1c3s5z, 3s5z, asymmetric ones with different units: 2s vs 1sc , 5m vs 6m, and different units with large actions space: MMM2. Details about the StarCraft Multi-Agent Challenge settings can be found in the Appendix B. Note that LSF-SAC performs exceptionally well on testing maps with challenging tasks that require more state information or sub- stantial cooperation. 13
6.3 General Results Following the practice of previous works, as suggested in [33], for every map result, we compare the winning rate and plot the median with the shaded area representing the highest and lowest range from testing results in Figure 2. In general, we observe LSF-SAC achieves strong performance on all selected SMAC maps, notably it outperforms the state-of-the-art algorithms or achieves faster and more stable convergence at a higher win rate. In easy scenarios like 8m and 1c3s5z, almost all algorithms perform well. As the built-in AI would tend to attack the nearest enemy, by pulling back the friendly unit with lower health value is a simple strategy to learn for winning. However it is worth noting that although QMIX can achieve a relatively high winning rate, it is pretty unstable on convergence, indicating its policy might be overfitting to some specific scenarios. Specifically, on these two maps, LSF-SAC outperforms all the baselines in both convergence speed and final performance with a more stable result, proving its potential of more generalized policy expressiveness in value decomposition. On 2s vs 1sc map, where a specific strategy is required to win - only by two units cooperating and taking turns to attack the enemy unity, LSF-SAC is able to achieve a high winning rate, yet fluttering at the early stage of training. This is potentially due to the penalty from the entropy maximization that forces the agent to try out additional tactics, even though an optimal policy is already learned. On more challenging scenarios like 5m vs 6m and 3s5z, LSF-SAC can achieve a higher winning rate than other algorithms listed. In MMM2, which is a complex environment with more unit types and numbers, VDAC soon falls into sub-optimal and converges to it, while LSF-SAC is able to keep exploring for a better policy; this demonstrates LSF-SAC’s improved exploration ability. Both COMA and MAVEN fail to learn a consistent policy to defeat the built-in AI due to the non-stationary setting of the environment and the lack of utilization of extra state information. 6.4 Ablation study In this section, we perform a comparison between LSF-SAC and several modified algo- rithms to understand the contribution of different modules in LSF-SAC. We choose two of the previously tested SMAC maps: 8m and 5m vs 6m. Each experiment is repeated with four independent runs with random seeds with their median results presented. 6.4.1 Ablation 1 First, we consider the setting of LSF-SAC without the extra state info encoding (Purple part in Fig.1) as MASAC. This demonstrates how multi-agent soft-actor-critic works alone. It highlights the importance of latent state information by comparing the results of MASAC against the original LSF-SAC. 6.4.2 Ablation 2 We then consider our implementation of multi-agent soft-actor-critic with value de- composition as MAA2C, which can also be considered as QMIX under an A2C setting 14
[36]. This is to find the contribution of soft-actor-critic to enhancing exploration. 6.4.3 Ablation 3 We also consider a fixed temperature design as MASAC with fixed α = 1.0 (MASAC α = 1.0); this is to understand the effectiveness of the design in automatically updating the temperature α. 6.4.4 Ablation 4 Finally we note that the original (single-agent) soft-actor-critic algorithm [10] and sev- eral other works uses two independently trained soft Q-functions and use the mini- mum of the two as policy for optimizing, as [11, 6] points out that policy steps are known to degrade the performance of value-based methods, e.g. in [30] they train with 0 0 0 L(θ) = [(rt + γ minj∈1,2 Qtot ((st , τ t , at ; θj− ))) − Qtot (st , τ t , at ; θ))2 ] . Their per- formance comparison can be found in the ablation studies as MASAC DoubleQ [30]. This is to find if TD advantage with double Q learning is more stable under MARL when combined with value function decomposition. Figure 3: Ablation Results on 8m Ablation 6.4.5 Ablation Results By comparing the results of MASAC and LSF-SAC, we observe an improvement in both maps regarding the performance of LSF-SAC, which confirms the contribution of the latent state information assisted value decomposition design. The MASAC and MASAC with α=1.0 were both able to outperform MAA2C, despite the latter having a fixed α parameter, which can be viewed as training with aggressive exploration throughout the entire training session. Note that MAA2C soon 15
Figure 4: Ablation Results on 5m vs 6m converges to a local optimal on 3s5z, while it cannot present a learnable policy on 5m vs 6m map. Also, MASAC is able to achieve a higher winning rate and faster converging than MASAC with α=1.0. On the 5m vs 6m map, MASAC with α=1.0 initially find a correct way of optimization, while the constant penalty from entropy maximization forces it to explore and find other policies. This illustrates the advantage of automatic temperature updating design. Finally, although MSAC DoubleQ delivers a learnable policy on 3s5z environment at a plodding pace, it fails to learn a policy on 5m vs 6m within the episode limits; this could potentially be the result of a complex model and relatively continuous reward on this specific environment. Also, due to its redundant network size, we find that MSAC DoubleQ, with its double value function design, takes a significantly longer time for training. This proves TD advantage with a single value function is sufficient to optimize multi-agent actor critics with value decomposition. 7 Conclusions In this paper, we propose LSF-SAC, a novel framework that combines latent state in- formation assisted individual value estimation for joint value function factorization and multi-agent entropy maximization, for a collaborative multi-agent reinforcement learn- ing under the CTDE paradigm. We introduce an information-theoretical regularization method for optimizing the latent state information assisted latent information generator to efficiently and effectively utilize extra state information in individual value estima- tion, while CTDE can still be maintained through a soft-actor-critic design. We also propose one possible implementation of expanding the off-policy maximum entropy deep reinforcement learning to the multi-agent domain with latent state information. We empirically show that latent state information sharing significantly improves the 16
Figure 5: Ablation Results on 3s5z power of value function decomposition operators. Empirical results in SMAC show that our framework significantly outperforms the baseline methods on the SMAC en- vironment. We further analyze the key factors contributing to the performance in our framework by a set of ablation studies. In future works, we plan to focus on expand- ing the proposed method to a continuous action space with different policy gradient methods. References [1] Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. (2016). Deep variational information bottleneck. ArXiv Preprint arXiv:1612.00410. [2] Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., and Pineau, J. (2019). Tarmac: Targeted multi-agent communication. International Conference on Machine Learning, 1538–1546. [3] Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Bench- marking deep reinforcement learning for continuous control. International Confer- ence on Machine Learning, 1329–1338. [4] Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. (2018). Coun- terfactual multi-agent policy gradients. Proceedings of the AAAI Conference on Artificial Intelligence, 32. [5] Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P. H., Kohli, P., and Whiteson, S. (2017). Stabilising experience replay for deep multi-agent reinforce- ment learning. International Conference on Machine Learning, 1146–1155. 17
[6] Fujimoto, S., Hoof, H., and Meger, D. (2018). Addressing function approxima- tion error in actor-critic methods. International Conference on Machine Learning, 1587–1596. [7] Grover, A., Al-Shedivat, M., Gupta, J., Burda, Y., and Edwards, H. (2018). Learn- ing policy representations in multiagent systems. International Conference on Ma- chine Learning, 1802–1811. [8] Gupta, J. K., Egorov, M., and Kochenderfer, M. (2017). Cooperative multi- agent control using deep reinforcement learning. International Conference on Au- tonomous Agents and Multiagent Systems, 66–83. [9] Ha, D., and Schmidhuber, J. (2018). World models. ArXiv Preprint arXiv:1803.10122. [10] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In- ternational Conference on Machine Learning, 1861–1870. [11] Hasselt, H. (2010). Double Q-learning. Advances in Neural Information Process- ing Systems, 23, 2613–2621. [12] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mo- hamed, S., and Lerchner, A. (2016). beta-vae: Learning basic visual concepts with a constrained variational framework. [13] Hu, J., Jiang, S., Harding, S. A., Wu, H., and Liao, S. (2021). RIIT: Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning. ArXiv Preprint arXiv:2102.03479. [14] Hu, Y., Nakhaei, A., Tomizuka, M., and Fujimura, K. (2019). Interaction-aware decision making with adaptive strategies under merging scenarios. 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 151–158. [15] Igl, M., Zintgraf, L., Le, T. A., Wood, F., and Whiteson, S. (2018). Deep varia- tional reinforcement learning for POMDPs. International Conference on Machine Learning, 2117–2126. [16] Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. (2014). Semi- supervised learning with deep generative models. Advances in Neural Information Processing Systems, 3581–3589. [17] Kingma, D. P., and Welling, M. (2013). Auto-encoding variational bayes. ArXiv Preprint arXiv:1312.6114. [18] Kraemer, L., and Banerjee, B. (2016). Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190, 82–94. [19] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. ArXiv Preprint arXiv:1509.02971. 18
[20] Littman, M. L. (1994). Markov games as a framework for multi-agent reinforce- ment learning. In Machine learning proceedings 1994 (pp. 157–163). Elsevier. [21] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mordatch, I. (2017). Multi- Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Neural In- formation Processing Systems (NIPS). [22] Mahajan, A., Rashid, T., Samvelyan, M., and Whiteson, S. (2019). Maven: Multi- agent variational exploration. ArXiv Preprint arXiv:1910.07483. [23] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. ArXiv Preprint arXiv:1312.5602. [24] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., and others. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. [25] Mordatch, I., and Abbeel, P. (2017). Emergence of Grounded Compositional Lan- guage in Multi-Agent Populations. ArXiv Preprint arXiv:1703.04908. [26] Oliehoek, F. A., and Amato, C. (2016). A concise introduction to decentralized POMDPs. Springer. [27] Panait, L., and Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems, 11(3), 387–434. [28] Papoudakis, G., and Albrecht, S. V. (2020). Variational autoencoders for opponent modeling in multi-agent systems. ArXiv Preprint arXiv:2001.10829. [29] Papoudakis, G., Christianos, F., Schäfer, L., and Albrecht, S. V. (2020). Com- parative evaluation of multi-agent deep reinforcement learning algorithms. ArXiv Preprint arXiv:2006.07869. [30] Pu, Y., Wang, S., Yang, R., Yao, X., and Li, B. (2021). Decomposed Soft Actor-Critic Method for Cooperative Multi-Agent Reinforcement Learning. ArXiv Preprint arXiv:2104.06655. [31] Rashid, T., Farquhar, G., Peng, B., and Whiteson, S. (2020). Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Rein- forcement Learning. [32] Rashid, T., Samvelyan, M., Schroeder, C., Farquhar, G., Foerster, J., and White- son, S. (2018). Qmix: Monotonic value function factorisation for deep multi- agent reinforcement learning. International Conference on Machine Learning, 4295–4304. [33] Samvelyan, M., Rashid, T., De Witt, C. S., Farquhar, G., Nardelli, N., Rudner, T. G., Hung, C.-M., Torr, P. H., Foerster, J., and Whiteson, S. (2019). The starcraft multi-agent challenge. ArXiv Preprint arXiv:1902.04043. 19
[34] Shao, J., Zhang, H., Jiang, Y., He, S., and Ji, X. (2021). Credit Assignment with Meta-Policy Gradient for Multi-Agent Reinforcement Learning. ArXiv Preprint arXiv:2102.12957. [35] Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi, Y. (2019). Qtran: Learning to factorize with transformation for cooperative multi-agent reinforce- ment learning. International Conference on Machine Learning, 5887–5896. [36] Su, J., Adams, S., and Beling, P. (2021). Value-Decomposition Multi-Agent Actor-Critics. Proceedings of the AAAI Conference on Artificial Intelligence, 35(13), 11352–11360. [37] Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., and others. (2017). Value- decomposition networks for cooperative multi-agent learning. ArXiv Preprint arXiv:1706.05296. [38] Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J., and Vicente, R. (2017). Multiagent cooperation and competition with deep rein- forcement learning. PloS One, 12(4), e0172395. [39] Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. Proceedings of the Tenth International Conference on Machine Learning, 330–337. [40] Tishby, N., Pereira, F., and Bialek, W. (2000). The information bottleneck method. ArXiv Preprint physics/0004057. [41] Tishby, N., and Zaslavsky, N. (2015). Deep learning and the information bottle- neck principle. 2015 IEEE Information Theory Workshop (ITW), 1–5. [42] Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., and others. (2019). Grand- master level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782), 350–354. [43] Wang, J., Ren, Z., Liu, T., Yu, Y., and Zhang, C. (2020). Qplex: Duplex dueling multi-agent q-learning. ArXiv Preprint arXiv:2008.01062. [44] Wang, T., Wang, J., Zheng, C., and Zhang, C. (2019). Learning nearly de- composable value functions via communication minimization. ArXiv Preprint arXiv:1910.05366. [45] Wang, Y., Han, B., Wang, T., Dong, H., and Zhang, C. (2020). Off-policy multi- agent decomposed policy gradients. ArXiv Preprint arXiv:2007.12322. [46] Yang, Y., Hao, J., Liao, B., Shao, K., Chen, G., Liu, W., and Tang, H. (2020). Qatten: A general framework for cooperative multiagent reinforcement learning. ArXiv Preprint arXiv:2002.03939. 20
[47] Ye, D., Zhang, M., and Yang, Y. (2015). A multi-agent framework for packet routing in wireless sensor networks. Sensors, 15(5), 10026–10047. [48] Zhang, T., Li, Y., Wang, C., Xie, G., and Lu, Z. (2021). FOP: Factorizing Optimal Joint Policy of Maximum-Entropy Multi-Agent Reinforcement Learning. Interna- tional Conference on Machine Learning, 12491–12500. A Mathematical Details A.1 Boundaries for extra-state information To efficiently and effectively encode extra state information for individual value estimation, we consider this information encoding problem as an information bot- tleneck problem [40], the objective for each agent i can be written as: n X Jm (θ m ) = [Iθm (Aj ; Mi |Tj , Mj ) − βIθm (Mi ; Ti )] (15) j=1 This object is appealing because it defines what is a good representation in terms of trade-off between a succinct representation and inferencing ability. The main shortcoming is that the computation of the mutual information is computationally challenging. Inspired by the recent advancement in Bayesian inference and vari- ational auto-encoder [17, 28, 44], we propose a novel way of representing it by utilizing latent vectors from variational inference models using information theo- retical regularization method, and then derive the evidence lower bound (ELBO) of its objective. Lemma 2. A lower bound of mutual information Iθm (Aj ; Mi |Tj , Mj ) is ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , M )]] where qψ is a variational Gaussian distribution with parameters ψ to approx- imate the unknown posterior p(Aj |Tj , Mj ), T = {T1 , T2 , · · · , Tn }, M = {M1 , M2 , · · · , Mn }. Proof. Iθc (Aj ; Mi |Tj , Mj ) p aj , mout out i |τj , mj Z = daj dτj dmj p (aj , τj , mj ) log p aj |τj , mout j p mout out i |τj , mj p (aj |τj , mj ) Z = daj dτj dmj p (aj , τj , mj ) log p aj |τj , mout j where p(aj |τj , mj ) is fully defined by our encoder and Markov Chain. Since this is intractable in our case, let qψ (aj |τj , mj ) be a variational approximation 21
to p(aj |τj , mj ), where this is our decoder which we will take to another neural network with its own set of parameters ψ. Using the fact that Kullback Leibler divergence is always positive, we have KL[p(aj |τj , mj ), qψ (aj |τj , mj )] ≥ 0 Z Z daj dτj dmj p(aj , τj , mj ) log p(aj |τj , mj ) ≥ daj dτj dmj p(aj , τj , mj ) log qψ (aj |τj , mj ) and hence Iθc (Aj ; Mi |Tj , Mj ) qψ (aj |τj , mj ) Z ≥ daj dτj dmj p (aj , τj , mj ) log p aj |τj , mout j Z Z = daj dτj dmj p (aj , τj , mj ) log qψ (aj |τj , mj ) − daj dτj dmj p (aj , τj , mj ) log p aj |τj , mout j Z = daj dτj dmj p(τj )p(mj |τj )p(aj |τj ) log qψ (aj |τj , mj ) + H(Aj |Tj , Mj ) Z =ET∼D,Mj ∼fm ( daj p(Aj |T) log qψ (aj |τj , mj )) + H(Aj |Tj , Mj ) =ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , M )]] + H(Aj |Tj , Mj ) Notice that the entropy of labels H(Aj |Tj , Mj ) is an positive term that is indepen- dent of our optimization procedure and thus can be ignored. Then we have Iθm (Aj ; Mi |Tj , Mj ) ≥ ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , M )]] which is the lower bound of the first term in Eq.(2) Lemma 3. A lower bound of mutual information Iθm (Mi ; Ti ) is ETi ∼D,Mj ∼fm [DKL (p (Mi |Ti ) kqφ (Mi ))] where DKL denotes Kullback-Leibler divergence operator and qφ (Mi ) is a varia- tional posterior estimator of p(Mi ) with parameters φ. Proof. Iθm (Mi ; Ti ) p(mout i |τi ) Z = dmout out i dτi p(mi |τi )p(τi ) log p(mout i ) Z Z = dmout i dτi p(mout i |τ i )p(τi ) log p(mout i |τ i ) − dmout out out i dτi p(mi |τi )p(τi ) log p(mi ) 22
Again, p(mout i ) is fully defined by our encoder and RMarkov Chain, and when it is fully defined, computing the marginal distribution dτi p(mout i |τi )P (τi ) might be difficult. So we use qφ (mouti ) as a variational approximation to this marginal. Since KL[p(mout i ), qφ (mout i )] ≥ 0, We have Z Z dmout out out i p(mi ) log p(mi ) ≥ dmout out out i p(mi ) log qφ (mi ) Then Iθm (Mi ; Ti ) Z Z ≤ dmout i dτi p(mout i |τi )p(τ i ) log p(m out i |τi ) − dmout out out i dτi p(mi |τi )p(τi ) log qφ (mi ) p(mout i |τi ) Z = dmout out i dτi p(mi |τi )p(τi ) log qφ (mout i ) = ETi ∼D,Mj ∼fm [DKL (p (Mi |Ti ) kqφ (Mi ))] Combining Lemma 1 and Lemma 2, we have the ELBO for the message encoding objective, which is to minimize Lm (θ m ) = ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , Mj )]+βDKL (p(Mi |Ti )kqφ (Mi ))]. A.2 Soft Value Decomposition with latent state Information The joint soft action value estimation using latent state Information value function decomposition with a monotonic mixing network: Eπi [α log π(at |τ t ) − Qπtot (st , τt , at )] X = [k i (s)Eπ [α log π i (at |τt )] − Eπ [Qtot (τ , a, m; θ)] i X = [k i (s)Eπ [α log π i (at |τt )] − π(a|τ )Qtot (τ , a, m; θ) i X X = [k i (s)Eπ [α log π i (at |τt )] − π(a|τ ) [k i (s)q i (τti , ait , mit ) + b(s)] i i X X = [k i (s)Eπ [α log π i (at |τt )] − [k i (s)Eπ [q i (τti , ait , mit )] + b(s)] i i X mix i = q (st , Eπi [α log π (ait |τti ) − q i (τti , ait , mit )]) i Then we have the objective for soft policy gradients update using latent state In- formation value function decomposition with a monotonic mixing network: L(π) = ED [α log π(at |τ t )−Qπtot (st , τt , at )] = q mix (st , Eπi [α log π i (ait |τti )−q i (τti , ait , mit )]) 23
B StarCraft Multi-Agent Challenge For the experiments on StarCraft II micromanagement, we follow the setup of SMAC [33] with open-source implementation including COMA [4], MAVEN [22], QMIX [32]and VDAC [36]. We consider combat scenarios where the enemy units are controlled by the StarCraft II built-in AI and the friendly units are controlled by the algorithm-trained agent. The possible options for built-in AI difficulties are Very Easy, Easy, Medium, Hard, Very Hard, and Insane, ranging from 0 to 7. We carry out the experiments with ally units controlled by a learning agent while built-in AI controls the enemy units with difficulty = 7 (Insane). Depending on the specific scenarios(maps), the units of the enemy and friendly can be symmetric or asymmetric. At each time step each agent chooses one action from discrete action space, including noop, move[direction], attack[enemy id] and stop. Dead units can only choose noop action. Killing an enemy unit will result in a reward of 10 while winning by eliminating all enemy units will result in a reward of 200. The global state information are only available in the centralized critic. For easier maps, we train each baseline algorithm with 1.5 million time steps, while we train with 2 million steps on other maps. For the maps used in the experiments: • 1c3s5z is a symmetric battle that consists of 1 Colossus, 3 Stalkers, and 5 Zealots for each side. • 2s vs 1sc is an asymmetric battle in which the friendly side controls two stalker units with the competitor side controls one spine crawler. • 3s5z is a symmetric battle that consists of 3 Stalkers and 5 Zealots • 5m vs 6m is an asymmetric battle where the friendly side controls 5 marines to compete with the competitor side controls 6 marines. • 8m is a symmetric battle that consists 8 marines for both sides. • MM2 is an asymmetric battle where 1 Medivac, 2 Marauders, and 7 Marines battle against 1 Medivac, 3 Marauders, and 8 Marines. Medivacs are healing units that can heal friendly unit with limited healing energy. It cannot attack an enemy, and its healing energy will regrow over time. Readers are encouraged to watch the game replay for better understanding. More details about the environment can be found in [33]. C Implementation Details We use PyTorch for all implementations. The pseudocode for optimizing PAF-SAC can be summarized in Algorithm 1. Experiments are obtained using Nvidia RTX2080Ti GPU. The training uses episode runners, i.e., non-parallel runners, to discourage a very large batch size 24
for training. Each independent run takes around 12 hours depending on the sce- nario, while we carry out 4 different training sessions at the same time, bringing an amortized training time of around 3 hours. Each training session runs with a random seed which is generated randomly at the beginning of the session. The agent networks of all algorithms resemble a DRQN with a recurrent layer implemented as a GRU with 64-dimensional hidden states. The latent state infor- mation encoding network is a feed-forward network that outputs 3 latent vectors per agent. The latent state information information encoder is a fully connected network with two 16 dimensional hidden layers. The posterior estimator is a fully connected network implemented with one 16-dimensional hidden layer. Unless mentioned with an update policy, all parameters introduced in PAF-SAC remain the same throughout the training session. In Eq. (1), λ1 = λ2 = 1.0, in Eq. (2) and Eq. (5), β = 0.001. All algorithms are trained with the same default hyper- parameter settings. RMSprop is used for optimizing all algorithms with a learning rate 5 ∗ 10−4 . Replay buffer stores the latest 5000 episodes with batch size 32. Reward discount factor γ = 0.99. The Target network is set to update every 200 training steps. 25
You can also read