A Memory Efficient Deep Reinforcement Learning Approach For Snake Game Autonomous Agents
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Memory Efficient Deep Reinforcement Learning Approach For Snake Game Autonomous Agents Md. Rafat Rahman Tushar1 Shahnewaz Siddique2 Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering North South University North South University Dhaka, Bangladesh Dhaka, Bangladesh rafat.tushar@northsouth.edu shahnewaz.siddique@northsouth.edu Abstract—To perform well, Deep Reinforcement Learning the image-based DRL methods have enjoyed considerable (DRL) methods require significant memory resources and success, they are memory intensive during training as well as computational time. Also, sometimes these systems need deployment. Since they require a massive amount of memory, additional environment information to achieve a good reward. However, it is more important for many applications and devices they are not suitable for implementation in mobile devices or to reduce memory usage and computational times than to achieve mid-range autonomous robots for training and deployment. arXiv:2301.11977v1 [cs.AI] 27 Jan 2023 the maximum reward. This paper presents a modified DRL All modern reinforcement learning algorithms use replay method that performs reasonably well with compressed imagery buffer for sampling uncorrelated data for online training in data without requiring additional environment information and mainly off-policy algorithms. Experience replay buffer also also uses less memory and time. We have designed a lightweight Convolutional Neural Network (CNN) with a variant of the improves the data efficiency [9] during data sampling. Since Q-network that efficiently takes preprocessed image data as the use of neural networks in various DRL algorithms is input and uses less memory. Furthermore, we use a simple increasing, it is necessary to stabilize the neural network reward mechanism and small experience replay memory so as to with uncorrelated data. That is why the experience replay provide only the minimum necessary information. Our modified buffer is a desirable property of various reinforcement learning DRL method enables our autonomous agent to play Snake, a classical control game. The results show our model can achieve algorithms. The first successful implementation of DRL in similar performance as other DRL methods. high dimensional observation space, the Deep Q-learning [6], used a replay buffer of 106 size. After that, [8], [10]–[12], to Index Terms—Deep Reinforcement Learning, Convolutional name a few, have solved complex high dimensional problems Neural Network, Deep Q Learning, Hyperparameter Tuning, but still use a replay buffer of the same size. Replay Size, Image Preprocessing Experience replay buffer suffers from two types of issues. One is to choose the size of the replay buffer, and the second I. I NTRODUCTION is the method of sampling data from the buffer. [13]–[15] Complex problems can be solved in real-world applications consider the latter problem to best sample from the replay by carefully designing Deep Reinforcement Learning (DRL) buffer. But the favorable size for the replay buffer remains models by taking high dimensional input data and producing unknown. Although [15] points out that the learning algorithm discrete or continuous outputs. It is challenging to build a is sensitive to the size of the replay buffer, they have not come agent using sensory data capable of controlling and acting up with a better conclusion on the size of the buffer. in an environment. The environment is also complex and In this paper, we tackle the memory usage of DRL al- primarily unknown to the acting agent. The agent needs to gorithms by implementing a modified approach for image learn the underlying distribution of the state and action spaces, preprocessing and replay buffer size. Although we want the and the distribution changes as the agent encounters new agent to obtain a decent score, we are more concerned about data from an environment. Previously reinforcement learning memory usage. We choose a Deep Q-Network (DQN) [6] algorithms [1]–[5] were presented with lower constraint prob- for our algorithm with some variations. Our objective is to lems to demonstrate the algorithms effectiveness. However, design a DRL model that can be implemented on mobile these systems were not well generalized for high dimensional devices during training and deployment. To be deployed on inputs; thus, they could not meet the requirements of practical mobile devices, memory consumption must be minimized as applications. traditional DRL model with visual inputs sometimes need half Recently, DRL has had success in CNN based vision-based a terabyte of memory. We achieve low memory consumption problems [6]–[8]. They have successfully implemented DRL by preprocessing the visual image data and tuning the replay methods that learn to control based on image pixel. Although buffer size with other hyperparameters. Then, we evaluate 1 Research our model in our simulation environment using the classical Assistant. 2 Assistant Professor, IEEE Member. control game named Snake.* The results show that our model * GitHub implementation: https://github.com/rafattushar/rl-snake can achieve similar performance as other DRL methods.
II. R ELATED W ORK TABLE I R EWARD M ECHANISM FOR S NAKE G AME The core idea of reinforcement learning is the sequential Moves Rewards Results decision making process involving some agency that learns Eats an apple +1 Score Increase from the experience and acts on uncertain environments. After Hits with wall or itself -1 End of episode the development of a formal framework of reinforcement Not eats or hits wall or itself -0.1 Continue playing games learning, many algorithms have been introduced such as, [1]– [5]. TABLE II Q-learning [1] is a model-free asynchronous dynamic pro- M EMORY R EQUIREMENT FOR D IFFERENT P IXEL DATA gramming algorithm of reinforcement learning. Q-learning proposes that by sampling all the actions in states and iterating RGB Grayscale Binary the action-value functions repeatedly, convergence can be Data Type float float int achieved. The Q-learning works perfectly on limited state Size (kB) 165.375 55.125 6.890 Memory Save % w.r.t. RGB 0% 67% 96% and action space while collapsing with high dimensional Memory Save % w.r.t. Grayscale - 0% 87.5% infinite state space. Then, [6] proposes their Deep Q-network algorithm that demonstrates significant results with image data. Among other variations, they use a convolutional neural network and replay buffer. Double Q-learning [16] is applied 0 with DQN to overcome the overestimation of the action-value 10 function and is named Deep Reinforcement Learning with 20 Double Q-Learning (DDQN) [8]. DDQN proposes another 30 neural network with the same structure as DQN but gets 40 50 updated less frequently. Refined DQN [17] proposes another 60 DRL method that involves a carefully designed reward mech- 70 anism and a dual experience replay structure. Refined DQN 80 evaluate their work by enabling their agent to play the snake 0 20 40 60 80 game. (a) Before preprocessing (b) After preprocessing The experience replay buffer is a desirable property of Fig. 1. Visual image data before and after preprocessing modern DRL algorithms. It provides powerful, model-free, off- policy DRL algorithms with correlated data and improves data efficiency [9] during data sampling. DQN [6] shows the power A. Image Preprocessing of replay buffer in sampling data. DQN uses the size 106 for replay buffer. After that, [8], [10]–[12], [17], among others, The agent gets the RGB values in the 3-D array format have shown their work with the same size and structure as from the games’ environments. We convert the RGB array into the replay buffer. Schaul et al. propose an efficient sampling grayscale because it would not affect the performance [18] and strategy in their prioritized experience replay (PER) [13]. PER it saves three times of memory. We resize the grayscale data shows that instead of sampling data uniform-randomly, the into 84 × 84 pixels. Finally, for more memory reduction, we latest data gets the most priority; hence the latest data have convert this resized grayscale data into binary data (values only more probability of being selected, and this selection method with 0 and 1). The memory requirement for storing various seems to improve results. [15] shows that a large experience image data (scaled-down between 0 and 1) is given in Table II. replay buffer can hurt the performance. They also propose that Table II shows that it saves around 67% from converting when sampling data to train DRL algorithms, the most recent RGB into grayscale and around 96% from converting RBG data should the appended to the batch. into binary. Also, the memory requirement reduces by around 87.5% converting from grayscale into binary. Visual pixel data transformation with preprocessing is given in Fig. 1. The III. M ETHOD preprocessing method is presented using a flowchart in Fig. 2. B. Game Selection and Their Environments Our objective is to reduce memory usage during training time while achieving the best performance possible. The replay The use-case of our target applications is less complex tasks. memory takes a considerable amount of memory, as described For this reason, we implemented the classical Snake game [19] later. We try to achieve memory efficiency by reducing the massive replay buffer requirement with image preprocessing and the buffer size. The buffer size is carefully chosen so Game Env Graysclale Resize 84X84 Pixel value 0 or 1 that the agent has the necessary information to train well and achieves a moderate score. We use a slight variation of the deep Q-learning algorithm for this purpose. Fig. 2. Diagram of image preprocessing
in the ’pygame’ module. The game screen is divided into a TABLE III 12 × 12 grid. The resolution for the game is set to 252 × 252. T HE ARCHITECTURE OF N EURAL N ETWORK The initial snake size is 3. The controller has four inputs to Layer Filter Stride Layer Acti- Zero Output navigate. Table I shows the valid actions and respective reward Name vation Padd for the snake game environment. Input 84*84*4 Conv1 8*8 4 32 ReLU Yes 21*21*32 M. Pool 2*2 2 Yes 11*11*32 Conv2 4*4 2 64 ReLU Yes 6*6*64 C. Reinforcement Learning Preliminary M. Pool 2*2 2 Yes 3*3*64 B. Norm 3*3*64 Any reinforcement learning or sequential decision-making Conv3 3*3 2 128 ReLU Yes 2*2*128 problem can be formulated with Markov Decision Processes M. Pool 2*2 2 Yes 1*1*128 (MDPs). An MDP is a triplet M = (X , A, P 0 ), where X B. Norm 1*1*128 Flatten 128 is a set of valid states, A is a set of valid actions, and P0 FC 512 ReLU 512 is transition probability kernel that maps X × A into next FC 512 ReLU 512 state transition probability. For a deterministic system, the state Output No. of Linear No. of actions actions transition is defined as, M. Pool = Max Pooling, B. Norm = Batch Normalization, FC = Fully Connected st+1 = f (st , at ) (1) TABLE IV M EMORY R EQUIREMENT E XPERIENCE R EPLAY The reward is defined as, RGB Grayscale Binary Memory Usage (GB) 1261.71 420.57 2.628 rt = R(st , at ) (2) Memory Save % w.r.t. RGB 0% 67% 99.7% Memory Save % w.r.t. Grayscale - 0% 99.4% The cumulative reward over a trajectory or episode is called the return, R(τ ). The equation for discounted return is given below, E. Neural Network ∞ X The action-value function is iteratively updated to achieve R(τ ) = γ t rt (3) the optimal action-value function. The neural network used t=0 to approximate the action-value function and update at each iteration is called Q-network. We train the Q-network, param- D. Deep Q-Learning eterized by θ, by minimizing a loss function Li (θi ) at ith The goal of the RL agent is to maximize the expected return. iteration. Following a policy π, the expected return, J(π), is defined as, Li (θi ) = E (yi − Q(s, a; θi ))2 (8) s,a∼ρ J(π) = E [R(τ )] (4) h i τ ∼π 0 0 0 0 where yi = 0E r(s, a) + γmax 0 Q (s , a ; θ k ) is the target s ∼ρ a The optimal action-value or q function Q∗ (s, a) maximizes for that update. Here Q0 is another Q-network with the the expected return by taking any action at state s and acting same shape as Q-network but with a frozen parameter called optimally in the following states. target Q-network for training stability parameterized by θk0 . We train the Q-network by minimizing this loss function (8) Q∗ (s, a) = max E [R(τ )|s0 = s, a0 = a] (5) w.r.t. the parameter θi . We use Adam [20] optimizer for fast π τ ∼π For finding out the optimal actions based on an optimal action- value function at time t, the Q∗ must satisfy the Bellman Equation, which is, E1= (s1,a1,r2,s2) h i Random Action E2= (s2,a2,r3,s3) Screen Data Q∗ (s, a) = 0E r(s, a) + γ max 0 Q ∗ 0 0 (s , a ) (6) or Actions taken by Agent Environment Rewards E3= (s3,a3,r4,s4) s ∼ρ a E4= (s4,a4,r5,s5) .... .... State, Action, Reward, Next State .... The optimal action-value function gives rise to optimal action .... E1= (st,at,rt+1,st+1) a∗ (s). The a∗ (s) can be described as, State Replay Memory a∗ (s) = arg max Q∗ (s, a) (7) Experience Replay Memory a For training an optimal action-value function, sometimes a non-linear function approximator like neural network [6] is used. We used a convolutional neural network. Fig. 3. Structure of experience replay memory and flowchart
Image Pre-processing St+1 St+1 Q0 Q1 Max Q Et=(st, at, rt+1, st+1) St+1 St Online DQN At ENV Q2 Rt+1 Q3 Experience Replay Memory Random Mini-Batch Sync weights every p steps E2=(s2, a2, r3, s3) Q0 Q0' s2 Q1 s3 Q1' Q2 Q2' Q3 Q3' Online Deep Q Network Target Deep Q Network Loss = [ yt - Q(At) ]2 yt = Rt+1 + .maxa Q'(a) Fig. 4. The deep reinforcement learning design structure of our model convergence. Our convolutional neural network structure is model has two convolutional neural networks (online DQN shown in Table III. and target DQN) sharing the same structure but does not sync automatically. The weights of the target network are frozen so F. Experience Replay Buffer that it cannot be trained. The state history from the mini-batch As our focus is to keep memory requirements as low as is fed into the Online DQN. The DQN outputs the Q-values, possible during training, choosing the size of the replay buffer Q(st , at ). is one of the critical design decisions. The size of the replay Loss = [yt − Q(st , at )]2 (9) buffer directly alters the requirement of memory necessity. We use a replay buffer of size 50,000, requiring less memory The yt is calculated from the target Q-network. We are passing (only 5%) than [6], [8], [17], which use a replay buffer the next-state value to the target Q-network, and for each next- of size 1,000,000. [6], [8], [17] store grayscale data into a state in the batch, we get Q-value, respectively. That is our replay buffer. Table IV shows that we use 99.4% less memory maxa0 Q(s0 , a0 ) value in the below equation. compared to these works. The replay buffer stores data in FIFO (first in, first out) order so that the buffer contains only the yt = Rt+1 + γmaxa0 Q(s0 , a0 ) (10) latest data. We present the complete cycle of the experience The γ is the discount factor, which is one of many hyperpa- replay buffer in Fig 3. Fig. 4 illustrates our complete design rameters we are using in our model. Initially, we set γ value to diagram. 0.99. The Rt+1 is the reward in each experience tuple. So, we IV. E XPERIMENTS get the yt value. The loss function is generated by putting these values in (9). Then, we use this loss function to backpropagate A. Training our Online DQN with an ‘Adam’ optimizer. Adam optimizer is For training our model, we take a random batch of 32 used instead of classical stochastic gradient descent for more experiences from the replay buffer at each iteration. Our speed. The target DQN is synced with online DQN at every
18 3 90 16 80 2.5 14 70 12 2 60 10 1.5 50 8 40 6 1 30 4 0.5 20 2 0 0 10 0 5 10 15 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 4 4 10 10 (a) Score vs. episode graph (b) Reward vs. episode graph (a) Performance (a) Performance (a) Score graphevaluation in terms of Refined DQN(b) Performance evaluation (b) Score graph in model of our terms Fig. 5. Results of our agent playing Snake game during training of gametaken scorefrom [17]) of survival time score (graph Fig. Fig. 8. Comparison 2. Visualization between comparison. of performance Refined DQN model and To improve ourwe clarity, model only Fig. 3. The perf use the averaged values of each 1,000 games. in additional 50 18 3 90 6000 20.0 P ERFOR 16 2.5 80 14 Moreover, for benchmarking purpose, 17.5 we also conduct 5000 15.0 70 experiments using a baseline model, 12 which follows the same 4000 2 60 12.5 strategy used in the DeepMind’s 10.0groundbreaking work [2] 10 Score 1.5 50 3000 B 40 (with the same network structure as 8 7.5 shown in Table I). This Re 6 2000 1 30 baseline model is trained in the same 4 5.0 manner as our refined 2.5 0.5 20 DQN model, but without our carefully 2 designed reward mech- 1000 (a) Score vs. episode graph (b) Reward vs. episode graph 0.0 R 0 10 anism, training gap, and dual experience 0 0 5 10 15 20 25 30 35 0 40 replay 10 45 20strategy. 50 30 0 0 Fig. 250 40 5 10 15 20 25 30 35 40 45 50 0 2 4 6 8 10 12 14 10 4 0 2 4 6 8 10 12 14 10 4 Episode Fig. 6. Results of baseline DQN model playing Snake game during training clearly demonstrates (a) Performance in terms that our model of game outperforms (b) Performance in terms theof baseline the num- (a) Performance evaluation in terms (b) Performance evaluation in terms (a) modelRefinedin DQN terms of score both (Taken the game (b) Our score and model’sthe score survival of game score of survival time score [17]) from ber of steps survived time. This finding empirically shows the effectiveness of our game play po Fig. 3.Fig. The9.performance improvements Testing of our overevaluation the agent baselineby(after playingbeing model, training random i.e., 50for the 134,000 episodes reward games) not be optim game assign- 10,000Fig.steps. The values 2. Visualization of hyperparameters of performance we choose comparison. To improve clarity, weare only in additional 50 games, wherein = 0 and training is turned off. use the averaged values of each 1,000 games. ment based on distance, the training gap, the timeout punish- 27,000 game listed in Table VI. ment, and the dual experience replay strategies. Nevertheless, agent drops s results than TABLE II B. Results and Comparisons as shown in the Fig. baseline 2, the highest and values refined P ERFORMANCE C OMPARISON A MONG D IFFERENT M ODELS of DQN the averaged models. game Fig. of 6 game sco displays score and the thebaseline averagedDQN number results of steps duringsurvived training on the snake are seemingly see that even Moreover, We allow DRL for benchmarking agents to play 140,000purpose, we also episodes of conduct games experiments using a baseline model, which follows the same small, In game. i.e., Fig. around 7 we2.5 Performance and 80, present the respectively. scoreSurvival Score and However, reward Steps comparison please learn more ap to match the training results presented in [17]. We train one note thatour these numbers increasing pe between model Human andarethe Average computed baseline 1.98 as DQN the average 216.46 model. of 1,000 The blue agentstrategy with our used in the and method DeepMind’s another withgroundbreaking the DQN work method [2] games, within which Baseline severalouroutlier Average 0.26 cases 31.64may and drastically the period o (with the same network structure as shown in Table I). This line in Fig. 7(a) Refined DQN represents Average model’s 9.04 score, 1477.40 the purple presented in [6], we refer to [6] as the baseline DQN model. lower the averaged performance. Furthermore, in the latter part for the agent baseline model is trained in the same manner as our refined line representsHuman the scoreBest of the15 baseline 1389 DQN model. During Next, we compare our model with the baseline DQN model of this experiment Baselinesection, Best we compare 2 the performance 1015 of our due to the li DQN model, but without our carefully designed reward mech- 140,000 numbers of training episodes, our model remains [6] and the refined DQN model [17]. The results of training refined DQN modelDQN Refined withBesthuman performance, 17 5039 trying to further able to re-ru anism, training gap, and dual experience replay strategy. Fig. 2 better at theepisode scoreof though it requires the snake evaluate capability our proposed model. fewer As shown resources. in Nonetheless, clearlygame with our demonstrates thatmodel are shown our model in Fig. outperforms the 5. Fig. baseline Fig. 7(b) demonstrates that our model is capable of achieving 5(a) shows the game’s score with our model during training. Fig. 2, the performance of our refined DQN model in terms of 77,000th gam model in terms of both the game score and the survival higher cumulative rewards Fig. 5(b) game score increases slowly than over thethefirst baseline 50,000DQN gamesmodel. along correctly in t time. shows that even This finding thoughshows empirically our reward mechanism the effectiveness is of our game play policies learned during the exploration phase may of this section simpler than the refined DQN model, the agent maximizes the We the with alsodecay compare the results of . Moreover, between ourin model the performance terms ofand the the improvements over the baseline model, i.e., the reward assign- not be optimal or near optimal that after a while (around number refined of steps survived even gets decreasing (see Fig. 2(b)). can already s cumulative reward ment based on optimally. distance, the training gap, the timeout punish- 27,000 DQN games model after [17].decaysRefinedto 0), the DQN follows a ofdual performance the ex- These perience agent drops findings replay are due memory(also significantly to the exploration-exploitation architecture shown as aand slight a complex drop in terms reward To further trade- In ment, sectionand the we III-F dualshowed experience that replay strategies. our model is moreNevertheless, memory off. As in the exploration phase, wherein linearly decays trained agent as shown in Fig. 2, theDQN highest valuesandof the averaged game mechanism. of game scores However, in Fig. our 2(a)).modelHowever, surpasses their score. to it is encouraging Since efficient than the baseline model refined DQN from 0.5 to 0, the agent is actually getting familiar with the results in score and the averaged number of steps survived are seemingly their see that game evenisafter similar to ours, we the exploration phase, compare our agent our isresults able towith model during training. In this section we show that despite low the game environment by accumulating knowledge learned minimum sco the learnresults provided inknowledge their paper. Fig. 8(a) shows the results memorysmall, i.e., around usage, 2.5 and our model can80,achieve respectively. similarHowever, please if not better from more random appropriate exploration. After and the achieves exploration monotonically phase, the score of arou note that these numbers are computed as the average of 1,000 presented increasing performance in [17], performance and of the agent Fig. 8(b) afterstarts is our the performance to improve by model’s drop. results It seems making during all higher than games, within which several outlier cases may drastically the period of decay, i.e., 50,000 games, the decisions based on the learned knowledge. As shown in number of s is not sufficient lower the averaged performance. Furthermore, in the latter part for Fig.the agent 2(a), the to obtain game averaged a converged knowledge score generally keeps set.improving. However, again signific of this experiment section, we compare the performance of our due to the limited computing TABLEresource Similarly, as shown in Fig. 2(b), the averaged number V we have, we are not of To further refined DQN model with human performance, trying to further able LtoIST OF P ERFORMANCE re-run all the experiments COMPARISON due to OF Dthe time IFFERENT AGENTS constraint. steps survived also shows improvements in general. There is performance, evaluate the capability of our proposed model. As shown in Nonetheless, a noticeable the peakmonotonically in Performance terms of the increasing number performance of steps survived after Snake Game Score Fig. 2, the performance of our refined DQN model in terms of 77,000th game empirically around 50,000th to 77,000th shows Human Average that our games. This unexpected agent 1.98 * is able to learn peak may performance game score increases slowly over the first 50,000 games along correctly be due tointhe thecompletion Snake Game. Baseline Moreover, ofAverage decay that0.26 inthe the last paragraph *performance of 10 games to with the decay of . Moreover, the performance in terms of the of thethis section, agent startswetoRefined show improve DQN that as Average although it relies 9.04 * pre-converged, purely on theour agent implementati learned can already for surpass Our Average average human players. 9.53 number of steps survived even gets decreasing (see Fig. 2(b)). knowledge decision making. Human BestHowever,15we* suspect that the game scores These findings are due to the exploration-exploitation trade- To further justify the performance Baseline Best of our 2 *agent, we let the off. As in the exploration phase, wherein linearly decays trained agent play additional Refined DQN 50Bestgames with 17 * = 0 and show (a) Score comparison (b) Reward comparison Our Best from 0.5 to 0, the agent is actually getting familiar with the results in Fig. 3. In terms of game score,20 our agent obtains a Fig. 7. Comparison between our model the game environment by accumulating and baseline DQN model knowledge learned minimum score of *3,Data taken fromscore a maximum [17] of 17, and the averaged from random exploration. After the exploration phase, the score of around 9. The averaged score of 9 is significantly performance of the agent starts to improve by making all higher than 2.5 shown in Fig. 2(a). Similarly, the averaged the decisions based on the learned knowledge. As shown in number of steps survived is approximately 1,500, which is Fig. 2(a), the averaged game score generally keeps improving. again significantly higher than that of 80 shown in Fig. 2(b). Similarly, as shown in Fig. 2(b), the averaged number of To further compare our refined DQN model with human steps survived also shows improvements in general. There is performance, we invite ten undergraduate students to play the a noticeable peak in terms of the number of steps survived Snake Game for 50 games. Before they play 50 games for around 50,000th to 77,000th games. This unexpected peak may performance comparisons, each human player played at least be due to the completion of decay that the performance of 10 games to get familiar with this particular Snake Game the agent starts to improve as it relies purely on the learned implementation. The performance comparisons in terms of knowledge for decision making. However, we suspect that the game scores and the number of steps survived are shown
training. By comparing Fig. 8(a) and Fig. 8(b), we can safely R EFERENCES say that our model achieves better scores despite having a [1] C. J. C. H. Watkins and P. Dayan, “Q-learning,” in Machine Learning, simple replay buffer, a simple reward mechanism, and less 1992, pp. 279–292. memory consumption. [2] G. Tesauro, “Temporal difference learning and td-gammon,” Commun. ACM, vol. 38, no. 3, p. 58–68, Mar. 1995. Fig. 9(a) and Fig. 9(b) show scores of random 50 episodes [3] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient during testing of refined DQN and our model, respectively. methods for reinforcement learning with function approximation,” in Table V summarizes the scores provided in the refined DQN Advances in Neural Information Processing Systems, S. Solla, T. Leen, and K. Müller, Eds., vol. 12. MIT Press, 1999. and our model. We can identify from Table V that their refined [4] J. Peters, S. Vijayakumar, and S. Schaal, “Natural actor-critic,” in DQN average is 9.04, while ours is 9.53, and their refined Machine Learning: ECML 2005. Berlin, Heidelberg: Springer Berlin DQN best score is 17, while ours is 20. So, we can see that our Heidelberg, 2005, pp. 280–291. [5] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried- model also performs better in the training and testing phase. miller, “Deterministic policy gradient algorithms,” in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ser. ICML’14. JMLR.org, 2014, p. I–387–I–395. TABLE VI [6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- L IST OF H YPERPARAMETERS stra, and M. A. Riedmiller, “Playing atari with deep reinforcement learning,” Computing Research Repository, vol. abs/1312.5602, 2013. Hyperparameter Value Description [7] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, Discount Factor 0.99 γ-value in max Q-function A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, Initial Epsilon 1.0 Exploration epsilon initial value C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforce- Final Epsilon 0.01 Exploration final epsilon value ment learning,” Nature, vol. 518, pp. 529–33, 02 2015. Batch size 32 Mini batch from replay memory [8] H. v. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with Max step 10,000 Maximum number of steps double q-learning,” in Proceedings of the Thirtieth AAAI Conference on allowed per episode Artificial Intelligence, ser. AAAI’16. AAAI Press, 2016, p. 2094–2100. [9] L.-J. Lin, “Self-improving reactive agents based on reinforcement learn- Learning Rate 0.0025 Learning rate for Adam optimizer ing, planning and teaching,” Mach. Learn., vol. 8, no. 3–4, p. 293–321, Clip-Norm 1.0 Clipping value for Adam optimizer may 1992. Random Frames 50,000 Number of random initial steps [10] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement Epsilon greedy 500,000 Number of frames in which initial learning,” Computing Research Repository, 2019. frames epsilon will be equal final epsilon [11] S. Li, Y. Wu, X. Cui, H. Dong, F. Fang, and S. Russell, “Robust multi- Experience Replay 50,000 Capacity of experience replay agent reinforcement learning via minimax deep deterministic policy gradient,” Proceedings of the AAAI Conference on Artificial Intelligence, Memory memory vol. 33, no. 01, pp. 4213–4220, Jul. 2019. Update of DQN 4 The number of steps after each [12] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- update of DQN takes place policy maximum entropy deep reinforcement learning with a stochastic Update Target 10,000 The number of steps after the actor.” in ICML, ser. Proceedings of Machine Learning Research, vol. 80. PMLR, 2018, pp. 1856–1865. DQN Target and Online DQN sync [13] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” 2015. [Online]. Available: https://arxiv.org/abs/1511.05952 [14] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, V. C ONCLUSION B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight experience replay,” in Advances in Neural Information Processing In this paper, we have shown that better image preprocess- Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, ing and constructing a better mechanism for replay buffer S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. can reduce memory consumption on DRL algorithms during [15] S. Zhang and R. S. Sutton, “A deeper look at experience replay,” training. We have also demonstrated that using our method, Computing Research Repository, vol. abs/1712.01275, 2017. the performance of the DRL agent on a lower constraint [16] H. Hasselt, “Double q-learning,” in Advances in Neural Information Processing Systems, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, application is entirely similar, if not better. We combined our and A. Culotta, Eds., vol. 23. Curran Associates, Inc., 2010. method with the DQN (with some modification) algorithm [17] Z. Wei, D. Wang, M. Zhang, A.-H. Tan, C. Miao, and Y. Zhou, to observe the method’s effectiveness. Our presented design “Autonomous agents in snake game via deep reinforcement learning,” in 2018 IEEE International Conference on Agents (ICA), 2018, pp. 20–25. requires less memory and a simple CNN. We established that [18] T. D. Nguyen, K. Mori, and R. Thawonmas, “Image colorization using our method’s result is as good as other DRL approaches for a deep convolutional neural network,” Computing Research Repository, the snake game autonomous agent. vol. abs/1604.07904, 2016. [19] A. Punyawee, C. Panumate, and H. Iida, “Finding comfortable settings of snake game using game refinement measurement,” in Advances in ACKNOWLEDGMENT Computer Science and Ubiquitous Computing. Singapore: Springer This work was supported by North South University re- Singapore, 2017, pp. 66–73. [20] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- search grant CTRG-21-SEPS-18. tion,” in 3rd International Conference on Learning Representations, The authors would like to gratefully acknowledge that the ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track computing resources used in this work was housed at the Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. National University of Sciences and Technology (NUST), Pakistan. The cooperation was pursued under the South Asia Regional Development Center (RDC) framework of the Belt & Road Aerospace Innovation Alliance (BRAIA).
You can also read