Mobile Robot Path Optimization Technique Based on Reinforcement Learning Algorithm in Warehouse Environment - MDPI
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
applied sciences Article Mobile Robot Path Optimization Technique Based on Reinforcement Learning Algorithm in Warehouse Environment HyeokSoo Lee and Jongpil Jeong * Department of Smart Factory Convergence, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon 16419, Korea; huxlee@g.skku.edu * Correspondence: jpjeong@skku.edu; Tel.: +82-31-299-4260 Abstract: This paper reports on the use of reinforcement learning technology for optimizing mobile robot paths in a warehouse environment with automated logistics. First, we compared the results of experiments conducted using two basic algorithms to identify the fundamentals required for planning the path of a mobile robot and utilizing reinforcement learning techniques for path optimization. The algorithms were tested using a path optimization simulation of a mobile robot in same experimental environment and conditions. Thereafter, we attempted to improve the previous experiment and conducted additional experiments to confirm the improvement. The experimental results helped us understand the characteristics and differences in the reinforcement learning algorithm. The findings of this study will facilitate our understanding of the basic concepts of reinforcement learning for further studies on more complex and realistic path optimization algorithm development. Keywords: warehouse environment; mobile robot path optimization; reinforcement learning Citation: Lee, H.; Jeong, J. Mobile 1. Introduction Robot Path Optimization Technique The fourth industrial revolution and digital innovation have led to the increased Based on Reinforcement Learning interest in the use of robots across industries; thus, robots are being used in various Algorithm in Warehouse industrial sites. Robots can be broadly categorized as industrial or service robots. Industrial Environment. Appl. Sci. 2021, 11, robots are automated equipment and machines used in industrial sites such as factory lines 1209. https://doi.org/10.3390/ and are used for assembly, machining, inspection, and measurement as well as pressing app11031209 and welding [1]. The use of service robots is expanding in manufacturing environments and other areas such as industrial sites, homes, and institutions. Service robots can be Academic Editor: Marina Paolanti broadly divided into personal service robots and professional service robots. In particular, Received: 30 December 2020 Accepted: 26 January 2021 professional service robots are robots used by workers to perform tasks, and mobile robots Published: 28 January 2021 are one of the professional service robots [2]. A mobile robot needs the following elements to perform tasks: localization, mapping, Publisher’s Note: MDPI stays neutral and path planning. First, the location of the robot must be known in relation to the with regard to jurisdictional claims in surrounding environment. Second, a map is needed to identify the environment and move published maps and institutional affil- accordingly. Third, path planning is necessary to find the best path to perform a given iations. task [3]. Path planning has two types: point-to-point and complete coverage. A complete coverage path planning is performed when all positions must be checked, such as for a cleaning robot. A point-to-point path planning is performed for moving from the start position to the target position [4]. Copyright: © 2021 by the authors. To review the use of reinforcement learning as a path optimization technique for Licensee MDPI, Basel, Switzerland. mobile robots in a warehouse environment, the following points must be understood. First, This article is an open access article it is necessary to understand the warehouse environment and the movement of mobile distributed under the terms and robots. To do this, we must check the configuration and layout of the warehouse in which conditions of the Creative Commons the mobile robot is used as well as the tasks and actions that the mobile robot performs Attribution (CC BY) license (https:// in this environment. Second, it is necessary to understand the recommended research creativecommons.org/licenses/by/ steps for the path planning and optimization of mobile robots. Third, it is necessary to 4.0/). understand the connections between layers and modules as well as their working when Appl. Sci. 2021, 11, 1209. https://doi.org/10.3390/app11031209 https://www.mdpi.com/journal/applsci
formation, a simulation program for real experiments can be implemented. Becau forcement learning can be tested only when a specific experimental environment pared in advance, substantial preparation of the environment is required early study but the researcher must prepare it. Subsequently, the reinforcement learnin Appl. Sci. 2021, 11, 1209 2 of 16 rithm to be tested is integrated with the simulation environment for the experime This paper discusses an experiment performed using a reinforcement learnin rithm as a path optimization technique in a warehouse environment. Section 2 bri they are integrated with mobile robots, environments, hardware, and software. With this scribes the warehouse environment, the concept of path planning for mobile rob basic information, a simulation program for real experiments can be implemented. Because concept of reinforcement reinforcement learning can belearning, and tested only whenthea algorithms to be tested. specific experimental Section 3 brie environment scribes the in is prepared structure advance, of the mobile substantial robot path-optimization preparation of the environment issystem, requiredthe main early in logic the study but the researcher must prepare it. Subsequently, the reinforcement search, and the idea of dynamically adjusting the reward value according to the m learning algorithm to be tested is integrated with the simulation environment for the experiment. action. Section 4 describes the construction of the simulation environment for the This paper discusses an experiment performed using a reinforcement learning algo- ments, the parameters rithm as a path andtechnique optimization reward in values to be used a warehouse for the experiments, environment. Section 2 brieflyand the mental describesresults. Sectionenvironment, the warehouse 5 presents the conclusions the concept of pathand directions planning for future for mobile robots, researc the concept of reinforcement learning, and the algorithms to be tested. Section 3 briefly describes 2. RelatedtheWork structure of the mobile robot path-optimization system, the main logic of path search, and the idea of dynamically adjusting the reward value according to the 2.1. Warehouse moving Environment action. Section 4 describes the construction of the simulation environment for the 2.1.1. Warehouse Layout and experiments, the parameters andreward values to be used for the experiments, and the Components experimental results. Section 5 presents the conclusions and directions for future research. The major components for automated warehouse with mobile robots are as 2. Related Work [5–10]. 2.1. Warehouse Environment •2.1.1.Robot drive Warehouse device: Layout It instructs and Components the robot from the center to pick or load go The major components for automated warehouse withpod. tween the workstation and the inventory mobile robots are as follows [5–10]. • Inventory pod: This is a shelf rack that includes storage boxes. There are two • Robot drive device: It instructs the robot from the center to pick or load goods between ardworkstation the sizes: a small pod and the can hold inventory pod.up to 450 kg, and a large pod can load up to 1 •• Workstation: Inventory This pod: This is aisshelf therack work thatarea in which includes storagethe worker boxes. There operates and perform are two standard sizes: a small pod can hold up to 450 kg, such as replenishment, picking, and packaging.and a large pod can load up to 1300 kg. • Workstation: This is the work area in which the worker operates and performs tasks Figure such 1 shows an picking, as replenishment, example andofpackaging. a warehouse layout with inventory pods and stations. Figure 1 shows an example of a warehouse layout with inventory pods and workstations. Figure 1. Sample Layout of Warehouse Environment. Figure 1. Sample Layout of Warehouse Environment. 2.1.2. Mobile Robot’s Movement Type In the robotic mobile fulfillment system, a robot performs a storage trip, in which it transports goods, and a retrieval trip, in which it moves to pick up goods. After dropping the goods into the inventory pod, additional goods can be picked up or returned to the workstation [11].
the goods into the inventory pod, additional goods can be picked up or returned to the workstation [11]. The main 2.1.2. Mobile movements Robot’s MovementofType the mobile robot in the warehouse are to perform the fol lowing tasks [12]. In the robotic mobile fulfillment system, a robot performs a storage trip, in which it Appl. Sci. 2021, 11, 1209 3 of 16 1. move transports goods goods, from and the workstation a retrieval to itthe trip, in which inventory moves to pickpod; up goods. After dropping the 2. goods moveinto the one from inventory pod, additional inventory goodsto pod to another can be picked pick up or returned to the up goods; workstation [11]. 3. move goods from the inventory pod to the workstation Themain The main movements movements of the mobile mobile robot robotin inthe thewarehouse warehouseare aretoto perform thethe perform follow- fol- ing tasks lowing [12]. tasks [12]. 2.2. Path Planning for Mobile Robot 1.1. move movegoods goodsfrom fromthe theworkstation totothe theinventory inventorypod; For path planning inworkstation a given work environment, pod; a mobile robot searches for an opti 2.2. move from one inventory pod to another to pick up goods; move from one inventory pod to another to pick up goods; mal path from the starting point to the target point according to specific performance cri 3.3. move movegoods goodsfrom fromthe theinventory inventorypod podtotothe theworkstation workstation teria. In general, path planning is performed for two path types, global and local, depend ingPath 2.2. 2.2. on whether Path Planningfor Planning the mobile forMobile Mobile robot has environmental information. In the case of global path Robot Robot planning, Forpath For allplanning path information planning about in a given in work worktheenvironment, environment environment, is provided aamobile mobile robot to the mobile robotsearches searchesforfor ananrobot opti- before i optimal starts pathpath mal moving. from the the from Based starting on point starting local to the point path planning, target to the point target the according point robot according does to specific not know performance to specific most performance information criteria. cri- about In teria. In the general, environment pathpath general, planning before planning it moves is performed is performed for [13,14]. twotwo for path types, path global types, globaland local, and local,depending depend- on on ing whether The thethe path whether mobile planning mobilerobot has hasenvironmental development robot processinformation. environmental In for mobile robots information. In the the case usually global paththe steps follows of global path planning, planning, below inall all information 2 [13].about the environment is provided to the mobile robot before itit information Figure about the environment is provided to the mobile robot before startsmoving. starts moving.Based Basedon onlocal localpath pathplanning, planning,the therobot robotdoes doesnot notknow knowmost mostinformation information • about Environmental the environment modeling before it moves [13,14]. about the environment before it moves [13,14]. • The Optimization pathplanningcriteria planning developmentprocess processfor formobile mobilerobots robotsusually usuallyfollows followsthe thesteps steps The path development • Path finding belowininFigure below algorithm Figure22[13]. [13]. • Environmental modeling • Optimization criteria • Path finding algorithm Figure2.2.Development Figure Developmentsteps for for steps global/local pathpath global/local optimization [13]. [13]. optimization • Environmental modeling • The classification Optimization criteriaof the path planning of the mobile robot is illustrated in Figure 3. Figure 2. Development steps for global/local path optimization [13]. • Path finding algorithm The Theclassification classificationof ofthe thepath pathplanning planningof ofthe themobile mobilerobot robotisisillustrated illustratedin inFigure Figure 3. 3. Figure 3. Classification of path planning and global path search algorithm [13]. Classificationofofpath Figure3.3.Classification Figure pathplanning planningand andglobal globalpath pathsearch searchalgorithm algorithm[13]. [13]. Dijkstra, A*, and D* algorithms for the heuristic method are the most commonly used algorithms for path search in global path planning [13,15]. Traditionally, warehouse path planning has been optimized using heuristics and meta heuristics. Most of these methods work without additional learning and require significant computation time as the problem becomes complex. In addition, it is challenging to design and understand
biases from labeled data, and reinforcement learning learns weights and biases using the concept of rewards [17]. The essential elements of reinforcement learning are as follows, and see Figure 4 for how to interact. Appl. Sci. 2021, 11, 1209 First, agents are the subjects of learning and decision making. Second, the environ-4 of 16 ment refers to the external world recognized by the agent and interacts with the agent. Third is policy. The agent finds actions to be performed from the state of the surrounding environment. The fourth element is a reward, an information that tells you if an agent has heuristic algorithms, making it difficult to maintain and scale the path. Therefore, interest done good or bad behavior. In reinforcement learning, positive rewards are given for ac- in artificial intelligence technology for path planning is increasing [16]. tions that help agents reach their goals, and negative rewards are given for actions that prevent them from 2.3. Reinforcement reaching Learning their goals. The fifth element is a value function. The value Algorithm function tells you what’s good in the long term. Value is the total amount of reward an Reinforcement learning is a method that uses trial and error to find a policy to reach agent can expect. Sixth is the environmental model. Environmental models predicts the a goal by learning through failure and reward. Deep neural networks learn weights and next state and the reward that changes according to the current state. Planning is a way to biases from labeled data, and reinforcement learning learns weights and biases using the decide what to do before experiencing future situations. Reinforcement learning using concept of rewards [17]. models and plans is called a model-based method, and reinforcement learning using trial The essential elements of reinforcement learning are as follows, and see Figure 4 for and error is called a model free method [18]. how to interact. Figure 4. Figure The essential 4. The essential elements elements and and interactions interactions of of Reinforcement Reinforcement Learning Learning [18]. [18]. First, agents are the subjects of learning and decision making. Second, the environment The basic concept of reinforcement learning modeling aims to find the optimal policy refers to the external world recognized by the agent and interacts with the agent. Third for problem solving and can be defined on the basis of the Markov Decision Process is policy. The agent finds actions to be performed from the state of the surrounding (MDP). If the state and action pair is finite, this is called a finite MDP [18]. The most basic environment. The fourth element is a reward, an information that tells you if an agent Markov process concept in MDP, consists of a state s, a state transition probability p, and has done good or bad behavior. In reinforcement learning, positive rewards are given for an action a performed. The p is the probability that when performing an action a in state actions that help agents reach their goals, and negative rewards are given for actions that s, it will change to s’ [18]. prevent them from reaching their goals. The fifth element is a value function. The value The Formula function tells you (1) that good what’s defines inthe therelationship between long term. Value state, is the action total and probability amount of reward anis as follows [18,19]. agent can expect. Sixth is the environmental model. Environmental models predicts the next state and the reward(that , ) = Praccording ∣ changes = ∣ to the = current , =state. Planning is a way (1) to decide what to do before experiencing future situations. Reinforcement learning using models and plans is called a model-based method, and reinforcement learning using trial and error is called a model free method [18]. The basic concept of reinforcement learning modeling aims to find the optimal policy for problem solving and can be defined on the basis of the Markov Decision Process (MDP). If the state and action pair is finite, this is called a finite MDP [18]. The most basic Markov process concept in MDP, consists of a state s, a state transition probability p, and an action a performed. The p is the probability that when performing an action a in state s, it will change to s’ [18]. The Formula (1) that defines the relationship between state, action and probability is as follows [18,19]. p s0 | s, a = Pr St = s0 | St−1 = s, At−1 = a (1) The agent’s goal is to maximize the sum of regularly accumulated rewards, and the expected return is the sum of rewards. In other words, using rewards to formalize the concept of a goal is characteristic of reinforcement learning. In addition, the discount factor is also important in the concept of reinforcement learning. The discount factor (γ) is a coefficient that discounts the future expected reward and has a value between 0 and 1 [18].
Appl. Sci. 2021, 11, 1209 5 of 16 Discounted return obtained through reward and depreciation rate can be defined as the following Formula (2) [18,19]. ∞ Gt = Rt+1 + γRt+2 + γ2 Rt+3 + · · · = ∑ γ k R t + k +1 (2) k =0 The reinforcement learning algorithm has the concept of a value function, a function of a state that estimates whether the environment given to an agent is good or bad. Moreover, the value function is defined according to the behavioral method called policy. Policy is the probability of choosing possible actions in a state [18]. If the value function was the consideration of the cumulative reward and the change of state together, the state value function was also considered the policy [20]. The value function is the expected value of the return if policy π is followed since starting at state s, and the formulas are as follows [18,20]. ∞ " # vπ (s) = Eπ [ Gt | St = s] = Eπ ∑ γ k R t + k +1 | S t = s (3) k =0 Among the policies, the policy with the highest expected value of return is the optimal policy, and the optimal policy has an optimal state value function and an optimal action value function [18]. The optimal state value function can be found in equation (4), and the optimal action value function can be found in Equations (5) and (6) [18,20]. v∗ (s) = max vπ (s) (4) π q∗ (s, a) = max qπ (s, a) (5) π = E[ Rt+1 + γv∗ (St+1 ) | St = s, At = a] (6) The value of a state that follows an optimal policy is equal to the expected value of the return from the best action that can be chosen in that state [18]. The formula is as follows [18,20]. 0 q∗ (s, a) = E Rt+1 + γmax q∗ St+1 , a | St = s, At = a (7) a0 = ∑ p s , r | s, a 0 0 0 r + γmax q∗ s , a (8) s0 ,r a0 2.3.1. Q-Learning One of most basic algorithm among reinforcement learning algorithms is off-policy temporal-difference control algorithm known as Q-Learning [18]. The Q-Learning algo- rithm stores the state and action in the Q Table in the form of a Q Value; for action selection, it selects a random value or a maximum value according to the epsilon-greedy value and stores the reward values in the Q Table [18]. The algorithm stores the values in the Q Table using Equation (9) below [21]. Q(s, a) = r (s, a) + max Q s0 , a (9) a Equation (9) is generally converted into update—Equation (10) by considering a factor of time [18,22,23]. 0 0 Q(s, a) ← Q(s, a) + α r + γmax Q s , a − Q(s, a) (10) a0 The procedure of the Q-Learning algorithm is as follows (Algorithm 1) [18].
( , )← ( , )+ + ( , )− ( , ) (10) The procedure of the Q-Learning algorithm is as follows (Algorithm 1) [18]. Appl. Sci. 2021, 11, 1209 6 of 16 Algorithm 1: Q-Learning. Initialize Q(s, a) Loop for each episode: Algorithm 1: Q-Learning. Initialize s (state) Initialize Q(s, a) LoopLoop for episode: for each each step of episode until s (state) is destination: Chooses (state) Initialize a (action) from s (state) using policy from Q values Loop for each step of episode until s (state) is destination: Take a (action), find r (reward), s’ (next state) Choose a (action) from s (state) using policy from Q values Calculate Take Q value a (action), find rand Save s’ (reward), it (next into Q Table state) Calculate Change s’ Q (next value and Save state) to it into Q Table s (state) Change s’ (next state) to s (state) In reinforcement learning algorithms, the process of exploration and exploitation playsInanreinforcement learning important role. algorithms, Initially, the process lower priority of exploration actions and exploitation should be selected plays to investigate an important role. Initially, lower priority actions should be selected to investigate the best possible environment, and then select and proceed with higher priority actions the best possible environment, and then select and proceed with higher priority [23]. The Q-Learning algorithm has become the basis of the reinforcement learning algo- actions [23]. The rithmQ-Learning because it algorithm has shows is simple and becomeexcellent the basislearning of the reinforcement learning ability in a single agentalgorithm environ- because it is simple and shows excellent learning ability in a single agent environment. ment. However, Q-Learning updates only once per action; it is thus difficult to solve com- However, Q-Learning updates only once per action; it is thus difficult to solve complex plex problems effectively when many states and actions have not yet been realized. problems effectively when many states and actions have not yet been realized. In addition, because the Q table for reward values is predefined, a considerable In addition, because the Q table for reward values is predefined, a considerable amount of storage memory is required. For this reason, the Q-Learning algorithm is dis- amount of storage memory is required. For this reason, the Q-Learning algorithm is advantageous in a complex and advanced multi-agent environment [24]. disadvantageous in a complex and advanced multi-agent environment [24]. 2.3.2. Dyna-Q 2.3.2. Dyna-Q Dyna-Q is Dyna-Q is aa combination combination of of the the Q-Learning Q-Learning and and Q-Planning Q-Planning algorithms. algorithms. The Q- The Q- Learning algorithm changes a policy based on a real experience. The Q-Planning Learning algorithm changes a policy based on a real experience. The Q-Planning algorithm algo- rithm obtains obtains simulated simulated experience experience through through searchsearch controlcontrol and changes and changes the policy the policy basedbased on a on a simulated experience [18]. Figure 5 illustrates the structure of the Dyna-Q simulated experience [18]. Figure 5 illustrates the structure of the Dyna-Q model. model. Figure 5. Dyna-Q Figure 5. Dyna-Q model model structure structure [18]. [18]. The The Q-Planning Q-Planningalgorithm algorithmrandomly randomly selects samples selects onlyonly samples fromfrom previously experienced previously experi- information, and the Dyna-Q model obtains information for simulated experience enced information, and the Dyna-Q model obtains information for simulated experience through search control exactly as it obtains information through modeling learning based on a real through search control exactly as it obtains information through modeling learning based experience [18]. The Dyna-Q update-equation is the same as that for Q-Learning (10) [22]. on a real experience [18]. The Dyna-Q update-equation is the same as that for Q-Learning The Dyna-Q algorithm is as follows (Algorithm 2) [18]. (10) [22].
The Dyna-Q algorithm is as follows (Algorithm 2) [18]. Algorithm 2: Dyna-Q. Appl. Sci. 2021, 11, 1209 7 of 16 Initialize Q(s, a) Loop forever: (a) Get s (state) Algorithm 2: Dyna-Q. (b) Choose a (action) from s (state) using policy from Q values Initialize Q(s, a) (c) forever: Loop Take a (action), find r (reward), s’ (next state) (a) Get s (state) (d) Calculate Q value and Save it into Q Table (b) Choose a (action) from s (state) using policy from Q values (e) Take (c) Savea (action), s (state),find a (action), r (reward), r (reward), s’ (next state) in memory s’ (next state) (d) (f) Calculate Loop for Q value and Save it into Q Table n times (e) Save s (state), a (action), r (reward), s’ (next state) in memory (f) LoopTake for ns(state), times a (action), r (reward), s’ (next state) Calculate Take s(state),Qa (action), value and Save it into r (reward), Q Table s’ (next state) Calculate Q value and Save it into Q Table (g) Change s’ (next state) to s (state) (g) Change s’ (next state) to s (state) Steps (a) to (d) and, (g) are the same as in Q-Learning. Steps (e) and (f) added to the Steps (a) to (d) and, (g) are the same as in Q-Learning. Steps (e) and (f) added to the Dyna-Q algorithm store past experiences in memory to generate the next state and a more Dyna-Q algorithm store past experiences in memory to generate the next state and a more accurate reward value. accurate reward value. 3. 3.Reinforcement ReinforcementLearning-Based Learning-BasedMobile MobileRobot RobotPathPathOptimization Optimization 3.1. 3.1.System SystemStructure Structure In In the the mobile mobile robot robot framework framework structure structure in in Figure Figure 6, 6, the the agent agent is is controlled controlled by by the the supervisor supervisorcontroller, controller, and and the the robot robot controller controller is is executed executed byby the the agent, agent, so so that that the the robot robot operates operates in in the the warehouse warehouse environment environment and and the the necessary necessary information information isis observed observed and and collected collected [25–27]. [25–27]. Figure6. Figure Warehouse mobile 6.Warehouse mobile robot robot framework framework structure. 3.2. Operation Procedure The mobile robot framework works as follows. At the beginning, the simulator is set to its initial state, and the mobile robot collects the first observation information and sends it to the supervisor controller. The supervisor controller verifies the received information and communicates the necessary action to the agent. At this stage, the supervisor controller can obtain additional information if necessary. The supervisor controller can communicate job information to the agent. The agent conveys the action to the mobile robot through the
3.2. Operation Procedure The mobile robot framework works as follows. At the beginning, the simulator is set to its initial state, and the mobile robot collects the first observation information and sends it to the supervisor controller. The supervisor controller verifies the received information Appl. Sci. 2021, 11, 1209 8 of 16 and communicates the necessary action to the agent. At this stage, the supervisor control- ler can obtain additional information if necessary. The supervisor controller can com- municate job information to the agent. The agent conveys the action to the mobile robot through robot the robot controller, controller, and the mobile androbot the mobile performsrobot theperforms the received received action action using using actuators. ac- After tuators. the actionAfter the action is performed byisthe performed by the the robot controller, robot controller, changed the and position changed statusposition and information status is information transmitted is agent to the transmitted and the to supervisor the agent and the supervisor controller, and thecontroller, supervisorand the su- controller pervisor controller calculates the reward calculates value for thethereward valueaction. performed for the Subsequently, performed action. Subsequently, the action, reward value, and reward the action, status information are stored value, and status in memory. information These procedures are stored in memory. are repeated These until procedures the are mobile repeated robot untilachieves the defined the mobile goal or the defined robot achieves number goal of epochs/episodes or the number of or epochs/ep- conditions that isodessatisfy the goal [25]. or conditions that satisfy the goal [25]. In Figure 7, we show an example class that is composed of the essential functions functions and and main logic required required by by the the supervisor supervisor controller controller for for path path planning. planning. Figure 7. Required methods and logic of path finding finding [25]. [25]. 3.3. Dynamic Reward 3.3. Dynamic Reward Adjustment Adjustment Based Based on on Action Action The simplest type of reward for the movement The simplest type of reward for the movement of aof a mobile mobile robotrobot in a grid-type in a grid-type ware- warehouse environment is achieved by specifying a negative reward value for an obstacle house environment is achieved by specifying a negative reward value for an obstacle and and a positive reward value for the target location. The direction of the mobile robot is a positive reward value for the target location. The direction of the mobile robot is deter- determined by its current and target positions. The path planning algorithm requires mined by its current and target positions. The path planning algorithm requires signifi- significant time to find the best path by repeating many random path finding attempts. In cant time to find the best path by repeating many random path finding attempts. In the the case of the basic reward method introduced earlier, an agent searches randomly for a case of the basic reward method introduced earlier, an agent searches randomly for a path, path, regardless of whether it is located far from or close to the target position. To improve regardless of whether it is located far from or close to the target position. To improve on on these vulnerabilities, we considered a reward method as shown in Figure 8. these vulnerabilities, we considered a reward method as shown in Figure 8. The suggestions are as follows: First, when the travel path and target position are determined, a partial area close to the target position is selected. (Reward adjustment area close to the target position). Second, a higher reward value than the basic reward method is assigned to the path positions within the area selected in the previous step. Third, we propose a method to select an area as above whenever the movement path changes and changes the position of the movement path in the selected area to a high reward value. This is called dynamic reward adjustment based on action, and the following (Algorithm 3) is an algorithm of the process.
Appl. Sci. 2021, 11, 1209 9 of 16 Algorithm 3: Dynamic reward adjustment based on action. 1. Get new destination to move Appl. Sci. 2021, 11, x FOR PEER REVIEW 2. Select a partial area near the target position. 9 of 17 3. A higher positive reward value is assigned to the selected area, excluding the target point and obstacles. (Highest positive reward value to target position) Figure 8. 8. Figure Dynamic adjusting Dynamic reward adjusting values reward forfor values the moving the action. moving action. 4. Test The and Results are as follows: suggestions 4.1.First, Warehouse when Simulation Environment the travel path and target position are determined, a partial area close to the target position To create is selected.environment, a simulation (Reward adjustment areato we referred close the to the target paper [23] andposition) built a virtual Second, aenvironment experiment higher reward value based onthan the basic reward the previously method described is assigned warehouse to the pathA environment. positions within total of 15 the area inventory podsselected in the in were included a 25 × 25 previous step. gridThird, we layout. A propose a method storage trip to of a mobile robotanfrom select areathe workstation as above to the whenever theinventory movement pod was path simulated. changes This experiment and changes was the position ofconducted the movementto find theinoptimal path path to the selected thetotarget area position, a high reward avoiding inventory value. This is calledpods at the dynamic Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 17 starting reward position where adjustment basedaon single mobile action, androbot was defined. the following (Algorithm 3) is an algorithm of The Figure 9 shows the target position for the experiments. the process. Algorithm 3: Dynamic reward adjustment based on action. 1. Get new destination to move 2. Select a partial area near the target position. 3. A higher positive reward value is assigned to the selected area, excluding the target point and obstacles. (Highest positive reward value to target position) 4. Test and Results 4.1. Warehouse Simulation Environment To create a simulation environment, we referred to the paper [23] and built a virtual experiment environment based on the previously described warehouse environment. A total of 15 inventory pods were included in a 25 × 25 grid layout. A storage trip of a mobile robot from the workstation to the inventory pod was simulated. This experiment was con- ducted Figureto9. find theposition Target optimal of path to theexperiment. simulation target position, avoiding inventory pods at the start- ing position where a single mobile robot was defined. 4.2. TestFigure Parameter Values 4.2.The 9 shows Test Parameter Valuesthe target position for the experiments. Q-Learning and Dyna-Q algorithms were tested in a simulation environment. Q-Learning and Dyna-Q algorithms were tested in a simulation environment. Table 1 shows the parameter values used in the experiment. Table 1. Parameter values for experiment.
Figure 9. Target position of simulation experiment. Appl. Sci. 2021, 11, 1209 10 of 16 4.2. Test Parameter Values Q-Learning and Dyna-Q algorithms were tested in a simulation environment. Table11shows Table showsthe theparameter parametervalues valuesused usedininthe theexperiment. experiment. Table1.1.Parameter Table Parametervalues valuesfor forexperiment. experiment. Parameter Parameter Q-Learning Q-Learning Dyna-Q Dyna-Q Learning Learning Rate (α) Rate (α) 0.01 0.01 0.01 0.01 Discount Rate (γ) Rate (γ) Discount 0.9 0.9 0.9 0.9 4.3. 4.3.Test TestReward RewardValues Values The The experimentwas experiment wasconducted conductedwith withthe thefollowing followingreward rewardmethods. methods. •• Basic Basic method: method: TheThe obstacle obstaclewas setasas-1,−the wasset 1, the target target position position as 1,asand 1, and the other the other posi- positions as tions as 0. 0. •• Dynamic Dynamicreward rewardadjustment adjustment based based on on action: action: To Tocheck check the theefficacy efficacyof ofthe theproposal, proposal, an anexperiment experimentwas wasconducted conducteddesignating designatingset setvalues valuesthat thatwere weresimilar similartotothose thoseininthe the proposed method as shown in Figure 10 [28–30]. The obstacle was set as − 1, proposed method as shown in Figure 10 [28–30]. The obstacle was set as −1, the target the target position positionasas10, 10,the thepositions positionsin inthe theselected selectedareaarea(area (areaclose closeto totarget) target)as as0.3, 0.3,and andthe the other positions as 0. other positions as 0. Figure10. Figure 10.Selected Selectedarea areafor fordynamic dynamicreward rewardadjustment adjustmentbased basedon onaction. action. 4.4. Results First, we tested two basic reinforcement learning algorithms: the Q-Learning algorithm and the Dyna-Q algorithm. The experiment was performed five times for each algorithm, and there may be deviations in the results for each experiment, so the experiment was performed under the same condition and environment. The following points were verified and compared using the experimental results. The optimization efficiency of each algorithm was checked based on the length of the final path, and the path optimization performance was confirmed by the path search time. We were also able to identify learning patterns by the number of learning steps per episode. The path search time of the Q-Learning algorithm was 19.18 min for the first test, 18.17 min for the second test, 24.28 min for the third test, 33.92 min for the fourth test, and 19.91 min for the fifth test. It took about 23.09 min on average. The path search time of the Dyna-Q algorithm was 90.35 min for the first test, 108.04 min for the second test, 93.01 min for the third test, 100.79 min for the fourth test, and 81.99 min for the fifth test. The overall average took about 94.83 min. The final path length of the Q-Learning algorithm was 65 steps for the first test, 34 steps for the second test, 60 steps for the third test, 91 steps for the fourth test, 49 steps for the fifth test, and the average was 59.8 steps. The final path length of the Dyna-Q algorithm was 38 steps for the first test, 38 steps for the second test, 38 steps for the third test, 34 steps for the fourth test, 38 steps for the fifth test, and the average was 37.2 steps. The Q-Learning algorithm was faster than Dyna-Q in terms of path search time, and Dyna-Q selected shorter and simpler final paths than those selected by Q-Learning.
average took about 94.83 min. The final path length of the Q-Learning algorithm is 65 steps for the first test, 34 steps for the second test, 60 steps for the third test, 91 steps for the fourth test, 49 steps for the fifth test, and the average is 59.8 steps. The final path length of the Dyna-Q algorithm is Appl. Sci. 2021, 11, 1209 38 steps for the first test, 38 steps for the second test, 38 steps for the third test, 34 steps 11 offor 16 the fourth test, 38 steps for the fifth test, and the average is 37.2 steps. The Q-Learning algorithm was faster than Dyna-Q in terms of path search time, and Dyna-Q selected shorter and simpler final paths than those selected by Q-Learning. Fig- Figures 11 and 12 show the learning pattern through the number of training steps per ures 11 and 12 show the learning pattern through the number of training steps per epi- episode. Q-Learning found the path by repeating many steps for 5000 episodes. Dyna-Q, sode. Q-Learning found the path by repeating many steps for 5000 episodes. Dyna-Q, sta- stabilized after finding the optimal path between 1500 and 3000 episodes. In both algo- bilized after finding the optimal path between 1500 and 3000 episodes. In both algorithms, rithms, the maximum number of steps rarely exceeded 800 steps. All tests ran 5000 episodes the maximum number of steps rarely exceeded 800 steps. All tests ran 5000 episodes for for each test. In the case of Q-Learning algorithm, the second test exceeded 800 steps 5 times each test. In the case of Q-Learning algorithm, the second test exceeded 800 steps 5 times (0.1%) and fourth test exceeded 800 steps 5 times (0.1%). So, Q-Learning algorithm test (0.1%) and fourth test exceeded 800 steps 5 times (0.1%). So, Q-Learning algorithm test exceeded 800 steps at a total level of 0.04% in the five tests. In the case of Dyna-Q algo- exceeded 800 steps at a total level of 0.04% in the five tests. In the case of Dyna-Q algo- rithm, the first test exceeded 800 steps once (0.02%), third test exceeded 800 steps 7 times rithm, the first test exceeded 800 steps once (0.02%), third test exceeded 800 steps 7 times (0.14%), fifth test exceeded 800 steps 3 times (0.06%). So, Dyna-Q algorithm test exceeded (0.14%), fifth test exceeded 800 steps 3 times (0.06%). So, Dyna-Q algorithm test exceeded 800 steps at a total level of 0.044 % in the five tests. To compare the path search time and 800 steps at a total level of 0.044 % in the five tests. To compare the path search time and the learning pattern result, we compared the experiment (e) of the two algorithms. The the learning pattern result, we compared the experiment (e) of the two algorithms. The path search time was 19.9 s for Q-Learning algorithm and 81.98 s for Dyna-Q. Looking at path search time was 19.9 s for Q-Learning algorithm and 81.98 s for Dyna-Q. Looking at the learning patterns in Figures 11 and 12, the Q-Learning algorithm learned by executing the learning patterns in Figures 11 and 12, the Q-Learning algorithm learned by executing 200 steps to 800 steps per each episode until the end of 5000 episodes, and the Dyna-Q 200 steps to 800 steps per each episode until the end of 5000 episodes, and the Dyna-Q algorithm learned by executing 200 steps to 800 steps per each episode until the middle algorithm learned by executing 200 steps to 800 steps per each episode until the middle of of the 2000 episode. However, the time required for the Dyna-Q algorithm is more than the 2000 episode. However, the time required for the Dyna-Q algorithm is more than 75% 75% slower than the Q-Learning algorithm. In this perspective, it has been confirmed that slower than the Q-Learning algorithm. In this perspective, it has been confirmed that Dyna-Q takes longer to learn in a single episode than Q-Learning. Dyna-Q takes longer to learn in a single episode than Q-Learning. Appl. Sci. 2021, 11, x FOR PEER REVIEW 12 of 17 (a) (b) (c) (d) (e) Figure11. Figure 11.Path Pathoptimization optimizationlearning learning progress progress status status of of Q-Learning: Q-Learning: (a) 1st test; (b) 2nd test; (c) 3rd test; (d) (d) 4th 4th test; test; (e) (e)5th 5thtest. test.
Appl. Sci. 2021, 11, 1209 12 of 16 (e) Figure 11. Path optimization learning progress status of Q-Learning: (a) 1st test; (b) 2nd test; (c) 3rd test; (d) 4th test; (e) 5th test. (a) (b) Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 17 (c) (d) (e) Figure 12. 12. Path Pathoptimization optimizationlearning learningprogress progressstatus statusofofDyna-Q: Dyna-Q:(a)(a) 1st1st test; (b)(b) test; 2nd test; 2nd (c) (c) test; 3rd3rd test;test; (d) (d) 4th4th test;test; (e) 5th (e) test. 5th test. Subsequently, additional experiments were conducted by applying ‘dynamic reward adjustment adjustment based based on action’ to the Q-Learning algorithm. The The path path search time of this experiment was 19.04 min for the first test, 15.21 min for the second test,test, experiment was 19.04 min for the first test, 15.21 min for the second 17.05 17.05 minmin for for the the third third test,test, 17.5117.51 min min forfourth for the the fourth test, test, and 17.38 and 17.38 minthe min for forfifth the test. fifth Ittest. tookIt about took about 17.23 17.23on min min on average. average. The finalThepath finallength path length of this of this experiment experiment was 38 wassteps38for steps for the the first first test, 34 test, 34 steps for the second test, 34 steps for the third test, 34 steps for the steps for the second test, 34 steps for the third test, 34 steps for the fourth test, and 34 steps fourth test, and 34 steps for for the the fifth test.fifth Thetest. The average average was 34.8 is 34.8 steps. The steps. The experimental experimental results canresultsbe seen caninbe seen Figure in Figure 13, and it includes the experimental results of Q-Learning 13, and it includes the experimental results of Q-Learning and Dyna-Q previously con- and Dyna-Q previously conducted. ducted. In the In the case case of of thethe Q-Learning Q-Learning algorithmtotowhich algorithm which‘dynamic ‘dynamicreward reward adjustment adjustment based on action’ was applied, the path search time was slightly faster based on action’ was applied, the path search time was slightly faster than Q-Learning, than Q-Learning, and the final and pathpath the final length waswas length confirmed confirmed to beto improved be improved to atolevel similar a level similarto to that thatof of Dyna-Q. Dyna- Figure 14 shows the learning pattern obtained using the additional Q. Figure 14 shows the learning pattern obtained using the additional experiments com- experiments compared with the pared withresult the of Q-Learning. result There There of Q-Learning. is no outstanding is no outstandingpoint.point. FigureFigure 15 shows the path 15 shows the found through the experiment of the Q Leaning algorithm path found through the experiment of the Q Leaning algorithm applying ‘dynamic applying ‘dynamic reward re- adjustment based on action’ on the map. ward adjustment based on action’ on the map.
ducted. In the case of the Q-Learning algorithm to which ‘dynamic reward adjustment based on action’ was applied, the path search time was slightly faster than Q-Learning, and the final path length was confirmed to be improved to a level similar to that of Dyna- Q. Figure 14 shows the learning pattern obtained using the additional experiments com- Appl. Sci. 2021, 11, 1209 pared with the result of Q-Learning. There is no outstanding point. Figure 15 shows 13 ofthe 16 path found through the experiment of the Q Leaning algorithm applying ‘dynamic re- ward adjustment based on action’ on the map. (a) (b) Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 17 Figure 13. Figure 13. Comparison of experiment experiment results: results: (a) Total Total elapsed elapsed time; time; (b) (b) Length Length of offinal finalpath. path. (a) (b) (c) (d) (e) Figure14. Figure 14.Path Pathoptimization optimization learning learning progress progress status statusof ofQ-Learning Q-Learningwith withaction actionbased basedreward: (a)(a) reward: 1st1st test; (b)(b) test; 2nd2nd test;test; (c) 3rd test; (d) 4th test; (e) 5th test. (c) 3rd test; (d) 4th test; (e) 5th test.
Appl. Sci. 2021, 11, 1209 14 of 16 (e) Figure 14. Path optimization learning progress status of Q-Learning with action based reward: (a) 1st test; (b) 2nd test; (c) 3rd test; (d) 4th test; (e) 5th test. Appl. Sci. 2021, 11, x FOR PEER REVIEW 15 of 17 (a) (b) (c) (d) (e) Figure15. Figure 15. Path results of of Q-Learning Q-Learningwith withaction actionbased basedreward: reward:(a)(a) 1st1st test; (b)(b) test; 2nd test; 2nd (c) (c) test; 3rd3rd test;test; (d) (d) 4th 4th test;test; (e) 5th (e) test. 5th test. 5. 5. Conclusions Conclusions In Inthis thispaper, paper,we wereviewed reviewedthe theuse useofofreinforcement reinforcementlearning learningas asaapath pathoptimization optimization technique technique in a warehouse environment. We built a simulation environmentto in a warehouse environment. We built a simulation environment totest testpath path navigation navigation in a warehouse environment and tested the two most basic algorithms,the in a warehouse environment and tested the two most basic algorithms, theQQ -Learning -LearningandandDyna-Q Dyna-Qalgorithms, algorithms, forfor reinforcement reinforcement learning. learning.WeWe then compared then comparedthe the exper- ex- imental results in terms of the path search times and the final path lengths. perimental results in terms of the path search times and the final path lengths. ComparedCompared with the withaverage of the results the average of the five of the results experiments, of the the path the five experiments, search pathtime of the search Q- Learning time of the Q- Learning algorithm was about 4.1 times faster than the Dyna-Q algorithm’s, and the final path length of Dyna-Q algorithm was about 37% shorter than in the Q-Learning algo- rithm’s. As a result, the experimental results of both Q-Learning and Dyna-Q algorithm showed unsatisfactory results for either the path search time or the final path length. To
Appl. Sci. 2021, 11, 1209 15 of 16 algorithm was about 4.1 times faster than the Dyna-Q algorithm’s, and the final path length of Dyna-Q algorithm was about 37% shorter than the Q-Learning algorithm’s. As a result, the experimental results of both Q-Learning and Dyna-Q algorithm showed unsatisfactory results for either the path search time or the final path length. To improve the results, we conducted various experiments using the Q-Learning algorithm and found that it does not perform well when finding search paths in a complex environment with many obstacles. The Q-Leaning algorithm has the property of finding a path at random. To minimize this random path searching trial, the target position was determined for movement, and the set reward value was high around the target position,’ thus the performance and efficiency of the agent improved with ‘dynamic reward adjustment based on action’. Agents mainly explore locations with high reward values. In further experiments, the path search time was about 5.5 times faster than the Dyna-Q algorithm’s, and the final path length was about 71% shorter than the Q-Learning algorithm. It was confirmed that the path search time was slightly faster than the Q-Learning algorithm, and the final path length was improved to the same level as the Dyna-Q algorithm. It is a difficult goal to meet both minimization of path search time and high accuracy for path search. If the learning speed is too fast, it will not converge to the optimal value function, and if the learning speed is set too late, the learning may take too long. In this paper, we have experimentally confirmed how to reduce inefficient exploration with simple technique to improve path search accuracy and maintain path search time. In this study, we have seen how to use reinforcement learning for a single agent and a path optimization technique for a single path, but the warehouse environment in the field is much more complex and dynamically changing. In the automation warehouse, multiple mobile robots work simultaneously. Therefore, this basic study is an example of improving the basic algorithm in a single path condition for a single agent, and it is insufficient to apply the algorithm to an actual warehouse environment. Therefore, based on the obtained results, an in-depth study of the improved reinforcement learning algorithm is needed, considering highly challenging multi-agent environments to solve more complex and realistic problems. Author Contributions: Conceptualization, H.L. and J.J.; methodology, H.L.; software, H.L.; vali- dation, H.L. and J.J.; formal analysis, H.L.; investigation, H.L.; resources, J.J.; data curation, H.L.; writing–original draft preparation, H.L.; writing—review and editing, J.J.; visualization, H.L.; super- vision, J.J.; project administration, J.J.; funding acquisition, J.J. All authors have read and agreed to the published version of the manuscript. Funding: This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2020-2018-0-01417) super- vised by the IITP (Institute for Information & Communications Technology Planning & Evaluation). Acknowledgments: This work was supported by the Smart Factory Technological R&D Program S2727115 funded by Ministry of SMEs and Startups (MSS, Korea). Conflicts of Interest: The authors declare no conflict of interest. References 1. Hajduk, M.; Koukolová, L. Trends in Industrial and Service Robot Application. Appl. Mech. Mater. 2015, 791, 161–165. [CrossRef] 2. Belanche, D.; Casaló, L.V.; Flavián, C. Service robot implementation: A theoretical framework and research agenda. Serv. Ind. J. 2020, 40, 203–225. [CrossRef] 3. Fethi, D.; Nemra, A.; Louadj, K.; Hamerlain, M. Simultaneous localization, mapping, and path planning for unmanned vehicle using optimal control. SAGE 2018, 10, 168781401773665. [CrossRef] 4. Zhou, P.; Wang, Z.-M.; Li, Z.-N.; Li, Y. Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm. In Proceedings of the 2nd International Conference on Electronic & Mechanical Engineering and Information Technology (EMEIT-2012), Shenyang, China, 7 September 2012; pp. 1837–1841. 5. Azadeh, K.; Koster, M.; Roy, D. Robotized and Automated Warehouse Systems: Review and Recent Developments. INFORMS 2019, 53, 917–1212. [CrossRef] 6. Koster, R.; Le-Duc, T.; JanRoodbergen, K. Design and control of warehouse order picking: A literature review. EJOR 2007, 182, 481–501. [CrossRef]
Appl. Sci. 2021, 11, 1209 16 of 16 7. Jan Roodbergen, K.; Vis, I.F.A. A model for warehouse layout. IIE Trans. 2006, 38, 799–811. [CrossRef] 8. Larson, T.N.; March, H.; Kusiak, A. A heuristic approach to warehouse layout with class-based storage. IIE Trans. 1997, 29, 337–348. [CrossRef] 9. Kumara, N.V.; Kumarb, C.S. Development of collision free path planning algorithm for warehouse mobile robot. In Proceedings of the International Conference on Robotics and Smart Manufacturing (RoSMa2018), Chennai, India, 19–21 July 2018; pp. 456–463. 10. Sedighi, K.H.; Ashenayi, K.; Manikas, T.W.; Wainwright, R.L.; Tai, H.-M. Autonomous local path planning for a mobile robot using a genetic algorithm. In Proceedings of the 2004 Congress on Evolutionary Computation, Portland, OR, USA, 19–23 June 2004; pp. 456–463. 11. Lamballaisa, T.; Roy, D.; De Koster, M.B.M. Estimating performance in a Robotic Mobile Fulfillment System. EJOR 2017, 256, 976–990. [CrossRef] 12. Merschformann, M.; Lamballaisb, T.; de Kosterb, M.B.M.; Suhla, L. Decision rules for robotic mobile fulfillment systems. Oper. Res. Perspect. 2019, 6, 100128. [CrossRef] 13. Zhang, H.-Y.; Lin, W.-M.; Chen, A.-X. Path Planning for the Mobile Robot: A Review. Symmetry 2018, 10, 450. [CrossRef] 14. Han, W.-G.; Baek, S.-M.; Kuc, T.-Y. Genetic algorithm based path planning and dynamic obstacle avoidance of mobile robots. In Proceedings of the 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation, Orlando, FL, USA, 12–15 October 1997. 15. Nurmaini, S.; Tutuko, B. Intelligent Robotics Navigation System: Problems, Methods, and Algorithm. IJECE 2017, 7, 3711–3726. [CrossRef] 16. Johan, O. Warehouse Vehicle Routing Using Deep Reinforcement Learning. Master’s Thesis, Uppsala University, Uppsala, Sweden, November 2019. 17. Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron. Available online: https://www.oreilly.com/ library/view/hands-on-machine-learning/9781491962282/ch01.html (accessed on 22 December 2020). 18. Sutton, R.S.; Barto, A.G. Introduction to Reinforcement Learning, 2nd ed.; MIT Press: London, UK, 2018; pp. 1–528. 19. Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st ed.; Wiley: New York, NY, USA, 1994; pp. 1–649. 20. Deep Learning Wizard. Available online: https://www.deeplearningwizard.com/deep_learning/deep_reinforcement_learning_ pytorch/bellman_mdp/ (accessed on 22 December 2020). 21. Meerza, S.I.A.; Islam, M.; Uzzal, M.M. Q-Learning Based Particle Swarm Optimization Algorithm for Optimal Path Planning of Swarm of Mobile Robots. In Proceedings of the 1st International Conference on Advances in Science, Engineering and Robotics Technology, Dhaka, Bangladesh, 3–5 May 2019. 22. Hsu, Y.-P.; Jiang, W.-C. A Fast Learning Agent Based on the Dyna Architecture. J. Inf. Sci. Eng. 2014, 30, 1807–1823. 23. Sichkar, V.N. Reinforcement Learning Algorithms in Global Path Planning for Mobile Robot. In Proceedings of the 2019 International Conference on Industrial Engineering Applications and Manufacturing, Sochi, Russia, 25–29 March 2019. 24. Jang, B.; Kim, M.; Harerimana, G.; Kim, J.W. Q-Learning Algorithms: A Comprehensive Classification and Applications. IEEE Access 2019, 7, 133653–133667. [CrossRef] 25. Kirtas, M.; Tsampazis, K.; Passalis, N. Deepbots: A Webots-Based Deep Reinforcement Learning Framework for Robotics. In Proceedings of the 16th IFIP WG 12.5 International Conference AIAI 2020, Marmaras, Greece, 5–7 June 2020; pp. 64–75. 26. Altuntas, N.; Imal, E.; Emanet, N.; Ozturk, C.N. Reinforcement learning-based mobile robot navigation. Turk. J. Electr. Eng. Comput. Sci. 2016, 24, 1747–1767. [CrossRef] 27. Glavic, M.; Fonteneau, R.; Ernst, D. Reinforcement Learning for Electric Power System Decision and Control: Past Considerations and Perspectives. In Proceedings of the 20th IFAC World Congress, Toulouse, France, 14 July 2017; pp. 6918–6927. 28. Wang, Y.-C.; Usher, J.M. Application of reinforcement learning for agent-based production scheduling. Eng. Appl. Artif. Intell. 2005, 18, 73–82. [CrossRef] 29. Mehta, N.; Natarajan, S.; Tadepalli, P.; Fern, A. Transfer in variable-reward hierarchical reinforcement learning. Mach. Learn. 2008, 73, 289. [CrossRef] 30. Hu, Z.; Wan, K.; Gao, X.; Zhai, Y. A Dynamic Adjusting Reward Function Method for Deep Reinforcement Learning with Adjustable Parameters. Math. Probl. Eng. 2019, 2019, 7619483. [CrossRef]
You can also read