Mobile Robot Path Optimization Technique Based on Reinforcement Learning Algorithm in Warehouse Environment - MDPI
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
applied
sciences
Article
Mobile Robot Path Optimization Technique Based on
Reinforcement Learning Algorithm in Warehouse Environment
HyeokSoo Lee and Jongpil Jeong *
Department of Smart Factory Convergence, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu,
Suwon 16419, Korea; huxlee@g.skku.edu
* Correspondence: jpjeong@skku.edu; Tel.: +82-31-299-4260
Abstract: This paper reports on the use of reinforcement learning technology for optimizing mobile
robot paths in a warehouse environment with automated logistics. First, we compared the results of
experiments conducted using two basic algorithms to identify the fundamentals required for planning
the path of a mobile robot and utilizing reinforcement learning techniques for path optimization. The
algorithms were tested using a path optimization simulation of a mobile robot in same experimental
environment and conditions. Thereafter, we attempted to improve the previous experiment and
conducted additional experiments to confirm the improvement. The experimental results helped us
understand the characteristics and differences in the reinforcement learning algorithm. The findings
of this study will facilitate our understanding of the basic concepts of reinforcement learning for
further studies on more complex and realistic path optimization algorithm development.
Keywords: warehouse environment; mobile robot path optimization; reinforcement learning
Citation: Lee, H.; Jeong, J. Mobile 1. Introduction
Robot Path Optimization Technique The fourth industrial revolution and digital innovation have led to the increased
Based on Reinforcement Learning interest in the use of robots across industries; thus, robots are being used in various
Algorithm in Warehouse
industrial sites. Robots can be broadly categorized as industrial or service robots. Industrial
Environment. Appl. Sci. 2021, 11,
robots are automated equipment and machines used in industrial sites such as factory lines
1209. https://doi.org/10.3390/
and are used for assembly, machining, inspection, and measurement as well as pressing
app11031209
and welding [1]. The use of service robots is expanding in manufacturing environments
and other areas such as industrial sites, homes, and institutions. Service robots can be
Academic Editor: Marina Paolanti
broadly divided into personal service robots and professional service robots. In particular,
Received: 30 December 2020
Accepted: 26 January 2021
professional service robots are robots used by workers to perform tasks, and mobile robots
Published: 28 January 2021
are one of the professional service robots [2].
A mobile robot needs the following elements to perform tasks: localization, mapping,
Publisher’s Note: MDPI stays neutral
and path planning. First, the location of the robot must be known in relation to the
with regard to jurisdictional claims in
surrounding environment. Second, a map is needed to identify the environment and move
published maps and institutional affil- accordingly. Third, path planning is necessary to find the best path to perform a given
iations. task [3]. Path planning has two types: point-to-point and complete coverage. A complete
coverage path planning is performed when all positions must be checked, such as for a
cleaning robot. A point-to-point path planning is performed for moving from the start
position to the target position [4].
Copyright: © 2021 by the authors.
To review the use of reinforcement learning as a path optimization technique for
Licensee MDPI, Basel, Switzerland.
mobile robots in a warehouse environment, the following points must be understood. First,
This article is an open access article
it is necessary to understand the warehouse environment and the movement of mobile
distributed under the terms and robots. To do this, we must check the configuration and layout of the warehouse in which
conditions of the Creative Commons the mobile robot is used as well as the tasks and actions that the mobile robot performs
Attribution (CC BY) license (https:// in this environment. Second, it is necessary to understand the recommended research
creativecommons.org/licenses/by/ steps for the path planning and optimization of mobile robots. Third, it is necessary to
4.0/). understand the connections between layers and modules as well as their working when
Appl. Sci. 2021, 11, 1209. https://doi.org/10.3390/app11031209 https://www.mdpi.com/journal/applsciformation, a simulation program for real experiments can be implemented. Becau
forcement learning can be tested only when a specific experimental environment
pared in advance, substantial preparation of the environment is required early
study but the researcher must prepare it. Subsequently, the reinforcement learnin
Appl. Sci. 2021, 11, 1209 2 of 16
rithm to be tested is integrated with the simulation environment for the experime
This paper discusses an experiment performed using a reinforcement learnin
rithm as a path optimization technique in a warehouse environment. Section 2 bri
they are integrated with mobile robots, environments, hardware, and software. With this
scribes the warehouse environment, the concept of path planning for mobile rob
basic information, a simulation program for real experiments can be implemented. Because
concept of reinforcement
reinforcement learning can belearning, and
tested only whenthea algorithms to be tested.
specific experimental Section 3 brie
environment
scribes the in
is prepared structure
advance, of the mobile
substantial robot path-optimization
preparation of the environment issystem,
requiredthe main
early in logic
the study but the researcher must prepare it. Subsequently, the reinforcement
search, and the idea of dynamically adjusting the reward value according to the m learning
algorithm to be tested is integrated with the simulation environment for the experiment.
action. Section 4 describes the construction of the simulation environment for the
This paper discusses an experiment performed using a reinforcement learning algo-
ments, the parameters
rithm as a path andtechnique
optimization reward in values to be used
a warehouse for the experiments,
environment. Section 2 brieflyand the
mental
describesresults. Sectionenvironment,
the warehouse 5 presents the conclusions
the concept of pathand directions
planning for future
for mobile robots, researc
the concept of reinforcement learning, and the algorithms to be tested. Section 3 briefly
describes
2. RelatedtheWork
structure of the mobile robot path-optimization system, the main logic of
path search, and the idea of dynamically adjusting the reward value according to the
2.1. Warehouse
moving Environment
action. Section 4 describes the construction of the simulation environment for the
2.1.1. Warehouse Layout and
experiments, the parameters andreward values to be used for the experiments, and the
Components
experimental results. Section 5 presents the conclusions and directions for future research.
The major components for automated warehouse with mobile robots are as
2. Related Work
[5–10].
2.1. Warehouse Environment
•2.1.1.Robot drive
Warehouse device:
Layout It instructs
and Components the robot from the center to pick or load go
The major components for automated warehouse withpod.
tween the workstation and the inventory mobile robots are as follows [5–10].
• Inventory pod: This is a shelf rack that includes storage boxes. There are two
• Robot drive device: It instructs the robot from the center to pick or load goods between
ardworkstation
the sizes: a small pod
and the can hold
inventory pod.up to 450 kg, and a large pod can load up to 1
•• Workstation:
Inventory This
pod: This is aisshelf
therack
work thatarea in which
includes storagethe worker
boxes. There operates and perform
are two standard
sizes: a small pod can hold up to 450 kg,
such as replenishment, picking, and packaging.and a large pod can load up to 1300 kg.
• Workstation: This is the work area in which the worker operates and performs tasks
Figure
such 1 shows an picking,
as replenishment, example andofpackaging.
a warehouse layout with inventory pods and
stations.
Figure 1 shows an example of a warehouse layout with inventory pods and workstations.
Figure 1. Sample Layout of Warehouse Environment.
Figure 1. Sample Layout of Warehouse Environment.
2.1.2. Mobile Robot’s Movement Type
In the robotic mobile fulfillment system, a robot performs a storage trip, in which it
transports goods, and a retrieval trip, in which it moves to pick up goods. After dropping
the goods into the inventory pod, additional goods can be picked up or returned to the
workstation [11].the goods into the inventory pod, additional goods can be picked up or returned to the
workstation [11].
The main
2.1.2. Mobile movements
Robot’s MovementofType
the mobile robot in the warehouse are to perform the fol
lowing tasks [12].
In the robotic mobile fulfillment system, a robot performs a storage trip, in which it
Appl. Sci. 2021, 11, 1209 3 of 16
1. move
transports goods
goods, from
and the workstation
a retrieval to itthe
trip, in which inventory
moves to pickpod;
up goods. After dropping
the
2. goods
moveinto the one
from inventory pod, additional
inventory goodsto
pod to another can be picked
pick up or returned to the
up goods;
workstation [11].
3. move goods from the inventory pod to the workstation
Themain
The main movements
movements of the mobile
mobile robot
robotin inthe
thewarehouse
warehouseare
aretoto
perform thethe
perform follow-
fol-
ing tasks
lowing [12].
tasks [12].
2.2. Path Planning for Mobile Robot
1.1. move
movegoods
goodsfrom
fromthe
theworkstation totothe
theinventory
inventorypod;
For path planning inworkstation
a given work environment, pod;
a mobile robot searches for an opti
2.2. move from one inventory pod to another to pick up goods;
move from one inventory pod to another to pick up goods;
mal path from the starting point to the target point according to specific performance cri
3.3. move
movegoods
goodsfrom
fromthe
theinventory
inventorypod
podtotothe
theworkstation
workstation
teria. In general, path planning is performed for two path types, global and local, depend
ingPath
2.2.
2.2. on whether
Path Planningfor
Planning the mobile
forMobile
Mobile robot has environmental information. In the case of global path
Robot
Robot
planning,
Forpath
For allplanning
path information
planning about
in a given
in work
worktheenvironment,
environment
environment, is provided
aamobile
mobile robot to the mobile
robotsearches
searchesforfor
ananrobot
opti- before i
optimal
starts
pathpath
mal moving.
from the the
from Based
starting on
point
starting local
to the
point path planning,
target
to the point
target the
according
point robot
according does
to specific not know
performance
to specific most
performance information
criteria.
cri-
about
In
teria. In the
general, environment
pathpath
general, planning before
planning it moves
is performed
is performed for [13,14].
twotwo
for path types,
path global
types, globaland local,
and local,depending
depend-
on on
ing whether
The thethe
path
whether mobile
planning
mobilerobot has
hasenvironmental
development
robot processinformation.
environmental In
for mobile robots
information. In the
the case
usually global paththe steps
follows
of global path
planning,
planning,
below inall all information
2 [13].about the environment is provided to the mobile robot before itit
information
Figure about the environment is provided to the mobile robot before
startsmoving.
starts moving.Based
Basedon onlocal
localpath
pathplanning,
planning,the
therobot
robotdoes
doesnot
notknow
knowmost
mostinformation
information
•
about Environmental
the environment modeling
before it moves [13,14].
about the environment before it moves [13,14].
• The Optimization
pathplanningcriteria
planning developmentprocess
processfor
formobile
mobilerobots
robotsusually
usuallyfollows
followsthe
thesteps
steps
The path development
• Path finding
belowininFigure
below algorithm
Figure22[13].
[13].
• Environmental modeling
• Optimization criteria
• Path finding algorithm
Figure2.2.Development
Figure Developmentsteps for for
steps global/local pathpath
global/local optimization [13]. [13].
optimization
• Environmental modeling
• The classification
Optimization criteriaof the path planning of the mobile robot is illustrated in Figure 3.
Figure 2. Development steps for global/local path optimization [13].
• Path finding algorithm
The
Theclassification
classificationof
ofthe
thepath
pathplanning
planningof
ofthe
themobile
mobilerobot
robotisisillustrated
illustratedin
inFigure
Figure 3.
3.
Figure 3. Classification of path planning and global path search algorithm [13].
Classificationofofpath
Figure3.3.Classification
Figure pathplanning
planningand
andglobal
globalpath
pathsearch
searchalgorithm
algorithm[13].
[13].
Dijkstra, A*, and D* algorithms for the heuristic method are the most commonly
used algorithms for path search in global path planning [13,15]. Traditionally, warehouse
path planning has been optimized using heuristics and meta heuristics. Most of these
methods work without additional learning and require significant computation time as
the problem becomes complex. In addition, it is challenging to design and understandbiases from labeled data, and reinforcement learning learns weights and biases using the
concept of rewards [17].
The essential elements of reinforcement learning are as follows, and see Figure 4 for
how to interact.
Appl. Sci. 2021, 11, 1209 First, agents are the subjects of learning and decision making. Second, the environ-4 of 16
ment refers to the external world recognized by the agent and interacts with the agent.
Third is policy. The agent finds actions to be performed from the state of the surrounding
environment. The fourth element is a reward, an information that tells you if an agent has
heuristic algorithms, making it difficult to maintain and scale the path. Therefore, interest
done good or bad behavior. In reinforcement learning, positive rewards are given for ac-
in artificial intelligence technology for path planning is increasing [16].
tions that help agents reach their goals, and negative rewards are given for actions that
prevent them from
2.3. Reinforcement reaching
Learning their goals. The fifth element is a value function. The value
Algorithm
function tells you what’s good in the long term. Value is the total amount of reward an
Reinforcement learning is a method that uses trial and error to find a policy to reach
agent can expect. Sixth is the environmental model. Environmental models predicts the
a goal by learning through failure and reward. Deep neural networks learn weights and
next state and the reward that changes according to the current state. Planning is a way to
biases from labeled data, and reinforcement learning learns weights and biases using the
decide what to do before experiencing future situations. Reinforcement learning using
concept of rewards [17].
models and plans is called a model-based method, and reinforcement learning using trial
The essential elements of reinforcement learning are as follows, and see Figure 4 for
and error is called a model free method [18].
how to interact.
Figure 4.
Figure The essential
4. The essential elements
elements and
and interactions
interactions of
of Reinforcement
Reinforcement Learning
Learning [18].
[18].
First, agents are the subjects of learning and decision making. Second, the environment
The basic concept of reinforcement learning modeling aims to find the optimal policy
refers to the external world recognized by the agent and interacts with the agent. Third
for problem solving and can be defined on the basis of the Markov Decision Process
is policy. The agent finds actions to be performed from the state of the surrounding
(MDP). If the state and action pair is finite, this is called a finite MDP [18]. The most basic
environment. The fourth element is a reward, an information that tells you if an agent
Markov process concept in MDP, consists of a state s, a state transition probability p, and
has done good or bad behavior. In reinforcement learning, positive rewards are given for
an action a performed. The p is the probability that when performing an action a in state
actions that help agents reach their goals, and negative rewards are given for actions that
s, it will change to s’ [18].
prevent them from reaching their goals. The fifth element is a value function. The value
The Formula
function tells you (1) that good
what’s defines
inthe
therelationship between
long term. Value state,
is the action
total and probability
amount of reward anis
as follows [18,19].
agent can expect. Sixth is the environmental model. Environmental models predicts the
next state and the reward(that , ) = Praccording
∣ changes = ∣ to the = current
, =state. Planning is a way
(1)
to decide what to do before experiencing future situations. Reinforcement learning using
models and plans is called a model-based method, and reinforcement learning using trial
and error is called a model free method [18].
The basic concept of reinforcement learning modeling aims to find the optimal policy
for problem solving and can be defined on the basis of the Markov Decision Process (MDP).
If the state and action pair is finite, this is called a finite MDP [18]. The most basic Markov
process concept in MDP, consists of a state s, a state transition probability p, and an action
a performed. The p is the probability that when performing an action a in state s, it will
change to s’ [18].
The Formula (1) that defines the relationship between state, action and probability is
as follows [18,19].
p s0 | s, a = Pr St = s0 | St−1 = s, At−1 = a
(1)
The agent’s goal is to maximize the sum of regularly accumulated rewards, and the
expected return is the sum of rewards. In other words, using rewards to formalize the
concept of a goal is characteristic of reinforcement learning. In addition, the discount factor
is also important in the concept of reinforcement learning. The discount factor (γ) is a
coefficient that discounts the future expected reward and has a value between 0 and 1 [18].Appl. Sci. 2021, 11, 1209 5 of 16
Discounted return obtained through reward and depreciation rate can be defined as the
following Formula (2) [18,19].
∞
Gt = Rt+1 + γRt+2 + γ2 Rt+3 + · · · = ∑ γ k R t + k +1 (2)
k =0
The reinforcement learning algorithm has the concept of a value function, a function of
a state that estimates whether the environment given to an agent is good or bad. Moreover,
the value function is defined according to the behavioral method called policy. Policy is
the probability of choosing possible actions in a state [18].
If the value function was the consideration of the cumulative reward and the change
of state together, the state value function was also considered the policy [20]. The value
function is the expected value of the return if policy π is followed since starting at state s,
and the formulas are as follows [18,20].
∞
" #
vπ (s) = Eπ [ Gt | St = s] = Eπ ∑ γ k R t + k +1 | S t = s (3)
k =0
Among the policies, the policy with the highest expected value of return is the optimal
policy, and the optimal policy has an optimal state value function and an optimal action
value function [18]. The optimal state value function can be found in equation (4), and the
optimal action value function can be found in Equations (5) and (6) [18,20].
v∗ (s) = max vπ (s) (4)
π
q∗ (s, a) = max qπ (s, a) (5)
π
= E[ Rt+1 + γv∗ (St+1 ) | St = s, At = a] (6)
The value of a state that follows an optimal policy is equal to the expected value of
the return from the best action that can be chosen in that state [18]. The formula is as
follows [18,20].
0
q∗ (s, a) = E Rt+1 + γmax q∗ St+1 , a | St = s, At = a (7)
a0
= ∑ p s , r | s, a
0 0 0
r + γmax q∗ s , a (8)
s0 ,r a0
2.3.1. Q-Learning
One of most basic algorithm among reinforcement learning algorithms is off-policy
temporal-difference control algorithm known as Q-Learning [18]. The Q-Learning algo-
rithm stores the state and action in the Q Table in the form of a Q Value; for action selection,
it selects a random value or a maximum value according to the epsilon-greedy value and
stores the reward values in the Q Table [18].
The algorithm stores the values in the Q Table using Equation (9) below [21].
Q(s, a) = r (s, a) + max Q s0 , a
(9)
a
Equation (9) is generally converted into update—Equation (10) by considering a factor
of time [18,22,23].
0 0
Q(s, a) ← Q(s, a) + α r + γmax Q s , a − Q(s, a) (10)
a0
The procedure of the Q-Learning algorithm is as follows (Algorithm 1) [18].( , )← ( , )+ + ( , )− ( , ) (10)
The procedure of the Q-Learning algorithm is as follows (Algorithm 1) [18].
Appl. Sci. 2021, 11, 1209 6 of 16
Algorithm 1: Q-Learning.
Initialize Q(s, a)
Loop for each episode:
Algorithm 1: Q-Learning.
Initialize s (state)
Initialize Q(s, a)
LoopLoop for episode:
for each each step of episode until s (state) is destination:
Chooses (state)
Initialize a (action) from s (state) using policy from Q values
Loop for each step of episode until s (state) is destination:
Take a (action), find r (reward), s’ (next state)
Choose a (action) from s (state) using policy from Q values
Calculate
Take Q value
a (action), find rand Save s’
(reward), it (next
into Q Table
state)
Calculate
Change s’ Q (next
value and Save
state) to it into Q Table
s (state)
Change s’ (next state) to s (state)
In reinforcement learning algorithms, the process of exploration and exploitation
playsInanreinforcement learning
important role. algorithms,
Initially, the process
lower priority of exploration
actions and exploitation
should be selected plays
to investigate
an important role. Initially, lower priority actions should be selected to investigate
the best possible environment, and then select and proceed with higher priority actions the
best possible environment, and then select and proceed with higher priority
[23]. The Q-Learning algorithm has become the basis of the reinforcement learning algo- actions [23].
The
rithmQ-Learning
because it algorithm has shows
is simple and becomeexcellent
the basislearning
of the reinforcement learning
ability in a single agentalgorithm
environ-
because it is simple and shows excellent learning ability in a single agent environment.
ment. However, Q-Learning updates only once per action; it is thus difficult to solve com-
However, Q-Learning updates only once per action; it is thus difficult to solve complex
plex problems effectively when many states and actions have not yet been realized.
problems effectively when many states and actions have not yet been realized.
In addition, because the Q table for reward values is predefined, a considerable
In addition, because the Q table for reward values is predefined, a considerable
amount of storage memory is required. For this reason, the Q-Learning algorithm is dis-
amount of storage memory is required. For this reason, the Q-Learning algorithm is
advantageous in a complex and advanced multi-agent environment [24].
disadvantageous in a complex and advanced multi-agent environment [24].
2.3.2. Dyna-Q
2.3.2. Dyna-Q
Dyna-Q is
Dyna-Q is aa combination
combination of of the
the Q-Learning
Q-Learning and and Q-Planning
Q-Planning algorithms.
algorithms. The Q-
The Q-
Learning algorithm changes a policy based on a real experience. The Q-Planning
Learning algorithm changes a policy based on a real experience. The Q-Planning algorithm algo-
rithm obtains
obtains simulated
simulated experience
experience through through
searchsearch
controlcontrol and changes
and changes the policy
the policy basedbased
on a
on a simulated experience [18]. Figure 5 illustrates the structure of the Dyna-Q
simulated experience [18]. Figure 5 illustrates the structure of the Dyna-Q model. model.
Figure 5. Dyna-Q
Figure 5. Dyna-Q model
model structure
structure [18].
[18].
The
The Q-Planning
Q-Planningalgorithm
algorithmrandomly
randomly selects samples
selects onlyonly
samples fromfrom
previously experienced
previously experi-
information, and the Dyna-Q model obtains information for simulated experience
enced information, and the Dyna-Q model obtains information for simulated experience through
search control exactly as it obtains information through modeling learning based on a real
through search control exactly as it obtains information through modeling learning based
experience [18]. The Dyna-Q update-equation is the same as that for Q-Learning (10) [22].
on a real experience [18]. The Dyna-Q update-equation is the same as that for Q-Learning
The Dyna-Q algorithm is as follows (Algorithm 2) [18].
(10) [22].The Dyna-Q algorithm is as follows (Algorithm 2) [18].
Algorithm 2: Dyna-Q.
Appl. Sci. 2021, 11, 1209 7 of 16
Initialize Q(s, a)
Loop forever:
(a) Get s (state)
Algorithm 2: Dyna-Q.
(b) Choose a (action) from s (state) using policy from Q values
Initialize Q(s, a)
(c) forever:
Loop Take a (action), find r (reward), s’ (next state)
(a) Get s (state)
(d) Calculate Q value and Save it into Q Table
(b) Choose a (action) from s (state) using policy from Q values
(e) Take
(c) Savea (action),
s (state),find
a (action), r (reward),
r (reward), s’ (next state) in memory
s’ (next state)
(d)
(f) Calculate
Loop for Q value and Save it into Q Table
n times
(e) Save s (state), a (action), r (reward), s’ (next state) in memory
(f) LoopTake
for ns(state),
times a (action), r (reward), s’ (next state)
Calculate
Take s(state),Qa (action),
value and Save it into
r (reward), Q Table
s’ (next state)
Calculate Q value and Save it into Q Table
(g) Change s’ (next state) to s (state)
(g) Change s’ (next state) to s (state)
Steps (a) to (d) and, (g) are the same as in Q-Learning. Steps (e) and (f) added to the
Steps (a) to (d) and, (g) are the same as in Q-Learning. Steps (e) and (f) added to the
Dyna-Q algorithm store past experiences in memory to generate the next state and a more
Dyna-Q algorithm store past experiences in memory to generate the next state and a more
accurate reward value.
accurate reward value.
3.
3.Reinforcement
ReinforcementLearning-Based
Learning-BasedMobile
MobileRobot
RobotPathPathOptimization
Optimization
3.1.
3.1.System
SystemStructure
Structure
In
In the
the mobile
mobile robot
robot framework
framework structure
structure in
in Figure
Figure 6,
6, the
the agent
agent is
is controlled
controlled by
by the
the
supervisor
supervisorcontroller,
controller, and
and the
the robot
robot controller
controller is
is executed
executed byby the
the agent,
agent, so
so that
that the
the robot
robot
operates
operates in in the
the warehouse
warehouse environment
environment and
and the
the necessary
necessary information
information isis observed
observed and
and
collected
collected [25–27].
[25–27].
Figure6.
Figure Warehouse mobile
6.Warehouse mobile robot
robot framework
framework structure.
3.2. Operation Procedure
The mobile robot framework works as follows. At the beginning, the simulator is set
to its initial state, and the mobile robot collects the first observation information and sends
it to the supervisor controller. The supervisor controller verifies the received information
and communicates the necessary action to the agent. At this stage, the supervisor controller
can obtain additional information if necessary. The supervisor controller can communicate
job information to the agent. The agent conveys the action to the mobile robot through the3.2. Operation Procedure
The mobile robot framework works as follows. At the beginning, the simulator is set
to its initial state, and the mobile robot collects the first observation information and sends
it to the supervisor controller. The supervisor controller verifies the received information
Appl. Sci. 2021, 11, 1209 8 of 16
and communicates the necessary action to the agent. At this stage, the supervisor control-
ler can obtain additional information if necessary. The supervisor controller can com-
municate job information to the agent. The agent conveys the action to the mobile robot
through
robot the robot
controller, controller,
and the mobile androbot
the mobile
performsrobot
theperforms the received
received action action using
using actuators. ac-
After
tuators.
the actionAfter the action
is performed byisthe
performed by the the
robot controller, robot controller,
changed the and
position changed
statusposition and
information
status
is information
transmitted is agent
to the transmitted
and the to supervisor
the agent and the supervisor
controller, and thecontroller,
supervisorand the su-
controller
pervisor controller
calculates the reward calculates
value for thethereward valueaction.
performed for the Subsequently,
performed action. Subsequently,
the action, reward
value, and reward
the action, status information are stored
value, and status in memory.
information These procedures
are stored in memory. are repeated
These until
procedures
the
are mobile
repeated robot
untilachieves the defined
the mobile goal or the defined
robot achieves number goal
of epochs/episodes
or the number of or epochs/ep-
conditions
that
isodessatisfy the goal [25].
or conditions that satisfy the goal [25].
In Figure 7, we show an example class that is composed of the essential functions functions and
and
main logic required
required by by the
the supervisor
supervisor controller
controller for
for path
path planning.
planning.
Figure 7. Required methods and logic of path finding
finding [25].
[25].
3.3. Dynamic Reward
3.3. Dynamic Reward Adjustment
Adjustment Based
Based on
on Action
Action
The simplest type of reward for the movement
The simplest type of reward for the movement of aof a mobile
mobile robotrobot in a grid-type
in a grid-type ware-
warehouse environment is achieved by specifying a negative reward value for an obstacle
house environment is achieved by specifying a negative reward value for an obstacle and
and a positive reward value for the target location. The direction of the mobile robot is
a positive reward value for the target location. The direction of the mobile robot is deter-
determined by its current and target positions. The path planning algorithm requires
mined by its current and target positions. The path planning algorithm requires signifi-
significant time to find the best path by repeating many random path finding attempts. In
cant time to find the best path by repeating many random path finding attempts. In the
the case of the basic reward method introduced earlier, an agent searches randomly for a
case of the basic reward method introduced earlier, an agent searches randomly for a path,
path, regardless of whether it is located far from or close to the target position. To improve
regardless of whether it is located far from or close to the target position. To improve on
on these vulnerabilities, we considered a reward method as shown in Figure 8.
these vulnerabilities, we considered a reward method as shown in Figure 8.
The suggestions are as follows:
First, when the travel path and target position are determined, a partial area close to
the target position is selected. (Reward adjustment area close to the target position).
Second, a higher reward value than the basic reward method is assigned to the path
positions within the area selected in the previous step. Third, we propose a method to
select an area as above whenever the movement path changes and changes the position
of the movement path in the selected area to a high reward value. This is called dynamic
reward adjustment based on action, and the following (Algorithm 3) is an algorithm of
the process.Appl. Sci. 2021, 11, 1209 9 of 16
Algorithm 3: Dynamic reward adjustment based on action.
1. Get new destination to move
Appl. Sci. 2021, 11, x FOR PEER REVIEW 2. Select a partial area near the target position. 9 of 17
3. A higher positive reward value is assigned to the selected area, excluding the target point and obstacles.
(Highest positive reward value to target position)
Figure 8. 8.
Figure Dynamic adjusting
Dynamic reward
adjusting values
reward forfor
values the moving
the action.
moving action.
4. Test
The and Results are as follows:
suggestions
4.1.First,
Warehouse
when Simulation Environment
the travel path and target position are determined, a partial area close to
the target position
To create is selected.environment,
a simulation (Reward adjustment areato
we referred close
the to the target
paper [23] andposition)
built a virtual
Second, aenvironment
experiment higher reward value
based onthan the basic reward
the previously method
described is assigned
warehouse to the pathA
environment.
positions within
total of 15 the area
inventory podsselected in the in
were included a 25 × 25
previous step.
gridThird, we
layout. A propose a method
storage trip to
of a mobile
robotanfrom
select areathe workstation
as above to the
whenever theinventory
movement pod was
path simulated.
changes This experiment
and changes was
the position
ofconducted
the movementto find theinoptimal
path path to
the selected thetotarget
area position,
a high reward avoiding inventory
value. This is calledpods at the
dynamic
Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 17
starting
reward position where
adjustment basedaon single mobile
action, androbot was defined.
the following (Algorithm 3) is an algorithm of
The Figure 9 shows the target position for the experiments.
the process.
Algorithm 3: Dynamic reward adjustment based on action.
1. Get new destination to move
2. Select a partial area near the target position.
3. A higher positive reward value is assigned to the selected area,
excluding the target point and obstacles.
(Highest positive reward value to target position)
4. Test and Results
4.1. Warehouse Simulation Environment
To create a simulation environment, we referred to the paper [23] and built a virtual
experiment environment based on the previously described warehouse environment. A
total of 15 inventory pods were included in a 25 × 25 grid layout. A storage trip of a mobile
robot from the workstation to the inventory pod was simulated. This experiment was con-
ducted
Figureto9. find theposition
Target optimal of path to theexperiment.
simulation target position, avoiding inventory pods at the start-
ing position where a single mobile robot was defined.
4.2. TestFigure
Parameter Values
4.2.The 9 shows
Test Parameter Valuesthe target position for the experiments.
Q-Learning and Dyna-Q algorithms were tested in a simulation environment.
Q-Learning and Dyna-Q algorithms were tested in a simulation environment.
Table 1 shows the parameter values used in the experiment.
Table 1. Parameter values for experiment.Figure 9. Target position of simulation experiment.
Appl. Sci. 2021, 11, 1209 10 of 16
4.2. Test Parameter Values
Q-Learning and Dyna-Q algorithms were tested in a simulation environment.
Table11shows
Table showsthe
theparameter
parametervalues
valuesused
usedininthe
theexperiment.
experiment.
Table1.1.Parameter
Table Parametervalues
valuesfor
forexperiment.
experiment.
Parameter
Parameter Q-Learning Q-Learning Dyna-Q
Dyna-Q
Learning
Learning Rate (α) Rate (α) 0.01 0.01 0.01 0.01
Discount Rate (γ) Rate (γ)
Discount 0.9 0.9 0.9 0.9
4.3.
4.3.Test
TestReward
RewardValues
Values
The
The experimentwas
experiment wasconducted
conductedwith
withthe
thefollowing
followingreward
rewardmethods.
methods.
•• Basic
Basic method:
method: TheThe obstacle
obstaclewas setasas-1,−the
wasset 1, the target
target position
position as 1,asand
1, and the other
the other posi-
positions as
tions as 0. 0.
•• Dynamic
Dynamicreward
rewardadjustment
adjustment based
based on
on action:
action: To
Tocheck
check the
theefficacy
efficacyof ofthe
theproposal,
proposal,
an
anexperiment
experimentwas wasconducted
conducteddesignating
designatingset setvalues
valuesthat
thatwere
weresimilar
similartotothose
thoseininthe
the
proposed method as shown in Figure 10 [28–30]. The obstacle was set as − 1,
proposed method as shown in Figure 10 [28–30]. The obstacle was set as −1, the target the target
position
positionasas10,
10,the
thepositions
positionsin
inthe
theselected
selectedareaarea(area
(areaclose
closeto
totarget)
target)as as0.3,
0.3,and
andthe
the
other positions as 0.
other positions as 0.
Figure10.
Figure 10.Selected
Selectedarea
areafor
fordynamic
dynamicreward
rewardadjustment
adjustmentbased
basedon
onaction.
action.
4.4. Results
First, we tested two basic reinforcement learning algorithms: the Q-Learning algorithm
and the Dyna-Q algorithm. The experiment was performed five times for each algorithm,
and there may be deviations in the results for each experiment, so the experiment was
performed under the same condition and environment. The following points were verified
and compared using the experimental results. The optimization efficiency of each algorithm
was checked based on the length of the final path, and the path optimization performance
was confirmed by the path search time. We were also able to identify learning patterns by
the number of learning steps per episode.
The path search time of the Q-Learning algorithm was 19.18 min for the first test,
18.17 min for the second test, 24.28 min for the third test, 33.92 min for the fourth test, and
19.91 min for the fifth test. It took about 23.09 min on average. The path search time of the
Dyna-Q algorithm was 90.35 min for the first test, 108.04 min for the second test, 93.01 min
for the third test, 100.79 min for the fourth test, and 81.99 min for the fifth test. The overall
average took about 94.83 min.
The final path length of the Q-Learning algorithm was 65 steps for the first test, 34 steps
for the second test, 60 steps for the third test, 91 steps for the fourth test, 49 steps for the
fifth test, and the average was 59.8 steps. The final path length of the Dyna-Q algorithm
was 38 steps for the first test, 38 steps for the second test, 38 steps for the third test, 34 steps
for the fourth test, 38 steps for the fifth test, and the average was 37.2 steps.
The Q-Learning algorithm was faster than Dyna-Q in terms of path search time,
and Dyna-Q selected shorter and simpler final paths than those selected by Q-Learning.average took about 94.83 min.
The final path length of the Q-Learning algorithm is 65 steps for the first test, 34 steps
for the second test, 60 steps for the third test, 91 steps for the fourth test, 49 steps for the
fifth test, and the average is 59.8 steps. The final path length of the Dyna-Q algorithm is
Appl. Sci. 2021, 11, 1209 38 steps for the first test, 38 steps for the second test, 38 steps for the third test, 34 steps
11 offor
16
the fourth test, 38 steps for the fifth test, and the average is 37.2 steps.
The Q-Learning algorithm was faster than Dyna-Q in terms of path search time, and
Dyna-Q selected shorter and simpler final paths than those selected by Q-Learning. Fig-
Figures 11 and 12 show the learning pattern through the number of training steps per
ures 11 and 12 show the learning pattern through the number of training steps per epi-
episode. Q-Learning found the path by repeating many steps for 5000 episodes. Dyna-Q,
sode. Q-Learning found the path by repeating many steps for 5000 episodes. Dyna-Q, sta-
stabilized after finding the optimal path between 1500 and 3000 episodes. In both algo-
bilized after finding the optimal path between 1500 and 3000 episodes. In both algorithms,
rithms, the maximum number of steps rarely exceeded 800 steps. All tests ran 5000 episodes
the maximum number of steps rarely exceeded 800 steps. All tests ran 5000 episodes for
for each test. In the case of Q-Learning algorithm, the second test exceeded 800 steps 5 times
each test. In the case of Q-Learning algorithm, the second test exceeded 800 steps 5 times
(0.1%) and fourth test exceeded 800 steps 5 times (0.1%). So, Q-Learning algorithm test
(0.1%) and fourth test exceeded 800 steps 5 times (0.1%). So, Q-Learning algorithm test
exceeded 800 steps at a total level of 0.04% in the five tests. In the case of Dyna-Q algo-
exceeded 800 steps at a total level of 0.04% in the five tests. In the case of Dyna-Q algo-
rithm, the first test exceeded 800 steps once (0.02%), third test exceeded 800 steps 7 times
rithm, the first test exceeded 800 steps once (0.02%), third test exceeded 800 steps 7 times
(0.14%), fifth test exceeded 800 steps 3 times (0.06%). So, Dyna-Q algorithm test exceeded
(0.14%), fifth test exceeded 800 steps 3 times (0.06%). So, Dyna-Q algorithm test exceeded
800 steps at a total level of 0.044 % in the five tests. To compare the path search time and
800 steps at a total level of 0.044 % in the five tests. To compare the path search time and
the learning pattern result, we compared the experiment (e) of the two algorithms. The
the learning pattern result, we compared the experiment (e) of the two algorithms. The
path search time was 19.9 s for Q-Learning algorithm and 81.98 s for Dyna-Q. Looking at
path search time was 19.9 s for Q-Learning algorithm and 81.98 s for Dyna-Q. Looking at
the learning patterns in Figures 11 and 12, the Q-Learning algorithm learned by executing
the learning patterns in Figures 11 and 12, the Q-Learning algorithm learned by executing
200 steps to 800 steps per each episode until the end of 5000 episodes, and the Dyna-Q
200 steps to 800 steps per each episode until the end of 5000 episodes, and the Dyna-Q
algorithm learned by executing 200 steps to 800 steps per each episode until the middle
algorithm learned by executing 200 steps to 800 steps per each episode until the middle of
of the 2000 episode. However, the time required for the Dyna-Q algorithm is more than
the 2000 episode. However, the time required for the Dyna-Q algorithm is more than 75%
75% slower than the Q-Learning algorithm. In this perspective, it has been confirmed that
slower than the Q-Learning algorithm. In this perspective, it has been confirmed that
Dyna-Q takes longer to learn in a single episode than Q-Learning.
Dyna-Q takes longer to learn in a single episode than Q-Learning.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 12 of 17
(a) (b)
(c) (d)
(e)
Figure11.
Figure 11.Path
Pathoptimization
optimizationlearning
learning progress
progress status
status of
of Q-Learning:
Q-Learning: (a) 1st test; (b) 2nd test; (c) 3rd test; (d)
(d) 4th
4th test;
test; (e)
(e)5th
5thtest.
test.Appl. Sci. 2021, 11, 1209 12 of 16
(e)
Figure 11. Path optimization learning progress status of Q-Learning: (a) 1st test; (b) 2nd test; (c) 3rd test; (d) 4th test; (e) 5th test.
(a) (b)
Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 17
(c) (d)
(e)
Figure 12.
12. Path
Pathoptimization
optimizationlearning
learningprogress
progressstatus
statusofofDyna-Q:
Dyna-Q:(a)(a)
1st1st
test; (b)(b)
test; 2nd test;
2nd (c) (c)
test; 3rd3rd
test;test;
(d) (d)
4th4th
test;test;
(e) 5th
(e) test.
5th test.
Subsequently, additional experiments were conducted by applying ‘dynamic reward
adjustment
adjustment based based on action’ to the Q-Learning algorithm. The The path
path search time of this
experiment was 19.04 min for the first test, 15.21 min for the second test,test,
experiment was 19.04 min for the first test, 15.21 min for the second 17.05
17.05 minmin for
for the
the third
third test,test,
17.5117.51
min min forfourth
for the the fourth
test, test, and 17.38
and 17.38 minthe
min for forfifth
the test.
fifth Ittest.
tookIt about
took about
17.23
17.23on
min min on average.
average. The finalThepath
finallength
path length
of this of this experiment
experiment was 38 wassteps38for steps for the
the first first
test, 34
test, 34 steps for the second test, 34 steps for the third test, 34 steps for the
steps for the second test, 34 steps for the third test, 34 steps for the fourth test, and 34 steps fourth test, and
34 steps
for for the
the fifth test.fifth
Thetest. The average
average was 34.8
is 34.8 steps. The steps. The experimental
experimental results canresultsbe seen caninbe seen
Figure
in Figure 13, and it includes the experimental results of Q-Learning
13, and it includes the experimental results of Q-Learning and Dyna-Q previously con- and Dyna-Q previously
conducted.
ducted. In the
In the case case
of of
thethe Q-Learning
Q-Learning algorithmtotowhich
algorithm which‘dynamic
‘dynamicreward
reward adjustment
adjustment
based on action’ was applied, the path search time was slightly faster
based on action’ was applied, the path search time was slightly faster than Q-Learning, than Q-Learning, and
the final
and pathpath
the final length waswas
length confirmed
confirmed to beto improved
be improved to atolevel similar
a level similarto to that
thatof of
Dyna-Q.
Dyna-
Figure 14 shows the learning pattern obtained using the additional
Q. Figure 14 shows the learning pattern obtained using the additional experiments com- experiments compared
with the
pared withresult
the of Q-Learning.
result There There
of Q-Learning. is no outstanding
is no outstandingpoint.point.
FigureFigure
15 shows the path
15 shows the
found through the experiment of the Q Leaning algorithm
path found through the experiment of the Q Leaning algorithm applying ‘dynamic applying ‘dynamic reward re-
adjustment based on action’ on the map.
ward adjustment based on action’ on the map.ducted. In the case of the Q-Learning algorithm to which ‘dynamic reward adjustment
based on action’ was applied, the path search time was slightly faster than Q-Learning,
and the final path length was confirmed to be improved to a level similar to that of Dyna-
Q. Figure 14 shows the learning pattern obtained using the additional experiments com-
Appl. Sci. 2021, 11, 1209 pared with the result of Q-Learning. There is no outstanding point. Figure 15 shows 13 ofthe
16
path found through the experiment of the Q Leaning algorithm applying ‘dynamic re-
ward adjustment based on action’ on the map.
(a) (b)
Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 17
Figure 13.
Figure 13. Comparison of experiment
experiment results:
results: (a) Total
Total elapsed
elapsed time;
time; (b)
(b) Length
Length of
offinal
finalpath.
path.
(a) (b)
(c) (d)
(e)
Figure14.
Figure 14.Path
Pathoptimization
optimization learning
learning progress
progress status
statusof
ofQ-Learning
Q-Learningwith
withaction
actionbased
basedreward: (a)(a)
reward: 1st1st
test; (b)(b)
test; 2nd2nd
test;test;
(c)
3rd test; (d) 4th test; (e) 5th test.
(c) 3rd test; (d) 4th test; (e) 5th test.Appl. Sci. 2021, 11, 1209 14 of 16
(e)
Figure 14. Path optimization learning progress status of Q-Learning with action based reward: (a) 1st test; (b) 2nd test; (c)
3rd test; (d) 4th test; (e) 5th test.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 15 of 17
(a) (b)
(c) (d)
(e)
Figure15.
Figure 15. Path results of
of Q-Learning
Q-Learningwith
withaction
actionbased
basedreward:
reward:(a)(a)
1st1st
test; (b)(b)
test; 2nd test;
2nd (c) (c)
test; 3rd3rd
test;test;
(d) (d)
4th 4th
test;test;
(e) 5th
(e) test.
5th test.
5.
5. Conclusions
Conclusions
In
Inthis
thispaper,
paper,we wereviewed
reviewedthe theuse
useofofreinforcement
reinforcementlearning
learningas asaapath
pathoptimization
optimization
technique
technique in a warehouse environment. We built a simulation environmentto
in a warehouse environment. We built a simulation environment totest
testpath
path
navigation
navigation in a warehouse environment and tested the two most basic algorithms,the
in a warehouse environment and tested the two most basic algorithms, theQQ
-Learning
-LearningandandDyna-Q
Dyna-Qalgorithms,
algorithms, forfor
reinforcement
reinforcement learning.
learning.WeWe
then compared
then comparedthe the
exper-
ex-
imental results in terms of the path search times and the final path lengths.
perimental results in terms of the path search times and the final path lengths. ComparedCompared with
the
withaverage of the results
the average of the five
of the results experiments,
of the the path the
five experiments, search
pathtime of the
search Q- Learning
time of the Q-
Learning algorithm was about 4.1 times faster than the Dyna-Q algorithm’s, and the final
path length of Dyna-Q algorithm was about 37% shorter than in the Q-Learning algo-
rithm’s. As a result, the experimental results of both Q-Learning and Dyna-Q algorithm
showed unsatisfactory results for either the path search time or the final path length. ToAppl. Sci. 2021, 11, 1209 15 of 16
algorithm was about 4.1 times faster than the Dyna-Q algorithm’s, and the final path length
of Dyna-Q algorithm was about 37% shorter than the Q-Learning algorithm’s. As a result,
the experimental results of both Q-Learning and Dyna-Q algorithm showed unsatisfactory
results for either the path search time or the final path length. To improve the results, we
conducted various experiments using the Q-Learning algorithm and found that it does not
perform well when finding search paths in a complex environment with many obstacles.
The Q-Leaning algorithm has the property of finding a path at random. To minimize this
random path searching trial, the target position was determined for movement, and the set
reward value was high around the target position,’ thus the performance and efficiency of
the agent improved with ‘dynamic reward adjustment based on action’. Agents mainly
explore locations with high reward values. In further experiments, the path search time
was about 5.5 times faster than the Dyna-Q algorithm’s, and the final path length was about
71% shorter than the Q-Learning algorithm. It was confirmed that the path search time
was slightly faster than the Q-Learning algorithm, and the final path length was improved
to the same level as the Dyna-Q algorithm.
It is a difficult goal to meet both minimization of path search time and high accuracy
for path search. If the learning speed is too fast, it will not converge to the optimal value
function, and if the learning speed is set too late, the learning may take too long. In this
paper, we have experimentally confirmed how to reduce inefficient exploration with simple
technique to improve path search accuracy and maintain path search time.
In this study, we have seen how to use reinforcement learning for a single agent and a
path optimization technique for a single path, but the warehouse environment in the field
is much more complex and dynamically changing. In the automation warehouse, multiple
mobile robots work simultaneously. Therefore, this basic study is an example of improving
the basic algorithm in a single path condition for a single agent, and it is insufficient to
apply the algorithm to an actual warehouse environment. Therefore, based on the obtained
results, an in-depth study of the improved reinforcement learning algorithm is needed,
considering highly challenging multi-agent environments to solve more complex and
realistic problems.
Author Contributions: Conceptualization, H.L. and J.J.; methodology, H.L.; software, H.L.; vali-
dation, H.L. and J.J.; formal analysis, H.L.; investigation, H.L.; resources, J.J.; data curation, H.L.;
writing–original draft preparation, H.L.; writing—review and editing, J.J.; visualization, H.L.; super-
vision, J.J.; project administration, J.J.; funding acquisition, J.J. All authors have read and agreed to
the published version of the manuscript.
Funding: This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the
ITRC (Information Technology Research Center) support program (IITP-2020-2018-0-01417) super-
vised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).
Acknowledgments: This work was supported by the Smart Factory Technological R&D Program
S2727115 funded by Ministry of SMEs and Startups (MSS, Korea).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Hajduk, M.; Koukolová, L. Trends in Industrial and Service Robot Application. Appl. Mech. Mater. 2015, 791, 161–165. [CrossRef]
2. Belanche, D.; Casaló, L.V.; Flavián, C. Service robot implementation: A theoretical framework and research agenda. Serv. Ind. J.
2020, 40, 203–225. [CrossRef]
3. Fethi, D.; Nemra, A.; Louadj, K.; Hamerlain, M. Simultaneous localization, mapping, and path planning for unmanned vehicle
using optimal control. SAGE 2018, 10, 168781401773665. [CrossRef]
4. Zhou, P.; Wang, Z.-M.; Li, Z.-N.; Li, Y. Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming
Algorithm. In Proceedings of the 2nd International Conference on Electronic & Mechanical Engineering and Information
Technology (EMEIT-2012), Shenyang, China, 7 September 2012; pp. 1837–1841.
5. Azadeh, K.; Koster, M.; Roy, D. Robotized and Automated Warehouse Systems: Review and Recent Developments. INFORMS
2019, 53, 917–1212. [CrossRef]
6. Koster, R.; Le-Duc, T.; JanRoodbergen, K. Design and control of warehouse order picking: A literature review. EJOR 2007, 182,
481–501. [CrossRef]Appl. Sci. 2021, 11, 1209 16 of 16
7. Jan Roodbergen, K.; Vis, I.F.A. A model for warehouse layout. IIE Trans. 2006, 38, 799–811. [CrossRef]
8. Larson, T.N.; March, H.; Kusiak, A. A heuristic approach to warehouse layout with class-based storage. IIE Trans. 1997, 29,
337–348. [CrossRef]
9. Kumara, N.V.; Kumarb, C.S. Development of collision free path planning algorithm for warehouse mobile robot. In Proceedings
of the International Conference on Robotics and Smart Manufacturing (RoSMa2018), Chennai, India, 19–21 July 2018; pp. 456–463.
10. Sedighi, K.H.; Ashenayi, K.; Manikas, T.W.; Wainwright, R.L.; Tai, H.-M. Autonomous local path planning for a mobile robot
using a genetic algorithm. In Proceedings of the 2004 Congress on Evolutionary Computation, Portland, OR, USA, 19–23 June
2004; pp. 456–463.
11. Lamballaisa, T.; Roy, D.; De Koster, M.B.M. Estimating performance in a Robotic Mobile Fulfillment System. EJOR 2017, 256,
976–990. [CrossRef]
12. Merschformann, M.; Lamballaisb, T.; de Kosterb, M.B.M.; Suhla, L. Decision rules for robotic mobile fulfillment systems. Oper.
Res. Perspect. 2019, 6, 100128. [CrossRef]
13. Zhang, H.-Y.; Lin, W.-M.; Chen, A.-X. Path Planning for the Mobile Robot: A Review. Symmetry 2018, 10, 450. [CrossRef]
14. Han, W.-G.; Baek, S.-M.; Kuc, T.-Y. Genetic algorithm based path planning and dynamic obstacle avoidance of mobile robots.
In Proceedings of the 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and
Simulation, Orlando, FL, USA, 12–15 October 1997.
15. Nurmaini, S.; Tutuko, B. Intelligent Robotics Navigation System: Problems, Methods, and Algorithm. IJECE 2017, 7, 3711–3726.
[CrossRef]
16. Johan, O. Warehouse Vehicle Routing Using Deep Reinforcement Learning. Master’s Thesis, Uppsala University, Uppsala,
Sweden, November 2019.
17. Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron. Available online: https://www.oreilly.com/
library/view/hands-on-machine-learning/9781491962282/ch01.html (accessed on 22 December 2020).
18. Sutton, R.S.; Barto, A.G. Introduction to Reinforcement Learning, 2nd ed.; MIT Press: London, UK, 2018; pp. 1–528.
19. Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st ed.; Wiley: New York, NY, USA, 1994; pp.
1–649.
20. Deep Learning Wizard. Available online: https://www.deeplearningwizard.com/deep_learning/deep_reinforcement_learning_
pytorch/bellman_mdp/ (accessed on 22 December 2020).
21. Meerza, S.I.A.; Islam, M.; Uzzal, M.M. Q-Learning Based Particle Swarm Optimization Algorithm for Optimal Path Planning of
Swarm of Mobile Robots. In Proceedings of the 1st International Conference on Advances in Science, Engineering and Robotics
Technology, Dhaka, Bangladesh, 3–5 May 2019.
22. Hsu, Y.-P.; Jiang, W.-C. A Fast Learning Agent Based on the Dyna Architecture. J. Inf. Sci. Eng. 2014, 30, 1807–1823.
23. Sichkar, V.N. Reinforcement Learning Algorithms in Global Path Planning for Mobile Robot. In Proceedings of the 2019
International Conference on Industrial Engineering Applications and Manufacturing, Sochi, Russia, 25–29 March 2019.
24. Jang, B.; Kim, M.; Harerimana, G.; Kim, J.W. Q-Learning Algorithms: A Comprehensive Classification and Applications. IEEE
Access 2019, 7, 133653–133667. [CrossRef]
25. Kirtas, M.; Tsampazis, K.; Passalis, N. Deepbots: A Webots-Based Deep Reinforcement Learning Framework for Robotics. In
Proceedings of the 16th IFIP WG 12.5 International Conference AIAI 2020, Marmaras, Greece, 5–7 June 2020; pp. 64–75.
26. Altuntas, N.; Imal, E.; Emanet, N.; Ozturk, C.N. Reinforcement learning-based mobile robot navigation. Turk. J. Electr. Eng.
Comput. Sci. 2016, 24, 1747–1767. [CrossRef]
27. Glavic, M.; Fonteneau, R.; Ernst, D. Reinforcement Learning for Electric Power System Decision and Control: Past Considerations
and Perspectives. In Proceedings of the 20th IFAC World Congress, Toulouse, France, 14 July 2017; pp. 6918–6927.
28. Wang, Y.-C.; Usher, J.M. Application of reinforcement learning for agent-based production scheduling. Eng. Appl. Artif. Intell.
2005, 18, 73–82. [CrossRef]
29. Mehta, N.; Natarajan, S.; Tadepalli, P.; Fern, A. Transfer in variable-reward hierarchical reinforcement learning. Mach. Learn. 2008,
73, 289. [CrossRef]
30. Hu, Z.; Wan, K.; Gao, X.; Zhai, Y. A Dynamic Adjusting Reward Function Method for Deep Reinforcement Learning with
Adjustable Parameters. Math. Probl. Eng. 2019, 2019, 7619483. [CrossRef]You can also read