A Deep Reinforcement Learning Algorithm Based on Tetanic Stimulation and Amnesic Mechanisms for Continuous Control of Multi-DOF Manipulator
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
actuators Article A Deep Reinforcement Learning Algorithm Based on Tetanic Stimulation and Amnesic Mechanisms for Continuous Control of Multi-DOF Manipulator Yangyang Hou, Huajie Hong * , Dasheng Xu, Zhe Zeng, Yaping Chen and Zhaoyang Liu College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China; houyangyang14@nudt.edu.cn (Y.H.); ixudasheng@163.com (D.X.); zengzhe@nudt.edu.cn (Z.Z.); yaping_chen2021@163.com (Y.C.); ZhaoyangLiuNUDT@163.com (Z.L.) * Correspondence: honghuajie@nudt.edu.cn; Tel.: +86-138-7313-0046 Abstract: Deep Reinforcement Learning (DRL) has been an active research area in view of its capability in solving large-scale control problems. Until presently, many algorithms have been developed, such as Deep Deterministic Policy Gradient (DDPG), Twin-Delayed Deep Deterministic Policy Gradient (TD3), and so on. However, the converging achievement of DRL often requires extensive collected data sets and training episodes, which is data inefficient and computing resource consuming. Motivated by the above problem, in this paper, we propose a Twin-Delayed Deep Deterministic Policy Gradient algorithm with a Rebirth Mechanism, Tetanic Stimulation and Amnesic Mechanisms (ATRTD3), for continuous control of a multi-DOF manipulator. In the training process of the proposed algorithm, the weighting parameters of the neural network are learned using Tetanic stimulation and Amnesia mechanism. The main contribution of this paper is that we show a Citation: Hou, Y.; Hong, H.; Xu, D.; biomimetic view to speed up the converging process by biochemical reactions generated by neurons Zeng, Z.; Chen, Y.; Liu, Z. A Deep in the biological brain during memory and forgetting. The effectiveness of the proposed algorithm Reinforcement Learning Algorithm Based on Tetanic Stimulation and is validated by a simulation example including the comparisons with previously developed DRL Amnesic Mechanisms for Continuous algorithms. The results indicate that our approach shows performance improvement in terms of Control of Multi-DOF Manipulator. convergence speed and precision. Actuators 2021, 10, 254. https:// doi.org/10.3390/act10100254 Keywords: multi-DOF manipulator; tetanic stimulation; amnesia mechanism; deep reinforcement learning Academic Editors: Ioan Doroftei and Karsten Berns Received: 11 August 2021 1. Introduction Accepted: 20 September 2021 Deep Reinforcement Learning (DRL) is an advanced intelligent control method. It uses Published: 29 September 2021 a neural network to parameterize the Markov decision processes (MDP). DRL has been successfully applied to the field of robots [1–4], Machine Translation [5], Auto-Driving [6], Publisher’s Note: MDPI stays neutral Target Positioning [7], and shows strong adaptability. There are two kinds of the DRL with regard to jurisdictional claims in published maps and institutional affil- algorithms, one is based on value function such as Deep Q Network (DQN) [8] and Nature iations. DQN [9]. The output of DRL algorithm based on value function is discrete state-action value. The other is policy-based, such as Deep Deterministic Policy Gradient (DDPG) [10], Trust Region Policy Optimization (TRPO) [11], Asynchronous Advantage Actor-Critic (A3C) [12], Distributed Proximal Policy Optimization (DPPO) [13,14], and Twin-Delayed Deep Deterministic Policy Gradient (TD3) [15]. For continuous action space, advanced Copyright: © 2021 by the authors. search policy can improve the sampling efficiency of the underlying algorithms [10]. Many Licensee MDPI, Basel, Switzerland. research results focus on improving exploration policy. Among them, Fortunato et al. [16] This article is an open access article and Plppert et al. [17] put forward a noise-based exploration policy in adding noise to the distributed under the terms and conditions of the Creative Commons action space and observation space. Bellemare et al. propose an exploratory algorithm Attribution (CC BY) license (https:// based on pseudo-counting for efficient exploration. The algorithm evaluates frequency by creativecommons.org/licenses/by/ designing a density model that satisfies certain properties and calculates pseudo-counts 4.0/). that are generalized in continuous space to encourage exploration [18]. In [19], Fox et al. Actuators 2021, 10, 254. https://doi.org/10.3390/act10100254 https://www.mdpi.com/journal/actuators
Actuators 2021, 10, 254 2 of 16 innovatively propose a framework, DORA, that uses two parallel MDP processes to inject exploration signals into random tasks. Reward-based exploration results in a slower approximation of functions and failure to provide an intrinsic reward signal in time. So, Badia et al. [20] propose a “never give up” exploration strategy (NGU), which is designed to quickly prevent repeated access to the same state in the same episode. Due to the coupling and complexity, a multi-degree-of-freedom (multi-DOF) manipu- lator is a hot application background of DRL. The direct applications of DRL algorithms have, so far, been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity [21]. Therefore, the application of DRL in multi-DOF robots requires a more effective DRL algorithm. Neuroscience provides a rich source of inspiration for new types of algorithms and architectures, independent of and complemen- tary to the mathematical and logic-based methods and ideas that have largely dominated traditional approaches to AI [22]. In this paper, we design a new DRL algorithm named ATRTD3 based on the research results of neuroscience and the analysis of the human brain memory learning process. Our team shows a biomimetic view to speed up the converging process by biochemical reactions generated by neurons in the biological brain during memory and forgetting. We apply the neural network parameter updating mechanism with Tetanic stimulation and Amnesia mechanism to the DRL algorithm to further improve the efficiency in the application of manipulator. 2. Related Work The advancement of DRL has led to the development of intelligent control of a multi- DOF manipulator in recent years. Kim et al. [23] propose a motion planning algorithm for robot manipulators using TD3, which designed paths are smoother and shorter after 140,000 episodes than those designed by Probabilistic Roadmap. Based on the classic DDPG algorithm, Zhang et al. smoothly add Gaussian parameters to improve the exploratory nature of the algorithm, dynamically sets the robot grasping space parameters to adapt to the workspace of multiple scales, and realizes the accurate grasping of the robot [24]. Robert Kwiatkowski et al. [25] used DL methods to make the manipulator build a self- model after 35 h of training. By comparing the application of DDPG and Proximal Policy Optimization (PPO) in the manipulator, Lriondo et al. [26] concluded that the current DRL algorithm could not obtain robust motion ability and acceptable training efficiency. The difficulty of applying DRL to the motion control of the multi-DOF manipulator lies in how to improve the exploration efficiency and improve the robustness of the output action of the manipulator. Therefore, it is necessary to get inspiration from the research results of neuroscience. For flexible manipulators, some research [27,28] are very interesting, which brings some help to kinematics modeling and control design in this paper. Long-term potentiation (LTP) is a form of activity-dependent plasticity which results in a persistent enhancement of synaptic transmission. LTP has been a source of great fascination to neuroscientists since its discovery in the early 1970s [29] because it satisfies the criteria proposed by Donald Hebb for a synaptic memory mechanism in his influential book ‘The Organization of Behavior’ [30]. LTP is a persistent enhancement of excitatory synaptic transmission induced by some kinds of preceding operations of high-frequency stimulation (HFS) [31]. In LTP, stimulation changes the synaptic proteins, that is, changes the sensitivity of postsynaptic neurons to presynaptic neurons, thus changing the strength and efficiency of synaptic signal transmis- sion. Memory formation is considered to be the result of long-term synaptic plasticities, such as long-term depression (LTD) and LTP [32]. LTP and LTD have another potentially important role in modern neuroscience, and that is the possibility that they may be exploited to treat disorder and disease in the human central nervous system (CNS). A variety of neurological conditions arise from lost or excessive synaptic drive due to sensory deprivation during childhood, brain damage, or disease [33]. Memory and forgetting are the stages that the human brain must go through in
to presynaptic neurons, thus changing the strength and efficiency of synaptic signal trans‐ mission. Memory formation is considered to be the result of long‐term synaptic plastici‐ ties, such as long‐term depression (LTD) and LTP [32]. LTP and LTD have another potentially important role in modern neuroscience, and that is the possibility that they may be exploited to treat disorder and disease in the human Actuators 2021, 10, 254 3 of 16 central nervous system (CNS). A variety of neurological conditions arise from lost or ex‐ cessive synaptic drive due to sensory deprivation during childhood, brain damage, or disease [33]. Memory and forgetting are the stages that the human brain must go through in thethe process process ofof accepting accepting knowledge knowledge and and accumulating accumulating experience. experience. This This paperwill paper willmodify mod‐ ify thethe Actor‐network Actor-network module module in DRL, in DRL, andand change change thethe Actor‐network Actor-network module module optimized optimized by by gradient gradient descent descent into into a network a network module module with with biological biological characteristics. characteristics. 3. Methods 3. Methods ATRTD3 is a DRL algorithm ATRTD3 algorithm proposed proposedin inthis thispaper papertotoimprove improvethe themotion motionability abilityofofa amulti-DOF multi‐DOF manipulator. manipulator. The innovation The of the innovation of algorithm is to transform the algorithm the research is to transform results the research of neuroscience results into DRL. of neuroscience This into DRL.algorithm is based is This algorithm onbased the Twin-Delayed Deep Deterministic on the Twin‐Delayed Deep De‐ Policy Gradient terministic Policyalgorithm Gradient with Rebirth algorithm Mechanism with (RTD3) [34](RTD3) Rebirth Mechanism and improves [34] andthe update improves of network weight parameters of Actor-network module. The highlight the update of network weight parameters of Actor‐network module. The highlight of the of the algorithm is that it uses tetanic stimulation and amnesia mechanism to randomly enhance algorithm is that it uses tetanic stimulation and amnesia mechanism to randomly enhance and weaken the weighting and weaken the parameters weightingofparameters the neural of network, thusnetwork, the neural realizingthusthe bionic update realizing of the the bionic neural of update network. the neuralThe Actor-network network. module obtained The Actor‐network modulethrough obtained the deterministic through policy the determin‐ gradient needs to be further updated through the above two mechanisms. istic policy gradient needs to be further updated through the above two mechanisms. Compared with otherwith Compared DRLother algorithms, ATRTD3 ATRTD3 DRL algorithms, increasesincreases the network update update the network part of part biological of bi‐ characteristics and further expands the scope of exploration. Figure 1 shows ological characteristics and further expands the scope of exploration. Figure 1 shows the the framework of the overall framework of algorithm. the overall algorithm. Figure 1. Figure ATRTD3 algorithm 1. ATRTD3 algorithm framework. framework. The pseudo-code of ATRTD3 is shown in Appendix A at the end of this paper. The The pseudo‐code of ATRTD3 is shown in Appendix A at the end of this paper. The following is a detailed description of the Tetanic stimulation and Amnesia mechanism. following is a detailed description of the Tetanic stimulation and Amnesia mechanism. 3.1. Tetanic Stimulation Tetanic stimulation is the memory part of the Actor-network module. In the process of back propagation, the neural network obtains the updating quantity of parameters by gradient descent, to realize the iterative updating of the network. By evaluating the updating of network parameters, we can get which neural nodes’ weights are enhanced and which ones are weakened. For the parameters of the strengthened neurons, there are also differences in the strengthened degree. Therefore, it is necessary to evaluate and sort the parameters of the strengthened neuron nodes, select the ranking of the strengthened degree, and then obtain the neuron nodes qualified for Tetanic stimulation, and conduct Tetanic stimulation on the above parameters of the neuron nodes to achieve LTP, as shown in Figure 2, and the specific pseudo code is shown in Algorithm 1. Directly modifying the parameters of neural nodes will directly affect the nonlinear expression results of neural networks. This is a kind of immediate influence, which will immediately show the change
Actuators 2021, 10, 254 4 of 16 effect of neuron weight in the continuous MDP, so we need to control this kind of influence in a reasonable range, and constantly update the parameters of the neural network in the iterative process. The designed Tetanic stimulation mechanism is nested in the fixed delay update step, which can exert the effect of Tetanic stimulation mechanism to a certain extent, and will not affect the overall effect of network update iteration, to ensure that the network converges towards the direction of performance improvement in the training process, and has the function of not weakening the attempt to explore the process. Algorithm 1 Tetanic stimulation 1: Tetanic stimulation coefficient κ 2: Load Actor network φ, W ← Actor. f c.weight 3: Update Actor network φ0 ← φ 4: Load New Actor network, W new ← Actor. f c.weight 5: ∆W = W new − W 6: Select the serial number (rowlist , collist ) of the T largest data in ∆W 7: For t = 1 to T do: 8: If random(0, 1) < κ: 9: If A.w(rowt , colt ) > 0 10: A.w(rowt , colt ) ← (1 + random(0, 0.01)) × A.w(rowt , colt ) 11: else: 12: A.w(rowt , colt ) ← (1 − random(0, 0.01)) × A.w(rowt , colt ) Actuators 2021, 10, x FOR PEER REVIEW 5 of 17 13: End if 14: End if 15: End for Tetanic Stimulation Framework Behind _Fc Behind _Fc Behind _Fc Front _Fc Front _Fc Front _Fc 1 1 1 1 1 1 Backward 2 2 2 + 2 2 ‐ 2 3 3 3 Figure 2. Schematic Figure 2. Schematic diagram diagram of of working working mechanism mechanismof ofTetanic Tetanicstimulation. stimulation. 3.2. Amnesia Mechanism 3.2. Amnesia Mechanism The Amnesia mechanism is the forgetting part of the Actor-network module. When The Amnesia mechanism is the forgetting part of the Actor‐network module. When there are problems in information transmission between neurons, synapses cannot normally there are problems in information transmission between neurons, synapses cannot nor‐ perform the function of neurotransmitter transmission, the brain begins to have problems mally perform the function of neurotransmitter transmission, the brain begins to have in information transmission, forgetting begins to occur, at this time, the huge redundant problems in information transmission, forgetting begins to occur, at this time, the huge brain neural network begins to have problems, and some memory and logic units begin to redundant brain neural network begins to have problems, and some memory and logic have problems. Neuron function is not always in a stable and good working state, there unitsbebegin will to have all kinds problems. Neuron of accidents, function just as the world’s is not bestalways snookerin players a stable and cannotgood working guarantee state,every that thereshot willwill be all bekinds of accidents, accurate. So, therejust as the world’s is forgetting in thebest whole snooker period players cannot and process guarantee that every shot of neurons. For this reason, the Amnesia mechanism is added to the neural networkand will be accurate. So, there is forgetting in the whole period by process ofselecting randomly neurons.theFor this reason, neurons in thethe Amnesia network withmechanism is added small probability, andtothen the weakening neural net‐ work the by randomly parameters selecting of the neuronthe neurons nodes. in the When network using with small the Amnesia probability, mechanism to and then weaken weakening the parameters of the neuron nodes. When using the Amnesia the representation ability of the neural network, the influence of this weakening must be mechanism to weaken the representation controllable, that is to say, itability of affect will not the neural network, thetrend the convergence influence of theofneural this weakening network. must be controllable, that is to say, it will not affect the convergence trend of the neural network. Therefore, to ensure that the influence of the Amnesia mechanism can be con‐ trolled, the weights of neurons are weakened by random force (a random percentage number) in this paper, as shown in Figure 3, and the specific pseudo code is shown in Algorithm 2.
Actuators 2021, 10, 254 5 of 16 Therefore, to ensure that the influence of the Amnesia mechanism can be controlled, the weights of neurons are weakened by random force (a random percentage number) in this paper, as shown in Figure 3, and the specific pseudo code is shown in Algorithm 2. Algorithm 2 Amnesia Framework 1: Load Actor network, W ← A.w( Actor. f c.weight) 2: N is the number of the W’s node 3: For i = 1 to N: 4: Random(0, 1) number ξ, Amnesia threshold value τ, Mutation coefficient k 5: If ξ < τ: Actuators 2021, 10, x FOR PEER REVIEW 6: Actor. f c.weight[i ] ← Actor. f c.weight[i ] × (1 − k × random(0, 1)) 6 of 17 7: End if 8: End for Figure 3. Schematic diagram of Amnesia framework. Figure 3. Schematic diagram of Amnesia framework. 4. Experiment 4.1. Experiment Setup 4. Experiment For the 4.1. Experiment control problem of the multi-DOF manipulator, if we only consider the kine- Setup matics model of the manipulator, and the motion of the multi-DOF manipulator is regarded For the control problem of the multi‐DOF manipulator, if we only consider the kine‐ as a discrete process from one position of the end effector to another position, the DRL maticsmethod model ofof deterministic the manipulator, policyand the motion gradient, ofRTD3, such as the multi‐DOF can achieve manipulator good results.is However, re‐ gardedif as a discrete process from one position of the end effector to another position, the motion of the manipulator is regarded as a continuous motion process, a group the DRL method of deterministic of new inputs policy and outputs gradient, of DRL must such as RTD3, be found. can achieve The idea adoptedgood results. in this paper is to However, discretize the motion process of the manipulator in time, take the position deviationa of the if the motion of the manipulator is regarded as a continuous motion process, groupend-effector of new inputs and from outputs the target, of theDRL mustvelocity angular be found. Thejoints, of the idea and adopted in this paper the angular of the joints is to discretize the motion process of the manipulator in time, take the position as the input information of the DRL in this paper, and then take the angular deviation acceleration of the end‐effector from the target, the angular velocity of the joints, and the angular of the joints as the input information of the DRL in this paper, and then take the angular acceleration command for the next interval of the control joint as the output information, as shown in Figure 4. In this way, by controlling the angular acceleration of the joint, the problem of discrete processes in the previous control position process can be solved. How‐
Actuators 2021, 10, 254 6 of 16 command for the next interval of the control joint as the output information, as shown in Figure 4. In this way, by controlling the angular acceleration of the joint, the problem of discrete processes in the previous control position process can be solved. However, Actuators 2021, 10, x FOR PEER REVIEW 7 of 17 this change will inevitably lead to the increase in model dimensions, which not only puts forward new requirements for the ability of the DRL algorithm but also puts forward new requirements for the redesign of the reward function. Initial State DRL Controller Position Deviation Observation Input Output Joints Angular Joints Angular Acceleration Velocity ATRTD3 Increment Manipulator Joints Angular Reset Update Intermediate State Termination State No Replay Buffer Yes Termination Judgment Manipulator Episode Manipulator Figure 4. Schematic diagram of control flow. Figure 4. Schematic diagram of control flow. UR manipulator is a representative manipulator in industrial production and scientific 4.2. Task Introduction research. Therefore, this paper will use the structural size and joint layout of manipulator to establish the simulation In this paper, manipulator the DRL algorithm in this is used paper.a model controller of the multi‐DOF to train manipulator. The model can control the angular acceleration of the joints of the manipu‐ 4.2. so lator Task Introduction that the manipulator can start to move from the initial position of the workspace in the In this state, static paper,and the then DRL move algorithmto theis used targettoposition train a model controller and stop. In theofwhole the multi-DOF training manipulator. process, The model the target position canofcontrol the task theis angular acceleration a fixed position of the in the joints of of workspace thethe manipulator manipu‐ so that lator. The the manipulator core of the taskcan startthe is that to move from the manipulator initial can reachposition the targetof the workspace position in the and when itstatic reachesstate,theand thenposition target move toatthe thetarget sameposition time, each andjoint stop.ofInthe themanipulator whole training is atprocess, static the target state. Finally,position of the task is the manipulator a fixed can reachposition the targetin the workspace position of the smoothly bymanipulator. controlling theThe core of acceleration angular the task is that of the manipulator joints. In order to can reach limit thethe target position boundaries of theand when entire it reaches task, the the entire target position at the same time, each joint of the manipulator is at training process must be restricted. Each episode is divided into twenty steps. This setting static state. Finally, the manipulator mainly takes intocan account reach thethattarget theposition training smoothly convergence by controlling process takes the aangular acceleration long time and the time of a single episode must be shortened. This task is a simulation experiment to must of joints. In order to limit the boundaries of the entire task, the entire training process test beconvergence the restricted. Each episode ability is divided and learning intooftwenty ability steps. This the improved settingATRTD3. algorithm mainly takes into account that the training convergence process takes a long time and the time of a single episode 4.3. mustEnvironment Simulation be shortened. This task is a simulation experiment to test the convergence Construction ability and learning ability of the improved algorithm ATRTD3. The DRL algorithm establishes the manipulator model through the standard DH [35] method and uses 4.3. Simulation the positiveConstruction Environment kinematics solution to solve the spatial pose of the end ef‐ fector through the joint angle. DH modeling method is a general modeling method of The DRL algorithm establishes the manipulator model through the standard DH [35] multi‐link mechanism. Standard DH models are used for serial structure robots. In Table method and uses the positive kinematics solution to solve the spatial pose of the end 1, we show the DH parameters of this manipulator, where a is the length of the link, d effector through the joint angle. DH modeling method is a general modeling method of is the offset of the link, is the twist angle of the link, is the joint angle. The units multi-link mechanism. Standard DH models are used for serial structure robots. In Table 1, of a and d are meters, and the units of and are radians. we show the DH parameters of this manipulator, where a is the length of the link, d is the offset of the link, α is the twist angle of the link, θ is the joint angle. The units of a and d are Table 1. The D‐H parameters of manipulator. meters, and the units of α and θ are radians. Joint Base 0 0.0892 /2 1 Shoulder −0.4250 0 0 2 Elbow −0.3923 0 0 3 Wrist1 0 0.1092 /2 4 Wrist2 0 0.0946 /2 5
Actuators 2021, 10, 254 7 of 16 Table 1. The D-H parameters of manipulator. Joint α d α θ Base 0 0.0892 π/2 θ1 Shoulder −0.4250 0 0 θ2 Elbow −0.3923 0 0 θ3 Wrist1 0 0.1092 π/2 θ4 Wrist2 0 0.0946 −π/2 θ5 Wrist3 0 0.0823 0 θ6 During the experiment, only base, shoulder, and elbow are controlled, wrist1, wrist2, and wrist3 were locked. Because the problem studied in this paper focuses on the position reaching ability of the manipulator, it can be realized only by using three joints, so the three joints of the wrist joints are locked. The homogeneous transformation matrix is established through D–H parameters, as shown in Equation (1). − sin(θi ) cos(αi ) cos(θi ) sin(θi ) sin(αi ) ai cos(θi ) i −1 sin(θi ) cos(θi ) cos(αi ) − cos(θi ) sin(αi ) ai sin(θi ) i T = (1) 0 sin(αi ) cos(αi ) di 0 0 0 1 The solution of the forward kinematics of the manipulator can be obtained by mul- tiplying the homogeneous transformation matrix, as shown in Equation (2). In the base coordinate system {B}, the position of the end of the manipulator can be obtained. 0 6T = 01 T 12 T 23 T 34 T 45 T 56 T (2) In each episode, the target position is randomly generated in the workspace of the manipulator. In the experiment, the distance difference between the center position of the end effector and the target position in three directions (dx, dy, and dz), the angular veloc- ity (ω Joint_Base , ω Joint_Shoulder , and ω Joint_Eblow ) and absolute angle of the first three joints (θ Joint_Base , θ Joint_Shoulder , and θ Joint_Eblow ) are used as the input of DRL, and the angular . . acceleration control commands of the base, shoulder and elbow (ω Joint_Base , ω Joint_Shoulder , . and ω Joint_Elbow ) are output by DRL. In order to ensure the safe operation of the virtual manipulator, the angular acceleration (θ/s2 ) is limited, as shown in Equation (3). . ω i ∈ (−0.5, 0.5), i ∈ ( Base, Shoulder, Eblow) (3) . When the DRL outputs the angular acceleration control command ω i , the joint angle increment ∆θi obtained in this step is calculated by Equation (4) according to the interval time t = 0.1 s and the angular velocity of the previous step ω_i . The current joint angle θi_ is updated through the joint angle increment ∆θi and the joint angle of the previous step θ_i in Equation (5). The position of the manipulator end effector in {B} coordinate is obtained by homogeneous transformation matrix 06 T calculation. The joint angular velocity is updated as shown in Equation (6). 1 . ∆θi = ω_i t + ω i t2 , i ∈ ( Base, Shoulder, Eblow) (4) 2 θi_ = θ_i + ∆θi (5) . ωi_ = ω_i + ω i t (6) In this motion process, the DRL model sends the angular acceleration command (the output of the DRL algorithm) of the joints to manipulator according to the perceived environment and its state (the input of the DRL algorithm) and gives the termination command when judging the motion.
Actuators 2021, 10, x FOR PEER REVIEW 9 of 17 Actuators 2021, 10, 254 8 of 16 4.4. Rewriting Experience Playback Mechanism and Reward Function Design 4.4.The motion Rewriting process Playback Experience of a multi‐DOF manipulator Mechanism is no longer and Reward Function multiple discrete spatial Design position Thepoints. motionAs can be process of seen from Figure a multi-DOF 5 below, manipulator this is no experiment longer multiple completes the spec‐ discrete spatial ified task by controlling the angular acceleration of the three joints. position points. As can be seen from Figure 5 below, this experiment completes the specifiedIn the joint angular velocity task byfield drawnthe controlling byangular three joint angularofvelocities, acceleration in Case the three joints. 1, the In the manipulator joint can reach angular velocity thefield target drawnposition by three and stop, joint which angular meansin velocities, that Case the1, task is successfully the manipulator completed; can reach in Case the target position and stop, which means that the task is successfully completed; 2, the angular velocity of the joint does not stop at all, or it passes through the target po‐ in Case 2, the angular sition velocity quickly andof theto fails joint does not complete stop the at all, task; foror it passes Case throughthe 3, although themanipulator target position stops in quickly and fails to complete the task; for Case 3, although the manipulator the scheduled time, the end of the manipulator does not reach the target position, and the stops in the scheduled time, the end of the manipulator does not reach the target position, and the task task fails to complete. Figure 5 also shows that the task of this experiment is more difficult fails to complete. Figure 5 also shows that the task of this experiment is more difficult than than thethe tasktask of only of only achieving achieving thethrough the goal goal through discretediscrete spatial position. spatial position. Figure 5. Schematic diagram of the angular velocity field of joints. ξ is the threshold for determining joint stop. Figure 5. Schematic diagram of the angular velocity field of joints. is the threshold for determining joint stop. Therefore, the design of the reward function must be reconsidered, that is to say, the angular velocity Therefore, theinformation design of theof the joint function reward must be introduced. In the wholethat must be reconsidered, process is to of say, the the manipulator movement, each instance of acceleration and deceleration has an impact angular velocity information of the joint must be introduced. In the whole process of the on the angular velocity of each joint when the termination condition is reached and the manipulator movement, each instance of acceleration and deceleration has an impact on iteration of the round stops. Therefore, it is necessary to further improve the experience theplayback angularmechanism velocity ofand each joint when change the termination the experience condition pool storage. is reached In other words, the and the iter‐ final ation angular velocity of each joint should be shared by all the previous continuous movements. play‐ of the round stops. Therefore, it is necessary to further improve the experience back As mechanism and change shown in Equation the (7), here experience the poolof absolute value storage. In other the angular words, velocity thejoint of the final angular will velocity be taken,of multiplied each joint by should be shared the constant by all the λi , divided previous by the numbercontinuous movements. of steps TStop , and the As corresponding reward value will be obtained. shown in Equation (7), here the absolute value of the angular velocity of the joint will be taken, multiplied by the constant i , divided by the number of steps TStop , and the corre‐ λ |ω | + λ2 |ωShoulder | + λ3 |ωElbow | R Joint_Vel = − 1 Base , λi > 0, i = {1, 2, 3} (7) sponding reward value will be obtained. TStop This part of the reward as 1 anegative Base 2 reward Shoulderis added 3 Elbow to the corresponding reward in RJoint_Vel the experience pool, so as to realize the feedback of the joint , i 0velocity angular , i {1,2state ,3} in the (7) TStop neural network update parameters, as shown in Figure 6. This part of the reward as a negative reward is added to the corresponding reward in the experience pool, so as to realize the feedback of the joint angular velocity state in the neural network update parameters, as shown in Figure 6.
Actuators 2021, 10, x FOR PEER REVIEW 10 of 17 Actuators 2021, 10, 254 9 of 16 Figure Figure 6. Schematic 6. Schematic diagram diagram of rewriting of rewriting experience experience playback playback mechanism. mechanism. TheThe design design of the of the reward reward function function in this in this paperpaper addsadds thethe angular angular velocity velocity reward reward partpart to the Step-by-Step reward function (r −by−Step ) [35]. The Step-by-Step reward function to the Step‐by‐Step reward function ( rStepStep by is Step ) [35]. The Step‐by‐Step reward function mainly includes two parts: the first part the negative value of the Euclidean distance mainly includes between the endtwoofparts: the first part the manipulator andisthethetarget. negative Thevalue secondofpart the is Euclidean the rewarddistance obtained between the end ofthe by comparing thedistance manipulator closedand the target to the target.position The second part isthe between thecurrent rewardposition obtainedand by the comparing the distance last position closed to the target of the manipulator end during position thebetween the current movement. position Therefore, and the reward thefunction last position in thisofpaper the manipulator end during is shown in Equation (8): the movement. Therefore, the reward function in this paper is shown in Equation (8): r 0 = λ4 rStep−by−Step + λ5 R Joint_Vel (8) r ' 4 rStep by Step 5 RJoint_Vel (8) where λ4 and λ5 are two constants. where 4 and 5 are two constants. 4.5. Simulation Experimental Components In the E 4.5. Simulation application xperimentalofComponents the DRL algorithm, a problem that cannot be avoided is the random generation of a large number of neural network parameters. It is because of the In the application of the DRL algorithm, a problem that cannot be avoided is the ran‐ randomness of the parameters that we cannot train and learn effectively in the face of dom generation of a large number of neural network parameters. It is because of the ran‐ specific tasks, so we need to explore a more efficient, faster convergence, and more stable domness of the parameters that we cannot train and learn effectively in the face of specific algorithm framework to make up for this disadvantage. Since ATRTD3, RTD3, and TD3 are tasks, so we need to explore a more efficient, faster convergence, and more stable algo‐ all improved and innovated on the basis of DDPG, the contrast group of this experiment rithm framework to make up for this disadvantage. Since ATRTD3, RTD3, and TD3 are chooses the above four algorithms. In the contrast experiment, we train with the same all improved and innovated on the basis of DDPG, the contrast group of this experiment task and acquire the ability to solve the target task through learning and training. In the chooses the above four algorithms. In the contrast experiment, we train with the same task experiment, we specify two kinds of evaluation indexes. The first index is to calculate and acquire the ability to solve the target task through learning and training. In the exper‐ all the reward scores in an episode and divide them by the total number of steps in the iment, we specify two kinds of evaluation indexes. The first index is to calculate all the episode to get the average score. The second index is to record the absolute value of the reward scores angular in anof velocity episode and divide base, shoulder, andthem elbow byatthe thetotal end number of steps of an episode. in the episode Additionally, in the to get the average score. The second index is to record the absolute value same group of experiments, in order to ensure a fair comparison of the practical of the application angular velocity abilityofand base, shoulder, model and elbow convergence at of ability thethe end of algorithms, four an episode. weAdditionally, use the sameininitialization the same group of experiments, in order model as the test algorithm model.to ensure a fair comparison of the practical application ability and model convergence ability of the four algorithms, we use the same initializa‐ tion5.model as the test algorithm model. Discussion Through the comparison of DDPG, TD3, RTD3, and ATRTD3 in Figure 7a, we can 5. Discussion clearly see the improvement effect of ATRTD3 learning ability. Therefore, we need to Through further the comparison analyze the evaluation of DDPG, index ofTD3, RTD3, and the average ATRTD3 score when the in final Figure 7a, weerror distance can is clearly see the roughly theimprovement same. From Figureeffect 7a, of ATRTD3 we can see learning that theability. averageTherefore, score of we need to ATRTD3 is fur‐ higher therthan analyze other the evaluation algorithms when index of the the final average distance score error when is the sametheinfinal purpledistance areas A, error is C, B, and roughly whichthe same. From indicates Figure that the 7a, we learning can see effect that the average of ATRTD3 is better.score of ATRTD3 The reason is thatispart higher of the reward than function is the other algorithms when negative the finalreward distanceintroduced by same error is the the absolute in purple value areasofA, theB,angular and velocity C, which of each joint indicates afterlearning that the each termination of an episode. effect of ATRTD3 Through is better. purpleisareas The reason A, B,ofand that part theCreward of Figure 7b, ATRTD3 function can better is the negative guideintroduced reward the multi-DOF by the manipulator absolute valueto move of thetoangular the target position velocity and joint of each stop after better,each so ATRTD3 terminationis more of anefficient episode.and stable than Through purple other areasalgorithms. A, B, and
Secondly, through Figure 7, we can draw the conclusion that ATRTD3 sho stronger stability than other algorithms in the late training period. As can be seen in Figu 8, although DDPG has reached the score level close to that of ATRT3 through the la training, we can clearly see from the average score curve and the final error distance cur Actuators 2021, 10, 254 10 of 16 that ATRTD3 has better stability in the later training stage compared with DDPG, and t two curves of ATRTD3 are straighter, there are many spikes in the two curves of DDPG 25 20 20 A 15 2 1 Avg_Score Avg_Score 15 10 10 RTD3 5 5 ATRTD3 0 0 TD3 -5 -5 DDPG -10 -10 -15 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode 25 25 20 B 20 C Avg_Score 4 Avg_Score 15 10 3 15 10 5 5 0 0 -5 -5 -10 -10 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode (a) RTD3 0.8 1 0.8 2 Final_Dis (m) ATRTD3 Final_Dis (m) 0.6 0.6 TD3 0.4 0.4 DDPG 0.2 0.2 0.0 A 0.0 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode 0.8 0.8 Final_Dis (m) Final_Dis (m) 0.6 0.6 4 0.4 3 0.4 0.2 0.2 B C 0.0 0.0 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode (b) Figure Figure 7. Four 7. Four randomized randomized experiments experiments (Group1, (Group1, Group2,Group3, Group2, Group3, andandGroup4) Group4) werewere conducted to evaluate conducted the perfor- to evaluate the perfor‐ mance of four algorithms (RTD3, ATRTD3, TD3 and DDPG). There are Group1, Group2, Group3, andaverage mance of four algorithms (RTD3, ATRTD3, TD3 and DDPG). There are Group1, Group2, Group3, and Group4 in the Group4 in the score (Avg_score) average score (Avg_score) in (a) in and (a) the andfinal theerror distance final error (Final_Dis) distance in (b). Purple areas (Final_Dis) A, B, in (b). and C need Purple areasspecial A, B,attention. and C need special attention. Secondly, through Figure 7, we can draw the conclusion that ATRTD3 shows stronger stability than other algorithms in the late training period. As can be seen in Figure 8, although DDPG has reached the score level close to that of ATRT3 through the later training, we can clearly see from the average score curve and the final error distance curve that ATRTD3 has better stability in the later training stage compared with DDPG, and the two curves of ATRTD3 are straighter, there are many spikes in the two curves of DDPG. From Figure 9, we can see that ATRTD3 improves the average score by at least 49.89% compared with the other three algorithms. In terms of stability, ATRTD3 performs better, with an improvement of at least 89.27% compared with the other three algorithms.
Group1 Group1 1.0 100 Group2 Group2 89.84 Group3 0.858 Group3 Actuators2021, 2021,10, 10,254 x FOR PEER REVIEW Group4 12 of 16 17 Actuators 0.8 0.755 80 Group4 11 of Final_Dis(m) Avg_Score 60 59.62 0.6 0.531 52.06 47.27Group1 Group1 1.0 100 0.4 Group2 40 Group2 89.84 Group3 0.858 Group3 0.8 0.755 Group4 80 Group4 0.2 20 Final_Dis(m) Avg_Score 0.053 60 59.62 0.6 0.531 0.0 0 52.06 47.27 RTD3 ATRTD3 TD3 DDPG RTD3 ATRTD3 TD3 DDPG 0.4 40 Algorithm Algorithm 0.2 20 (a) (b) 0.053 0.0 Figure 8. The results 0 of four experiments (Group1, Group2, Group3, and Group4) of TD3 four algorithms (RTD3, ATRTD3, RTD3 ATRTD3 TD3 DDPG RTD3 ATRTD3 DDPG TD3, and DDPG) are compared in a stacked way. From left to right, (a) represents the final error distance; (b) represents Algorithm Algorithm the average score. (a) (b) From Figure 9, we can see that ATRTD3 improves the average score by at least 49.89% Figure8.8.The Figure Theresults resultsofoffour experiments four experiments(Group1, Group2, (Group1, Group3, Group2, andand Group3, Group4) of four Group4) algorithms of four (RTD3, algorithms ATRTD3, (RTD3, TD3, ATRTD3, and DDPG) are compared in a stacked way. From left to right, (a) represents the final error distance; (b) representsperforms TD3, and DDPG) are comparedcompared in a with stacked the way. other From three left to algorithms. right, (a) In represents terms the of final stability, error ATRTD3 distance; (b) represents the better the average average score.score. with an improvement of at least 89.27% compared with the other three algorithms. From Figure 9, we can see that ATRTD3 improves theAvg_Score average score by at least 49.89% compared with the other three algorithms. In terms of stability, Final_DisATRTD3 performs better, with100% an improvement of at least 89.27% compared with the other three algorithms. Improvement 93.36% 92.45% 89.27% 89.06% Avg_Score 80% Final_Dis 100% 71.67% Improvement 93.36% 92.45% 60% 89.27% 89.06% 80% Performance 49.89% 71.67% 40% 60% Performance 49.89% 20% 40% 0% 20% DDPG RTD3 TD3 0% Figure9.9.Combined Figure Combined with with four four groups groups of experiments of experiments (Group1, (Group1, Group2, Group2, Group3, Group3, and Group4), and Group4), the the DDPG RTD3 TD3 ATRTD3algorithm ATRTD3 algorithm improves improves the the average average scorescore anderror and final finaldistance error distance performance performance of the model. of the model. Figure 9. Combined with four groups of experiments (Group1, Group2, Group3, and Group4), the To Tofurther ATRTD3 furtherdemonstrate algorithmdemonstrate the improves the advantages the advantages average of the score and ofATRTD3 final the ATRTD3 error fromperformance distance the underlying from the of thecontrol underlying model. contro variables, we will collect the end angular speeds of the three joints during model training variables, we will collect the end angular speeds of the three joints during model training as shown in Figure To further 10a,b. In the demonstrate Figure 10a, we show advantages of the final angular velocity of thecontrol three as shown joints of the in high Figure 10a,b. In Figure 10a, wethe ATRTD3 show the from angular final the underlying velocity of the three variables, we willscore collect model (Group1). the end angularInspeeds Figureof 10b, thewe show three theduring joints final angular model velocity training joints of of thejoints highofscore model (Group1). In Figure 10b, we show the final angular velocity as the three shown in Figure the 10a,b.arm Incontrolled bywe Figure 10a, a low show score themodel (Group2). final angular By placing velocity of thea local three of the three magnification joints joints of the highin each of score the image, arm modelwe controlled can clearly (Group1). by In see a low the 10b, Figure score final we model angular showvelocity (Group2). the finalofangular By the three placing joints a loca velocity magnification at ofthe theend three thein ofjoints eachthe image, training of process, arm we can the where controlled clearly by lowsee acurve themodel final angular represented score by velocity the red (Group2). color By is of thethe placing aspeedthree joints local at of the end ofinthe each joint magnification under eachtraining the guidance image, we process, of the where ATRTD3 can clearly thefinal curve throughout see the represented the angular training bythe the process. velocity of redjoints ATRTD3 three color is the has speed obvious at the end ofadvantages of each joint the training over under the other the process, threethe guidance where algorithms. of the curveATRTD3 represented throughout by the redthecolor training is theprocess ATRTD3 has joint speed of each obvious advantages under over the guidance of the other three the ATRTD3 algorithms. throughout the training process. ATRTD3 has obvious advantages over the other three algorithms.
Actuators 2021, 10, x FOR PEER REVIEW 13 of 17 Actuators 2021, 10, 254 12 of 16 4 1 RTD3 0.12 ATRTD3 2 TD3 Baes(rad/s) 0.08 3 0.04 DDPG 0.05 Elbow(rad/s) 0.00 0.00 0 Shoulder(rad/s) 1 Baes(rad/s) -0.04 2 Elbow(rad/s) 40k 50k 60k -0.05 Episode 0.05 -0.10 Shoulder(rad/s) 1 40k 50k 60k 0 0.00 Episode -1 -0.05 -0.10 0 -0.15 -1 40k 50k 60k Episode -1 -2 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode Episode (a) 4 1 4 0.15 Elbow(rad/s) 0.10 3 0.06 0 0.05 Baes(rad/s) 0.04 2 0.00 0.02 -0.05 Shoulder(rad/s) Baes(rad/s) Elbow(rad/s) 2 0.00 -1 -0.10 40k 50k 60k -0.02 0.05 Episode Shoulder(rad/s) -0.04 40k 50k 60k 0.00 0 1 Episode -2 -0.05 -0.10 -0.15 RTD3 0 -3 ATRTD3 -0.20 -2 40k 50k 60k TD3 Episode DDPG -1 -4 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k 0 10k 20k 30k 40k 50k 60k Episode Episode Episode (b) Figure 10. (a) The The final angular velocity of the joints in the high score model Group1; (b) The final angular velocity of the joints in the low score model Group2. The final The final joint jointangular angular velocity velocityof the manipulator of the manipulatorbasedbased on ATRTD3 on ATRTD3is always around is always 0 (rad/s), around with the 0 (rad/s), minimum with fluctuation the minimum and theand fluctuation bestthe stability. In the high best stability. In thescore highmodel score (Group1), model only DDPG (Group1), can achieve only DDPG a score can achieve similar a score to ATRTD3 similar to ATRTD3 in the training in the trainingscore, so score, only the joint angular velocity of DDPG and ATRTD3 so only the joint angular velocity of DDPG and ATRTD3 is compared. It canis compared. It can be seen from Table 22 that Table that DDPG DDPGhas hasananorder orderofofmagnitude magnitude disadvantage disadvantage in the in the comparison comparison of an‐ of the the angular velocity of the other two joints except that the final angular velocity gular velocity of the other two joints except that the final angular velocity of the base joint of the base joint is is roughly roughly the samethe same as of as that that of ATRTD3. ATRTD3. By comparing By comparing the variances the variances in Table in Table 2, it 2, is it is not not difficult difficult to find to find that that ATRTD3 ATRTD3 has absolute has absolute advantages advantages in stability, in stability, whichwhich is generally is generally one one order order of magnitude of magnitude higher. higher. Because Because the smaller the smaller the variance the variance is, theis, the stable more more stable the the train‐ training is. In the low model (Group2), the advantages of ATRTD3 cannot ing is. In the low model (Group2), the advantages of ATRTD3 cannot be seen in Table 3. be seen in Table 3. After accumulating and comparing the angular velocity and angular After accumulating and comparing the angular velocity and angular velocity variance of velocity variance of the three the three joints, joints, itit can can be be seen seen from from Figure Figure 1111 that that ATRTD3 ATRTD3 is is still still better better than than the the other other three algorithms. three algorithms.
Average Average Average Var Var (rad/s) (rad/s) (rad/s) −2.50Actuators × 102021, −3 10, 254 2.80 × 10−5 2.51 × 10−3 3.09 × 10−5 −3.90 × 10 −416 13 of −4.65 × 10−3 4.58 × 10−4 −3.13 × 10−2 1.07 × 10−3 3.35 × 10−3 3.13 × 10−2Table 2. The mean 7.26and× variance 10−4 of joints angular −2.79velocity × 10in−2local enlarged1.32 images×of10 −3score model−6.69 high (Group1).× 10−3 2.66 × 10Joint −3 4.60 × 10 −5 Base 2.52 × 10−2 Shoulder 6.41 × 10−5 −2.01 × 10−2 Elbow Angular Average Average Average Var Var Var Velocity (rad/s) (rad/s) (rad/s) e mean and variance ATRTD3 of ×joints −2.50 10−3 angular 2.80 × 10−velocity 5 2.51 in × 10local −3 enlarged 3.09 × 10−5 images of10low −3.90 × −4 score model 2.83 × 10 −5 (Gro RTD3 −4.65 × 10−3 4.58 × 10−4 −3.13 × 10−2 1.07 × 10−3 3.35 × 10−3 6.81 × 10−4 TD3 3.13 × 10−2 7.26 × 10−4 −2.79 × 10−2 1.32 × 10−3 −6.69 × 10−3 4.73 × 10−4 DDPG Base 2.66 × 10−3 Shoulder 4.60 × 10−5 2.52 × 10−2 6.41 × 10−5 −2.01 × 10−2 Elbow 3.95 × 10−5 Average Average Average Table 3. The meanVar Var and variance of joints angular velocity in local enlarged images of low score model (Group2). (rad/s) (rad/s) (rad/s) Joint Base Shoulder Elbow −1.71 × Angular 10−2 4.17 Average× 10−5 8.04 ×Average 10−3 1.08 × 10−4Average −1.23 × 10−2 Var Var Var Velocity (rad/s) (rad/s)−3 −4 (rad/s) 3.03 × 10 −2 1.14 ×−10 2 −4 − −2.50 5 × 10 − 3 5.87 − 4 × 10 − 2 −1.73 × 10 − 5 −2 ATRTD3 −1.71 × 10 4.17 × 10 8.04 × 10 1.08 × 10 −1.23 × 10 6.18 × 10 1.39 × 10 −2 RTD3 3.40 3.03 × 10× −210−5 1.14 × 10−43.67−× − 2 − 5 2.5010 × −2 10−3 − 2 5.879.69 × 10−4 × 10−−5 − 5 1.73 × 10−2 −6.58 − 2 × 1010 4.48 × −4 −2 − TD3 1.39 × 10 3.40 × 10 3.67 × 10 9.69 × 10 −6.58 × 10 6.95 × 10 5 3.20 × 10 −2 DDPG 5.38 3.20 × 10× 10 5.38 × 10 −1.40 − 2 −5 − 5 × 10 −1.40 × 10 −2 − 2 7.077.07 × 10 × 10 8.21 × 10 − 5 −5 − 3 8.21 1.27×× 10 10−4 −3 Skill Level Improvement Stability Improvement 100% Performance Improvement 96.54 96.05 91.80 88.74 86.25 80% 81.59 67.84 60% 40% 41.71 30.94 25.27 20% 15.90 -5.540 0% Group1 Group2 Group1 Group2 Group1 Group2 DDPG RTD3 TD3 Figure 11. In the high score model (Group1) and low score model (Group2), the ATRTD3 algorithm compared with the Figure other 11. Ininthe three algorithms the high score control of model joint angular (Group1) velocity skill leveland low the ATR and late score training model stability (Group2), are two aspects of improvement effect. rithm compared with the other three algorithms in the control of joint angular veloc and late training stability are two aspects of improvement effect. Through Figure 11, we can see that ATRTD3 can significantly improve and stability compared with the other three algorithms. ATRTD3 generally
Actuators 2021, 10, 254 14 of 16 Through Figure 11, we can see that ATRTD3 can significantly improve the skill level and stability compared with the other three algorithms. ATRTD3 generally improves the skill level by more than 25.27% and improves stability by more than 15.90%. Compared with TD3, the stability improvement of ATRTD3 in Group 2 is −5.54%. The main reason is that TD3 falls into a more stable local optimization, so the performance of ATRTD3 cannot be questioned under this index. In the Actor neural network, the Tetanic stimulation mechanism changes the param- eters of some neurons compulsively and improves the exploration space of action in a certain range. Through Tetanic stimulation, the weight of selected neurons can be increased again, which helps to improve the convergence speed of the neural network and speed up the training speed. The Amnesia mechanism makes the parameters of neurons changed compulsively, which is in line with the situation of biological cell decay. 6. Conclusions In this paper, we propose an algorithm named ATRTD3, for continuous control of a multi-DOF manipulator. In the training process of the proposed algorithm, the weighting parameters of the neural network are learned using Tetanic stimulation and Amnesia mechanism. The main contribution of this paper is that we show a biomimetic view to speed up the converging process by biochemical reactions generated by neurons in the biological brain during memory and forgetting. The integration of the two mechanisms is of great significance to expand the scope of exploration, jump out of the local optimal solution, speed up the learning process and enhance the stability of the algorithm. The effectiveness of the proposed algorithm is validated by a simulation example including the comparisons with previously developed DRL algorithms. The results indicate that our approach shows performance improvement in terms of convergence speed and precision in the multi-DOF manipulator. The ATRTD3 we proposed successfully applies the research results of neuroscience to the DRL algorithm to enhance its performance, and uses computer technology to carry out biological heuristic design through code to approximately realize some functions of the biological brain. In the future, we will further learn from neuroscience achievements, improve the learning efficiency of DRL, and use DRL to achieve reliable and accurate manipulator control. Furthermore, we will continue to challenge the scheme of controlling six joints to realize the cooperative control of position and orientation. Author Contributions: Conceptualization, H.H. and Y.H.; methodology, Y.H.; software, Y.H.; valida- tion, Y.H., D.X. and Z.Z.; formal analysis, Y.H.; investigation, Y.H.; resources, H.H.; data curation, Y.C.; writing—original draft preparation, Y.H.; writing—review and editing, Y.C. and Z.L.; visualization, Y.H.; supervision, H.H. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: The authors exclude this statement. Conflicts of Interest: The authors declare no conflict of interest.
Actuators 2021, 10, 254 15 of 16 Appendix A Pseudo code of Algorithm A1. Algorithm A1. ATRTD3 1: Initialize critic networks Qθ1 , Qθ2 ,and actor network πφ with random parameters θ1 , θ2 , φ 2: Initialize target networks Qθ10 , Qθ20 , πφ0 3: Target network node assignment θ1 → θ10 ; θ2 → θ20 ; φ → φ0 4: Initialize replay buffer B 5: for t = 1 to T do 6: Amnesia framework 7: Select action with exploration noise a ∼ πφ + ε, ε ∼ N (0, σ) 8: Temporary store transition tuple (s, a, s0 , r ) in B 9: Fix transition tuple r 0 ← r + R joint_Vel , angular velocity correction R Joint_Vel 10: Store transition tuple (s, a, s0 , r 0 ) in B 11: If SumB < mini_batch then 12 return 13: Sample mini-batch of N transition (s, a, r, s0 ) from B 14: a ← πφ0 (s0 ) + ε, ε ∼ clip( N (0, e e σ), −c, c) 15: y = r + mini=1,2 Qθi0 (s0 , e a) 16: Statistical calculation of Q value utilization ratio wθ10 , wθ20 17: If wθ10 < Mini_utilization then 18: Rebirth target networks θ10 19: End if 20: If wθ20 < Mini_utilization then 21: Rebirth target networks θ20 22: End if 23: Update critics θi ← argminθi N −1 ∑ (y − Qθi (s, a))2 24: If t mod d then 25: Update φ by the deterministic policy gradient: 26: ∇φ J (φ) = N −1 ∑ ∇ a Qθ1 (s, a) a=πφ (s) ∇φ πφ (s) 27: Tetanic stimulation Framework 28: Update target networks: 29: θi0 ← τθi + (1 − τ )θi0 30: φ0 ← τφ + (1 − τ )φ0 31: End if 32: End if 33: End for References 1. Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection. In International Symposium on Experimental Robotics; Springer: Cham, Switzerland, 2016; pp. 173–184. 2. Zhang, M.; Mccarthy, Z.; Finn, C.; Levine, S.; Abbeel, P. Learning deep neural network policies with continuous memory states. In Proceedings of the International Conference on Robotics and Auto-mation, Stockholm, Sweden, 16 May 2016; pp. 520–527. 3. Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1–40. 4. Lenz, I.; Knepper, R.; Saxena, A. DeepMPC:learning deep latent features for model predictive control. In Proceedings of the Robotics Scienceand Systems, Rome, Italy, 13–17 July 2015; pp. 201–209. 5. Satija, H.; Pineau, J. Simultaneous machine translation using deep reinforcement learning. In Proceedings of the Workshops of International Conference on Machine Learning, New York, NY, USA, 24 June 2016; pp. 110–119. 6. Sallab, A.; Abdou, M.; Perot, E.; Yogamani, S. Deep reinforcement learning framework for autonomous driving. Electron. Imaging 2017, 19, 70–76. [CrossRef] 7. Caicedo, J.; Lazebnik, S. Active Object Localization with Deep Reinforcement Learning. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 2488–2496. 8. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. 9. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Figjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef]
You can also read