Neuro-evolutionary Frameworks for Generalized Learning Agents

Page created by Christopher Bowman
 
CONTINUE READING
Neuro-evolutionary Frameworks for Generalized
                                                        Learning Agents
arXiv:2002.01088v1 [cs.AI] 4 Feb 2020

                                                                Thommen George Karimpanal
                                                                           February 2020

                                                                               Abstract
                                                  The recent successes of deep learning and deep reinforcement learning
                                              have firmly established their statuses as state-of-the-art artificial learning
                                              techniques. However, longstanding drawbacks of these approaches, such
                                              as their poor sample efficiencies and limited generalization capabilities
                                              point to a need for re-thinking the way such systems are designed and de-
                                              ployed. In this paper, we emphasize how the use of these learning systems,
                                              in conjunction with a specific variation of evolutionary algorithms could
                                              lead to the emergence of unique characteristics such as the automated
                                              acquisition of a variety of desirable behaviors and useful sets of behavior
                                              priors. This could pave the way for learning to occur in a generalized
                                              and continual manner, with minimal interactions with the environment.
                                              We discuss the anticipated improvements from such neuro-evolutionary
                                              frameworks, along with the associated challenges, as well as its potential
                                              for application to a number of research areas.

                                        1    Introduction
                                        The ultimate aim of artificial intelligence research is to develop agents with truly
                                        intelligent behaviors, akin to those found in humans and animals. To this end,
                                        a number of tools and techniques have been developed. In recent years, two
                                        approaches in particular - deep learning (DL) and reinforcement learning (RL),
                                        seem to have made considerable progress towards this goal. Both these fields
                                        have been widely studied, with numerous successful examples [22, 29, 42, 25, 40]
                                        reported, particularly in recent years. However, even with the unprecedented
                                        success of recent approaches such as deep RL [28, 27, 36], poor sample efficiency
                                        and limited generalization remain major concerns to be addressed, keeping in
                                        view the ultimate goal of developing general purpose agents.
                                            The poor generalization capability of DL is exposed by its liability to de-
                                        ception when presented with adversarial examples [30, 39]. Recent work [38],
                                        showed that it was possible to hurt the performance of DL-based image recog-
                                        nition systems by carefully altering just a single pixel. Although DL is an
                                        extremely data-hungry approach, the issue of sample efficiency is perhaps more

                                                                                    1
significant for RL systems, in which the time and energy costs associated with
obtaining the samples via interactions with the environment can be prohibitively
high. This is evident from the sparse number of cases where RL is directly ap-
plied to embodied systems. Traditional RL approaches require the learning to be
carried out from scratch (tabula-rasa), and typically, in very large state-action
spaces. The cost of exploring and learning in such spaces typically renders RL
infeasible in many practical circumstances. Even in simulated environments
such as Atari games [4], several thousands of training episodes and millions
of environment interactions are typically needed in order to obtain acceptable
agent behaviors in just a single game. Moreover, the resulting policy suffers
from poor transfer properties, and is often limited to the particular task/game
it was trained to solve. These issues pose major impediments to the practical
utility of deep RL, particularly in embodied applications such as robotics. Such
physical systems typically not have the luxury of enduring the consequences of
such a large number of agent-environment interactions, especially when highly
sub-optimal actions are occasionally chosen during the learning process.
    These drawbacks of DL and RL stand in stark contrast to how learning
actually takes place in animals, who seem to acquire a diverse range of skills in
a much more sample-efficient and generalized manner, seemingly, without any
explicitly defined objective function. In addition, unlike in DL and RL, they do
not learn tabula-rasa. They are born with several simple or elaborate priors,
ingrained into their neural systems through millions of years of evolution and
natural selection. Innate behaviors such as the suckling and grasping reflexes in
human babies are examples of such priors. Apart from innate behaviors, intrinsic
mappings which help drive individuals towards survival, are also inherited as
priors. For example, thirst, hunger and pain are mapped to negative states of
being, which in turn drive individuals to take appropriate actions to escape these
states. Other states such as warmth and satiety generally correspond to positive
states of being, which individuals actively seek to attain. These intrinsic priors
and mappings have a general applicability, in the sense that they are useful to
the individual, irrespective of the tasks they go on to learn during their lifetime.
Recent studies in humans [10] have shown that by directing exploration towards
particular environmental states, such priors may also facilitate sample efficient
learning. This naturally leads us to the questions:
   • What is the underlying mechanism that allows for the emergence of such
     diverse and generalized learning without the need for an explicit objective
     function?
   • Can this process, along with prior acquisition mechanisms be mimicked in
     artificial agents to achieve sample-efficient learning?
In this paper, we qualitatively analyse these mechanisms and examine their
potential applicability to artificial learning systems, and discuss how they may
help achieve more generalized and sample-efficient learning, two of the necessary
ingredients for artificial general intelligence [14, 6]. We also discuss the relevance

                                          2
of these mechanisms to a range of important research areas in artificial learning,
along with some of the associated challenges.

2    Acquisition of Generalized Skills via Relaxed
     Selection
Generalization has been a longstanding problem in both DL and RL, both
of which tend to produce solutions whose performance drastically deteriorates
when evaluated on new tasks/environments. The quest for generalization has
driven entire fields of research such as feature engineering [26], multi-task learn-
ing [46, 45], transfer learning [32, 41], etc., all of which are concerned with the
decoupling of task-specific characteristics from the skills that are generally use-
ful to learn. A majority of these approaches essentially aim to extract and utilize
domain-specific and task-independent knowledge that would likely be useful for
a range of tasks within the domain. Although considerable progress has been
made towards generalizing to task distributions [9, 35, 12], the choice of these
distributions is often arbitrary, and poorly justified. In addition, the domain
knowledge that results from these approaches directly or indirectly relies on the
optimization of a predefined objective function, the choice of which is again,
usually poorly justified.
    Even the field of Evolutionary Algorithms (EA) [18], which was originally
motivated by the problem of mimicking the emergence of generalized intelligent
agent behavior, has failed in this regard. The idea behind EA is to initialize
a population of ‘random’ agents, and to subsequently maintain the population
over a number of generations (iterations) through a fitness-proportional selection
process. Following the selection process, the subsequent generation is populated
by copying over a proportion of individuals directly as they are (exploitation),
and by subjecting the remaining individuals to crossover and mutation opera-
tions (exploration). By selectively and repeatedly propagating fitter individuals
to subsequent generations, the average performance of the population progres-
sively improves, until finally, the desired agent behavior is achieved. However,
the problem with this approach is that the agent behaviors are optimized as per
a specific and manually defined fitness function. This inevitably leads to a loss
in the generality of the solutions, much like in RL and DL. In addition, man-
ual specification of the fitness function can be counterintuitive, often leading to
unexpected and undesired behaviors.
    In contrast to these approaches, natural agents (humans and animals) seem
to be able to acquire a multitude of skills/behaviors (be it through learning
or evolution), without an explicit objective function. Moreover, many of these
skills seem to be general in nature, and applicable to a multitude of tasks. The
apparent universal utility of these skills may seem to point to a violation of
no free lunch theorems for optimization problems [44]. However, this is not
the case. As general as they may seem, the skills acquired by natural agents
are applicable only to a subset of all possible problems. For example, certain

                                         3
fundamental skills acquired by humans, although beneficial for understanding
language, and for complex motor skills such as walking, have no utility if the
task is to communicate with other species, or to be able to jump over a tall
building. This is because these are seldom encountered tasks in our natural
environment, and the corresponding skills to solve them are generally not useful
to acquire, in the sense that our survival and ability to reproduce/self-replicate
would generally not be affected by the absence of these skills. Hence, the appli-
cability of the skills acquired by natural agents seem to be limited to a subset of
problems that are useful, in the sense as described above. Apart from usefulness,
the environment also imposes feasibility constraints. For example, although it
may be useful to control the motion of objects with just our thoughts, it is not
possible to acquire such skills, as it violates the fundamental laws of physics,
which are strictly imposed by the environment we inhabit.
    As it follows from the above arguments, from an evolutionary point of view,
the only skills that seem to matter are those that are both feasible to acquire,
and are directly or indirectly useful for the successful propagation of genetic
information. This feasible usefulness is completely determined by the environ-
ment under consideration, and is a primary factor that ultimately guides the
acquisition of skills. For example, for a given environment, only certain sets
of traits may be feasibly useful. Individuals possessing these traits would be
able to survive to be able to reproduce, in turn, making it likely for their very
similar offspring to do the same. Unlike in traditional EA, where the selection is
fitness proportional (as per a specifically defined fitness function), the selection
criterion in natural evolutionary systems is decided by the environment under
consideration, and is more relaxed, in the sense that it is merely dependent on
whether the set of traits possessed by an agent allows it to be able to repro-
duce within its lifetime. Even simpler individuals which may correspond to ‘low
fitness values’ in the given environment, are allowed to self-replicate and prop-
agate to the next generation, as long as they can survive for long enough to do
so. As a natural consequence of this, the initial generations would be comprised
of simpler individuals that are capable of self-replication, and as the generations
proceed, more and more complex individuals would emerge, owing to mutations
that occur during self-replication. This process would continue, producing a
diverse set of agents (with a diverse range of skills and behaviors), all of which
are adequately equipped for self-replication. The relaxed selection criterion im-
posed by the environment serves to implicitly encode a more generalized fitness
function, which would eventually result in the generation of diverse sets of agent
behaviors.
    Despite the absolute autonomy made possible via relaxed selection, the un-
derlying evolutionary mechanism is inherently slow, which might render the
process infeasible for developing artificial agents. However, when deployed in
combination with learning mechanisms, some of these drawbacks may be ad-
dressed by virtue of specific neuro-evolutionary mechanisms. A more detailed
discussion of the nature of such neuro-evolutionary processes is presented in
Section 3.

                                         4
3    From Learning to Inheritance - The Baldwin
     Effect
Evolutionary mechanisms, in some sense, correspond to a form of inter-life trans-
fer of knowledge, whereas learning is typically concerned with behaviors attain-
able within an individual’s lifetime. However, in nature, the two mechanisms
interact to produce interesting effects, particularly, the eventual acquisition of
certain learned behaviors as evolutionary priors. The mechanism responsible
for this phenomenon is called the Baldwin effect [3], and it provides an elegant
explanation for how behaviors learned within a lifetime can eventually become
instinctual over generations.
    The Baldwin effect can be be understood with a typical example of a predator-
prey scenario. Suppose certain individuals in the prey population learn a par-
ticular predator-avoidance strategy, on average, more of these individuals would
go on to survive and populate subsequent generations. As the generations pro-
ceed, individuals that learn this strategy earlier in their lives would survive for
longer periods, thus making them more likely to survive to be able to repro-
duce more successfully. Due to natural selection, individuals possessing learning
mechanisms with shorter and shorter learning times would be selected, until
eventually, the learning time is so short that the behavior effectively becomes
instinctual.
    This phenomenon has also been studied in artificial learning systems [17, 1,
11]. Hinton and Nowlan [17] showed how the processes of evolution and learn-
ing interact to accelerate the acquisition of desired behaviors via the Baldwin
effect. It was shown to be particularly useful in cases where the fitness function
was highly non-smooth, which is the case with many tasks in nature. However,
previous studies were based on traditional EA, which probably restricted the
generalization of the solutions. In principle, Baldwinian effects would also arise
when learning is combined with relaxed selection-driven evolutionary mecha-
nisms (Section 2). Such a system could potentially result in diverse sets of
generalized behaviors eventually being inherited as priors, which could poten-
tially equip agents to learn a range of useful behaviors with minimal interactions
with the environment.
    To demonstrate this phenomenon, consider a harsh (scarce food, presence of
predators etc.,) simulated environment in which artificial agents are equipped
with RL-like learning mechanisms. For simplicity, let us assume that self-
replication, food consumption and predator avoidance are some of the possible
agent behaviors that can be learned in such an environment. A natural con-
sequence of applying the relaxed selection criterion would be the emergence of
an initial population of learning agents with reward structures that have an
affinity for self-replication. This is because agents who are born with reward
structures that are averse to self-replication would automatically be selected
against, and would eventually die out. Within this initial population, there
could be considerable variation in the agents’ individual behaviors, in terms of
their relative affinities for the learnable behaviors mentioned above. For exam-

                                        5
ple, individuals with greater affinity for self-replication would naturally cause
similar agents to become proportionately more abundant in the population. In
subsequent generations, affinities that enable agents to survive for longer peri-
ods may be discovered, and would propagate through the population. This is
because agents that learn to survive for longer periods, say, by learning to forage
and/or avoid predators, would probably produce a greater number of offspring,
owing to the longer period in which it could potentially self-replicate. In some
sense, learned behaviors such as foraging and predator avoidance simply serve
to make the agents more efficient at self-replication. In reality, the number of
offspring produced is not the only measure of replication efficiency. How reli-
ably these offspring reproduce would also matter. The process thus incentivises
frequent and reliable self-replication.
    As the process continues, one could imagine the automatic inheritance of
complex learned behaviors that serve to directly or indirectly improve the repli-
cation efficiency. For example, as the generations proceed, it may lead to the
emergence of agents that are equipped with the necessary priors to quickly and
easily learn to build tools for more efficient foraging, construct enclosures to
protect themselves against predators, communicate with each other to exchange
survival strategies, etc., It is worth noting that the emergence of these behav-
iors is rooted in the Baldwin effect, coupled with the implicit fitness function
specified by the relaxed selection criterion.

4     Research Potential
The unique effects of relaxed selection, operating in conjunction with the Bald-
win effect could uncover novel frameworks for the design of artificial agents
with generalized behaviors. Such frameworks could possibly address and re-
solve a number of issues currently plaguing the field of artificial intelligence.
We discuss some of these issues, with the hope that the core idea of enabling
the autonomous acquisition of behaviors and priors through the interaction of
learning and evolution will inspire future attempts at developing more general
purpose agents:

4.1    Freedom From Objective Functions
One of the principal benefits of using the described approach would be that
the design of agents would no longer be tied down to arbitrary and pre-defined
fitness or objective functions. As relaxed selection allows the self-replication of
any agent that can survive for long enough to do so, the focus would need to
shift from designing the perfect fitness function to simulating the environment
as closely as possible. Results from the field of model-based learning [16, 34, 19],
which is dedicated to the cause of building more accurate environment models,
could provide a useful starting point for this undertaking. However, the de-
scribed neuro-evolutionary frameworks would entail more high-resolution, de-
tailed models that can account for various aspects of real world interactions.

                                         6
Although this is likely to be a highly computationally intensive task, with the
proliferation of cheaper and more powerful computing resources, the idea may
not be too far-fetched. Even with existing computational capabilities, novel con-
cepts along the lines of observational dropout [13] could enable sample-efficient
ways of obtaining models, with emphasis on their usefulness, rather than mere
accuracy. An alternative would be to deploy these mechanisms on physical sys-
tems such as field robots in the real world. This would circumvent the need for
developing computationally intensive, and possibly inaccurate simulated envi-
ronments. However, it would translate to drastically increased training times,
as well as other issues such as wear, high cost and other limitations, as is the
case in Evolutionary robotics [31].

4.2    Intrinsic and Innate Behaviors
One of the primary limitations of DL and RL algorithms is that the learning
generally occurs from scratch, without appropriate initial behaviors that aid
learning in the long run. Topics such as intrinsic motivation [37, 8] and transfer
learning [41] which have garnered interest in the learning community, attempt
to partially address this issue. However, they are limited by the requirement
of arbitrary task distributions and/or directly or indirectly specified, but often
poorly justified objective functions. The unique properties of relaxed selection,
and the corresponding inheritance of prior behaviors via the Baldwin effect
could lay the foundation for novel approaches to acquire sets of generalized
priors and mappings, which could translate to several innate behaviors that
are beneficial for self-replication. It would be interesting to study this effect
in artificial systems, and to analyze the transformation of learned behaviors
into intrinsic ones. It also opens up new avenues for the study of other related
topics such as the nature of behaviors which are likely to become innate, the
conditions under which innate/instinctual behaviors arise, the type of tasks
which may benefit most from priors and mappings, etc.,

4.3    Automation of Task Hierarchies
An artificial agent deployed in the real world may be required to learn and man-
age a number of tasks. In many cases, the tasks may be interdependent [23, 24],
such that specific tasks or sets of tasks may need to be completed before others
can be undertaken. For example, the task of cooking can only be undertaken
following the successful completion of the task of procuring food and other mate-
rials. This may in turn be dependent on other tasks such as farming or hunting.
The idea is that there exists a tree of task hierarchies which the agent must
be aware of in order to evaluate the feasibility of a given task. With relaxed
selection, the hypothesis is that self-replication lies at the root of a hierarchi-
cal evolutionary reward structure, and any behavior (such as food gathering,
learning to ward off threats, developing energy-efficient technologies etc.,) that
aids this ultimate objective, is assigned partial credit, which is encoded into the
reward structure for the corresponding learning mechanism.

                                        7
The study of this evolutionary credit-assignment problem [43] could be used
for extracting task dependencies, which could in turn be used to equip agents
with intent, enabling them to autonomously choose their own tasks, or decide
which task or set of tasks should be undertaken next. This is an aspect that is
generally ignored by current approaches, which generally do not allow room for
automated task identification [21] and selection. Although the idea of artificial
intent could be very useful for learning agents, the permitted set of intentions
must be strictly restricted, and perhaps even monitored with human oversight.
Automated intent could also be coupled with existing approaches for safe learn-
ing [2]. Such an integration may inform the development of the type of safety
algorithms that need to be used in conjunction with truly autonomous agents
of the future.
    In addition, the hypothesis of self-replication efficiency being at the root of
a hierarchical reward structure is implicitly related to the idea of explainabil-
ity [15, 7], as such a structure may help dictate why a certain solution was
proposed or why a certain task was assigned a higher priority.

4.4    Learning Efficient Representations
During an agent’s lifetime, it may encounter a number of tasks. Some of these
may be fairly simple, while others may be more complex. A wide spectrum of
task complexities motivates a corresponding variation in the representations [5]
used for learning. As relaxed selection operates on the basis of usefulness for
self-replication, it may naturally allow for the emergence of sets of representa-
tions that are likely to be useful during the agent’s lifetime. Representations that
directly or indirectly boost the replication efficiency would be favored, while oth-
ers would be selected against. This could potentially lead to the discovery and
utilization of novel representations that are equipped to perform tasks such as
causal reasoning [33]. In addition, relaxed selection could perhaps also give rise
to mechanisms to select appropriately complex representations commensurate
with the significance or complexities of different tasks. Acquiring generalized
representation systems in this manner may also allow for the seamless transfer
of knowledge across tasks.

5     Challenges
Although the combination of learning algorithms with relaxed selection mech-
anisms seems to be a promising approach for the realization of various facets
of general intelligence, there are a number of challenges that may need to be
overcome. Existing learning mechanisms, for instance, may not be sufficiently
flexible for the emergence of certain aspects of generalized behaviors, such as
learning task hierarchies. For such behaviors, more dynamic mechanisms that
can encode ways for learning networks to split, merge and communicate with
other networks may be needed.
    Unlike in traditional EA, where the agent population remains constant across

                                         8
generations, a natural consequence of relaxed selection is that the agent popu-
lation would continue to grow exponentially. As a consequence, it may quickly
become infeasible to evaluate all agents in the subsequent generation. It may
thus be necessary to develop methods to evaluate only a representative subset
of the population. Alternatively, periodic artificial extinction events [20] may
need to be introduced in order to restrict the population size.
    Apart from the computational aspects, allowing behaviors to emerge natu-
rally by means of relaxed selection mechanisms has the disadvantage that the
process is completely autonomous, and thus, may not be controllable. Success-
ful, but undesirable behaviors that emerge may have to be weeded out frequently,
either with appropriate safety mechanisms or using human oversight.
    Despite these challenges, the fact remains that generalized behaviors en-
tail generalized objectives, which may be implicitly encoded via the framework
discussed in this work. To this end, evolutionary mechanisms such as relaxed
selection and the Baldwin effect are worth studying further, especially in an era
marked by powerful and inexpensive computer hardware, and an ardent interest
in equipping artificial agents with generalized capabilities.

References
 [1] David Ackley and Michael Littman. Interactions between learning and
     evolution. Artificial life II, 10:487–509, 1991.
 [2] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schul-
     man, and Dan Mané. Concrete problems in ai safety. arXiv preprint
     arXiv:1606.06565, 2016.
 [3] J Mark Baldwin. A new factor in evolution. The american naturalist,
     30(354):441–451, 1896.
 [4] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The
     arcade learning environment: An evaluation platform for general agents.
     Journal of Artificial Intelligence Research, 47:253–279, 2013.
 [5] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learn-
     ing: A review and new perspectives. IEEE transactions on pattern analysis
     and machine intelligence, 35(8):1798–1828, 2013.
 [6] Nick Bostrom. Superintelligence. Dunod, 2017.
 [7] Bruce Chandrasekaran, Michael C Tanner, and John R Josephson. Ex-
     plaining control strategies in problem solving. IEEE Intelligent Systems,
     (1):9–15, 1989.
 [8] Nuttapong Chentanez, Andrew G. Barto, and Satinder P. Singh. Intrinsi-
     cally motivated reinforcement learning. In Advances in neural information
     processing systems, pages 1281–1288, 2004.

                                       9
[9] Yilun Du and Karthik Narasimhan. Task-agnostic dynamics priors for deep
     reinforcement learning. arXiv preprint arXiv:1905.04819, 2019.
[10] Rachit Dubey, Pulkit Agrawal, Deepak Pathak, Tom Griffiths, and Alexei
     Efros. Investigating human priors for playing video games. In International
     Conference on Machine Learning, pages 1348–1356, 2018.
[11] Chrisantha Fernando, Jakub Sygnowski, Simon Osindero, Jane Wang, Tom
     Schaul, Denis Teplyashin, Pablo Sprechmann, Alexander Pritzel, and An-
     drei Rusu. Meta-learning by the baldwin effect. In Proceedings of the Ge-
     netic and Evolutionary Computation Conference Companion, pages 1313–
     1320. ACM, 2018.
[12] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-
     learning for fast adaptation of deep networks. In Proceedings of the 34th
     International Conference on Machine Learning-Volume 70, pages 1126–
     1135. JMLR. org, 2017.
[13] Daniel Freeman, David Ha, and Luke Metz. Learning to predict without
     looking ahead: World models without forward prediction. In Advances in
     Neural Information Processing Systems, pages 5380–5391, 2019.
[14] Ben Goertzel and Cassio Pennachin. Artificial general intelligence, vol-
     ume 2. Springer, 2007.
[15] David Gunning. Explainable artificial intelligence (xai). Defense Advanced
     Research Projects Agency (DARPA), nd Web, 2, 2017.
[16] David Ha and Jürgen Schmidhuber.        World models.     arXiv preprint
     arXiv:1803.10122, 2018.
[17] Geoffrey E Hinton and Steven J Nowlan. How learning can guide evolution.
     Complex systems, 1(3):495–502, 1987.
[18] John H Holland. Genetic algorithms. Scientific american, 267(1):66–73,
     1992.
[19] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski,
     Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Pi-
     otr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning
     for atari. arXiv preprint arXiv:1903.00374, 2019.
[20] Thommen George Karimpanal. A self-replication basis for designing com-
     plex agents. In Proceedings of the Genetic and Evolutionary Computation
     Conference Companion, pages 45–46. ACM, 2018.
[21] Thommen George Karimpanal and Erik Wilhelm. Identification and off-
     policy learning of multiple objectives using adaptive clustering. Neurocom-
     puting, 263:39 – 47, 2017. Multiobjective Reinforcement Learning: Theory
     and Applications.

                                      10
[22] Nate Kohl and Peter Stone. Policy gradient reinforcement learning for
     fast quadrupedal locomotion. In Proceedings of the IEEE International
     Conference on Robotics and Automation, pages 2619–2624, May 2004.
[23] Kamran Kowsari, Donald E Brown, Mojtaba Heidarysafa, Kiana Jafari
     Meimandi, Matthew S Gerber, and Laura E Barnes. Hdltex: Hierarchical
     deep learning for text classification. In 2017 16th IEEE International Con-
     ference on Machine Learning and Applications (ICMLA), pages 364–371.
     IEEE, 2017.
[24] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenen-
     baum. Hierarchical deep reinforcement learning: Integrating temporal ab-
     straction and intrinsic motivation. In Advances in neural information pro-
     cessing systems, pages 3675–3683, 2016.
[25] Quoc V Le. Building high-level features using large scale unsupervised
     learning. In 2013 IEEE international conference on acoustics, speech and
     signal processing, pages 8595–8598. IEEE, 2013.
[26] Martin D Levine. Feature extraction: A survey. Proceedings of the IEEE,
     57(8):1391–1407, 1969.
[27] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis
     Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with
     deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[28] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel
     Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K
     Fidjeland, Georg Ostrovski, et al. Human-level control through deep rein-
     forcement learning. Nature, 518(7540):529, 2015.
[29] Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, and Shankar Sastry. Inverted
     autonomous helicopter flight via reinforcement learning. In International
     Symposium on Experimental Robotics. MIT Press, 2004.
[30] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are
     easily fooled: High confidence predictions for unrecognizable images. In
     Proceedings of the IEEE conference on computer vision and pattern recog-
     nition, pages 427–436, 2015.
[31] Stefano Nolfi, Josh C Bongard, Phil Husbands, and Dario Floreano. Evo-
     lutionary robotics., 2016.
[32] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE
     Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
[33] Judea Pearl. The seven tools of causal inference, with reflections on machine
     learning. Commun. ACM, 62(3):54–60, 2019.

                                       11
[34] Athanasios S Polydoros and Lazaros Nalpantidis. Survey of model-based
     reinforcement learning: Applications on robotics. Journal of Intelligent &
     Robotic Systems, 86(2):153–173, 2017.
[35] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal
     value function approximators. In International Conference on Machine
     Learning, pages 1312–1320, 2015.
[36] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,
     George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda
     Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep
     neural networks and tree search. Nature, 529(7587):484–489, 2016.
[37] Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg.
     Intrinsically motivated reinforcement learning: An evolutionary perspec-
     tive. Autonomous Mental Development, IEEE Transactions on, 2(2):70–82,
     2010.
[38] Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel at-
     tack for fooling deep neural networks. IEEE Transactions on Evolutionary
     Computation, 2019.
[39] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Du-
     mitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of
     neural networks. arXiv preprint arXiv:1312.6199, 2013.
[40] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deep-
     face: Closing the gap to human-level performance in face verification. In
     Proceedings of the IEEE conference on computer vision and pattern recog-
     nition, pages 1701–1708, 2014.
[41] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement
     learning domains: A survey. Journal of Machine Learning Research,
     10(Jul):1633–1685, 2009.
[42] Gerald Tesauro. Temporal difference learning and td-gammon. Commun.
     ACM, 38(3):58–68, March 1995.
[43] James M Whitacre, Tuan Q Pham, and Ruhul A Sarker. Credit assignment
     in adaptive evolutionary algorithms. In Proceedings of the 8th annual con-
     ference on Genetic and evolutionary computation, pages 1353–1360. ACM,
     2006.
[44] David H Wolpert, William G Macready, et al. No free lunch theorems for
     optimization. IEEE transactions on evolutionary computation, 1(1):67–82,
     1997.
[45] Zhaoyang Yang, Kathryn E Merrick, Hussein A Abbass, and Lianwen Jin.
     Multi-task deep reinforcement learning for continuous action control. In
     IJCAI, pages 3301–3307, 2017.

                                      12
[46] Yu Zhang and Qiang Yang. A survey on multi-task learning. arXiv preprint
     arXiv:1707.08114, 2017.

                                     13
You can also read