Continual learning in humans and neuroscience-inspired AI

Continual learning in humans and neuroscience-inspired AI

Lucas Weber,∗ Elia Bruni,† and Dieuwke Hupkes‡

University of Amsterdam

(Dated: June 28, 2018)

Abstract

The field of Artificial Intelligence (AI) research is more prosperous than ever. However, current

research and applications are still aiming to optimise single task-performances instead of algorithms

that are able to generalize over multiple tasks and reuse prior knowledge for new challenges. This

makes these systems data hungry, computationally expensive and inflexible. A main obstacle

in the way towards more flexible and generalizing algorithms is the phenomenon of catastrophic

forgetting in connectionist networks. While artificial neural networks generalizing-abilities suffer

under catastrophic forgetting, biological neural networks seem to be relatively unaffected by it. In

this literature review we aim to understand the necessary mechanisms implemented in the human

nervous system to overcome catastrophic forgetting and review in how far these mechanisms are

already realized in AI systems. Our review is guided by Marrs contemplations of levels of analysis

and comes to the conclusion that an integration of the partial solutions already realized in AI may

be able to overcome catastrophic forgetting in a more complete way than prior solutions.

Keywords: neuroscience, interdisciplinary, catastrophic forgetting, connectionist,

∗ [email protected]† [email protected]‡ [email protected]

1

mailto:[email protected]



CONTENTS

1. Introduction 3

2. Machine learning literature on catastrophic forgetting 10

3. Neuroscientific literature on catastrophic forgetting 12

3.1. Neuroscientific framework 13

3.2. Neuroscientific Theory and Evidence 17

3.2.1. Complementary Learning Systems Theory 17

3.2.2. Selective Constraints of neuroplasticity 21

3.2.3. Neurogenesis within the hippocampus 24

4. Integration of neuroscientific insight into machine learning 25

4.1. Using complementary learning systems: from the DQN-model to deep generative

replay 25

4.2. Constraining weight plasticity within the network 27

5. Discussion 29

References 32

2

1. INTRODUCTION

Reports in mass media and popular science in recent years come thick and fast with de-

lineations how artificial intelligence (AI) research is breaking through over and over again,

making the lay reader expecting the machine revolution just around the next corner. As

usual, news reports over the current developments in science are tremendously exaggerated:

Neither is any company currently working on the creation of Skynet nor will your boss

be fired tomorrow, because machine learning made his job obsolete. However, the current

enthusiasm is not completely unfounded since state-of-the-art algorithms made great leaps

in their capabilities in the recent years, surpassing human-level performance in important

tasks.

The success is mainly carried by so-called artificial neural networks (ANNs). ANNs are sim-

plified computational models of biological brains comprising graph-like structures. Nodes

of artificial neural networks correspond to biological neurons and are organized layer-wise.

From layer to layer the network calculates non-linear transformations, by weighting the out-

put from the previous layer(s) and applying a non-linearity (e.g. by setting negative values

to zero Glorot et al. (2011)). Their interactions are modeled through non-linear functions

applied to weighted input to a unit. Especially popular are deep neural networks (DNNs),

which are stacking greater numbers of neural layers on top of each other, making up large

architectures with several million optimizable parameters. Examples for benchmark-shifting

neural network architectures are, amongst others, from the domain of computer-vision (e.g.

AlexNet, Krizhevsky et al. 2012) or control of artificial environments (e.g. DQN, Mnih et al.

2015). In other domains, like speech-recognition (Graves et al. 2013; Hinton et al. 2012),

even though not beating human expertise yet, machine learning made great progress.

Advances in these areas are especially impressive, since these domains of learning largely

depend on unstructured data, that traditionally posed the more difficult form of data for

algorithms to learn from. Opposed to structured data, where every datapoint has a spe-

cific inherent meaning and is organized in a predefined manner (e.g. lists of housing prices

mapped to number of rooms in a house), in learning from unstructured data conclusions

have to be drawn from datapoints which only get their meaning through their context. As

illustrative example for unstructured learning we can consider multiple pixels in a picture

making up a pattern that looks like a dog. While a single brown pixel on the tip of the dogs

3

nose itself has no meaning without its surrounding, the pattern of multiple pixel together

are meaningful. Another example from speech-recognition might be sound frequencies that

have to be combined in certain patterns to make up comprehensible speech. While humans

are normally extraordinary good at finding regularities in and making sense of this kind

of data, AI was traditionally better in inferring from structured information. The advance

of machines in these domains is reason for excitement, since most data in the world is of

unstructured nature, giving AI greater scope of possible application. On top of that it may

make their interactions with humans become more intuitive, when data utilized from both

agents becomes more coherent: computer-speech-recognition and computer-vision might

play a crucial role in handling increasingly complex technology opposed to classical highly

structured interfaces.

However, under closer inspection, these recent advances may lose some of their gloriousness.

Looking for the reasons of their success, namely, one finds that these are, to a not insignif-

icant part, based on two factors: availability of big amounts of data and computational

power. The increased distribution and usage of mobile information-technology (statista.com

2018a,b) and the accompanying surge in produced digital data (Kitchin 2014), made it

possible to build large scale on-line databases (e.g. ImageNet, Jia Deng et al. 2009). These

databases provide researchers with millions of labeled training examples, making it possi-

ble to train larger architectures, approximating more complex functions (e.g. by building

deeper networks, Cabestany et al. 2005). Training these large architectures with excessive

amounts of data has a great computational cost. Which brings us to our second reason

for the current surge in machine learning: Moore’s law (Moore 1965) is still retaining and

results in a before unmatched amount of computational power available to researchers. The

combination of explosion in available data and computational power enabled the training of

larger and larger models, resulting in better and better performances.

An instance of this development is the prior mentioned AlexNet (Krizhevsky et al. 2012),

which caught the AI-community by surprise by winning the ImageNet large scale visual

recognition-challenge (ILSVRC) in 2012 (Russakovsky et al. 2015) and almost halving the

error-rate of all its competitors on the fly. While the used technique (convolutional neural

networks [CNNs]) was already widely known in the AI-community since the introduction

of LeNet by Lecun & Bengio in 1995, the refinement of this technique, but especially the

usage in a computationally expensive DNN trained on millions of labeled images, made this

4

success possible. To illustrate the dynamics the development of increasing architecture size

took we shortly mention state-of-the-art architectures like deep residual networks (ResNet)

(He et al. 2015) that comprise up to 152 layers of convolutional computations.

If increasing network size works so well, why would we want to change anything about it?

Why do we negate media’s reports about machine’s rise of intelligence? There are multiple

ways to answer this question.

First of all, retaining at the moment, there is the prospect of Moore’s law to end in the

near future (see e.g. Kumar 2012, Waldrop 2016): Engineers are approaching the physical

limitations of possible transistor size, what may decelerate the growth of computational effi-

ciency of processing units in the near future. With its dependence on ever growing amounts

of available computational power, the approach of building bigger and bigger systems to

yield better performances is most likely to decelerate the advance of AI as well. If the field

of AI does not want to rely on a revolution in electrical engineering and the way we do

computations within the near future, it is well advised to avoid to get too heavily invested

into developing brute force systems (e.g. deep ResNets). Optimization objective during the

development of new architectures should not only be the decrease of error rates, but also

how data- and computationally efficient they obtain their results.

Second, the current approach leads to algorithms, whose capabilities are limited to the

very specific domain they are trained for. Within their domain they are, without com-

plete retraining of the system (Lake et al. 2016), unable to adapt to new environments or

changes in task demands. Even though DNNs have solved the problem of finding patterns

within a task, finding regularities on a larger scope (between different tasks) is still a mostly

unsolved issue. This reveals a lack of self-organized transfer-learning and larger scale gen-

eralization. These, however, are attributes necessary to achieve what psychologists term

general intelligence or g - factor (Spearman 1904), the ultimate goal in the creation of AI.

In psychometrics, g describes the positive correlation of performances of an individual in

different cognitive tasks (A transfer of g to machine intelligence is given by Legg & Hutter

2007). This generalization of cognitive abilities over multiple domains and tasks, being

central in psychologist’s definition of intelligence since over a century, is missing in current

AI. On the contrary, the current brute-force, data-hungry models perform extra-ordinarily

good, but are restricted to their predefined, very narrow domain. Stating that nowadays

systems are intelligent is therefore per definitionem false.

5

However, building a system comprising real general intelligence is a immensely complex

task. Luckily, researchers are able to draw inspiration from the most sophisticated cogni-

tive processor currently known: the human brain. Emerging from selective pressure over

thousands of years (Wynn 1988), the human mind is the most adaptive and productive

computational agent we know. To illustrate the extra-ordinary capabilities of human in-

telligence opposed to contemporary AI, we would like to cite Lake et al. (2016), who are

giving a comprehensible example on the control in Atari 2600 video games: When both,

humans without noteworthy experience in one of the games and a contemporary deep rein-

forcement learning algorithm (DQN) (Mnih et al. 2013) learn to play Atari videos games,

their learning curves differ tremendously. When the DQN is trained on an equivalent of 924

hours of unique playing time and additionally revisiting these 924 hours of playing eight

times, it still only reaches 19% of a human player’s performance, who played the very same

game for 2 hours. This illustrates how humans are much more efficient in making use of

the data they are given. Even though subsequent, enhanced variants of the same algorithm

(DQN+ and DQN++, Wang et al. 2015) were able to achieve up to 83% and even 98% of

the humans players performance, their learning curve is still far from being as steep. This

is especially significant in the early phase of learning: while humans demonstrate particular

large performance gains in the initial phase of learning, DQN++ needs more time to show

improvements. After being trained for only two hours like the human competitor, DQN++

only reaches 3.5% of human performance.

How is this possible? As explained above, g, as it is found in humans, requires to generalize

and transfer knowledge from prior tasks to apply it to new challenges posed by the environ-

ment. Most machine learning agents are currently lacking this ability.

To be able to transfer knowledge from one domain to another, a cognitive agent needs to be

able to learn sequentially from a myriad of different experiences over a life time and integrate

their commonalities. This sequential learning task appears very natural to humans, but it

is in fact of great difficulty for other cognitive agents. Learning opportunities often appear

unanticipated, only shortly and temporarily separated from each other. To nevertheless

make sense of this apparent mess of inputs, Lake et al. (2016) formulated guiding principles

that are likely to be central in how humans are thought to learn. Lake et al. (2016) highlight

three principles that are necessary for efficient, generalizing sequential learning: (1) compo-

6

sitionality, (2) learning-to-learn and (3) causality. We will shortly introduce these principles

here and relate them to the core issue of this paper, the obstacle of catastrophic forgetting

in sequentially learning systems, which we will explain in more depth in part 2.

(1) Compositionality is the idea that concepts are build out of more primitive building

blocks. While these concepts can be decomposed into their elements, they can themselves

again also pose the building blocks for even more complex concepts. These more complex

concepts can then be recombined again and so on. As an anecdotal illustration of this

we can consider computer programming: in computer programming basic functions can be

combined to build more complex functions, which on their part can again be recombined to

make up even more sophisticated functions. In this way functions stack up from machine

code up to high level programming languages or sophisticated computer programs. To be

able to reach this high level of complexity, the lower level concepts need to be of general

form and be shared among as many higher level concepts as possible.

Compositionality naturally connects to (2) learning-to-learn, first introduced by Harlow

(1949): When humans are confronted with situations that go beyond the data they have

encountered so far, they are able to infer based on prior learned concepts what is most rea-

sonable in the new situation and therewith try to deal with the new circumstances. Since

concepts are often (partially) shared between different tasks, learning will go faster and with

need of less data. While being very similar to transfer learning, an idea already very popular

in current AI, learning-to-learn has a greater emphasis on being based in the prior mentioned

compositionality. Transfer learning describes that parts of learned concepts are taken and

utilized to solve other tasks. It usually takes place when two similar tasks are trained in

a row. Transfer learning is already partly realized in deep learning CNNs, through feature

sharing between tasks. Transfer learning through feature sharing in CNNs, however, is on

a very small scale. Learning-to-learn on the other hand is defined somewhat differently. It

intends to take transfer learning to a higher, more human like level, by not only extracting

shareable features and let them loosely coexist next to each other, but also relate these

features (or concepts) to each other in a causal way.

This leads us to Lakes third principle. (3) Causality refers to knowledge about how the

observed data comes about. Systems that feature causality therefore not only concentrate

on the final product, but also on the process in which it is created. In general, causal models

are generative (opposed to pure discriminative models). Causality gives generative models

7

the possibility to grasp how concepts relate to each other, making generative models that

are embracing causality usually better in capturing regularities in the environment. Causal-

ity comprises knowledge of state-to-state transitions in the environment and goes therefore

naturally hand in hand with sequential learning. When knowledge about state-to-state tran-

sitions is learned as well, the system is able to relate concepts with each other and determine

how they usually interact. By inverting the idea of causality, a cognitive agent can infer and

reason about the causes of its current situation.

While making these points, Lake et al. (2016) refer to their implementation of the very same

ideas in Lake et al. (2015). Their generative model (called Bayesian Programme Learning

[BPL]) recognizes and categorizes different characters from different alphabets by combining

a set of primitives according to the prior mentioned principles. Doing so, it is able to reach

super-human level performance in one-shot learning of new character concepts, demonstrat-

ing its ability to learn new concepts from sparse data, with little computational power.

While promoting important ideas and yielding promising results, BPL has the problem that

it needs a lot of top-down, knowledge-based hand-crafting to obtain its impressive perfor-

mance. Further, their task is limited to a very specific, simple domain. This is the opposite

of greater generalization abilities, the idea and ultimate goal they intent to promote. This

handcrafting includes that they provide their generative model with primitives that it can

use to create its more complex characters. While it is practicable to provide a generative

model with appropriate primitives for a relatively simple character recognition task, it be-

comes more difficult to do so with models learning more complicated functions. McClelland

et al. (2010) named this problem more eloquently with the need for ’considerable initial

knowledge about the hypothesis-space, space of possible concepts and structures for related

concepts’ that is inherent to generative, probabilistic models. The idea of top-down design

of model-architecture become less feasible when we aim for a more general AI agent. The

principle ideas, compositionality, causality and learning-to-learn, however, are to be consid-

ered fundamental to build more intelligent AI systems. The implementation though has to

take another route: it has to be driven by an emergent approach that is able to add the

needed complexity to the model. The prior mentioned ANN-architectures offer the needed

emergent complexity. One way to harness top-down ideas while sticking to an emergent,

connectionist framework is to create modular architectures, holding top-down inspiration in

the functionality of its modules and their interactions, while keeping the benefits of emerging

8

complex structure within the single components (Marblestone et al. 2016).

An example for the value approaches integrating emergent structure with top-down guiding

knowledge is the so-called long-short term memory (LSTM) (Hochreiter & Schmidhuber

1997). LSTMs enjoy evergrowing popularity since their introduction. LSTMs are related

to the function of human working memory (Baddeley & Hitch 1974). Similar to human

working memory, LSTMs can hold small portions of information that will be needed lateron

by providing a static temporary memory buffer that can store, retrieve or erase its contents

as needed. This modular edition to classical recurrent neural network architectures (RNNs)

allows a great increase in performance in sequential-behavior tasks that rely on use of in-

formation over a larger amount of timesteps. Building on this idea, subsequent algorithms

(e.g. memory networks, Weston et al. 2014) that are constructed even closer to its biological

archetype (e.g. by dividing memory and control function), yielding even better performance

without adding excessive amounts of trainable parameters by refining the modular structure

of the architecture.

In the transfer of Lake et al. (2016) principles to a proposed modular structured emergent

approach, however, lays an old well known problem. To learn sequentially with the perspec-

tive to achieve true intelligence, one needs to avoid the common problem of catastrophic

forgetting. A system that forgets catastrophically will not be able to utilize Lake’s princi-

ples of learning-to-learn, compositionality and causality. However, catastrophic forgetting is

inherent to classical emergent connectionist approaches.

In the following part 2 of this review we will explain catastrophic forgetting in more detail

and expand on early ideas in the AI-community to resolve the problem. In the subsequent

part 3, we will look at how humans handle the problem of catastrophic forgetting. Since we

already reasoned that human cognitive agents are able to learn sequentially and implement

Lake’s principles, we should be able to find mechanisms by which catastrophic forgetting is

prevented in biological neural networks. While doing so, we will localize the ideas we find on

Marr’s 1982 levels of analysis, to make it easier to put them into context. Thereafter, we will

present evidence from the isolated disciplines, that is likely to be useful in the construction

of new AI architectures. In part 4 we will demonstrate machine learning algorithms in

which those ideas are already implemented. In part 5 we will discuss how to go about and

possibly integrate the prior mentioned ideas.

9

2. MACHINE LEARNING LITERATURE ON CATASTROPHIC FORGETTING

The phenomenon of catastrophic forgetting, also known as catastrophic interference, was

first brought up by McCloskey & Cohen (1989). It describes the interference of a new, with

previous learned tasks in classical, sequentially-trained connectionist networks. The reason

why these networks are prone to interference is that, when a task A is learned by the network,

information regarding this task is not saved localized, but in a distributed manner spread

over many nodes in the network (see parallel distributed processing (pdp) (Rumelhart et al.

1986). When a second task B is trained afterwards, the network will use the very same

connections to learn task B, that beforehand were used to memorize task A. Therewith new

training of task B is overwriting the pattern for task A within the weight-distribution.

What happens when knowledge representations of two tasks are interfering with each other

is easiest to understand, when we consider the learned solution to a task in weight space.

The weights space is a multidimensional space in which every parameter of the network

represents one dimension. The weight space represents all combinations of weight-values that

a network can possibly adopt. What happens in weight space, when we train the network on

two different tasks (A and B) one after the other? While being trained on task A the weight

distribution of the system will slowly migrate through weight-space and finally converge on a

weight-combination that solves task A satisfactorily well. When the network is subsequently

trained on the second task B the weight distribution will migrate through weight space

towards a solution of task B. During this second training phase it is neglecting prior learned

information, veering away from the solution of task A and therewith consequentially causing

catastrophic forgetting. To prevent this from happening, the network needs to find a solution

within the weight-space for task B that also poses a solution to task A. Since networks are

usually overparameterized it is very likely that there are multiple points in weight space

(certain weight-combinations) that yield overlapping solutions for both tasks (Kirkpatrick

et al. 2017). If the learning algorithm can be constraint in a way that it finds a solution

resided in such an overlap area, task B can be learned without interfering with task A.

Catastrophic forgetting is an extreme case of the stability-plasticity dilemma (Carpenter

& Grossberot 1987). The stability-plasticity dilemma is concerned with the how the global

rate of learning (or plasticity) in a network influences the stability/instability of distributed

knowledge-representations. In a parallel distributed system a certain amount of plasticity

10

Figure 1: In this example a two-weigth system found a solution for task A. Sequentially it is trained on a

second task B. During training the system is unconstrained by its prior knowledge and therefore neglects

task B while migrating through weight space towards a solution of task B. This results in catastrophic

forgetting of task A. To not forget catastrophically the system needs to migrate towards the overlapping

area on top.

is necessary to integrate new knowledge into the system: When plasticity is too low, a

so called entrenchment effect can be observed. In entrenchment the rate of change of the

connection-weights is too low to cause any noteworthy adaptations when confronted with

new information. While new information will not erase prior knowledge, the network is also

not able anymore to adapt to new information. However, if there is too much plasticity,

prior knowledge will constantly be overwritten and catastrophic forgetting will occur. Thus,

a optimal learner has to keep its connections partially plastic to be able to integrate new

knowledge, while at the same time constraining plasticity selectively to not overwrite prior

knowledge.

Catastrophic forgetting is posing a substantial problem to sequential learning of neural

11

networks, and thereby to the development of continually learning and generalizing systems.

This is why over the years different solutions to the problem have been proposed. One of the

first suggested solutions to the matter came from French (1992). He argues that reducing

overlap between different representations is key to avoid catastrophic forgetting. Almost

all subsequent solutions follow this line of thinking. French introduced the technique of

weight ’sharpening’, in which he increased activations of nodes which are already high and

decreased activation of nodes that are already low, making the activation pattern for a certain

representation more sharply separated and therewith disentangling it from the activation

patterns of other representations. He called the outcome ’semi-distributed representations’.

While his approach was partly successful in reducing catastrophic forgetting, it may reduce

the ability of the network to generalize. Generalization relays on emergence of more abstract

features that may contribute to different tasks and not only to a single one. Since it is

harder for the network to change prior knowledge representations when weight-sharpening

is applied, it won’t create abstract features. Instead it is more likely that it will find solutions

for different tasks seperate from each other and store them in parallel in different parts of

the network. In another approach, Brousse & Smolensky (1989) and McRae & Hetherington

(1993) stated that humans do not learn from scratch, but they can base new representations

on large pretrained networks. They describe that when tasks are highly internally structured

(like e.g. speaking a language), new data samples will have the same regularities as previous

data and therefore not omit drastically different activation and changes in weights. However,

according to French (1994) and Sharkey & Sharkey (1995) this idea suffers from an inability

of generalization as well: According to Brousse & Smolensky (1989), only highly internally

structured tasks should be learned squentially, meaning that new data must have the same

regularities as previous data. Differing tasks will most likely have different internal structure.

Therefore, generalizing knowledge from one domain to another, like it is necessary for general

intelligence,, won’t be possible.

3. NEUROSCIENTIFIC LITERATURE ON CATASTROPHIC FORGETTING

In the previous paragraph we depicted the problem that catastrophic forgetting poses for

AI-systems to become better sequential learners and to utilize causality, compositionality

and learning-to-learn, and thereby to become true generalizing, intelligent agents. Almost

12

thirty years have passed since the first description of the problem, but it still was not

resolved by the AI-research community. We also depicted the early attempts to resolve

the issue, which were mostly seeking a solution through dexterous mathematical insight,

which, however, was not able to completely sort out the problem or brought other issues

with them (e.g. loss of ability to discriminate properly in pretraining). Considering this

apparent persistence of catastrophic forgetting against the prevailing practice, broadening

our perspective to other disciplines (studying human information processing) for inspiration,

as suggested in the introduction, might be a good idea. In the following paragraph we will

carve out the most promising fields of human-related research to find a solution in and

present available evidence that relate to the problem of catastrophic forgetting.

3.1. Neuroscientific framework

So, how do humans prevent catastrophic forgetting in their biological neural networks?

To be able to harness scientific insight about this question and probably finally answer

it, we need to not only consider the discipline of neuroscience in its broadest sense, but

also theory from cognitive sciences. Together, they span the field of Brain and Cognitive

Science, a strongly interdisciplinary endeavor. Ranging from loose cognitive-psychological

theory to deterministic molecular-biological mechanisms, the field is not straightforward to

comprehend from an outside view. To make it more palatable, we will present Marr (1982)

influential framework of levels of analysis, which is guiding the discipline up to this day.

Based on Marr’s levels it becomes easier to organize discovered research in a meaningful

way and appreciate the idea that there is most likely not a single, but multiple approaches

that conjointly may give us a satisfying and complete solution.

Marr introduces three levels of analysis: computational, algorithmic and implementational.

He describes the computational level as where we state the problem which we would like

to address, but providing no answer on how to solve it. In Brain and Cognitive Sciences it

is best described by the discipline of cognitive psychology/sciences. It offers modularized

cognitive concepts, whose interconnections are loose and unformalized, providing only few

concrete mechanisms by which they come about and interact. It helps us stating relevant

questions (e.g. What kind of modules do humans need to store short-term memories, ma-

nipulate and integrate them into prior knowledge?). We mostly obtain knowledge on this

13

Figure 2: The three levels of Marr, (a) computational, (b) algorithmic and (c) implementational, are

exemplified by the process of human vision on the right hand side. Every lower level is a realization of its

higher levels. Every higher level can be realized in different ways on lower levels. In our example of vision

the algorithm at the algorithmic level can not only be realized in vivo in biological neural networks, but

also in silico using artificial neural networks.

level through highly controlled, quantified psychological experiments and deductive reason-

ing. Approaches that are solely inspired by the computational level of analysis and not

informed by the other two, are top-down oriented like Lake et al.’s BPL mentioned in the

introduction.

The algorithmic level helps to find solutions to the problems stated on the computational

level, holding the concrete mechanisms by which they may be solved. This level is providing

the bridge between computation and implementation. It is corresponding to the discipline

of cognitive neuroscience, which is locating cognitive concepts within different brain areas

and therewith matching them to neural substrate. Researchers do this by utilizing evidence

from e.g. neuroanatomy and neuroimaging or electrophysiology. By finding correlations

between the use of certain cognitive resources and brain activity the algorithmic level is

building a bridge between ideas about high level concepts like cognition and the underlying

neural ’hardware’.

On the implementational level we define how the prior mentioned mechanisms or al-

gorithms are realized (i.e. the physical substrate that the mechanisms are performed on).

This physical substrate may be in silico through transistors on a microchip or in vivo

14

through populations of neurons and their interactions. While some substrates may be more

suitable than others, in principle all theory from the higher levels may be implemented in

multiple different substrate. In regard to humans, this level is best represented by molecu-

lar/behavioral neuroscience, showing us how exactly single neurons function and how they

can be affected through different ways of stimulation (e.g. neuromodulation or in vivo elec-

trical stimulation). Evolved from ideas on the implementational level of analysis, emergent

connectionist networks are a very successful account in state-of-the-art AI (e.g. in form of

straight feedforward networks and recurrent networks).

Thus it appears that in Brain and Cognitive Sciences we have the same approaches to knowl-

edge acquisition as we have modeling approaches in AI: A top-down, knowledge-guided

approach represented through Cognitive Science/Psychology and a bottom-up, emergent

approach represented through molecular-neurobiology. Both approaches try to inform an

intermediate level of understanding. This intermediate level is yielding the mechanism by

which neural and cognitive processing works.

In Marr’s framework, every new level should be considered as a realization of its predeces-

sor, meaning that for example the algorithmic level is realizing the problem stated on the

computational level. It is worth mentioning that this, however, does not mean that insights

from lower level research cannot inform higher level theories (for example the discovery

of grid cells in the human cortex changed the way we think about spatial-memory and

-cognition [Moser et al., 2008]). Choosing the right level of analysis to conduct research has

been a controversial subject for a long time. The protracted debate about the supposed

superiority of one approach over another has seen no single winner. The opposite is the

case: Holding on to a single framework has not been proven to be fruitful, and it is wide

consensus by now that a complete theory should be informed by all levels of analysis. As

systems in artificial intelligence grows more and more sophisticated, this idea will become

increasingly important in that discipline as well. For illustrative purposes, we would like to

give two brief examples of contemporary modeling approaches ignoring this notion.

Our first exemplar will be Bayesian Programme Learning (BPL), which we already men-

tioned in the introduction. BPL is an algorithm inspired by cognitive science only and

therewith residing on the computational level. It categorizes different handwritten charac-

ters from different alphabets by combining a set of primitives. These primitives are possible

pen strokes, that, when combined, make up a character. However, these primitives are fairly

15

simple and few in numbers, which is why it is straightforward to provide a appropriate

hypothesis-space (set of primitives) for the bayesian inference. As soon as the task becomes

more complex the primitives will have to become more abstract and greater in number.

Finding such an appropriate hypothesis-space and hand-crafting it it into the system is not

trivial.Therefore a purely top-down oriented approach is not able to capture complexity as

in human information processing.

On the other hand, even though they were successful in the past and still are, the pure

emergent connectionist networks guided from insights from the implementational level like

standard feedforward networks pose the problems stated in the introduction: They lack

the ability to generalize broadly and learning sequentially. Focusing on the implementa-

tional level can construct more powerful networks that perform extraordinary on single

tasks, but will most likely not be able to learn this task flexibly, adapt to changing task

requirements and transfer knowledge from one domain to another. Thus I will not satisfy

our quest for real general intelligence. An demonstration of this is how the idea of pure

feedforward neural networks has been recently led ad-absurdum by creating incredibly deep

networks (e.g. ResNet-152). Those Networks may yield better performances, but need an

unreasonable large amount of training, consuming computational power and data on an

exaggerated scale and are only possible by using clever hacks in the network architecture

(skip connections in case of ResNet). From a biological perspective these models lack every

plausibility. When considering human vision, which is commonly modeled by these kind of

networks, such an excessive amount of layers in the feedforward sweep of processing would

lead to exaggerated perceptual delays in humans, since stage to stage processing time in

the ventral visual stream is approximated with 10ms per neural population (Panzeri et al.

2001). A biologically implemented ResNet would therefore be no match for human object

recognition (which is estimated with 120ms) in regard of efficiency. It thus seems not to be

necessary to maintain large numbers of expensive computational layers to reach sufficient

performance for object recognition. In accordance with this, Serre et al. (2007) actually

suggest that the depth of the human ventral visual pathway may be estimated at only 10

processing stages.

An algorithm providing a generally intelligent solution should therefore be informed by all

levels of analysis and by doing so providing flexibility paired with complexity. We will keep

this in mind and see how it also may apply to a solution for the problem of catastrophic

16

forgetting.

3.2. Neuroscientific Theory and Evidence

So far we have layed out the problem of catastrophic forgetting in machine learning and

isolated the levels of description on which we are searching for a solution in neuroscience.

Now we will look at evidence from the brain and cognitive sciences that we might profit

from. Interestingly, even though humans do suffer from retrograde interference of newly

acquired information with older memories (Barnes & Underwood 1959), this interference is

never catastrophic. Since artificial neural networks are thought to function in the same way

as biological networks, the human neural system must have implemented countermeasures

to overcome this sequential learning problem that haven’t been adopted by their highly

simplified artificial counterparts. There are different ideas how this might be accomplished

about what these countermeasures are.

3.2.1. Complementary Learning Systems Theory

The first idea we want to present here is a cognitive neuroscientific theory, which is in-

formed by molecular neuroscience as well as cognitive scientific ideas. Rooted in ideas of

Marr (1970, 1971) and Tulving (1985), the so-called complementary learning systems theory

(CLS) was first formalized by McClelland et al. (1995). CLS might give an account for how

catastrophic forgetting is avoided in humans. The theory proposes that human learning

functions via two separate memory systems. The first system is the neocortex which, as

the name implies, is a very recent evolutionary acquisition shared only among mammals. It

is responsible for all higher cognitive functions of mammals (Lodato & Arlotta 2015). To

achieve the complexity of higher cognitive functions, the neocortex has to integrate ambigu-

ous information over long time spans. It does this by slowly estimating the statistics of the

individual’s environment. The functionality of modern deep ANNs is mainly inspired by the

neocortex. Due to their similarity in architecture neocortex and deep ANNs share a large set

of properties, like being large in capacity, and their slow, statistical way of learning. Since

the neocortex is a statistical learner, it integrates general knowledge (i.e. semantic knowl-

edge) about the world that is not connected to the specific learning experience anymore (i.e.

17

it stores no episodic memory).

The storage of episodic information is achieved by the second of the two memory system in

CLS-theory. It is located in the medial temporal lobe structure of the hippocampus. This

system is thought to be a fast learner with very limited storage capacity. The hippocampus

main objective is to store episodic memories and preprocess them for later integration into

the statistically learning neocortex. To be able to store specific events the hippocampus

has to orthogonalize incoming activation patterns, to makes them distinct from previous

experience, a process called pattern separation. Further, the hippocampus extracts regular-

ities from these distinct experiences and then trains the neocortex in an interleaved fashion.

This interleaved memory replay helps the neocortex in the learning process by reactivating

cortical connections central for the memory. Replay is essential, since the slow learning neo-

cortex will hardly learn from the single exposure to an experience. Additionally, interleaving

different memories and replaying them can also be a mean to prevent catastrophic forgetting

in the neocortex (McCloskey & Cohen 1989). This is similar to optimizing multiple tasks

in parallel during training of ANNs, where by interleaving examples of different tasks can

also help to overcome catastrophic forgetting. In support of the assumption that the hip-

pocampus’ interleaved replay is important to avoid catastrophic forgetting, McClelland et al.

(1995) argue that in lower mammals that are lacking the hippocampal-neocortical division

(and therewith do not have complementary learning systems), catastrophic forgetting might

actually take place. It is still an open question if that is actually the case (French 1999). As

mentioned before, architecture and functionality of current ANNs can be seen analogous to

the human neocortex. The second memory system of the hippocampus has no counterpart in

most current AI-systems. Since it seems to be important to prevent catastrophic forgetting,

however, it might be potentially a worthwhile additional module. To better understand how

such a module might work, we should consider the inner dynamics of the hippocampus in

more detail (see also figure 3 ).

The hippocampus circuitry consists mainly of the trisynaptic pathway or loop (TSP) and

the monosynaptic pathway (MSP). The TSP is made up of the entorhinal cortex (ERC),

dentate gyrus (DG), CA3 and CA1. This neural populations are connected through forward

connections, while CA3 has additional recurrent autoconnections. The TSP is responsible

for the encoding of new information and pattern separation (orthogonalization of single ex-

perience) (Schapiro et al. 2016). The encoded and orthogonalized information is then stored

18

Figure 3: The entorhinal cortex (EC) serves as an input as well as output module for episodic memory

buffer of the hippocampus. The trisynaptic pathway (green) comprises the dentate gyrus (DG) and cornu

ammonis 3 and 1 (CA3, CA1). The DG orthogonalizes the input from EC to be able to store without

overlap to prior experiences in CA3. The monosynaptic pathway (red) comprises CA1 and the

Ecsubscript(output). CA1 is extracting statistical regularities from the episodic memory buffer in CA3,

which is necessary to subsequently train the slow, statistically learning neocortex.

in CA3 (Tulving, 1985). This episodic memory buffer is in its mechanism similar to Hopfield

networks (Wiskott et al. 2006; see Amit 1989, for an introduction to Hopfield networks as

neural circuit model).

The MSP on the other hand, consisting of ERC and CA1, is trained by CA3 in a statistical

manner, similar to the neocortex in the CLS theory. The generalized knowledge representa-

tion in the MSP is then used to train the neocortex via repetitive, interleaved memory-replay.

Replay takes predominantly place during low activity phases (e.g. during slow-wave sleep,

Stickgold 2005).

So far CLS-theory seems like a reasonable and parsimonious solution to our problem. How-

ever, there are three reasons that make it unlikely that the hippocampal-neocortical division

and therewith episodic memory replay is the only mechanism that is contributing to prevent

catastrophic forgetting in humans. Firstly, there are no cases of catastrophic forgetting in

higher mammals. Lesion studies in animal models or case studies in humans with lesions due

to strokes in the hippocampus should lead to conditions similar to catastrophic forgetting in

19

neural network models. The biological brain would not be able to interleave its new learning

experience with older experiences any more. A lesioned hippocampus, however, leads to a

related, but different condition: medial temporal lobe amnesia (MTL amnesia) (Squire et al.

2004; Squire et al. 1991). In MTL amnesia individuals suffer from loss of episodic memory,

what we described above as orthogonal, pattern separated memories stored in the CA3 of

the hippocampus. At the same time patients have relatively unimpaired generalized seman-

tic memory (Race et al. 2013). Since they general semantic memory is unimpaired, they

seem not to suffer under catastrophic forgetting. This appears to speak against a central

role of complementary learning systems for prevention of catastrophic forgetting in humans,

because the lack or malfunction of the hippocampus would leave the neocortex exposed to

new experiences that are not interleaved with prior knowledge. However, one might argue

that with the lack of the memory replay unit, MTL amnesia patients’ neocortex is only

affected by new experiences once, namely during the time at which the event is actually

taking place, opposed to exposure through multiple replays in healthy individuals. It may

be that the neocortex is simply not stimulated enough to exhibit fundamental changes to

its connections. While it is still stimulated by the original experience, there is no replay

of this memory, which is essential for the slow learning neocortex to efficiently alter it’s

connectivity patterns. This would ’freeze’ the knowledge stored in the neocortex, making

it inaccessible for new information, but at the same time preventing it from losing older

semantic knowledge. This is indeed the case in MTL-amnestic patients: while retrospective

knowledge, manifested before the loss of the hippocampus is relatively unimpaired, acquisi-

tion of new knowledge almost comes to standstill, with learning of new factual information

being learned only after long time intervals and repetitions (Bayley & Squire 2002).

Another compelling refutation to the exclusive role of the hippocampal system as a counter-

measure to catastrophic forgetting is that all experiences an individual was ever confronted

with would have to be saved within the capacity-limited hippocampal system and constantly

be replayed interleaved with new experience. We know, that the hippocampus has relatively

limited capacity. Additionally, the amount of memories that would be needed to be replayed

would grow linearly with lifetime. Thus, with a certain age memory replay would become

unfeasible. A solution to this issue would be so-called pseudoreheasal introduced by Robins

(1995). Pseudorehearsal works without access to all prior training data (in our case the

memories of a lifetime), but creates its own training examples (pseudoitems) by passing

20

random binary input into the network and using the output as new training examples for

interleaved training. In this way created pseudoitems are described by Robins as a kind of

’map’ that is able to reproduce the original weight distribution. As an anecdotic side-note

from the authors, this intuitively makes sense, since sleep is the time during which memory

replay is thought to predominantly take place (Stickgold 2005) and sleep co-occurs with the

subjective experiences of dreams which often resemble a commingling of recent experience

and odd intermixes of past memories.

Lastly, how do we keep the hippocampus itself free of catastrophic forgetting? It is itself a

connectionist network and should suffer under the same problem of catastrophic forgetting

that it is trying to avoid in the neocortex. An additional mechanism would be necessary to

protect the hippocampus from suffering under catastrophic forgetting itself (Wiskott et al.

2006). Excitingly, this mechanism actually exists and is layed out by research of Wiskott

et al. (2006). Adult neurogenesis, the generation of new neurons out of neural stem cells,

in the DG of the hippocampus might be a countermeasure against catastrophic forgetting

within the hippocampus itself. The DG of the hippocampus is one of only two regions within

the brain that is capable of neurogenesis (Eriksson et al. 1998). We will consider the idea

of neurogenesis in more depth in part 3.2.3. of this review.

3.2.2. Selective Constraints of neuroplasticity

The next neuroscientific insight we would like to present here is located on the molecular

level. In the human central nervous system a change of plasticity (in other words the readi-

ness of a synapse to change its connection-strength) is able to either render connections (in

ANNs: weights) in a network convertible or fix its status quo. By selectively increasing or

decreasing plasticity, the brain is able to learn new tasks, while conserving old skills and

knowledge (Yang et al. 2009). Important here is, that changes in plasticity are selective,

which makes it possible to learn a new task, while not overwriting existing skills. This is

opposed to what is happening in most ANNs during training: while the learning rate is

often times dynamic, meaning that it is changing during the learning progress, it is applied

globally over all connections and not adapted separatly in different regions.

When talking about plasticity changes in the human nervous system on a cellular level,

dendritic spines are considered essential. In an interneuronal connection, spines are little

21

protrusions on the post-synaptic neuron, which are formed at synaptic connections between

neurons. Changes of connection-strength between two neurons are due to morphological

changes (remodulation) of the dendritic spines of the post-synaptic neuron (Yang et al.

2009). Additionally, increased plasticity can also lead to the formation of new spines and

therewith the formation of new interneural connections or the elimination of existing spines

and therewith the loss of connections. Those changes, formations and eliminations of den-

dritic spines are the means to change the connection-strength between two neurons and thus

are the basis for learning.

To better understand how selective neuroplasticity comes about we will take a look at the

molecular basis of spine remodulation. The changes in morphology of spines depend on N -

methyl-D-asparate (NMDA)-receptor activity. Opposed to other excitatory receptors (e.g.

AMPA in figure 4), NMDA is not only able to allow Sodium (Na+) to enter the neuron, but

also allows influx of Calcium-Ions (Ca2+). Ca2+-influx renders the cell morphology change-

able, thus increases plasticity. If connection-plasticity is elavated, connection strength can

either be increased (potentiate) (Bliss & Lømo 1973) or decreased (depotantiate) (Ito 1989)

as a consequence of activity. The kind of change occurring depends on the time interval

between the synaptic activity and the Ca2+-spike. Thus modulating the activity of NMDA-

receptors and consequently Ca2+-influx changes plasticity of the neural connection.

One mean of the brain to influence the activity of NMDA-receptors is through the hormone

somatostatin (SST). According to Pittaluga et al. (2000) the hormone is able to increase

NMDA activity by releasing the Mg2+-block off the receptor. Normally, the Mg2+-block is

only released due to high activity of the neuron and consequently its strong depolarization.

Only the removal of the Mg2+-block enables Ca2+ to pass the membrane into the cell.

SST is distributed via interneurons, which are neurons that connect different neural circuits

with each other without taking a primary function within either of them. To make learn-

ing without forgetting possible it is important that SST-related plasticity changes can be

selective for certain branches of a neuron. This is indeed the case: Changes do not occur

over all connections (i.e. branches) the neuron maintains, but only in branches that are

relevant for the task that the cognitive agent is engaged in (Cichon & Gan 2015). When

SST-release is disrupted (e.g. in SST-interneuron deleted mice), SST is not longer influenc-

ing NMDA-receptor activity and branch-specific plasticity of spine morphology is lost. As

a result the same branches show similar synaptic changes during learning of different tasks,

22

Figure 4: (a) When the NMDA-receptors Mg2+-block hampering Calcium (Casuperscript(2+)) ions to flux

into the neuron. In this case only Sodium (Na+)-ions will enter the neuron via the AMPA-receptor on the

right side of the dendritic spine, which may cause the cell to fire, but wont lead to strengthening of the

synaptic connection. (b) If the Mg2+-block is removed via a high level of depolarization of the neuron (high

rate of activity) or Somatostetin (SST) in the synaptic cleft, Ca2+ will enter the cell which will result in

changes of the cell-metabolism. (c) The changes in cell metabolism due to Ca2+-influx results in additional

AMPA-receptors being integrated into the neurons membrane. A higher density of AMPA-receptors

increases the rate of Na+)-ion influx upon stimulation, which makes the neuron more likely to fire.

causing subsequent tasks to ’erase’ memories of preceding tasks. This happens because new

tasks are altering synapse strengths of the preceding tasks, which fits what we know as high

activity in ANNs. Cichon & Gan (2015) also provide evidence for this on the behavioural

level: the prior mentioned SST-interneuron deleted mice do indeed exhibit catastrophic for-

getting when learning two different tasks sequentially. This gives us causal evidence that

selective changes in neuroplasticity can serve as countermeasure to catastrophic forgetting,

when it is branch-specific.

A single SST-interneuron is targeting directly a single other neuron, what we refer to as

homosynaptic interaction. Not all neural interaction is homosynaptic. There are other

substance, that are refered to as neuromodulators, acting in a heterosynaptic fashion. Gen-

erally speaking, heterosynaptic neuromodulation means that a neurotransmitter released by

a neuron does not only affect a single target neuron, but a whole population of neurons

that that are in close distance. This happens when the neuromodulator is not only released

into the targeted synaptic cleft, but also ’spilled over’ into the extracellular space (ECS),

where it can diffuse and target other prior uninvolved neurons in close proximity. Next

to spillover some neuromodulators are directly released into the ECS, for example classical

neurotransmitter like Dopamine (Descarries et al. 1996) and Serotonin (De-Miguel & Trueta

23

2005). Additionally, there are also highly diffusible gaseous substances like nitric oxid (NO),

carbon-monoxid (CO) and hydrogen sulfide (H2S ) (Wang 2002), which since they are highly

diffusable have a greater area of effect. By diffusing through the ECS, neuromodulators are

able to render the plasticity of adjacent neurons as well, making them more prone to change

their connection strength (or vice versa, make them more stable). This localized change of

plasticity might be a mean to avoid catastrophic forgetting by rendering currently relevant

parts of the neural network changeable while keeping the rest of the network stable and

therewith antagonizing interference with old information. While the branch-specific SST-

induced changes to neuroplasticity are acting on a rather finegrained level, neuromodulators

are able to render larger portions of cortex more plastic, but both are able to influence

memory intereference.

3.2.3. Neurogenesis within the hippocampus

A third mechanism in the human nervous system that might be able to prevent catas-

trophic forgetting we already mentioned in our section about CLS and is located inside the

hippocampus. As depicted before a episodic memory unit that stores experiences tempo-

rally to replay and therewith train a slow statistical learner like the neocortex will have the

problem of catastrophic forgetting itself. Interestingly, there is another mechanism within

the episodic memory buffer of the hippocampus to circumvent this problem. Prominently,

the DG of the hippocampus is one of two cortical areas capable of adult-neurogenesis (Alt-

man & Das 1965, Gould & Gross 2002, Kempermann et al. 2004). Neurogenesis describes

the constant production of nervous cells from neural stem cells. These newly generated

neurons in the DG differ from older cells in the way that they exhibit a greater degree of

synaptic plasticity (Schmidt-Hieber et al. 2004), greater ease to form new connections to

other neurons (Gould and Gross, 2002) and greater mortality (apoptosis) (Eriksson et al.

1998). These properties draw a picture of a nervous cell that can easily be integrated into

an existing neural circuit, but may also be easily obliterated when not being proved useful.

The extend of neurogenesis and cell survival is decreased by age (Altman & Das 1965) and

aversive-stressful experiences (Gould & Tanapat 1999) and increased by diet (Lee et al.,

2000), physical activity (van Praag et al. 1999) and enriched environments (Kempermann

et al. 1998). But how do new neurons help to tackle catastrophic forgetting? French (1991)

24

suggested that within a large network sparsity of representations in the hidden layers of a

network is a mean to reduce CF, since representations will be localized and therefore not

interfere with each other. This strategy, however, has the effect of reducing generalization,

since solutions will just be stored in parallel and there is no need for generalization. Wiskott

et al. (2006) complement this idea by suggesting that the newly generated neurons open up

the opportunity to learn new feature-representations and at the same time, by the reduction

of plasticity in old neurons, saving the possibility to remember older feature-representations.

While in early life people encounter a lot of new environments with a lot of new features,

the need for DG to be able to adapt needs to be larger. On the other hand, in later life

new environments mostly consist of a recombination of known features which reduces the

need to create new feature-representations. This is an intuitive account for the reduction of

neurogenesis within the individual’s lifetime. The same goes for enriched environments: In

a complex environment the capability to adapt and learn is more important that it is in an

impoverished one. Higher rates of neurogenesis makes this possible.

4. INTEGRATION OF NEUROSCIENTIFIC INSIGHT INTO MACHINE LEARN-

ING

After we have illustrated different approaches from the different disciplines of Brain and

Cognitive Sciences, we now take a look in how far these ideas are already implemented in

contemporary AI systems.

4.1. Using complementary learning systems: from the DQN-model to deep gen-

erative replay

The aforementioned CLS-theory probably received most attention in the last years when

it comes to tackling the problem of catastrophic forgetting in connectionist networks. In

their very influential approach to train a neural network architecture on controlling Atari

2600 games, Mnih et al. (2013, 2015) introduced a memory replay unit in the deep Q-network

(DQN). In their attempt to make a DNN architecture learn based on less data, Mnih et al.

introduced a seperate memory unit, saving all prior experiences and replaying all of them

randomly up to eight times after the initial training. Even though this idea bears a lot of

what we think might help overcoming catastrophic forgetting and corresponds functionally

somewhat with the episodic memory buffer in CA3 of the hippocampus that we are also

25

referring to, it exhibits some conceptional flaws that limit its capacities as a mean against

catastrophic forgetting (to be fair, overcoming catastrophic forgetting was not Mnih et al.’s

intention here). As we explained in part 3.2.1 a episodic memory buffer that saves all recent

experiences in a one-to-one manner like in DQNs is not feasible. For an AI system com-

prising real general intelligence it is necessary to learn continual over a long time period.

An episodic memory buffer like in DQN would become exorbitantly large over time and

replay of the random, uniform (every memory is equally likely) sampled memories becomes

a computationally heavy task. To lower the extend of replayed memories, Schaul et al.

(2016) suggested to prioritize memories that are likely to yield a high reward over other

memories that might not be as important for success. By doing so their model outperforms

an uniformly sampling system on the basis of the same amount of training. This selective

reward-guided replay is biologically plausible (see Atherton et al. 2015;Hattori 2014). Even

though this constraining of memory sampling is already a step towards the right direction

by reducing the amount of memory that has to be replayed, over an agents lifetime still too

many memories would be needed to be stored.

In part 3.2.1 of this paper we presented Robins (1995) idea of pseudo-pattern replay. In

pseudo pattern replay there is no need for saving actual memories in a one-to-one fashion,

but the patterns that the new acquired memories are interleaved with are generated out of

the prior weight distribution. Mocanu et al. (2016) pick up on this idea and describe the

Online Contrastive Learning with Generative Replay (OCLGR)-model that uses generative

Restricted Boltzmann Machines (gRBMs) to store past experiences. By saving past experi-

ences in gRBMs the need to save them explicitly (in a one-to-one fashion) is made obsolete.

Using this idea the OCLGR outperforms regular experience replay models and adds more

biological plausibility to the approach by substantially reducing the memory requirements.

Generative replay is applicable to all common types of machine learning (reinforcement-

, supervised- and unsupervised learning). However, Mocanu et al. do not evaluate their

model on its capabilities to cope with catastrophic forgetting. Just recently, Shin et al.

(2017) put a generative replay model into test in how far it may help overcome catastrophic

forgetting on the MNIST-dataset (LeCun et al. 2010). Their results imply that generative

replay is compatible with other contemporary countermeasures (e.g. elastic weight consoli-

dation [EWC], Kirkpatrick et al. 2017; learning without forgetting [LwF], Li & Hoiem 2017).

Additionally, they state that their approach is superior in regard to weight constraining ap-

26

proaches like EWC and LwF since there is no a trade-off between performances of old and

new task. We will explain weight constraining approaches in the following part 4.2 in more

detail.

4.2. Constraining weight plasticity within the network

Another approach to tackle catastrophic forgetting in connectionist networks is the se-

lective constraining of weights. As explained in section 2, catastrophic forgetting in neural

networks is caused by plasticity of connections needed for a first task A being high during

the training of a subsequent second task B. When plasticity is high, the information for the

task A will be forgotten since the weights holding this information will adapt to task B.

On the other hand, when the plasticity of the weights is constrained globally, the network

will lose its ability to learn (see plasticity-stability dilemma; (Carpenter & Grossberot 1987,

cited in Gerstner & Kistler 2002). Current approaches to prevent catastrophic forgetting

therefore try to selectively constrain weights in a way, that weights necessary for task A are

protected during learning of task B and vice versa. There are several models implementing

this being based on the prior mentioned neurobiological insights.

The first of these implementations is elastic weight consolidation (EWC) (Kirkpatrick et al.

2017). EWC is inspired from ideas on a molecular neurobiological level: SST-expressing in-

terneurons are able selectively constrain plasticity on certain branches of a cortical neuron,

while leaving it intact for other branches of the very same neuron (Yang et al. 2009). The

branch-wise constraints here are functionally related: when a branch is necessary for task

A, plasticity will be unconstrained during learning of task A and constrained during task B

and vice versa. Kirkpatrick et al. (2017) take this idea of selectively constrained plasticity

and apply it to DNNs. However, while taking biology as an inspiration, they do not try to

model the underlying mechanics, but instead use a Bayesian approximation to determine

the importance of single connections (i.e. weights) for the current task. When being trained

on the following task, the algorithm determines based on Bayesian approximation how im-

portant the different weights were for the prior solved tasks and put constraints on them,

so that they remain relatively unchanged during backpropagation and subsequent weight

updating during learning. The track through weight-space changes accordingly (see figure

5 ).

27

Figure 5: Similar to Figure 1 a two-weight system is trained sequentially on two different tasks (A and B).

While being trained on task B in an unconstrained manner the system will migrate towards a solution of

task B neglecting prior knowledge of task A (see bottom trajectory). If weights are constrained by elastic

weight consolidation the system will be constantly ’pulled back’ towards the solution of task A will

migrating through weight space. Ultimately, it will converge on a weight combination that solves both

tasks satisfactory (if such a solution exists).

Velez & Clune (2017) are taking their inspiration from the way neuromodulation in the hu-

man cortex is thought to work. Part 3.2.2 explained how neuromodulators are spread locally

within the human cortex and are affecting plasticity of neural connections in their range.

Velez & Clune (2017) translate this by locally plasticity of connection-weights through the

spread of an ’artificial neuromodulator’ within their ANNs. This artificial neuromodulator

then selectively increases the plasticity of the weights around the diffusion node. For differ-

ent tasks different diffusion nodes are activated during training and therewith the network

creates local functional clusters while the rest of the network is relatively unaffected during

training. Their model is only validated on a very primitive small network and verification

28

of its capabilities on large scale state-of-the-art architecture still remains to be proven.

The main difference between these two weight constraining approaches is that on the one

hand the diffusion-based implementation of Valez and Clune has greater biological plausibil-

ity than Kirkpatrick et al.’s EWC. On the other hand EWC targets directly the functionality

of certain weights, while the diffusion-based approach lets functionality emerge within their

predetermined local clusters around the diffusion nodes. In this regard EWC might be su-

perior to the diffusion implementation, since the functional diffusion clusters are not flexible

in their scope as the EWC constraints are. The scope of EWC weight constraints is only

determined by the scale of needed weights in the task and is not handcrafted like in the

diffusion based model. This lets EWC become the more elegant and flexible solution.

In both weight constraining approaches the functional structure of the network is relatively

fixed. When a weight combination that maximizes the performance of one task is found in

the network, it is protected against change. This makes the network overall less flexible and

limits its capacity to a certain amount of tasks. Also it is possible that the network does not

find generalized solutions, since the approach is minimizing overlap between the different

representations of tasks, not allowing a more parsimonious, flexible solution that might be

found when both tasks are optimized in parallel.

5. DISCUSSION

In this literature review we took a closer look into the current developments in AI re-

search. We found the field prospering, especially within the recent years: In different im-

portant machine-learning applications, performance benchmarks have been shifted to reach

human level performance or even go beyond. However, recent performance achievements

were often build on the utilization of big amounts of data and computational power and

the systems are often lacking the ability to generalize the acquired skills to other related

or slightly changed tasks. We stated that this ability, however, is central to acquire true

intelligence. One obstacle on the way to more generalizing, intelligent systems is catastophic

forgetting in connectionist networks. Years of approaching the problem with sole mathe-

matical insight did not resolve the issue satisfyingly. As a consequence researchers turned

to neuroscience to draw inspiration from human cognitive agents that do not suffer from

catastrophic forgetting. We introduced Marr (1982) levels of analysis to help us better un-

29

derstand the neuroscientific research we encounter. Here we emphasized that just like in

neuroscientific research, where complete theories are always informed by all levels of analysis,

an algorithm for a truly intelligent neuroscience-inspired AI system should as well always

be informed by all levels of analysis. Hereafter, we brought up the complementary learning

systems theory (CLS) and constrained neuroplasticity, two of the main ideas of contempo-

rary neuroscience about how catastrophic forgetting is avoided in humans. Additionally to

that, we shortly explained how neurogenesis in the hippocampus might be able to prevent

catastrophic forgetting as well. Finally, we surveyed recent implementations of these ideas

in AI and what advantages and shortcomings the different approaches pose. Doing so, we

presented different examples of realisations of CLS with an emphasis on the most promising

and recent ’deep generative replay’-approach (Shin et al. 2017), utilizing two complementary

learning systems, just like humans do, to interleave current and past experiences.

We further depicted two realization of the molecular neuroscientific idea of selectively con-

strained neuroplasticity in elastic weight consolidation (EWC) of Kirkpatrick et al. (2017)

(2017) and in the diffusion-based approach of Velez & Clune (2017).

Both of the main approaches using constrained neuroplasticity have their shortcomings: The

EWC is only able to change plasticity in one direction: from plastic to stable. As soon as

the networks capacity is reached and the system is saturated, no more new information can

be learned and blackout catastrophe, a phenomenon known from saturated Hopfield net-

works (Amit 1989), may occur. A blackout catastrophe renders information in the network

unretrievable. To continually learn, the cognitive agent has to be able to selectively forget

information to prevent a Blackout.

On the other hand, in the diffusion-based approach number and size of the functional clus-

ters is handcrafted into the system. Handcrafting limits the range of applicability of the

system. To obtain a model that can function without these top-down decisions, the model

will have to learn the scale of clusters from data. The question remains which mechanism

might be able to supervise the assignment and the size of diffusion node within a ANN. One

possible candidate for this might be the human basal ganglia (Alexander & Crutcher 1990).

The basal ganglia (especially the ventral striatum) is central in human reward prediction

and processing (e.g. Schultz et al. 1992). The basal ganglia seems to maintains neuromod-

ulatory projections to the cortex (Alcaro et al. 2007, Graybiel 1990) and as stated in part

3.2.2 dopamine, while dopaminergic neurons are relative rare and mainly expressed in the

30

basal ganglia (Bjoerklund & Dunnett 2007), is able to serve as a potent neuromodulator

(Descarries et al. 1996). This might pose a natural connection to currently popular rein-

forcement learning algorithms in AI (like the prior mentioned DQN (Mnih et al. 2013); for

an introduction to reinforcement learning see Sutton & Barto 1998).

As converging evidence in neuroscience shows it seems to be the case that catastrophic for-

getting in humans is not overcome by a single, but multiple mechanisms on different levels.

While there is direct causal evidence for the importance of constraints on branch-specific

neuroplasticity for the avoidance of catastrophic forgetting in animal models (Cichon & Gan

2015), there is also broad evidence for the relevance of complementary learning systems in

human learning and the interleaved fashion in which the hippocampus is training the slow

learning neocortex (Kumaran et al. 2016). To overcome catastrophic forgetting in a way

equivalent to humans, developers of connectionist AI-systems might need to integrate the

separately developed frameworks into an all-embracing solution. This solution might not

be straightforward, but the prospect of being able to understand and prevent catastrophic

forgetting in a more complete fashion might be worth the while. Our outlook sees impor-

tance in the connection of reinforcement processing and selective changes in neuroplasticity

that might be facilitated through flexibly acting neuromodulatory nodes. An additional

episodic memory replay unit that creates pseudopattern to replay recent experiences in an

interleaved manner can not only help the consolidate recent memories within the neocortex

(like intended by DQNs), but also bound the networks to earlier learned tasks and prevent

catastrophic forgetting. A next step might be to create a memory replay unit that more

closely resembles the inner dynamics of the human hippocampus. A potential candidate

to closer match a artificial memory replay units to the human hippocampus would be the

utilization of the REMERGE-model (Kumaran & McClelland 2012). REMERGE models

hippocampal encoding, memory orthogonalization and retrieval in a down-scaled fashion.

When the episodic memory replay unit becomes more complex by closer resembling its

biological archetype (the hippocampus), it might be necessary to also mimic the internal

hippocampal process of adult neurogenesis to protect the new module from forgetting catas-

trophically as well.

Even though the envisioned integration of different neuroscientific inspired solutions to catas-

trophic forgetting poses a big challenge the prospects it has to offer are compelling enough

to shoulder the effort. Deeper analysis of the here collected ideas is necessary to find rea-

31

sonable and effective ways for a fusion. The final result might be a algorithm grasping the

complexity of human neural processing on all analysis levels in a complete fashion.

With such an algorithm available and the therewith connected possibility to create sequen-

tially learning AI-systems a big stepping-stone towards the development of true intelligence

is taken. Sequentially learning models make it easier to train models that let composition-

ality, explained in the introduction, emerge, since it may enable connectionist networks to

recycle prior used knowledge (Kirkpatrick et al. 2017).

Alcaro, A., Huber, R., & Panksepp, J. (2007). Behavioral Functions of the Mesolimbic Dopamin-

ergic System: an Affective Neuroethological Perspective. Brain Res Rev, 56(2), 283–321.

Alexander, G. E. & Crutcher, M. D. (1990). Functional architecture of basal ganglia circuits:

neural substrates of parallel processing. Trends in Neurosciences, 13(7), 266–271.

Altman, J. & Das, G. D. (1965). Autoradiographic and histological evidence of postnatal hip-

pocampal neurogenesis in rats. The Journal of comparative neurology, 124(3), 319–35.

Amit, D. (1989). Modeling brain function: the world of attractor neu-ral networks.

Atherton, L. A., Dupret, D., & Mellor, J. R. (2015). Memory trace replay: The shaping of memory

consolidation by neuromodulation.

Baddeley, A. D. & Hitch, G. (1974). Working Memory. Psychology of Learning and Motivation, 8,

47–89.

Barnes, J. & Underwood, B. (1959). "Fate" of first-list associations in transfer theory.

Export EXPORT Add To My List Email Print Share.

Bayley, P. J. & Squire, L. R. (2002). Medial temporal lobe amnesia: Gradual acquisition of factual

information by nondeclarative memory. The Journal of neuroscience : the official journal of

the Society for Neuroscience, 22(13), 5741–8.

Bjoerklund, A. & Dunnett, S. B. (2007). Dopamine neuron systems in the brain: an update. Trends

in Neurosciences, 30(5), 194–202.

Bliss, T. V. P. & Lømo, T. (1973). Longlasting potentiation of synaptic transmission in the dentate

area of the anaesthetized rabbit following stimulation of the perforant path. The Journal of

Physiology.

Brousse, O. & Smolensky, P. (1989). Virtual Memories and Massive Generalization in Connectionist

Combinatorial Learning ; CU- CS-431-89. 414.

Cabestany, J., Prieto, A., & Sandoval, F. (2005). LNCS 3512 - Computational Intelligence and

Bioinspired Systems. 8th International Work-Conference on Artificial Neural Networks.

Carpenter, G. A. & Grossberot, S. (1987). A Massively Parallel Architecture for a Self-Organizing

Neural Pattern Recognition Machine. 37, 54–115.

Cichon, J. & Gan, W.-b. (2015). cause persistent synaptic plasticity.

De-Miguel, F. F. & Trueta, C. (2005). Synaptic and extrasynaptic secretion of serotonin.

Descarries, L., Watkins, K. C., Garcia, S., Bosler, O., & Doucet, G. (1996). Dual character,

asynaptic and synaptic, of the dopamine innervation in adult rat neostriatum: A quantitative

autoradiographic and immunocytochemical analysis. Journal of Comparative Neurology.

32

Eriksson, P. S., Perfilieva, E., Bjork-Eriksson, T., Alborn, A.-M., Nordborg, C., Peterson, D. A., &

Gage, F. H. (1998). Neurogenesis in the adult human hippocampus. Nature Medicine, 4(11),

1313–1317.

French, R. (1992). Semi-distributed representations and catastrophic forgetting in connectionist

networks.

French, R. (1994). Dynamically constraining connectionist networks to produce distributed, or-

thogonal representations to reduce catastrophic interference - Semantic Scholar. Proceedings

of the 16th Annual Cognitive Science Society Conference.

French, R. M. (1991). Catastrophic Forgetting in Connectionist Networks. In Encyclopedia of

Cognitive Science. Chichester: John Wiley & Sons, Ltd.

French, R. M. (1999). Catastrophic forgetting in connectionist networks.

Gerstner, W. & Kistler, W. M. (2002). Spiking neuron models : single neurons, populations,

plasticity. Cambridge University Press.

Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks.

Gould, E. & Gross, C. G. (2002). Neurogenesis in adult mammals: some progress and problems.

The Journal of neuroscience : the official journal of the Society for Neuroscience, 22(3), 619–

23.

Gould, E. & Tanapat, P. (1999). Stress and hippocampal neurogenesis. In Biological Psychiatry.

Graves, A., Mohamed, A.-r., & Hinton, G. (2013). Speech Recognition with Deep Recurrent Neural

Networks.

Graybiel, A. M. (1990). Neurotransmitters and neuromodulators in the basal ganglia. Trends in

neurosciences, 13(7), 244–54.

Harlow, H. F. (1949). The formation of learning sets. Psychological Review, 56(1), 51–65.

Hattori, M. (2014). A biologically inspired dual-network memory model for reduction of catas-

trophic forgetting. Neurocomputing.

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition.

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V.,

Nguyen, P., Sainath, T., & Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling

in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing

Magazine, 29(6), 82–97.

Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8),

1735–80.

Ito, M. (1989). LONG-TERM DEPRESSION. Ann. Rev. Neurosci, 12, 85–102.

Jia Deng, Wei Dong, Socher, R., Li-Jia Li, Kai Li, & Li Fei-Fei (2009). ImageNet: A large-

scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern

Recognition.

Kempermann, G., Jessberger, S., Steiner, B., & Kronenberg, G. (2004). Milestones of neuronal

development in the adult hippocampus. Trends in Neurosciences, 27(8), 447–452.

Kempermann, G., Kuhn, H. G., & Gage, F. H. (1998). Experience-induced neurogenesis in the

senescent dentate gyrus. The Journal of neuroscience : the official journal of the Society for

Neuroscience, 18(9), 3206–12.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., & Rusu, A. A. (2017).

Overcoming catastrophic forgetting in neural networks. 114(13), 3521–3526.

Kitchin, R. (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures and Their

Consequences. 1st edition edition.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolu-

33

tional Neural Networks. NIPS’12 Proceedings of the 25th International Conference on Neural

Information Processing Systems, 1, 1097–1105.

Kumar, S. (2012). Fundamental Limits to Moore’s Law.

Kumaran, D., Hassabis, D., & Mcclelland, J. L. (2016). What Learning Systems do Intelligent

Agents Need ? Complementary Learning Systems Theory Updated. Trends in Cognitive

Sciences, 20(7), 512–534.

Kumaran, D. & McClelland, J. L. (2012). Generalization through the recurrent interaction of

episodic memories: A model of the hippocampal system. Psychological Review, 119(3), 573–

616.

Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning through

probabilistic program induction. Science (New York, N.Y.), 350(6266), 1332–8.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2016). Building Machines That

Learn and Think Like People.

Lecun, Y. & Bengio, Y. (1995). Convolutional networks for images, speech, and time-series.

LeCun, Y., Cortex, C., & Christopher, B. (2010). MNIST handwritten digit database, Yann LeCun,

Corinna Cortes and Chris Burges.

Legg, S. & Hutter, M. (2007). Universal Intelligence: A Definition of Machine Intelligence.

Li, Z. & Hoiem, D. (2017). Learning without Forgetting.

Lodato, S. & Arlotta, P. (2015). Generating Neuronal Diversity in the Mammalian Cerebral Cortex.

Annual Review of Cell and Developmental Biology.

Marblestone, A., Wayne, G., & Kording, K. (2016). Towards an integration of deep learning and

neuroscience.

Marr, D. (1970). A Theory for Cerebral Neocortex. Proceedings of the Royal Society B: Biological

Sciences.

Marr, D. (1971). Simple Memory: A Theory for Archicortex. Philosophical Transactions of the

Royal Society B: Biological Sciences.

Marr, D. (1982). Vision : a computational investigation into the human representation and pro-

cessing of visual information. MIT Press.

McClelland, J. L., Botvinick, M. M., Noelle, D. C., Plaut, D. C., Rogers, T. T., Seidenberg,

M. S., & Smith, L. B. (2010). Letting structure emerge: Connectionist and dynamical systems

approaches to cognition. Trends in Cognitive Sciences.

McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). Why there are complementary

learning systems in the hippocampus and neocortex: Insights from the successes and failures

of connectionist models of learning and memory. Psychological Review.

McCloskey, M. & Cohen, N. J. (1989). Catastrophic Interference in Connectionist Networks: The

Sequential Learning Problem. Psychology of Learning and Motivation, 24, 109–165.

McRae, K. & Hetherington, P. (1993). Catastrophic Interference is Eliminated in Pretrained

Networks. Proceedings of the 15th Annual Conference of the Cognitive Science Society, (pp.

723–728).

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M.

(2013). Playing Atari with Deep Reinforcement Learning.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Ried-

miller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,

I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control

through deep reinforcement learning. Nature, 518(7540).

Mocanu, D. C., Vega, M. T., Eaton, E., Stone, P., & Liotta, A. (2016). Online Contrastive

34

Divergence with Generative Replay: Experience Replay without Storing Data.

Moore, G. J. (1965). Cramming more components onto integrated circuits. Electronics, Volume

38(8).

Moser, E. I., Kropff, E., & Moser, M.-B. (2008). Place Cells, Grid Cells, and the Brain’s Spatial

Representation System. Annual Review of Neuroscience, 31(1), 69–89.

Panzeri, S., Petersen, R. S., Schultz, S. R., Lebedev, M., & Diamond, M. E. (2001). The role of

spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 29(3),

769–77.

Pittaluga, A., Bonfanti, A., & Raiteri, M. (2000). Somatostatin potentiates NMDA receptor

function via activation of InsP 3 receptors and PKC leading to removal of the Mg 2+ block

without depolarization 1. British Journal of Pharmacology, 130, 557–566.

Race, E., Keane, M. M., & Verfaellie, M. (2013). Living in the moment: patients with MTL

amnesia can richly describe the present despite deficits in past and future thought. Cortex; a

journal devoted to the study of the nervous system and behavior, 49(6), 1764–6.

Robins, A. (1995). Catastrophic Forgetting, Rehearsal and Pseudorehearsal. Connection Science.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-

propagating errors. Nature, 323(6088), 533–536.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,

Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet Large Scale Visual

Recognition Challenge. International Journal of Computer Vision.

Schapiro, A. C., Turk-Browne, N. B., Norman, K. A., & Botvinick, M. M. (2016). Statistical

learning of temporal community structure in the hippocampus. Hippocampus, 26(1), 3–8.

Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay.

Schmidt-Hieber, C., Jonas, P., & Bischofberger, J. (2004). Enhanced synaptic plasticity in newly

generated granule cells of the adult hippocampus. Nature.

Schultz, W., Apicella, P., Scarnati, E., & Ljungberg, T. (1992). Neuronal activity in monkey

ventral striatum related to the expectation of reward. The Journal of neuroscience : the

official journal of the Society for Neuroscience, 12(12), 4595–610.

Serre, T., Oliva, A., & Poggio, T. (2007). A feedforward architecture accounts for rapid catego-

rization. Proceedings of the National Academy of Sciences of the United States of America,

104(15), 6424–9.

Sharkey, N. E. & Sharkey, A. J. C. (1995). Backpropagation Discrimination Geometric Analysis

Interference Memory Modelling Neural Nets. Connection Science, 7(3-4), 301–330.

Shin, H., Lee, J. K., Kim, J., & Kim, J. (2017). Continual Learning with Deep Generative Replay.

Spearman, C. (1904). "General Intelligence," Objectively Determined and Measured.

The American Journal of Psychology, 15(2), 201.

Squire, L. R., Stark, C. E., & Clark, R. E. (2004). THE MEDIAL TEMPORAL LOBE. Annual

Review of Neuroscience.

Squire, L. R., Zola-Morgan, & Stuart (1991). The Medial Temporal Lobe Memory System. Science,

253(5026).

statista.com (2018a). Statistics on smartphone sells.

statista.com (2018b). Statistics on smartphone usage.

Stickgold, R. (2005). Sleep-dependent memory consolidation.

Sutton, R. S. & Barto, A. G. (1998). Reinforcement learning : an introduction. MIT Press.

Tulving, E. (1985). Memory and consciousness. Canadian Psychology/Psychologie canadienne,

26(1), 1–12.

35

van Praag, H., Kempermann, G., & Gage, F. H. (1999). Running increases cell proliferation and

neurogenesis in the adult mouse dentate gyrus. Nature Neuroscience, 2(3), 266–270.

Velez, R. & Clune, J. (2017). Diffusion-based neuromodulation can eliminate catastrophic forget-

ting in simple neural networks. PLOS ONE, 12(11), e0187736.

Waldrop, M. M. (2016). The chips are down for Moores law. Nature, 530(7589), 144–147.

Wang, R. (2002). Twos company, threes a crowd: can H2S be the third endogenous gaseous

transmitter? The FASEB Journal, 16(13), 1792–1798.

Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2015). Dueling

Network Architectures for Deep Reinforcement Learning.

Weston, J., Chopra, S., & Bordes, A. (2014). Memory Networks.

Wiskott, L., Rasch, M. J., & Kempermann, G. (2006). A functional hypothesis for adult hippocam-

pal neurogenesis: Avoidance of catastrophic interference in the dentate gyrus. Hippocampus.

Wynn, T. (1988). Tools and the evolution of human intelligence. New York: Clarendon

Press/Oxford University Press.

Yang, G., Pan, F., & Gan, W.-b. (2009). Stably maintained dendritic spines are associated with

lifelong memories. Nature, 462(7275), 920–924.

36

Documents

Continual learning in humans and neuroscience-inspired AI