Upload
mer-fro
View
220
Download
0
Embed Size (px)
Citation preview
7/25/2019 A Distributed Adaptive Control System for a Quadruped Mobile Robot
1/6
A
Distributed Adaptive Control System
for a Quadruped Mobile Robot
Bruce L. Digney and
M.
M. Gupta
Intelligent Systems Research Laboratory, College of Engineering
University of Saskatchewan, Saskatoon, Sask. CANADA S7N OW0
Email: [email protected]
Abdruct-
In
this research, a method by which
reinforcement learning can be combined into a be-
havior based control system is presented. Behav-
iors which are impossible or impractical to embed
as predetermined responses are learned through
self-exploration and self-organization using a tem-
poral difference reinforcement learning technique.
This
results in what is referred to as a distributed
adaptive control system
(DACS); in effect the
robots artificial nervous system.
A DACS is de-
veloped for a simulated quadruped mobile robot
and the locomotion behavior level is isolated and
evaluated. At the locomotion level the proper ac-
tuator sequences were learned for all possible gaits
and eventually graceful gait transitions were also
learned. When confronted with an actuator mal-
function, all gaits and transitions were adapted re-
sulting in new limping gaits for the quadruped.
I .
INTRODUCTION
Although conventional control and artificial intelligence
researchers have made many advances, neither ideology
seems capable of realizing autonomous operation. Th at
is, neither can produce machines which can interact with
the world with an ease comparable to humans or at least
higher animals. In responding to such limitations, many
researchers have looked to
biological physiological
based
systems
as
the motivation to design artificial systems. As
an example are the behavior based systems of Brooks [l]
and Beer [2]. Behavior based control systems consist of a
hierarchical structure of simple behavior modules. Each
module is responsible for the sensory motor responses of
a particular level of behavior. The overall effect is that
higher level behaviors are recursively built upon lower ones
and the resulting system operates in
a
self-organizing man-
ner. Both Brooks and Beers systems were loosely based
upon the nervous systems of insects. These artificial in-
sects operated in a hardwired manner and exhibited an
interesting repertoire of simple behaviors. By hardwired it
0-7803-0999-5/93/$03.0001993
IEEE
.
144
is
meant that each behavior module had its responses pre-
determined and was simply programmed externally. Al-
though this approach is successful with simple behaviors,
it is obvious that many situations exist where predeter-
mined solutions are impossible or impractical to obtain.
It is subsequently proposed that by incorporating learn-
ing into the behavior based control system, these difficult
behaviors could be acquired through self-exploration and
self-learning.
Complex behaviors are usually characterized by
a 8e-
quence of actions with success or failure only known at
the end of that sequence. Also, the critical error sig-
nal is only an indication of the success or failure of the
system and no information regarding error gradients can
be determined,
as
in the case of continuous valued error
feedback. Thus the required learning mechanism must
be capable of both reinforcement learning aswell as tem-
poral credit assignment. Incremental dynamic program-
ming techniques such
as
Bartos [3] temporal difference
(TD) appear to be well suited to such tasks. Based upon
Bartos previous adaptive heuristic critic
[4],
TD employs
adaptive state and action evaluation functions to incre-
mentally improve its action policy until successful oper-
ation is attained. The incorporation of TD learning into
behavior based control results in a framework of adap-
tive (ABMs) and non-adaptive behavior modules which
is referred to here
as a
distributed adaptive control sys-
tem (DACS). The remainder of this report will be con-
cerned with
a
brief description of the DACS and ABMs,
and implementing of the locomotion level ABM within the
DACS of
a
simulated quadruped mobile robot. This level
is considered appropriate because the actuator sequences
for quadruped locomotion are not intuitively obvious and
difficult to determine. Other levels such as global naviga-
tion,
t a sk
planning and task coordination are implemented
and discussed by Digney [ 5 ] .
11 DISTRIBUTEDDAPTIVEONTROLYSTEMS
The DACS shown in Figure
1
s comprised of various adap-
tive and non-adaptive behavior modules. Non-adaptive
Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.
7/25/2019 A Distributed Adaptive Control System for a Quadruped Mobile Robot
2/6
modules are present as inherent knowledge and are used
where adaptive solutions are not required. All modules
receive sensory inputs and respond with actions in an at-
tempt to perform
a
command specified by a higher level.
The performance of commands in most cams will require a
sequence of actione by the lower level system and possibly
the cooperation of many lower level systems. The coupling
between ABMs is shown in Figure 2. In this configura-
tion, the action from level 1+1becomes the command for
level
1
Level
1
+ 1dm upplies goal based reinforcement,
r g ,
o drive level 1 towards succeasful completion of that
command. Level 1 in turn issues actions to level 1- 1 and
receives environmental based reinforcement, pc
,
rom level
1 -
1.
This environment based reinforcement is represen-
tative of the difficulty or cost incurred while performing
the requested actions and is included to drive level
1
to a
cost effective solution. While operating, level 1 may en-
ter a state which is in some way damaging or dangerous.
To drive the system away from such
a
state, sensor based
reinforcement, r is used. Sensor based reinforcement is
supplied from sensors at level
1
It is analogous to pain o r
fear and will ensure that level 1 operates in
a
safe man-
ner. These three reinforcements are combined into a total
reinforcement signal, rt , according to Equation 1.
r:
=
re
+
ag rg+a r, (1)
where:
a g
and
a
are the relative importance of the
reinforcements.
Figure
1:
Schematic of DACS
It can be seen from Figure 2 that the flow of environ-
mental and sensor based reinforcement is in the upward
direction. This will result in lower level skills and behav-
iors being learned first, then other higher level behaviors,
converging in a recursive manner toward the highest level.
Figure 1 shows this highest level
as
existing within a sin-
gle physical machine. However, in the case of multiple
machines operating in a collective, higher abstract behav-
ior levels are possible. Within the context of this paper,
only behaviors relevant to individual machines will be dis-
cussed. In the absence
of
higher collective behaviors con-
Figure 2: Hierarchy of Three
ABMs
trolling individual machines, the purpose or task of the
machine is embedded within the
DACS
as
an
ins t inc t
or
d r i v e . This instinct is the high level action which results
in a feeling of accomplishment or positive reinforcement
within the DACS. It is then the responsibility of the adap-
tive behavior modules within the DACS to learn the skills
and behaviors necessary to fulfill this drive. This concept
as well as the self-organizing characteristics that result
from such interactions are further discussed by Digney [ 5 ] .
The
ABM
is the primary adaptive building block for
the DACS. Within it exist computational mechanisms for
state classification, learning and the combination of rein-
forcement signals. Figure 3 shows
a
schematic of an
ABM
complete with incoming command, sensory and reinforce-
ment signals. For clarity the outgoing reinforcement sig-
nals have been removed. For any particular level, say
I
the
ABM
observes the relevant system states through
ap-
propriate sensors. For a perception system consisting of
N sensors, the state
S I ,
s defined as
where: sn is the individual sensor reading, 0 < n < N
QOdBudwnfMamrd
mar Bumd Pdn)R*dmm
Ad.plv.8.k.riarModL&
EnvCo (.lBrd
F Y h f c m ~ ~ r d
Figure 3: Single ABM
State transitions are detected and the resulting states
are classified uaing an idealized neural claaeiflcation
145
., . . . ,
, ,
...
Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.
7/25/2019 A Distributed Adaptive Control System for a Quadruped Mobile Robot
3/6
scheme. This classification embodies the macroscopic
o p
erating principlea of unsupervised neural networks such
as
ART2-A
[6]
and will be assumed adequate in the con-
text
of these simulations. The Temporal Difference (TD)
algorithm as developed by Barton
[
learns by adjust-
ing state and action evaluation functions then uses these
evaluations to choose an optimum action policy. It can
be shown that these two evaluation functions can be com-
bined into
a
single action dependent evaluation function,
say Q,,,,, similar to th at described by Barto [7]. Given
the system at state 8 , the action taken,
U ,
is the action
which satisfies
(3)
where: is a random valued function.
In Equation 3, Q,,u and can be thought of as the
goal driven and exploration driven components of the ac-
tion policy respectively. Taking the action
U*
results in
the transition from state s to state w and the incurring
of a total reinforcement signal
r t .
The action dependent
evaluation function error is obtained by modifying the T D
error equation and is
e
=
Qvirtuai - s ,U* rt
(4)
where: Qvirlual
is
the virtual sta te evaluation value of the
where: is the rate of adaption and
k
is the index of
adap ion.
As the evaluation function converges, the goal driven
component begins to dominate over the exploration driven
component. The resulting action policy will perform the
command in
a
successful and efficient manner. Generally,
an ABM will be capable of performing more than a single
command. For an ABM capable of
cmao
commands, the
vector of the evaluation functions is defined as:
where: Qs,u is the evaluation function and
c
the particular
command 0 < c < Cmas.
111. DACS FOR A QUADRUPED OBLE
ROBOT
To evaluate the DACS, the simulated quadruped shown
in Figure
4
was
used. This mobile robot was placed in-
side a simulated three dimensional landscape where it is
left to develop skills and behaviors as it interacts with
its
environment. This world is made up of ramps, plateaus,
cliffs and walls,
as
well
as
various substances of interest.
In the absence of any predetermined knowledge it
is
the
responsibility of the DACS and in particular the ABMs to
acquire the skills and behaviors for successful operation.
next state w and 7 is the temporal discount factor.
If action,
U ,
does not achieve the desired goal, the vir-
~
tual state evaluation is,
Q u i r t u o l z
mU={QuJ.
5 )
It
is
easily seen t hat Qvirtuai becomes the minimum ac-
tion dependent evaluation function of the new state, w
(remember the evaluation functions are negative in sign)
and in effect corresponds to the action most likely to be
taken when the system leaves state w
state evaluation is,
If the action,
U
achieves the desired goal, the virtual
Qvirtual =
0 . (6)
This provides relative sta te evaluations and allows for ope-
nended or cyclic goal states. This is illustrated by consid-
ering that for cyclic goals it is the dynamic transitions
between
states
tha t constitutes a goal state and not sim-
plely the arrival at a static system state(s).
This error is used to adapt the evaluation functions ac-
cording to LMS rules
as
follows
Figure 4: Simulated Quadruped
Although not the most efficient method of locomotion,
the learning of quadruped walking provides interesting
and challenging problems. Involved is the learning of com-
plex actuator sequences in the midst of numerous false
goal states and modes of failure. Figure 5 shows the loco-
motion ABM with the appropriate sensory, reinforcement
and motor action connections.
146
Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.
7/25/2019 A Distributed Adaptive Control System for a Quadruped Mobile Robot
4/6
z=z:
71
= mldw
The reinforcement signals are defined aa
M M n a cn W
Figure 5: Locomotion ABM
The commands,
Clocomotjon,
are issued from the ABM
above and are dependent upon the possible sensory states
of that module. In this case these sensors are capable of
detecting all realizable modes of body motion. The com-
mands for the locomotion level are defined in Equation 10.
0 forward
1
left turn
(10)
cmoo all possible modes
For any specific command the locomotion ABM will is-
sue action responses, u~ocomo t~on ,o the actuators driv-
ing the legs in the horizontal,
h ,
and vertical,
U
direc-
tions. Within thi s action vector are the individual actu-
ator commands to extend, e x , or retract, rt as shown in
Equation
11
and 12.
where
Cleg =
h
hold
extend vertical
ur l
retract vertical (12)
he,
extend horizontal
, hrt retract horizontal
Each leg is equipped with sensors for measuring the
forces on each
foot
and the positions of each leg. The
forces on the foot are are biased such that
-fmaz