DOB-Net: Actively Rejecting Unknown Excessive Time-Varying ... · rejection control of AUVs in shallow and turbulent water, as shown in Fig 1. The DOB-Net consists of a distur-bance

DOB-Net: Actively Rejecting Unknown Excessive Time-VaryingDisturbances

Tianming Wang1, Wenjie Lu1, Zheng Yan2 and Dikai Liu1

Abstract— This paper presents an observer-integrated Rein-forcement Learning (RL) approach, called Disturbance OB-server Network (DOB-Net), for robots operating in environ-ments where disturbances are unknown and time-varying,and may frequently exceed robot control capabilities. TheDOB-Net integrates a disturbance dynamics observer networkand a controller network. Originated from conventional DOBmechanisms, the observer is built and enhanced via RecurrentNeural Networks (RNNs), encoding estimation of past valuesand prediction of future values of unknown disturbances inRNN hidden state. Such encoding allows the controller generateoptimal control signals to actively reject disturbances, underthe constraints of robot control capabilities. The observer andthe controller are jointly learned within policy optimizationby advantage actor critic. Numerical simulations on positionregulation tasks have demonstrated that the proposed DOB-Net significantly outperforms a conventional feedback controllerand classical RL algorithms.

I. INTRODUCTION

Autonomous Underwater Vehicles (AUVs) have becomevital assets in search and recovery, exploration, surveillance,monitoring, and military applications [1]. For large AUVsin deep water applications, the strength and changes ofexternal wave and current disturbances are negligible tothe AUVs, due to their considerable size and thrust capa-bilities. While small AUVs are required for some shallowwater applications, like bridge pile inspection [2], wherethe disturbances coming from the turbulent flows may fre-quently exceed AUVs’ thrust capabilities. These unknowndisturbances inevitably bring adverse effects and may evendestabilize robots [3]. Thus this paper studies an optimalcontrol problem of robots subject to excessive time-varyingdisturbances, and presents an observer-integrated RL solu-tion. Such problem also arises in many other applications,e.g., aerial quadrotors for surveillance in wind conditions [4]and manipulators operating with constantly varying loads. Inthe cases of robot’s actuator failure, the control capabilitiesmay also be lower than the external disturbances.

RL [5] is a trial-and-error method that does not require anexplicitly system model, and can naturally adapt to noises

This work was supported in part by the Australian Research CouncilLinkage Project (LP150100935), the Roads and Maritime Services of NSW,and the Centre for Autonomous Systems at the University of TechnologySydney.

1Tianming Wang, Wenjie Lu and Dikai Liu are with Centre forAutonomous Systems (CAS), Faculty of Engineering and InformationTechnology (FEIT), University of Technology Sydney (UTS), Ultimo,NSW 2007, Australia [email protected]{wenjie.lu,dikai.liu}@uts.edu.au

2Zheng Yan is with Centre for Artificial Intelligence (CAI), FEIT, UTS,Ultimo, NSW 2007, Australia [email protected]

Disturbance Dynamics Observer

Controller

AUV Dynamics

action

state

DOB-Net

disturbance behaviour

embedding

Disturbances

Environment

𝑡

𝑑

𝑢

𝑢

Fig. 1. Working flow of the DOB-Net (u and u are control limits).

and uncertainties in the real system. With recent advancesin deep neural network, RL is now able to solve practicalproblems. However, the excessive disturbances are not appro-priate to be regarded as noises any more, since AUV’s statetransition is heavily affected by the external disturbances,thus violating the assumption of Markov Decision Process(MDP).While considering the time-varying characteristics ofthe current and wave disturbances, if future disturbances canbe predicted, RL may be able to generate optimal controls.

Conventional DOB [6] and related methods have beenresearched and applied in various industrial sectors in thelast four decades. The main objective of the use of DOBis to deduce the unknown disturbances from measurablevariables, without additional sensors. Then, a control actioncan be taken, based on the disturbance estimate, to compen-sate for the influence of the disturbances, which is calledDisturbance-Observer-Based Control (DOBC) [7].

However, conventional DOBC does have some limitationswhen solving our problem. The first limitation is that DOBnormally needs an accurate system model, which is usuallyunavailable for underwater vehicles due to complex hydro-dynamics. In this case, model uncertainties are lumped withexternal disturbances, and then estimated by DOB together.Thus, the original properties for some disturbances, suchas harmonic ones, are changed. Furthermore, even with ansufficiently accurate estimate of the disturbances at currenttimestep, the optimal control solution is still unreachabledue to the neglection of time correlation of the disturbancesignals. The reason behind this is that disturbances exceedingcontrol constraints cannot be well rejected only throughfeedback regulation. Sun and Guo [8] proposed a neuralnetwork approach to construct DOBC. As compared withconventional DOBC methods, the primary merit of theproposed method is that the model uncertainties are approxi-mated using a radial basis function neural network (RBFNN)technique, and not regarded as part of the disturbances. Then

arX

iv:1

907.

0451

4v2

[cs

.RO

] 2

9 Ja

n 20

20

the external disturbances can be separately estimated by aconventional DOB. However, they still failed to considerthe control constraints, leading to poor control capabilitiesunder excessive disturbances. In order for optimal overallperformance, the AUV behaviour needs to be optimizedover a future time horizon using a sequence of disturbanceestimates.

This paper proposes a novel RL approach called DOB-Net,which enables integrated learning of disturbance dynamicsand an optimal controller, for current and wave disturbancerejection control of AUVs in shallow and turbulent water,as shown in Fig 1. The DOB-Net consists of a distur-bance dynamics observer network and a controller network.The observer network is built and enhanced via RNNs,through imitating the conventional DOB mechanisms. Butthis network is more flexible than the conventional one,since it encodes the prediction of the external disturbancesin RNN hidden state, instead of only estimating the currentvalue of disturbances. Also, the observer function is morerobust to the model uncertainties and the rapidly time-varying characteristics of the external disturbances. Basedon the encoded disturbance prediction, the controller net-work is able to actively reject the unknown disturbances.The observer and the controller are jointly learned withinpolicy optimization by advantage actor critic. This integratedlearning may achieve an optimized representation of observeroutputs, compared with traditional hand-designed features.The policy is trained using simulated disturbances consistingof multiple sinusoidal waves, and evaluated using both sim-ulated disturbances and collected disturbances, the latter isgathered from an Inertial Measurement Unit (IMU) onboardan AUV in a water tank at University of Technology Sydney.During training, the amplitude, period and phase of sine-wave disturbances are randomly sampled in each episode.

In this paper, related work is presented in Section II. Sec-tion III introduces problem formulation. Section IV providesthe detailed description of the DOB-Net. Then, Section Vpresents validation procedures and result analysis. Some po-tential future improvements are discussed in the last section.

II. RELATED WORKA. Feedback and Predictive Control

In the early development of disturbance rejection con-trol, feedback control strategies are used to suppress theunknown disturbances [9]–[11]. Then, disturbance estimationand attenuation methods through adding a feedforward com-pensation term have been proposed and practiced, such asDOBC [6], [7]. However, these methods often assume thatthe system deals with bounded disturbances which shouldbe small enough, thus fail to guarantee stability consideringcontrol constraints when meeting strong disturbances [12].

To this end, Model Predictive Control (MPC) [13] isoften applied due to its constraint handling capacity throughoptimizing plant behaviour over a certain time horizon [12].The MPC requires a prediction model of the system tooptimize future behaviour, this model includes not only therobot dynamics, but also the predicted disturbances over

next optimization horizon. Thus, researchers have developeda compound control scheme consisting of a feedforwardcompensation part based on the conventional DOB and afeedback regulation part based on the MPC (DOB-MPC)[14]. However, the performance of the MPC heavily relieson the accuracy of given system model, and the requirementfor online optimization at each timestep leads to a low com-putational efficiency. Besides, such separated modeling andcontrol optimization process might not be able to producemodels and controls that jointly optimize robot performance,as evidenced in [15]. In contrast, the DOB-Net uses neuralnetworks to construct both the observer and the controller,achieving model-free control, high run-time efficiency aswell as a joint optimization of the observer and the controller.

B. Classical RL

RL has drawn a lot of attention in finding optimal con-trollers for systems that are difficult to model accurately.Recently, deep RL algorithms based on Q-learning [16],policy gradients [17], and actor-critic methods [18] havebeen shown to learn complex skills in high-dimensional stateand action spaces, including simulated robotic locomotion,driving, video game playing, and navigation. RL generallyconsiders stochastic systems of the form [19]

xt+1 = f (xt ,ut)+ ε, (1)

with state variables x ∈ RD, control signal u ∈ RK and i.i.d.system noise marginalized over time ε ∼ N (0,E), whereE = diag

(σ2

1 , . . . ,σ2D). While in our case, the current and

wave disturbances should be regarded as functions of timeinstead of random noises, refer to (2), due to its largeamplitudes and time-varying characteristics, as evidence inSection V.

C. History Window Approach

When using RL to deal with unknown disturbances, theproblem cannot be defined as a MDP, since the transitionfunction does not only depend on the current state andaction, but also heavily on the disturbances. The historywindow approaches [20] attempt to resolve the hidden stateby making the selected action depend not only on the currentstate, but also on a fixed number of the most recent statesand actions. Wang et al. [21] applied this approach to handlethe external disturbances of an AUV through characterizingthe disturbed AUV dynamics model as a multi-order Markovchain xt+1 = fh(Ht ,xt ,ut), and assuming the unobservedtime-varying disturbances and their prediction over nextplanning horizon are encoded in state-action history of fixedlength Ht = {xt−N ,ut−N , · · · ,xt−1,ut−1}, where N representsthe history length. Then, the policy is trained to take a fixedlength of state-action history along with the current state asinput to generate control signals. However, it is difficult todetermine an optimal length of the history. Shorter historymay not provide sufficient information about disturbances,longer history will make the training difficult. Wang et al.[21] considered the history length as a hyperparameter thatwas statistically optimized during training.

D. Recurrent Policy

Due to the difficulty in determining optimal history lengththrough history window approach, RNN is then utilized toautomatically learn how much past experience should beexplored to achieve optimal performance. Using RNN torepresent policies is a popular approach to handle partialobservability [22], [23]. The idea being that the RNN willbe able to retain information from states further back intime and incorporate this information into predicting betteractions and value functions and thus performing better ontasks that require long term planning. Particularly, a RNNpolicy produces an action and a hidden vector based oncurrent state and a hidden vector from last timestep. Sincethe hidden vector contains past processed information, theaction is a function that depends on all the previous states.These policies are able to solve tasks that require memoryby loading sequence of states [23]. However, most of themonly considered the state-only history, which, for example,has been used for estimating velocities in training video gameplayer [16].

Sutskever et al. [24] presented a novel usage of RNN,where the input to the RNN comes from the previouslypredicted output. This architecture gives rise to an ideaof using both past states and actions in the recurrent RLapproach, where the recurrent policy produces an action ut ,given both current state xt as well as previously executedaction ut−1 as the RNN input. These approaches encodeadditional dynamics information besides kinematics infor-mation, enabling the capability of observing disturbances.Compared with the recurrent RL using state-action history,the contributions of DOB-Net lie in the exploration andapplication of the architectural similarities between GatedRecurrent Unit (GRU) and conventional DOB, and then theability to encode the prediction of disturbance dynamicsfunction.

E. Meta-Reinforcement Learning

Meta-Reinforcement Learning (Meta-RL) [25]–[27] de-fines a framework which leverages data from previous tasksto acquire a learning procedure that can quickly adapt tonew tasks using only a small amount of experience. Thetasks are considered as MDPs or Partially Observable MDPs(POMDPs) drawn from a task distribution with varyingtransition functions or varying reward functions. Nagabandiet al. [26] and Rakelly et al. [27] both considered adaptationto the current task setting using past transitions. Rakelly etal. [27] considered adaptation at test time in meta-RL as aspecial case of RL in a POMDP by including the task as theunobserved part of the state, and learned a probabilistic latentrepresentation of prior experience to capture uncertainty overthe task.

However, these papers either assume each task is aninvariable MDP [25], [27], or assume the environment islocally consistent during a rollout [26], which may notbe suitable for our problem due to the rapid time-varyingcharacteristic of the current and wave disturbances. Also,these disturbances are better described as functions of time,

instead of just uncertainties in the transition function. In thatcase, the external disturbances (or variations in transitionfunction) for two consecutive samples are not independentand identically distributed (i.i.d.). Current meta-RL formu-lation has not clearly indicated an appropriate solution forthis time correlated variations in the transition function.A potential idea is to characterize the disturbed transitionfunction as a multi-step MDP, where the state space containsnot only the current state, but also the most recent statesand actions. This naturally leads us to the history windowapproach and recurrent RL.

III. PROBLEM FORMULATION

A. System Description

Our 6 Degree Of Freedom (DOF) AUV is designed tobe sufficiently stable in orientation even under strong dis-turbances, thanks to its large restoring forces. Thus, we onlyconsider the disturbance rejection control of the vehicleâAZs3-DOF position, and assume its orientation is well controlledall the time. However, the framework can be easily extendedto 6-DOF case, where a larger network and longer trainingtime may be required. And it is also applicable for otherkinds of mobile robots subject to excessive time-varyingdisturbances, such as quadrotors, gliders, and surface vessels,but the effectiveness needs further investigation.

The AUV model can be considered as a floating rigid bodywith external disturbances, which can be represented by

M(q)q+G(q, q) = u+d(t)G(q, q) =C(q, q)q+D(q, q)q+g(q) , (2)

where M(q) ∈ R3×3 is the inertia matrix, C(q, q) ∈ R3×3 isthe matrix of Coriolis and centripetal terms, D(q, q)∈R3×3 isthe matrix of drag force, g(q)∈R3 is the vector of the gravityand buoyancy forces, q, q, q ∈ R3 represent replacements,velocities and accelerations of the AUV, u ∈ R3 representsthe control forces, d(t) ∈R3 is the time-varying disturbanceforces, and the variation of d(t) with time from the past tothe future is the disturbance dynamics, which is exactly whatthe observer network tries to produce. The AUV dynamicmodel is assumed to have fixed parameters, the model andthe parameters are not known. In our case, we assume thatthe magnitudes of the disturbances will exceed the AUVcontrol limits u ∈R3 and u ∈R3, but are constrained withinreasonable ranges, ensuring the controller is able to stabilizethe AUV in a sufficiently small region.

B. Problem Definition

In RL, the goal is to learn a policy that chooses actions utat each timestep t in response to the current state xt , such thatthe total expected sum of discounted rewards is maximizedover all time. The state of the robot consists of position aswell as the corresponding velocities x = [qT qT ]T ∈X ∈R6.The action includes the control forces u ∈U ∈ R3. At eachtimestep, the system transitions from xt to xt+1 in responseto the chosen action ut and the transition dynamics functionf : X ×U →X , collecting a reward rt according to running

reward function or final reward function

r (xt ,ut) = xTt Qxt +uT

t Rut , r f (xt) = xTt Qxt , (3)

where Q ∈ R6×6 and R ∈ R3×3 represent weight matrices.The discounted sum of future rewards is then defined as∑

T−1t ′=t γ t ′−tr (xt ′ ,ut ′)+ γT−tr f (xT ), where γ ∈ [0,1] is a dis-

count factor that prioritizes near-term rewards over distantrewards [28].

IV. METHODOLOGY

The underwater disturbances present great challenges forstabilization control due to its excessive amplitudes as wellas rapidly time-varying characteristics. In this section, aconventional DOB is first compared with a GRU, the re-sults show some similarities in the structure of processinghidden information. Thus, an enhanced observer network forexcessive time-varying disturbances is designed using GRUs,encoding the disturbance dynamics into GRU hidden state. Afollowing controller network is then built upon this encodingin order to generate optimal controls.

A. Conventional DOB

The basic idea of conventional DOB is to estimate currentdisturbance forces based on robot state and executed controls,its formulation is proposed as

y =−L(q, q)y+L(q, q)(G(q, q)− p(q, q)−u)d = y+ p(q, q)

, (4)

where d ∈ R3 is the estimated disturbances, y ∈ R3 is theinternal state of the nonlinear observer and p(q, q) is thenonlinear function to be designed. The DOB gain L(q, q) isdetermined by the following nonlinear function:

L(q, q)M(q)q =

[∂ p(q, q)

∂q∂ p(q, q)

∂ q

][qq

]. (5)

It has been shown in [6] that DOB is globally asymptoticallystable by choosing

L(q, q) = diag{c,c}, (6)

where c > 0. More specifically, the exponential convergencerate can be specified by choosing c. The convergence and theperformance of the DOB have been established for slowlytime-varying disturbances and disturbances with boundedrate in [29]. A discrete version of DOB is also provided

y𝑡−1

𝑢𝑡−1

𝑥𝑡

𝑦𝑡

መ𝑑𝑡−1

𝐿(𝑥𝑡)

𝑝(𝑥𝑡)

𝐺(𝑥𝑡)

𝑑𝑡

1-

+- 𝑦𝑡

-

(a) DOB

ℎ𝑡−1

𝑠𝑡

ℎ𝑡

𝑡𝑎𝑛ℎ𝜎𝜎

1-

෨ℎ𝑡𝑧𝑡𝑟𝑡

(b) GRUFig. 2. Architecture of DOB and GRU.

(illustrated in Fig. 2(a))

yt = G(xt)− p(xt)−ut−1yt = (1−L(xt)dt)yt−1 +L(xt)dtyt

dt−1 = yt−1 + p(xt). (7)

B. Gated Recurrent Unit (GRU)

The architecture of GRU [30] is shown in Fig. 2(b). Theformulations are given below:

zt = σ (Wz [ht−1,st ]+bz)rt = σ (Wr [ht−1,st ]+br)ht = tanh(Wh [rt ◦ht−1,st ]+bh)ht = (1− zt)◦ht−1 + zt ◦ ht

, (8)

where st is the input vector, ht is the output vector, zt isthe update gate vector, rt is the reset gate vector, W and bare the weight matrices and bias vectors, σ and tanh arethe activation functions (sigmoid function and hyperbolictangent). The operator ◦ denotes the Hadamard product.

C. DOB-Net

The DOB-Net is constructed based on classical actor-criticarchitecture. The observer network consists of two GRUsand two fully connected layers between them, in order toimitate and enhance the function of the conventional DOB.As described in Fig. 2(a) and Fig. 2(b), the DOB and GRUhave a similar architecture, especially the part in the red box.yt of DOB acts as the hidden state, similar to the role of htin GRU, which preserves hidden information for the usageof next timestep. In order to equip GRU with capability ofobserving disturbances, we first employ a GRU to process thesame inputs as DOB, which are the current state xt and thelast action ut−1. Except for the hidden state, the DOB alsooutputs the estimated disturbances dt−1, which is a functionof both the input state and the hidden state. Thus, we addfully connected layers after the first GRU to provide betterembedding of the disturbance estimation.

After that, the embedding of the estimated disturbancescan be further fed into another GRU, in order to encode asequence of past disturbances. The embedding of this distur-bance sequence ht is supposed to represent the disturbancedynamics. It can then be combined with the current state xt ,becoming the actual inputs of the controller network. One de-sign parameter of the DOB-Net is the embedding dimensionof dt−1. In this paper, 3-dimension (the size of disturbances)and 64-dimension (the size of RNN hidden state) are chosenand compared in simulation. Such comparison shows the

𝑥𝑡

𝑢𝑡

𝑉𝑡GRU GRU

FC OutputFC

64 64

1

6464

tanh tanh

FC

3

n

tanh

መ𝑑𝑡−1

FC

64

𝑢𝑡−1

Observer

Controller

ℎ𝑡

Fig. 3. Network architecture of DOB-Net.

Algorithm 1: DOB-Net - pseudocode for each threadAssume global shared parameters θ and θvAssume thread-specific parameters θ ′ and θ ′vInitialize global shared counter T = 0Initialize thread step counter t = 1repeat

Reset gradients: dθ ← 0 and dθv← 0Synchronize parameters θ ′ = θ and θ ′v = θvtstart = tGet state xt , last action ut−1, last hidden state ht−1repeat

Sample ut according to π (ut |xt ,ut−1,ht−1;θ ′),receive ht

Perform ut , receive rt and xt+1t← t +1 and T ← T +1

until terminal xt or t− tstart == tmax;

R =

{0 for terminal xtV (xt ,ut−1,ht−1;θ ′v) for non-terminal xt

for i ∈ {t−1, . . . , tstart} doR← ri + γRAccumulate gradients wrt θ ′ : dθ ←dθ +∇θ ′ logπ (ui|xi,ui−1,hi−1;θ ′)(R−V (xi,ui−1,hi−1;θ ′v))

Accumulate gradients wrt θ ′v : dθv←dθv +∂ (R−V (xi,ui−1,hi−1;θ ′v))

2 /∂θ ′vendPerform update of θ , θv using dθ , dθv

until T > Tmax;

flexibility of neural networks after building observer fromGRU.

Training: Advantage Actor Critic (A2C) [18] is a concep-tually simple and lightweight framework for deep RL thatuses synchronous gradient descent for optimization of deepneural network controllers. The algorithm synchronouslyexecute multiple agents in parallel, on multiple instancesof the environment. This parallelism also decorrelates theagentsâAZ data into a more stationary process, since at anygiven time-step the parallel agents will be experiencing avariety of different states.

Our algorithm is developed in A2C style. Pseudocode ofthe DOB-Net is shown in Algorithm 1. Each thread interactswith its own copy of the environment. The disturbancesare also different in each thread, and each of them arerandomly sampled. We found this setting helps accelerate theconvergence of learning and improve performance, throughcomparison with using the same disturbances through allthreads during numerical simulations. The algorithm operatesin the forward view by explicitly computing k-step returns. Inorder to compute a single update for the policy and the valuefunction, the algorithm first samples and performs actionsusing its exploration policy for up to tmax steps or untila terminal state is reached. The algorithm then computesgradients for k-step updates. Each k-step update uses thelongest possible k-step return resulting in a one-step updatefor the last state, a two-step update for the second last state,and so on for a total of up to tmax updates. The accumulatedupdates are applied in a single gradient step.

0 5 10 15 20 25 30

Time (s)

-200

0

200

Dis

turb

an

ce

(N

)

X Y Z Control Limit

0 5 10 15 20 25 30

Time (s)

-200

0

200

Dis

turb

an

ce

(N

)

0 5 10 15 20 25 30

Time (s)

-200

0

200

Dis

turb

an

ce

(N

)

(a)

(b)

(c)

Fig. 4. Example disturbances in X, Y and Z directions, (a) small simulateddisturbances; (b) large simulated disturbances; (c) collected disturbances.

V. SIMULATION EXPERIMENTSA. Simulation Setup

A position regulation task is simulated to test our ap-proaches. The simulated AUV has the mass m = 60 kg withthe size of 0.8×0.8×0.25 m3. Only positional motion andcontrol are considered, thus the AUV has a 6-dimensionalstate space and a 3-dimensional action space. The controllimits |u|= |u|= [120N 120N 120N]T . Each training episodecontains 200 steps with 0.05s per step. In each episode, therobot starts at a random position with a random velocity, andit is controlled to reach a target position and stay within aregion (refer to as converged region) thereafter.

In these experiments, the algorithms are trained usingsimulated disturbances, and tested using both simulateddisturbances and collected disturbances. The simulated dis-turbances are constructed as a superposition of multiplesinusoidal waves (three in our case) with different ampli-tudes, frequencies and phases. Two different scenarios areconsidered, one has close or slightly excessive amplitudes(around 100-120% of control limits), the other one has largeramplitudes (130-150% of control limits). Examples of thesimulated disturbances are given in Fig. 4 (a) and (b). Ourpurpose is to enable the trained policy to deal with unknowntime-varying disturbances, thus their amplitudes, periods andphases are randomly sampled from the given distributions ineach training episode. In order to further validate the efficacyof the proposed algorithms, we also collected the current andwave disturbance data in a water tank using wave generator,as shown in Fig. 4 (c). The data is collected through anonboard IMU of an unactuated AUV, the measured linearaccelerations are mapped to forces, which can be assumedas the external disturbances. We notice that the amplitudesof the collected disturbances are not constrained within theranges seen during training, leading to a more challengingscenario.

Ten different approaches for disturbance rejection controlare tested and compared:

(a) Robust integral of the sign error (RISE) control [31](b) DOBC

0 0.5 1 1.5 2

Training Episode

(a)105

-2000

-1800

-1600

-1400

-1200

-1000

-800

-600

-400

-200

0E

pis

ode R

ew

ard

2 2.1 2.2 2.3 2.4

105

-120

-110

-100

-90

-80

0 0.5 1 1.5 2

Training Episode

(b)105

-2000

-1800

-1600

-1400

-1200

-1000

-800

-600

-400

-200

0

Epis

ode R

ew

ard

A2C

HWA2C-x

HWA2C-xu

RA2C-x

RA2C-xu

DOB-Net (n=3)

DOB-Net (n=64)

Fig. 5. Training rewards, (a) small simulated disturbances (amplitudebetween 100-120% of control limits); (b) large simulated disturbances(amplitude between 130-150% of control limits).

(c) A2C(d) History window A2C with state history (HWA2C-x)(e) History window A2C with state-action history

(HWA2C-xu)(f) Recurrent A2C with state history (RA2C-x)(g) Recurrent A2C with state-action history (RA2C-xu)(h) DOB-Net (n = 3)(i) DOB-Net (n = 64)(j) Trajectory optimization

Notice that, Among these methods, the trajectory optimiza-tion assumes full knowledge of the disturbances over thewhole episode, while all other algorithms deal with unknowndisturbances. The comparison is obviously not fair, thetrajectory optimization is used only to provide a performancein ideal case. RISE control is a conventional feedbackcontroller; DOBC is an implementation of Sun and Guo’swork [8], considering control constraints; HWA2C appliesthe history window approach into the A2C framework, theused window length is 10 timesteps, which is 0.5s in oursimulation setup; and RA2C employs RNNs to deal with thepast states and actions. The applied A2C framework employsa parallel training mode, 16 agents are used at the sametime, the equivalent real-world training time for each agent is43.4 hours. The RL algorithms accumulate gradients over 20steps, its reward discount factor is 0.99. The used optimizerfor RL is RMSprop, where the learning rate is 7e-4 and themax norm of gradients is 0.5. In the remaining part of thissection, we first evaluate the training process of differentalgorithms, then test and compare the control performanceamong them using either the simulated disturbances or thecollected disturbances.

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)0

0.5

1

1.5

2

2.5

3

3.5

Dis

tance fro

m T

arg

et

Step 1-100

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)0

0.5

1

1.5

2

Dis

tance fro

m T

arg

et

Step 101-200

Fig. 6. Distribution of distance from target with small simulated distur-bances.

B. Training Results

Fig. 5 proves the performance improvement of RL whenconsidering history information. When only small distur-bances occur, different approaches to use the history infor-mation have nearly the same accumulated reward. Whileif the disturbances become larger, there is noticeable in-crease of using additional action input for both HWA2C andRA2C. Also, using RNN instead of history window approachachieves higher accumulated reward, this is reasonable dueto the more efficient utilization of the history information.For the DOB-Net algorithms, we notice that using largerembedding size (n = 64) of disturbance estimate produceshigher reward.

C. Test Results on Simulated Disturbances

The training reward is not sufficient to compare the per-formance among different algorithms, we are also interestedin state distribution and bounded response (i.e. convergedregion) of the AUV disturbed by flows. As shown in Fig. 6and Fig. 7, the box plot is used to represent and comparethe distribution of AUVâAZs distance from target amongdifferent algorithms, in the first half (step 1-100) and secondhalf (step 101-200) of each episode. On each box, the centralmark indicates the median, and the bottom and top edges ofthe box indicate the 25th and 75th percentiles, respectively.The distance ranges of the second stage are smaller thanthose of the first stage, since the AUV first observes thedisturbance dynamics and then tries to stabilize itself. Theresults also demonstrate all these algorithms do stabilize theAUV to a certain extend.

Again notice that, the trajectory optimization provides anoptimal solution in the case the disturbances through theentire episode are given in advance. Our goal is to narrow thegap between our algorithm and the optimal solution in idealcase. If we focus on the second stage of the episode, it isclear both the history window policies and recurrent policiesperform better than the classical RL policy (A2C) and theconventional control schemes (RISE and DOBC), whichmeans the history information does improve the disturbancerejection capability. Also, the recurrent policy consideringboth past states and actions is even better, proving the RNNscan utilize the history information more efficiently thannaively putting together past states and actions into the statespace. Furthermore, considering past actions as additionalinput besides states yields better performance.

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)0

0.5

1

1.5

2

2.5

3

3.5

Dis

tance fro

m T

arg

et

Step 1-100

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)0

0.5

1

1.5

2

2.5

3

3.5

Dis

tance fro

m T

arg

et

Step 101-200

Fig. 7. Distribution of distance from target with large simulated distur-bances.

(a) RISE (b) DOBC (c) A2C

(d) HWA2C-xu (e) RA2C-xu (f) DOB-Net

(g) Trajectory Optimization

Fig. 8. 3D trajectories with large simulated disturbances (R is the radius of the converged region). Note the trajectory optimization assumes the disturbancedynamics is known in advance, thus provides the ideal performance.

The DOB-Net achieves the best performance among allthese algorithms, especially using n = 64. We believe en-larging the embedding size of the disturbance estimate canprovide better representation of the disturbance dynamics.Transforming this embedding from a 64-dimensional variableto a 3-dimensional variable may cause loss of information.However, even using the best RL algorithms we mentionedso far, the control performance still has a large gap fromthe trajectory optimization solution. There is still room forfurther improvements.

In addition, it is obvious that stronger disturbances leadto worse performance. But we also found larger amplituderange of disturbances gives closer performance between theDOB-Net and the optimal approach (the ratio of mediansbetween the DOB-Net and the trajectory optimization is502.31% and 186.65% respectively for small and largesimulated disturbances). This phenomenon might result fromthe optimal controls for different disturbance patterns withlarge amplitudes tend to be more similar than those withsmall amplitudes, thus it is easier for RL to learn a near-optimal policy under larger disturbance amplitudes.

The 3D trajectories of these algorithms subject to largesimulated disturbances are compared in Fig. 8. The red ball

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)0

0.5

1

1.5

2

2.5

3

3.5

Dis

tance fro

m T

arg

et

Step 1-100

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)0

0.5

1

1.5

2

2.5

3

3.5

Dis

tance fro

m T

arg

et

Step 101-200

Fig. 9. Distribution of distance from target with collected disturbances.

represents the AUV’s maximum distance from target duringlast 50 steps, called converged region. According to thisregion, we can see that the AUV is difficult to achievesatisfactory bounded response using either conventional con-troller (RISE and DOBC) or classical RL (A2C). While theproposed DOB-Net can significantly narrow the convergedregion. Using the DOB-Net, the AUV can quickly navigateto the target and stabilize itself within a distance of 0.493mfrom the target thereafter. However, there is still an obviousgap between the DOB-Net and the optimal trajectory.

D. Test Results on Collected Disturbances

Besides the simulated disturbances, we also use collectedcurrent and wave disturbances (as shown in Fig. 4 (c)) fortesting. Note the collected data is only used for testing, noretraining is required. The DOB-Net still has satisfactoryperformance on real-world disturbance scenarios and out-performs all the other algorithms, which proves the practicaleffectiveness of the DOB-Net. However, the converged re-gion of all the competitors become larger compared withthe simulated case, and the performance between the DOB-Net and the optimal approach gets less closer (the ratio of

Fig. 10. 3D trajectories with collected disturbances (R is the radius of theconverged region).

medians between the DOB-Net and the trajectory optimiza-tion increases to 264.88%, with respect to 186.65% whenusing large simulated disturbances). The reason behind is thatthe collected disturbances are more diverse and complicated,thus have a wider range of amplitudes compared with thesimulated case. The proposed algorithms may not be capableof handling such unseen scenarios optimally. This givesrise to another research question, which is to deal withdisturbances with a wider range of parameters based ontraining on small range of parameters. This may require thetechnique of transfer learning [32].

VI. CONCLUSION & FUTURE WORK

This paper proposes an observer-integrated RL approachcalled DOB-Net, for mobile robot control problems underunknown excessive time-varying disturbances. A disturbancedynamics observer network employing RNNs has been usedto imitate and enhance the function of conventional DOB,which produces the embedding of disturbance estimationand prediction. A controller network is designed using theobserver outputs as well as current state as inputs, to generateoptimal controls. Multiple control and RL algorithms havebeen tested and compared on position regulation tasks usingboth simulated disturbances and collected disturbances, theresults demonstrate that the proposed DOB-Net does have asignificant improvement for the disturbance rejection capac-ity compared to existing methods.

Currently, the test disturbances are collected in a watertank using wave generator, we plan to seek for the dis-turbance data from open water environments with naturalcurrent and wave for further testing. Also, we have noticedthat the performance of the DOB-Net is worse using thecollected disturbances, due to its more complex and diversedynamics. An interesting future work is to investigate theusage of transfer learning in dealing with real world currentand wave disturbances when simulated data and only asmall amount of collected data are available. In addition,the deployment of this method on real-world robotic systemsrequires future investigation, where the low sample efficiencyof generic model-free RL might be a problem. Some model-based approaches are necessary to overcome the constraintsof real-time sample collection in the real world.

REFERENCES

[1] G. Griffiths, Technology and applications of autonomous underwatervehicles. CRC Press, 2002, vol. 2.

[2] J. Woolfrey, D. Liu, and M. Carmichael, “Kinematic control of anautonomous underwater vehicle-manipulator system (auvms) usingautoregressive prediction of vehicle motion and model predictivecontrol,” in Robotics and Automation (ICRA), 2016 IEEE InternationalConference on. IEEE, 2016, pp. 4591–4596.

[3] L.-L. Xie and L. Guo, “How much uncertainty can be dealt with byfeedback?” IEEE Transactions on Automatic Control, vol. 45, no. 12,pp. 2203–2217, 2000.

[4] S. Waslander and C. Wang, “Wind disturbance estimation and rejec-tion for quadrotor position control,” in AIAA Infotech@ AerospaceConference and AIAA Unmanned... Unlimited Conference, 2009, p.1983.

[5] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.MIT press, 2018.

[6] W.-H. Chen, D. J. Ballance, P. J. Gawthrop, and J. O’Reilly, “A nonlin-ear disturbance observer for robotic manipulators,” IEEE Transactionson industrial Electronics, vol. 47, no. 4, pp. 932–938, 2000.

[7] W.-H. Chen, J. Yang, L. Guo, and S. Li, “Disturbance-observer-basedcontrol and related methodsâATan overview,” IEEE Transactions onIndustrial Electronics, vol. 63, no. 2, pp. 1083–1095, 2016.

[8] H. Sun and L. Guo, “Neural network-based dobc for a class ofnonlinear systems with unmatched disturbances,” IEEE transactionson neural networks and learning systems, vol. 28, no. 2, pp. 482–489,2016.

[9] C. Edwards and S. Spurgeon, Sliding mode control: theory andapplications. Crc Press, 1998.

[10] W. Lu and D. Liu, “Active task design in adaptive control of redun-dant robotic systems,” in Australasian Conference on Robotics andAutomation. ARAA, 2017.

[11] ——, “A frequency-limited adaptive controller for underwater vehicle-manipulator systems under large wave disturbances,” in 2018 13thWorld Congress on Intelligent Control and Automation (WCICA).IEEE, 2018, pp. 246–251.

[12] H. Gao and Y. Cai, “Nonlinear disturbance observer-based modelpredictive control for a generic hypersonic vehicle,” Proceedings ofthe Institution of Mechanical Engineers, Part I: Journal of Systemsand Control Engineering, vol. 230, no. 1, pp. 3–12, 2016.

[13] C. E. Garcia, D. M. Prett, and M. Morari, “Model predictive control:theory and practiceâATa survey,” Automatica, vol. 25, no. 3, pp. 335–348, 1989.

[14] U. Maeder and M. Morari, “Offset-free reference tracking with modelpredictive control,” Automatica, vol. 46, no. 9, pp. 1469–1476, 2010.

[15] S. Brahmbhatt and J. Hays, “Deepnav: Learning to navigate largecities,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,pp. 3087–3096.

[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,et al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, p. 529, 2015.

[17] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trustregion policy optimization,” in International Conference on MachineLearning, 2015, pp. 1889–1897.

[18] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in International conference on machine learning,2016, pp. 1928–1937.

[19] S. Sæmundsson, K. Hofmann, and M. P. Deisenroth, “Meta reinforce-ment learning with latent variable gaussian processes,” arXiv preprintarXiv:1803.07551, 2018.

[20] L.-J. Lin and T. M. Mitchell, “Reinforcement learning with hiddenstates,” From animals to animats, vol. 2, pp. 271–280, 1993.

[21] T. Wang, W. Lu, and D. Liu, “Excessive disturbance rejection controlof autonomous underwater vehicle using reinforcement learning,” inAustralasian Conference on Robotics and Automation, 2018.

[22] M. Hausknecht and P. Stone, “Deep recurrent q-learning for partiallyobservable mdps,” in 2015 AAAI Fall Symposium Series, 2015.

[23] D. Wierstra, A. Foerster, J. Peters, and J. Schmidhuber, “Solving deepmemory pomdps with recurrent policy gradients,” in InternationalConference on Artificial Neural Networks. Springer, 2007, pp. 697–706.

[24] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” in Advances in neural information processingsystems, 2014, pp. 3104–3112.

[25] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learningfor fast adaptation of deep networks,” in Proceedings of the 34thInternational Conference on Machine Learning-Volume 70. JMLR.org, 2017, pp. 1126–1135.

[26] A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel,S. Levine, and C. Finn, “Learning to adapt in dynamic, real-worldenvironments through meta-reinforcement learning,” arXiv preprintarXiv:1803.11347, 2018.

[27] K. Rakelly, A. Zhou, D. Quillen, C. Finn, and S. Levine, “Efficient off-policy meta-reinforcement learning via probabilistic context variables,”arXiv preprint arXiv:1903.08254, 2019.

[28] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural networkdynamics for model-based deep reinforcement learning with model-free fine-tuning,” in Robotics and Automation (ICRA), 2018 IEEEInternational Conference on. IEEE, 2018, pp. 7579–7586.

[29] S. Li, J. Yang, W.-H. Chen, and X. Chen, Disturbance observer-basedcontrol: methods and applications. CRC press, 2016.

[30] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations usingrnn encoder-decoder for statistical machine translation,” arXiv preprintarXiv:1406.1078, 2014.

[31] N. Fischer, D. Hughes, P. Walters, E. M. Schwartz, and W. E. Dixon,“Nonlinear rise-based control of an autonomous underwater vehicle,”IEEE Transactions on Robotics, vol. 30, no. 4, pp. 845–852, 2014.

[32] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEETransactions on knowledge and data engineering, vol. 22, no. 10,pp. 1345–1359, 2009.

Documents

DOB-Net: Actively Rejecting Unknown Excessive Time-Varying ... · rejection control of AUVs in shallow and turbulent water, as shown in Fig 1. The DOB-Net consists of a distur-bance