15
1300 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 State-Based SHOSLIF for Indoor Visual Navigation Shaoyun Chen, Member, IEEE, and Juyang Weng, Member, IEEE Abstract—In this paper, we investigate vision-based navigation using the self-organizing hierarchical optimal subspace learning and inference framework (SHOSLIF) that incorporates states and a visual attention mechanism. With states to keep the history in- formation and regarding the incoming video input as an observa- tion vector, the vision-based navigation is formulated as an obser- vation-driven Markov model (ODMM). The ODMM can be real- ized through recursive partitioning regression. A stochastic recur- sive partition tree (SRPT), which maps an preprocessed current input raw image and the previous state into the current state and the next control signal, is used for efficient recursive partitioning regression. The SRPT learns incrementally: each learning sample is learned or rejected “on-the-fly.” The purposed scheme has been successfully applied to indoor navigation. Index Terms—Content-based retrieval, eigen-subspace method, incremental learning, nearest neighbor regression, observa- tion-driven Markov model (ODMM), vision-based navigation. I. INTRODUCTION M UCH progress has been made in autonomous navigation of mobile robots, both indoors and outdoors. Among various sensory data, visual input is the most important for human drivers. Many of the early vision-based navigation systems followed roads by using either road edge detection or road region segmentation [18]. However, due to variation of environmental conditions, or the condition or nature of drive- ways, road edges, or road regions are not always detectable and thus these edge or region-based approaches face robustness problems. To handle a variety of environmental conditions, several ex- perimental navigation systems have employed some adaptation mechanisms, ranging from simple adaptive color thresholding to more complicated learning mechanisms such as artificial neural networks (ANNs) [24], [27]. Investigation around the Navlab project [31] developed different adaptation mechanisms for different outdoor driving environments. A navigation system called SCARF [31] was designed to handle various roads with adaptive color classification. Another navigation system, YARF [31], dealt with structured roads by explicitly modeling available constraints and features, since a single tech- nique for road detection may fail on different roads. Dickmann et al. [8] used Kalman filtering for feedback control with visual detected road features as observations. Manuscript received February 4, 1999; revised December 4, 1999. This work was supported in part by National Science Foundation under Grant IIS 9815191, DARPA ETO under Contract DAAN02-98-C-4025, DARPA ITO under Grant DABT63-99-1-0014, the Office of Naval Research under Grant N00014-95-06, and research gifts from Siemens Corporate Research and Zyvex. S. Chen is with KLA Tencor, San Jose, CA 95134 USA. J. Weng is with the Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824 USA. Publisher Item Identifier S 1045-9227(00)10094-3. In Meng and Kak’s work for indoor navigation [19], the Hough transform was used to detect edges from intensity images which was then fed into neural networks to produce a qualitative output for high-level semantically based planning. These above systems predefine the type of features that the system will use. Another type of methods does not require a human programmer to specify which features to use for navi- gation. They can be adapted to different environment through a learning process. At least two ANN-based navigation sys- tems belong to this category: ALVINN [24], [32], a multilayer perceptron (MLP) trained by a backpropagation learning algo- rithm, maps input images to an output steering signal for au- tonomous navigation; ROBIN [27], an ANN alternative, uses a radial basis function neural network (RBFN) to map a low-reso- lution input image into the output steering signal. These neural- network methods are computationally simple and real-time nav- igation speed can be achieved. ALVINN has been successfully tested in a variety of conditions for road following. Extensions of ALVINN include the use of virtual cameras (VCs) [14], which allows better driving performance, lane tran- sition, and intersection traversal. Lane markings can be used to greatly enhance the adaptability of a driving system, as shown in RALPH [25]. Appearance-based subspace methods, which apply high-di- mensional statistical methods directly to image space, belong to the type of methods that do not require predefined features. They have gained popularity in recent years starting with applications in face recognition [33]. These methods were applied to vision- based navigation independently by two groups of researchers. Hancock and Thorpe [11] used eigen-subspace for their global linear regression method in outdoor road following. Weng and Chen’s SHOSLIF [4], [36], [35] used eigen-subspaces for local adaptive regression realized by a recursive partition tree (RPT), which is generated automatically from training samples. Each time a new sample is used to query the RPT, the best-matched learning samples are retrieved, and their associated control vec- tors are used by a local adaptive regression method to gen- erate the next vehicle control signal. Learning by SHOSLIF can be accomplished incrementally with a very low logarithmic time complexity, online, in real time [35]. SHOSLIF uses local nonparametric regression [10] and stable statistical tools such as principal component analysis. Consequently, it can handle high-dimensional spaces for more complex navigation controls than the global parametric regression used by MLP and RBFN. Our recent work [36] demonstrated a performance comparison between SHOSLIF and MLP and RBFN and showed the advan- tages of SHOSLIF. We call the latter type of methods appear- ance-based methods because they learn directly from image in- tensity patterns. MLP and RBFN are artificial neural networks, while SHOSLIF uses high-dimensional statistical tools. 1045–9227/00$10.00 © 2000 IEEE

State-based SHOSLIF for indoor visual navigation

Embed Size (px)

Citation preview

Page 1: State-based SHOSLIF for indoor visual navigation

1300 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000

State-Based SHOSLIF for Indoor Visual NavigationShaoyun Chen, Member, IEEE,and Juyang Weng, Member, IEEE

Abstract—In this paper, we investigate vision-based navigationusing the self-organizing hierarchical optimal subspace learningand inference framework (SHOSLIF) that incorporates states anda visual attention mechanism. With states to keep the history in-formation and regarding the incoming video input as an observa-tion vector, the vision-based navigation is formulated as an obser-vation-driven Markov model (ODMM). The ODMM can be real-ized through recursive partitioning regression. A stochastic recur-sive partition tree (SRPT), which maps an preprocessed currentinput raw image and the previous state into the current state andthe next control signal, is used for efficient recursive partitioningregression. The SRPT learns incrementally: each learning sampleis learned or rejected “on-the-fly.” The purposed scheme has beensuccessfully applied to indoor navigation.

Index Terms—Content-based retrieval, eigen-subspace method,incremental learning, nearest neighbor regression, observa-tion-driven Markov model (ODMM), vision-based navigation.

I. INTRODUCTION

M UCH progress has been made in autonomous navigationof mobile robots, both indoors and outdoors. Among

various sensory data, visual input is the most important forhuman drivers. Many of the early vision-based navigationsystems followed roads by using either road edge detection orroad region segmentation [18]. However, due to variation ofenvironmental conditions, or the condition or nature of drive-ways, road edges, or road regions are not always detectableand thus these edge or region-based approaches face robustnessproblems.

To handle a variety of environmental conditions, several ex-perimental navigation systems have employed some adaptationmechanisms, ranging from simple adaptive color thresholdingto more complicated learning mechanisms such as artificialneural networks (ANNs) [24], [27]. Investigation around theNavlab project [31] developed different adaptation mechanismsfor different outdoor driving environments. A navigationsystem called SCARF [31] was designed to handle variousroads with adaptive color classification. Another navigationsystem, YARF [31], dealt with structured roads by explicitlymodeling available constraints and features, since a single tech-nique for road detection may fail on different roads. Dickmannet al. [8] used Kalman filtering for feedback control with visualdetected road features as observations.

Manuscript received February 4, 1999; revised December 4, 1999. This workwas supported in part by National Science Foundation under Grant IIS 9815191,DARPA ETO under Contract DAAN02-98-C-4025, DARPA ITO under GrantDABT63-99-1-0014, the Office of Naval Research under Grant N00014-95-06,and research gifts from Siemens Corporate Research and Zyvex.

S. Chen is with KLA Tencor, San Jose, CA 95134 USA.J. Weng is with the Department of Computer Science and Engineering,

Michigan State University, East Lansing, MI 48824 USA.Publisher Item Identifier S 1045-9227(00)10094-3.

In Meng and Kak’s work for indoor navigation [19], theHough transform was used to detect edges from intensityimages which was then fed into neural networks to produce aqualitative output for high-level semantically based planning.

These above systems predefine the type of features that thesystem will use. Another type of methods does not require ahuman programmer to specify which features to use for navi-gation. They can be adapted to different environment througha learning process. At least two ANN-based navigation sys-tems belong to this category: ALVINN [24], [32], a multilayerperceptron (MLP) trained by a backpropagation learning algo-rithm, maps input images to an output steering signal for au-tonomous navigation; ROBIN [27], an ANN alternative, uses aradial basis function neural network (RBFN) to map a low-reso-lution input image into the output steering signal. These neural-network methods are computationally simple and real-time nav-igation speed can be achieved. ALVINN has been successfullytested in a variety of conditions for road following.

Extensions of ALVINN include the use of virtual cameras(VCs) [14], which allows better driving performance, lane tran-sition, and intersection traversal. Lane markings can be used togreatly enhance the adaptability of a driving system, as shownin RALPH [25].

Appearance-based subspace methods, which apply high-di-mensional statistical methods directly to image space, belong tothe type of methods that do not require predefined features. Theyhave gained popularity in recent years starting with applicationsin face recognition [33]. These methods were applied to vision-based navigation independently by two groups of researchers.Hancock and Thorpe [11] used eigen-subspace for their globallinear regression method in outdoor road following. Weng andChen’s SHOSLIF [4], [36], [35] used eigen-subspaces for localadaptive regression realized by a recursive partition tree (RPT),which is generated automatically from training samples. Eachtime a new sample is used to query the RPT, the best-matchedlearning samples are retrieved, and their associated control vec-tors are used by a local adaptive regression method to gen-erate the next vehicle control signal. Learning by SHOSLIFcan be accomplished incrementally with a very low logarithmictime complexity, online, in real time [35]. SHOSLIF uses localnonparametric regression [10] and stable statistical tools suchas principal component analysis. Consequently, it can handlehigh-dimensional spaces for more complex navigation controlsthan the global parametric regression used by MLP and RBFN.Our recent work [36] demonstrated a performance comparisonbetween SHOSLIF and MLP and RBFN and showed the advan-tages of SHOSLIF. We call the latter type of methods appear-ance-based methods because they learn directly from image in-tensity patterns. MLP and RBFN are artificial neural networks,while SHOSLIF uses high-dimensional statistical tools.

1045–9227/00$10.00 © 2000 IEEE

Page 2: State-based SHOSLIF for indoor visual navigation

CHEN AND WENG: STATE-BASED SHOSLIF FOR INDOOR VISUAL NAVIGATION 1301

However, appearance-based methods developed so far for au-tonomous navigation can only deal with monolithic views inthat the entire scene is treated as a single entity. For example,there are many cases where two global views of two scenes arevery similar. The difference is salient in a local view—a part ofthe global view. A system without state will not be able to usesuch different views correctly. Consequently, a stateless appear-ance-based system can not distinguish many different complexscenes, and has a limited generalization power.

Our earlier journal paper [36] reported a stateless SHOSLIFnavigation scheme with only batch-learning capability. In thispaper, we present a systematic framework through which systemstates can be defined and learned online to deal with situationswhich a stateless system can not handle, such as where visualattention is required. The learning-based approach allows theteacher to define states online during learning, instead of pre-programming control rules into a static control scheme. Thus,the same learning scheme can potentially handle more complexnavigation scenes, such as those challenging ones associatedwith indoor navigations.

The remainder of this paper is organized as follows: InSection II, we show that the framework can be formulated as anobservation-driven Markov model realized by nearest neighborregression. The nearest neighbor regression is computedefficiently by our stochastic recursive partition tree (SRPT),as shown in Section III. Issues related to vision-based indoornavigation is discussed in Section IV. Section V presents someexperimental results. Section VI compares the new state-basedSHOSLIF with two ANN-based approaches. Finally, someconclusions and future work are discussed in Section VII.

II. STATE AND ATTENTION: OBSERVATION-DRIVEN MARKOV

MODEL

It is well known that landmarks are important for navigation.A landmark does not have to be an actual object. It can be a sub-part of a scene. Without a capability to attend to appropriate sub-parts of scene at the right situation, there will never be enoughtraining samples to deal with complex visual scenes. Fig. 1(a1)and (b1) show two images taken around a corner along the navi-gation path of a robot. These two images are similar judged fromthe entire image. In the context of appearance-based methods,this implies that the distance (e.g., Euclidean) is small. However,they require very different actions: one straightahead, the otherturning left. But if we look at the upper-left subregions markedwith white boxes in Fig. 1, we can see that their difference ismore salient. To visualize how each view is useful in sensingthe error in heading direction, we compute the cross-correla-tion between two global views (a) and (b) as shown in Fig. 1(c)and that of two local views (a3) and (b3) as shown in Fig. 1(d).The cross-correlation was computed by sliding one image on an-other in both directions assuming pixels are zeros beyond imageboundaries. A white point in (c) and (d) indicates a high correla-tion value and a black point indicates a low correlation. There-fore, the larger the white blob at the center, the less the sensi-tivity of the views is to the change in robot heading direction andtranslation in depth (the camera is fixed on the top of the robot).The height of the white blob roughly indicates the sensitivity

Fig. 1. Why states and attention? (a1) and (b1) show two images around acorner. White boxes in (a2) and (b2) are attention windows of (a1) and (b1),respectively. The attention images of (a1) and (b1) are further shown in (a3) and(b3). (c) and (d) show the correlation image of (a1) with (b1) and (a3) with (b3),respectively, after zeroes are padded to the periphery. (c) and (d) are resized tohave the same spatial resolution.

to translation in depth and the width indicates roughly the sen-sitivity to changing in heading direction. Comparing Fig. 1(c)and (d), we can see that these local views exhibit a smaller blobthan the corresponding global views and thus are more sensitiveto changes in heading direction and translation in depth. Theheight of the white blob in Fig. 1(d) is especially smaller thanthat in Fig. 1(c). Therefore these local views are particularly sen-sitive to indicate the timing to make the left turn. Also notice thatthe partial wall on the left that indicates the end of the left wallat a left-turn corner. Of course, we know that not all local viewshave such advantages over the global views. For example, a localview of a uniform wall will be much worse than the global view.Therefore, we can see that a partial view, selected by an attentionwindow, if properly chosen, can greatly facilitate navigation todeal with complex scenes. A partial view that was chosen im-properly may confuse the system. It is extremely difficult forthe system to automatically determine such attention windowswithout knowing the environment beforehand. In this work, it isthe human trainer who selects the positions of attention windowand directs the system to use it when the trainer thinks thatlocal windows can help to reduce the ambiguity. We introducean on-line training method for training a system to use localviews based on an appearance-based method (appearance-basedmethods are those that use normalized intensity pixels directlywithout requiring humans to define what features to extract). Al-though appearance-based methods published so far have the ad-vantages of automatically deriving features from training sam-ples and being able to adapt to virtually any scenes, incorpo-

Page 3: State-based SHOSLIF for indoor visual navigation

1302 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000

ration of local views for autonomous vision-guided navigationhas not been well understood yet. We propose a state-basedSHOSLIF which has a stochastic finite-state machine embeddedinto the appearance-based methods.

In typical indoor navigation along hallways, global views aresufficient for making navigation decisions most of the time. Par-tial views are needed only for difficult sections, e.g., corners andintersections. How does the system know that it is time to uselocal views, and how does it act differently for global and localviews? These problems can be solved by using system stateswhich keep history information.

In vision-based navigation, the system accepts observationimage at time , where denotes the space of all pos-sible images. can be a global view of the scene or a localview specified by the state of the current system.is a vector ofpixels, where each component is equal to the intensity of the cor-responding pixel after a normalization process which linearlytransforms intensity of all pixels so that all the pixels have azero mean and a unit variance. The system has a set of states

, a set of predefined control signalsfor navigation and a setof control signals for attention selection. The outcome of thesystem at time consists of .

Let and be the observations and the outcome randomcovariates at time, respectively. Let be a vector of thepresent and past input images and past outcomes, i.e.,

.Define the regression function as

(1)

where is the mean of random covariates, and themathematical expectation. This is an “observation-driven”Markov model (ODMM) [6], [39]. Our goal is to estimategiven . If , (1) is an observation-driven first-orderMarkov model and can be rewritten as

(2)

The training set consists of triples of form ,, where is the total number of learning

samples. , and are the current observation, theprevious outcomes, and the current outcome, respectively.For efficiency under high-dimensional , we use recursivepartition tree (RPT) to approximate function. Let the RPTbe denoted as , then

if is the best-matched sample of in thelearning set. Here and are the input part, while isthe output part, of a training sample .

The above best-matched sample is the nearestneighbor of based on RPT. The nearest neighbor esti-mator in Euclidean space [5] has been widely used in functionapproximation: , i.e., the nearest neighbor estimationof . It has also been used in approximation of Markovtime series. Nonparametric estimators, including the kernel es-timator [38] and the nearest neighbor estimator, have been used

for prediction of Markov time series, . Yakowitz[37] successfully applied the nearest neighbor estimator to theprediction of rainfall/runoff time series. The general conver-gence of the nearest neighbor estimation of forMarkov time series has not been proved. But the convergenceproperties have been proved under certain assumptions (cf. [37]and references therein). Here due to high-dimensionality of ob-servation space plus state space, as well as a large numberof training samples, nearest neighbor is computationally too ex-pensive to compute. We will explain in later sections how SRPTwas used to approximate the observation-driven Markov model.

Markov chain and its variant, hidden Markov model (HMM)or partially observable Markov decision process (POMDP)have been successfully applied to indoor navigation by severalresearchers. Koenig and Simmons [17] applied POMDP tointegrate topological and metric information for navigation ofXavier using sonar range sensors. Differences exist betweenODMMs and popularly used HMMs. Normal HMMs are sta-tionary [26], i.e., the transition probability does not depend oninput. Here the transition probability of an observation-drivenMarkov model explicitly depends on observations at eachtime step. Therefore observation-driven Markov model isnonstationary. Furthermore, autonomous navigation usingvision sensor, rather than sonar range sensor, is much moredifficult. Vision sensors have much higher dimensionality.Image intensity does not give range information and it is notdirectly related to spatial range information that is convenientfor navigation. Very similar images may requires quite differentnavigation actions.

In our vision-based navigation, three outcomes are required:state ( ), control signal () for mobile robot, and visual attentionsignal ( ) used to choose the attention window. The observationis a preprocessed input image () at each time step. Thus, in ourmodel, and . Three estimators are neededfor the following regressions:

(3)

Among them only the first equation is an observation-drivenMarkov model, since the outcome is the next state. The othertwo are normal mapping functions without recurrence of vari-ables. In practice, we use a single RPT approximator to approx-imate these three estimators. The output vector from the RPThas three components: one for each of the three estimators.

The overall architecture for vision-based navigation is shownin Fig. 2. A single RPT tree is used for estimating(state),(control signal) and (attention signal). The current input image

and the previous state are used to derive the next con-trol signal and next attention signal , and the currentstate . During real navigation, and are used to con-trol next motion of mobile robot and to extract the subsequentattention windows from the video camera, respectively. Atten-tion action is to extract a partial view from the global view ofthe camera.

Page 4: State-based SHOSLIF for indoor visual navigation

CHEN AND WENG: STATE-BASED SHOSLIF FOR INDOOR VISUAL NAVIGATION 1303

III. STOCHASTIC RECURSIVEPARTITION TREE FORNEAREST

NEIGHBOR REGRESSION

The main challenge here is that has a very high dimen-sionality (a few thousands). The number of samples is typicallysmaller than the dimensionality. Thus, traditional classificationand regression trees, such as CART [2], C4.5 [23] and OC1 [20]are not applicable, since they are designed for relatively low-di-mensional space. We will describe how SHOSLIF addresses thisproblem.

A. Navigation as a Content-Based Retrieval Problem

A navigator is trained to perform a complicated functionwhich maps a high-dimensional input (image and state) into thecorresponding low-dimensional output (control signal, action,the new state, etc). In the training phase, a set of training imagesis used to build an RPT, in which each learned sample recordsa desired input–output pair.

The RPT is constructed in the following way. We first con-sider batch training in that all the training samples are availablefor RPT construction. Although the input image has ahigh dimensionality, all the possible images may be containedin a linear subspace with a relatively low dimension-ality. The RPT automatically builds a space partition hierarchyto recursively partition space without explicitly detecting it.

Each training sample has two parts , where is theinput part and is theoutput part. The root of the RPT takes all the training samples.All the input parts are used to compute theprincipal components which are the unit eigen-vectors of the sample covariance matrixofassociated with the largesteigenvalues. These eigenvectorsindicate orthogonal directions of sample distribution alongwhich the samples have the most variations. To minimize thenumber of projections to be computed while reaching the fur-thest space partition, we choose , which results in a bi-nary tree. All the samples are projected onto

to give ( ). Compute the meanof the projections . and is used as splitat root. It is a hyperplane. For all the sample, if ,

belongs to the left child; otherwise, belongs to the rightchild. Each child partitions its space in a similar way. At eachchild node, all the samples assigned to it are used to compute thefirst principal component vector, which is used as the normal ofthe hyperplane and the mean of all the projections on the normalis the point that the hyperplane passes. The split further partitionthe region belonging to the node into two subregions. Such par-tition continues recursively until all the training samples fallinginto the node have the same output part. We quantized the outputpart so that only a few different vectors are possible for state, at-tention and control signals.

In the performance phase, each newly grabbed inputwiththe current state is used to retrieve the best-matched inputfrom the RPT. At each node, the hyperplane is used to decidewhich child the input belongs to should be further explored.Such a single-path exploring process is continued recursivelyuntil a leaf node is reached, whose output partis used as theestimated output for the input .

Fig. 2. The architecture of state-based SHOSLIF for vision guided navigation.“Att” stands for “Attention control.”

Fig. 3. Incremental update at each inner node along the searched path.

Fig. 4. Different types of corridor structure in our indoor navigation.

As can be seen, the principal component analysis can auto-matically guide the hyperplane split so that it only cuts acrossthe direction in which the samples of the corresponding subcellhave the most variation. In other words, the RPT automaticallyneglect directions along which samples only vary little. But theabove single-path search through the RPT does not necessarilyalways find the nearest neighbor in the original space of input

, although mostly it does lead to a good match. In practice, weexplore parallel paths. We measure the distance from theprojection of along the principle component to the meanof the projections of the samples in the child node. In addition tothe search path explained above, additional child nodesare also explored if the distance is among the smallest.Consequently, leaf nodes are obtained and the nearest leafnode is found among them as the best-matched node.

The average time complexity for each retrieval from a giveninput using the binary RPT is roughly where isthe dimensionality of input and is the number of leaf nodes

Page 5: State-based SHOSLIF for indoor visual navigation

1304 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000

in the tree, since the RPT tends to build a balanced tree, thanksto the principal component analysis at each internal node. Thislogarithmic time complexity is crucial for the real-time appli-cation here. In our earlier study [36], we found that a small,e.g., or , is sufficient to achieve good retrievingperformance. Throughout the experiments in this paper,was used.

B. Incremental Training

The above batch training is suitable only for a relativelysimple task where the number of samples is relatively small.Our task at hand is very complex and the number of samplescan be too large to store. Furthermore, a batch training modedoes not allow the trainer to dynamically select cases accordingto the current performance and mistakes, since the training hasto be started from scratch whenever the training set is changed.

With incremental learning, the trainer does not need to storethe entire training set. The RPT is updated if needed, one sampleat a time. As soon as the training sample has been used it isdiscarded.

An incremental way of computing eigenvector is asfollows. First, the data covariance matrix is esti-mated from input column vectors as

, where is the averagethe samples. Then the dominant eigenvector of the estimated

is computed by either the Power method or the Raylaighquotient conjugate gradient (RQCG) method.

However, the dominant eigenvectors can be estimatedwithout estimating the covariance matrix as a prerequi-site. Several authors addressed this problem using stochasticapproximation. Stochastic approximation has been shownespecially useful when high accuracy is not required while onlya few eigenvectors need to be estimated. Here we apply Ojaand Karhunen’s method [21] for stochastic approximation ofthe dominant eigenvector. For stochastic matrices ,with finite constant mean , Oja and Karhunen’siteration can be represented as follows:

(4)

(5)

where is the learning rate and denotes the -norm of avector. In our computation for the dominant eigenvector of co-variance matrix , is set as outer product of :

according to Oja and Karhunen[21]. is also computed incrementally as

.The convergence of the above stochastic approximation has

been proved in [21].Theorem 1–(Oja and Karhunen, 1985):Under the following

assumptions:

1) Each is almost surely bounded and symmetric and thematrices are mutually statistically independent with

for all .2) The largest eigenvalue of has a unit multiplicity.3) , , .4) Each has a probability density bounded away from

zero uniformly in in some neighborhood of in .

Fig. 5. Observation-driven Markov model for indoor vision-based navigation,wherex=y is used to represent a statex with associated actiony. “a” or “g” areused to indicate either two local attention views or a single global view is used.Associated with each arc is the current observed image, either a local view ora global view. Each transition corresponds to a very large number of possibleimages. The transition is determined by the RPT.

Fig. 6. The representation of input and output parts.

Fig. 7. A map of the test site. The test loop is indicated by thick solid lines.Six different corners or intersections are trained in our tests. They are markedwith bold-faced Arabic numbers.

Then, tends to the eigenvector associated with the largesteigenvalue almost surely as .

For each inner node, the incremental learning of a new samplecan be conducted recursively. When learning a new sample, themean and dominant eigenvector are updated incrementally. Forthe corresponding child nodes, only the changes caused by thisupdate need to be processed. Let old_list and new_list be the oldand new sample lists going into a child node, respectively,

Page 6: State-based SHOSLIF for indoor visual navigation

CHEN AND WENG: STATE-BASED SHOSLIF FOR INDOOR VISUAL NAVIGATION 1305

Fig. 8. Some sample training images around corners or intersections. Six sets of four consecutive images are from six different corners or intersections in the testloop.

Fig. 9. Sample state transitions that cover more than two passes of continuous running along the tested loop. The Arabic codes on top of each bar are the corneror intersection labels in Fig. 7. The state codes of the vertical axis are explained in Fig. 5. (a) Raw output states. (b) Output states after smoothing.

and be the mean vector and the dominant eigenvector of thechild node. Then the processing for each child node can be donerecursively as shown below in Fig. 3.

IV. A CASE STUDY FOR INDOOR NAVIGATION

Compared with outdoor road following, vision-based in-door navigation faces tremendous challenges because thereare no stable features (e.g., floor edges are often occluded)and no stable contrast pattern along typical navigation paths.This type of environment may cause severe problems in func-tion approximation due to the complexity of the function it-self.

For our indoor navigation, the corridor types are shown in Fig.4. In this figure, we use the following brief notation to indicatecorridor types: “L” for left turn; “R” for right turn; “LZ” forleft Z junction; “RZ” for right Z junction; “X” for four-wayintersection; and “S” for straight corridor.

For each corner or intersection, the situations are furthergrouped into states: approaching, entering, and exiting a corneror intersection. To reduce the number of states, some of thestates can be merged. For example, an indoor corridor sectionwith types “L,” “LZ,” “R,” and “RZ” can be represented by asix-state model, as shown in Fig. 5. Here approaching cornersor intersections is represented as a single state, ambiguity state(“A”). After transition to “A” state, the next transition can loopto itself or jump to “L,” “LZ,” “R,” or “RZ,” depending oncurrent visual image and previous state . The transitionfrom each state depends on the current observation and theprevious state estimated by the RPT. Visual attention for localview is needed only at state “A” to disambiguate possible nextstates. This is true because when approaching any of these

turns, global views are all very similar, given the finite numberof samples and the desired capability of generalization. A localview during this period can greatly disambiguate the type ofturns and whether the turn should be made immediately. In ourcase the raw input image is of 60 80 pixels: it is averagedwith 2 2 window and reduced to half size (3040) to getthe global view; the left and right local attention windows aresubregions of the raw input image centered at (15, 22) and (15,55), respectively, but with the same size (3040). For thisarrangement, the attention signal needs only a binary flag tospecify whether global view or local view will be used

ifotherwise.

Koenig and Simmons [17] used sonar and odometer sensoryinputs as observations in their HMM models. They definestates on the spatial grid of the hallway paths. Therefore alarge number of states are needed. If similar HMMs will beextremely high, since here the dimensionality of intensityimages is much higher than that of sonars. Our vision-basednavigation task is much more challenging than sonar-basednavigation, since the visual images do not relate to control asdirectly as range data. On the other hand, with the use of visualimages the number of states is greatly reduced in our method,since visual input provides richer information and our systemdoes not need odometer information.

For our ODMM model shown in Fig. 5, the input and outputrepresentation is shown in Fig. 6. to are binary fieldsrepresenting states zero to five, respectively;is the predictedcurrent state; is the predicted attention selection, which is abinary flag indicating whether global view or attention window

Page 7: State-based SHOSLIF for indoor visual navigation

1306 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000

(a) (b)

(c) (d)

Fig. 10. Some sample control signals and the associated states. The control signals plotted in this set of figures are corrected heading directions indegrees. (a)Around corner 2 (“L”). (b) Around corner 4 (“RZ”). (c) Around corner 6 (“L”). (d) Between corners 6 and 1 (“S”). (a) and (c) are cases of left turn. (b) is an“RZ,”which has a right turn followed by a left turn. (c) is a straight section between corners 6 and 1. The change in the corrected heading along straight section is smaller.

(a) (b) (c)

Fig. 11. Sample trajectories around the fourth corner show how states canhelp the robot in navigation. (a) and (b) Two failure cases when states were notused. “X”in (a) or (b) indicates an upcoming collision with the wall. (c) Robotsuccessfully made the turn when states were used.

will be used; is the associated output control signal forcurrent pair. The fields marked with solid boxes areconcatenated to form a single long one-dimensional vector asinput to SRPT. Those fields in dashed boxes are either predictedor associative quantities.

The training data are a collection of quadruples of. The control signal of mobile

robot consists of the required heading direction and the speed.

All the state transitions in our observation-driven Markovmodel need to be trained.

V. EXPERIMENTAL RESULTS

Our mobile robot Rome (RObotic Mobile Experiment), builton a labmate made by TRC, was used to test our algorithms.In our experiments, we trained the robot to navigate along theloop shown in Fig. 7, which is on the third floor of the MSUEngineering Building. The floor plan covers an area of approx-imately 136 116 square meters. Along corridors, there aresix types of local corridor structures shown in Fig. 4: “L,” “R,”“LZ,” “RZ,” “T” and “S.” Some sample training images aroundcorners and junctions are shown in Fig. 8. As can be seen, theappearances of these scenes are quite different, and the widthsof straight corridor segments are different too. Navigation usingonly a single camera in this environment is very challenging formethods that use predefined features, since the presence of fea-tures are very irregular.

We interactively trained the robot along the corridors. At eachlocation, an input image is grabbed. It is associated with theprevious state. The desired control signal is the difference be-tween current heading and the heading after interactive correc-

Page 8: State-based SHOSLIF for indoor visual navigation

CHEN AND WENG: STATE-BASED SHOSLIF FOR INDOOR VISUAL NAVIGATION 1307

Fig. 12. Six sample trajectories around corner 4 (“RZ”). The robot started fromsix different positions. One of the robot’s starting positions is marked with asolid circle, which indicates the size of the robot.

tion. Heading directions were read from the robot’s internal reg-ister. The human supplied the current state. As explained earlier,the input image was normalized to have a zero mean and a unitvariance to suppress the effect of absolute lighting brightnessand contrast variation to some degree. Then this preprocessedraw input image and the previous state information were used toquery the RPT. If the retrieved output was within error tolerance,the new sample was rejected without being learned; otherwise,the new sample was used to update the RPT. This process avoidsunnecessary updating the tree when the tree is good enough.

Under human interactive control, the robot took significantamount of training samples around each section of the loop, witha goal to cover basic variation of the scene. As soon as the robotrejected most recent samples, it moved on to learn new sections.When the robot rejected all samples along all the paths, we setRome free.

Rome roamed at a speed of 40 cm/s. It slowed down when thecorrection in heading direction is larger than 10or it entered anonstraight state. There behaviors of course were trained inter-actively by the human trainer.

In order to remove the spurious state transitions without re-sorting to higher order Markov models, a voting scheme wasused. We kept the state history up to five steps (image frames).A histogram of history states was computed. The state transi-tion was confirmed only when the number of votes was morethan two out of five; otherwise the current state was unchanged.Fig. 9 shows sample state transitions of more than two passesalong the tested loop. As shown in Fig. 5, around each corner orintersection, a transition to state “A” always precedes the con-firmed corner or intersection state. From Fig. 9 we can see thatthe smoothing scheme did help in achieving a stable navigationbehavior.

To provide a visualization of navigation behavior, somesample control signals with associated states are showed in Fig.10. The robot turned left in Fig. 10(a) and (c). In Fig. 10(b),the robot turned right then left around corner 4, which is an

“RZ.” From Fig. 10(d), it can be observed that along a straightsection, the robot turned with much smaller magnitude whichresulted in a smooth navigation behavior.

Are states and attention necessary? We trained the system ina similar way except that the state vector was absent. We call ita stateless SHOSLIF. As we explained earlier, this means that itcan not use local views either. Without the use of states, statelessSHOSLIF experienced difficulties in guiding the robot to navi-gate through the tested loop. It failed to make the fourth corner,which is an “RZ.” Fig. 11 shows how it failed by displayingplots collected from sample navigation trajectories. Fig. 11(a)and (b) show two failure cases when states were not used. Fig.11(a) shows a case when the robot retrieved images with leftturn around critical turning points, since the visual appearancesaround these positions are similar to some other images arounda left turn, e.g., the sixth corner. The robot turned left and thenran toward the wall. Fig. 11(b) shows another case where therobot underturned right and failed to make the fourth corner.Fig. 11(c) shows how the robot performed after the state infor-mation was incorporated: the robot successfully made the turnwith state-based SHOSLIF. The states were crucial in disam-biguating critical scenarios.

In order to provide a sense about how smoothly the robotnavigates and how stable the behaviors are, we measured therobot’s navigation trajectories with different starting positions.Each time the robot started from a different location away fromthe center of the hallway. Fig. 12 shows detailed plots of sixsample drives around corner 4, which is an “RZ” turn and turnedout to be the most difficult turn in the entire tested loop. Thesetrajectories were collected by attaching a marker to the centerof the rear bumper. The coordinates of sample trajectory pointswere manually collected at the test site. The accuracy of the dataplot is within 1 in. Fig. 12 indicates how these trajectories withdifferent starting positions tend to display less deviation aftermaking the turn and covering more distance.

Fig. 13 shows how the robot navigated around the fifthcorner. Fig. 14 shows the mobile robot’s views when it nav-igated around corner 5. An image was grabbed and used toupdate the control at every single time step.

In earlier stages of our experiments, different resolutions ofinput images were tested. It turned out the lower-resolution30 40 is sufficient for our tasks. When higher-resolutioninput image was used, there was no significant performanceimprovement, but more memory was needed and the responsewas slower. The content-based retrieval of each inputwas performed at a frequency of 6 HZ on the onboard SUNSPARC-1 workstation onboard the robot. The incrementallearning was conducted using the same workstation. In aseparate batch training experiment, a set of 272 samples tookabout 26 s on a SUN SPARC-10 workstation. The time recordis shown in Fig. 15. The maximal response time per learningsample is within 1 s. For most of the training samples, theincremental learning took less than 0.3 s per image frame. Butthe incremental learning time is not uniform for each trainingsample, since it depends on the number of samples which needto be redistributed at the internal nodes of the RPT along thesearched path, as the algorithm of Fig. 3 indicates. We have con-ducted test repeated runs to observe the performance stability

Page 9: State-based SHOSLIF for indoor visual navigation

1308 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000

Fig. 13. Rome navigated autonomously around corner 5. The first five images were video sequences taken from behind the robot. The last three images weretaken from front of the robot.

Fig. 14. The mobile robot’s views around corner 5 during a real navigation. This sequence show how the robot approached the fifth corner, successfullymadethe turn and entered the following straight section.

of the learned robot. One pass of autonomous navigation alongthe tested loop took about 20 min. The robot was observedto continuously run for longer than 5 h several times beforethe onboard batteries, which provided the power, were low. Indozens of such tests conducted so far, the robot all performedwell without hitting the wall or the hallway doors. During thesenavigation tests, passers-by were allowed to pass naturally.The mobile robot was not confused by these passers-by largelybecause it uses automatically derived principal componentsas features, which cover the entire image. A novelty detectorwas also used to handle unexpected objects: the robot slowsdown or stops if the input image is very different from theretrieved view. To build a general-purpose navigation system, itis necessary to incorporate other obstacle avoidance modules.

VI. COMPARISONWITH TWO ANN-BASED APPROACHES

SHOSLIF is used here as a general computational tool forapproximating a function where is a high-di-mensional space. Two popular artificial neural networks thathave been used for vision-based navigation are multilayer per-ceptron (MLP) by ALVINN [24] and radial basis function net-work (RBFN) by ROBIN [27]. We compared our SHOSLIFwith these two types of neural networks. We used two separatesets of data, Set 1 and Set 2, for comparison. Both sets of datawere collected on two separate occasions on the third floor of

Fig. 15. Time record of incremental learning a set 272 training images. Thelearning time was recorded on a SUN SPARC-10 workstation.

our Engineering Building. Training with either set of trainingsamples resulted a SHOSLIF RPT which gave successful per-formance when tested in the trained loop. Set 1 and Set 2 containabout 500 and 300 samples of , respectively.Set 1 is more redundant and covers more scenes than Set 2 does.

To map into current heading , we used the sameoutput representation as ALVINN and ROBIN: the output pat-

Page 10: State-based SHOSLIF for indoor visual navigation

CHEN AND WENG: STATE-BASED SHOSLIF FOR INDOOR VISUAL NAVIGATION 1309

tern of each training sample is a Gaussian distribution peaked atthe desired heading. In our simulation implementation of MLPand RBFN, the output layer has 21 to 31 nodes with a resolu-tion of two degrees apart. A sample training input–output pair isshown in Fig. 16. After the neural network is trained, the outputheading is taken as the peak of a Gaussian fit to the outputs ofneurons at the output layer.

A. Multilayer Perceptron

For simulation of a two-layer feedforward neural network,we used “trainbpx” function in the MATLAB neural-networktoolbox. This function can adaptively adjust learning rate. Sim-ilar to ALVINN, we used four to nine hidden nodes in our sim-ulation. Our simulation showed that for our complex indoor im-ages it was very unlikely for multilayer perceptron (MLP) toconverge to reasonable weights if it starts with random initialweights. To clearly study this we used three sets of data (in ad-dition to the Set 1 and Set 2 used for comparison).

Set A Real images from straight hallway sections: a set of100 (30 40)-pixel images with five possible headingdirections: 10 , , , and . Each of thesefive classes has 20 samples.

Set B A full set of 318 (30 40)-pixel real images. This setof images was used in our training of SHOSLIF-N forvision-based navigation. It includes 210 images fromstraight hallway sections and 108 images from cornersections. Images were collected from corners 1 and 2and three straight sections connected to these two cor-ners.

Set CSynthetic data set. We generated a road map for navi-gation, where the road is white and the other parts ofthe scene are all black as shown in Fig. 17(a). Giventhe road, synthetic sample images were generated withdifferent orientations and translations along the trajec-tory. For square road map in Fig. 17, a typical set con-sists of 100 (30 40)-pixel synthetic images with fivepossible headings: 10 , 5 , 0 , 5 and 10 . Someof these sample images are shown in Fig. 17.

When the size of problem is small, i.e., the number of im-ages is small and the size of image is also small, the training offeedforward network went successfully. We trained a few MLP,each starting from a different random initial weights. The bestMLP gives a reasonable error rate. But we found that when theinput dimension is increased to, e.g. 3040, the MLPs startingwith random initial weight would not converge to reasonable so-lutions. We ran our MATLAB script files 100 times, each timewith a different random guess of initial weights. The trainingepoch was set to 10 000, large enough to converge to a localminimum. Fig. 18 shows the sum of errors for these 100 MLPsusing Set A. Here we used four hidden nodes as indicated in[24]. Fig. 18 shows that the weights did sometimes convergeto a good solution for the training set, although only very fewtrials provided solutions with small sum-of-squares (SSE) er-rors as shown in Fig. 18(a). Our simulation with Sets B and Cgot similar results: the learning process easily got stuck to poorlocal minimums and resulted in large SSEs.

Fig. 16. Sample training input–output pair: the top part is a 30� 40 image.The second row is the associated output signal, which is Gaussian distributionpeaked at the desired heading direction. The last row is the binary state fields,which has a one in the first field and zero in other fields. All the values here areshown as intensity values.

Fig. 17. Some synthesized sample images generated at different locations andwith different orientations along the simulated navigation path. These imageswere generated using a perspective projection camera model. The top part is aroad map, where the dark area is nonroad and the white area is the road. Thebottom figure shown synthesized sample images.

To avoid these problems with random initial guess forweights, we proposed to use a more sophisticated way ofgenerating initial weights, the one that is similar to the spirit ofSHOSLIF. It is called principal component regression (PCR)[9]. The weights for hidden nodes are initialized with theprincipal components of the input patterns, while those of theoutput nodes are initialized with linear regression betweendesired outputs and the response of the hidden nodes. Thisway of initialization of MLP using PCR led to much betterconvergence in our experiments. One example is shown in Fig.19. Set A was trained with 1000 epochs. The learning processconverged quickly and resulted in very small sum-of-squareerror. Comparing Fig. 18 with Fig. 19 we can see that thenetwork output initialized with PCR is much smoother and thenetwork output gives less errors.

B. Radial Basis Function Network

ROBIN presented in [27] used RBFN for navigation. The im-plementation of ROBIN is different from a typical RBF networkin the following aspects: 1) Normalization. The responses of thereceptive fields are normalized to have unit sum response forall neurons in the first layer. This normalization improves theinterpolation capability, as shown by references sited in [27].2) Center selection. Rosenblum and Davis [27] used K-meansclustering algorithm for center selection but the results were notsatisfactory. So they used “forced-clustering” by manually as-signing subset of input patterns as centers. Their procedure wasimplemented in our comparative study except that here we usedorthogonal least squares (OLS) [3] to do the automatic centerselection for a better performance.

Page 11: State-based SHOSLIF for indoor visual navigation

1310 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000

Fig. 18. Training MLP using Set A. 100 trials with different random guesses of initial weights. (a) The record of sum of square error (SSE) for 100 trials. Thetrial with minimal SSE is marked with a circle. (b) The target output for five classes of training samples with different headings:�10 , �5 , 0 , 5 , and 10 .(c) The error histogram of retrieved heading directions for the training set by the best MLP recorded in (a) (marked with a small circle). (d1) to (d5) The networkoutputs overlapped on the target output for classed 1–4, respectively. The outputs of the best MLP and the target output are plotted in solid lines and dotted lines,respectively.

C. Comparison

In this section, we report our result of comparison forSHOSLIF, MLP, and RBFN discussed above. We used Set 1 tofor training and Set 2 for testing. The heading error histogramsare plotted in Fig. 20, for all three methods. SHOSLIF alwaysgets perfect retrieval for the training set, so the error histogramconcentrates on a single bin with zero errors.

In addition to heading error comparison shown in Fig. 20,we also compared the accuracy of state-prediction for the threemethods. Here the current image and previous state ismapped into current statewhich is a state label. This mappingis classification, rather than regression which mapsinto a numerical number . Therefore networks for classifi-cation, instead of regression, should be used. An MLP with alinear output layer [34] has been known to perform discriminant

Page 12: State-based SHOSLIF for indoor visual navigation

CHEN AND WENG: STATE-BASED SHOSLIF FOR INDOOR VISUAL NAVIGATION 1311

Fig. 19. Training with Set A using PCR for intialization of weight. (d1) to (d5) are the outputs for classes 0–4, respectively. The network outputs overlapped onthe target output for each class: network outputs and the target outputs are plotted in solid lines and dotted lines, respectively.

TABLE ICOMPARISON OFTHREE APPROACHES INSTATE PREDICTION

TABLE IICOMPARISON OFTHREE METHODS INRESPONSETIME PER INPUT

analysis or classification. We used a three-layer MLP, with twohidden layers, for state prediction. The number of output nodesis the same as the number of states, which are coded as binarypatterns. For network output, the node with the highest responsecorresponds to the predicted state. In our tests, the number ofnodes in the first hidden layer ranged from 4 to 15, while thenumber of nodes in the second hidden layer ranged from 4 to10.

RBFN can also be used for classification. Similar to MLP,the states are coded as binary patterns and the number of outputnodes is the same as the number of states. The output node withthe highest response gives the predicted state. As explained ear-lier, the centers are automatically selected using the orthogonalleast squares technique [3].

We used one set of data for training and the other for testing,and then reverse the role of train set and test set for cross-valida-tion. The accuracy of state prediction is reported in Table I. ForMLP, it has a tendency to overfit the training data, which usu-ally results in poor generalization performance for the test set.Therefore, we determine the number of epochs using a simplecross-validation process: check the state prediction accuracy of

both the train set and the test set. The number of epochs whichleads to a balanced performance for both the train and test setswas chosen. Since Set 1 is more redundant then Set 2, trainingwith Set 1 leads to better performance when tested with the dis-joint Set 2. From Table I it is clear that state-based SHOSLIFperforms better than both MLP and RBFN.

The response speeds of these three methods are reported inTable II. After training with each learning mechanism with Set2, the response time in milliseconds was recorded on a SUNSPARC-10 workstation. The recorded time is the CPU timespending in mapping input to output, not including time requiredfor grabbing images from camera. For ANN-based methods, theresponse time was that of C programs for network computationusing the learned parameters computed by the MATLAB.

A qualitative comparison among MLP, RBFN and SHOSLIFis summarized in Table III. The major power of SHOSLIF stemsfrom its local adaptive nonparametric regression, which is moreflexible than global parametric regression. Its recursive parti-tion tree provides a good tradeoff between speed and perfor-mance. Projection pursuit [12] is another well-known adaptivealgorithm for function approximation, but with a very high com-putational complexity. Friedman [10] provides an excellent dis-cussion on this subject. Both MLP and RBFN use global para-metric regression and try to minimize the global least squareserror. When a learned MLP is exposed to more scenarios, it per-formance for previously learned samples may deteriorate. Thisis known as the “memory loss” problem. RBF network for navi-gation has been shown to experience less memory loss problemby Rosenblum and Davis [27]. With SHOSLIF, the memory lossproblem is further alleviated due to its use of local nonpara-metric regression and dynamically increased tree nodes.

With MLP or RBF network, our experiments showed thatthere was little problem in training the network to deal withone corner. However, when the network was exposed to sev-eral corners and intersections, the situation deteriorated. Both

Page 13: State-based SHOSLIF for indoor visual navigation

1312 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000

(a)

(b)

(c)

Fig. 20. Comparison of three approaches: (a) SHOSLIF, (b) a feedforward network (the best MLP), and (c) the best RBF network we obtained. The errorhistograms for the training set are shown in the left column while the error histograms for the test set are shown in the right column. (

MLP and RBFN often failed to navigate through the tested loop,especially around corners 3 and 4 in Fig. 7, although all threemethods used the same set of training data. ALVINN experi-enced similar problems when exposed to various road condi-tions. Pomerleau [24] used a rule-based method to arbitrate indi-vidual networks trained for specific roads. The integration or ar-bitration of several networks is a difficult task. Here we use stateinformation to systematically accomplish the task. No explicitrule for network arbitration is involved using the state-basedSHOSLIF method.

According to our experience in the experiments, SHOSLIFis advantageous in at least two aspects: 1) ease of training; 2)better performance after it is trained. Although the MLP couldgive a reasonable performance, many random trials are neededfor initial weights if random initial weights are used. The RBFnetwork does provide a better performance than the MLP, butis not as good as SHOSLIF. When most of the training sam-ples are used as centers, the RBF network will act like a nearestneighbor estimator, but then the computation complexity willbe prohibitive when the number of training samples is large. On

Page 14: State-based SHOSLIF for indoor visual navigation

CHEN AND WENG: STATE-BASED SHOSLIF FOR INDOOR VISUAL NAVIGATION 1313

TABLE IIIA QUALITATIVE COMPARISON OFTHREE DIFFERENTMETHODS. THEFOLLOWING NOTATIONS ARE USED: d FOR INPUT DIMENSIONS; h FOR THE

NUMBER OF HIDDEN NODES IN MLP; c FOR THE NUMBER OF CENTERS IN A RBFN; n FOR THE NUMBER OF STORED SAMPLES. THE

ABBREVIATIONS ARE: “PARA.” FOR PARAMETRIC REGRESSION; “COMPL.” FOR COMPLEXITY; “HIERACH.” FOR HIERARCHICAL; “CONV.”FOR CONVERGENCE; “I NCREM.” FOR INCREMENTAL; “M EM,” FOR MEMORY

the other hand, our SHOSLIF gives superior performance withease of training, it has to pay extra cost in storing prototypesof training samples. It seems that the payoff from extra storageenables state-based SHOSLIF to reach a good performance ac-curacy with a real-time speed under challenging indoor environ-ment. The training of SHOSLIF is also significantly faster thanMLP and RBF network.

D. Mapping of SHOSLIF Into ANN Architecture

The learning algorithm used in SHOSLIF can be easilymapped into ANN architecture. The most dominant eigen-vector for binary partition of each inner node can be computedby various neural learning algorithms (e.g., Oja [22] andreferences therein). The tree structure of SHOSLIF can also bemapped into an ANN architecture [28]. Suppose that we havea binary tree with splits and leaves. Each split can beimplemented by weighted sum with a threshold mechanism.Thus, the ANN architecture for SHOSLIF has the same networktopology as the tree itself.

VII. CONCLUSIONS ANDFUTURE WORK

The characteristics of the proposed algorithm can besummarized as follows: 1) An observation-driven Markovmodel for vision-based navigation. The state information andvisual attention have been incorporated systematically intothe proposed framework to deal with more complex scenes.2) The ODMM is realized by a nonparametric approach—theSHOSLIF regression method. A stochastic recursive partitiontree is used for efficiently computing the best matches for thehigh-dimensional visual input data in real time. The overalllearning algorithm is a local nonparametric adaptive regression,which exhibits more flexibility and better performance thanglobal parametric regression used in both MLP and RBF net-work. 3) The appearance-based method. Compared with othernavigation approaches using sonar or odometer sensors, the useof visual intensity image provides richer information and hencethe number of states is greatly reduced. The appearance-basedmethod automatically derives the best features as the principalcomponent features, as used in a flat space by Kirby andSirovich [15] and Turk and Pentland [33]. Consequently, itdoes not require humans to define features, which is a difficultandad hoctask.

Some future researches will be conducted: 1) More extensivetests will be conducted, especially under different navigationenvironments. The capability of appearance-based approach indealing with factors such as lighting is an issue common to

MLP- and RBFN-based road following. Although recent studiessuch as the use of discriminant analysis [30] have demonstrateda power to deal with unrelated factors, further studies on thisimportant issue are needed. 2) The use of direction commandswhich allows to make different turns at the same intersection.This can be done using our current framework by allowing statesto include information about the desired turning direction. Thefeasibility of the proposed framework can be further utilized inthese future studies.

REFERENCES

[1] N. Ayache and O. D. Faugeras, “Maintaining representations of the en-vironment of a mobile robot,”IEEE Trans. Robot. Automat., vol. 5, no.6, pp. 804–819, 1989.

[2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone,Classificationand Regression Trees. New York: Chapman & Hall, 1993.

[3] S. Chen, C. F. N. Cowan, and P. M. Grant, “Orthogonal least squareslearning algorithms for radial basis function networks,”IEEE Trans.Neural Networks, vol. 2, pp. 302–309, Mar. 1991.

[4] S. Chen and J. Weng, “SHOSLIF-N: SHOSLIF for autonomousnavigation (Phase I),” Michigan State Univ., East Lansing, Tech. Rep.CPS-94-62.

[5] T. M. Cover, “Estimation by the nearest neighbor rule,”IEEE Trans.Inform. Theory, vol. IT-14, no. 1, pp. 50–55, January 1968.

[6] D. R. Cox, “Statistical analysis of time series: Some recent develop-ments,”Scand. J. Statist., vol. 8, no. 2, pp. 93–115, 1981.

[7] E. D. Dickmanns, “Machine perception exploiting high-levelspatio-temporal models,” inAGARD Lecture Series 185 ‘MachinePerception’, Sept./Oct. 1992.

[8] E. D. Dickmanns, “Vehicles capable of dynamic vision,” inProc. 15thInt. Joint Conf. Artificial Intell. (IJCAI-97), Nagoya, Japan, August23–29, 1997.

[9] I. E. Frank and J. H. Friedman, “A statistical review of some chemomet-rics regression tools,”Technometrics, vol. 35, no. 2, pp. 109–148, 1993.

[10] J. H. Friedman, “Multivariate adaptive regression splines (with discus-sion),” Ann. Statist., vol. 19, no. 1, pp. 1–141, 1991.

[11] J. Hancock and C. E. Thorpe, “ELVIS: Eigenvectors for land vehicleimage system,”, CMU-RI-TR-94-43, Dec. 1994.

[12] P. J. Huber, “Projection pursuit,”Ann. Statist., vol. 13, no. 2, pp.435–475, 1985.

[13] X. D. Huang, Y. Ariki, and M. A. Jack,Hidden Markov Models forSpeech Recognition. Edinburgh, U.K.: Edinburgh Univ. Press, 1990.

[14] T. M. Jochem, “Vision-based tactical driving,” The Robotics Institute,Carnegie-Mellon University, Tech. Rep. CMU-RI-TR-96-1 4, January1996.

[15] M. Kirby and L. Sirovich, “Application of the Karhunen-Loeve proce-dure for the characterization of human faces,”IEEE Trans. Pattern Anal.Machine Intell., vol. 12, pp. 103–108, 1990.

[16] K. Kluge and C. Thorpe, “Explicit models for robot road following,”in Vision and Navigation: The Carnegie Mellon Navlab, C. Thorpe,Ed. Norwell, MA: Kluwer, 1990, pp. 25–38.

[17] S. Koenig and R. G. Simmons, “ A robot navigation architecturebased on partially observable Markov decision process models,” inArtificial Intelligence Based Mobile Robotics: Case Studies of Suc-cessful Robot Systems, D. Kortenkamp, R. P. Bonasso, and R. Murphy,Eds. Cambridge, MA: MIT Press, 1997.

[18] X. Lebesgue and J. K. Argarwal, “Significant line segments for an indoormobile robot,”IEEE Trans. Robot. Automat., vol. 9, pp. 801–816, 1993.

Page 15: State-based SHOSLIF for indoor visual navigation

1314 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000

[19] M. Meng and A. C. Kak, “Mobile robot navigation using neural net-works and nonmetrical environment models,”IEEE Contr. Syst. Mag.,pp. 31–42, Aug. 1993.

[20] S. K. Murthy, “Automatic construction of decision trees from data: Amultidisciplinary survey,” Data Mining Knowledge Discovery, 1998,submitted for publication.

[21] E. Oja and J. Karhunen, “On stochastic approximation of the eigenvec-tors and eigenvalues of the expectation of a random matrix,”J. Math.Anal. Applicat., vol. 106, pp. 69–84, 1985.

[22] E. Oja, “Principal components, minor components, and linear neural net-works,” Neural Networks, vol. 5, pp. 927–936, 1992.

[23] J. R. Quinlan,C4.5: Programs for Machine Learning. San Mateo, CA:Morgan Kaufmann, 1993.

[24] D. A. Pomerleau,Neural Network Perception for Mobile Robot Guid-ance. Norwell, MA: Kluwer, 1993.

[25] D. A. Pomerleau, “RALPH: Rapidly adapting lateral position handler,”in Proc. IEEE Symp. Intell. Vehicles, Detroit, Michigan, Sept. 25–26,1995.

[26] L. R. Rabiner, “A tutorial on hidden Markov models and selected appli-cations in speech recognition,”Proc. IEEE, vol. 77, no. 2, pp. 257–286,1989.

[27] M. Rosenblum and L. S. Davis, “An improved radial basis function net-work for visual autonomous road following,”IEEE Trans. Neural Net-works, vol. 7, pp. 1111–1120, Sept. 1996.

[28] I. K. Sethi, “Decision tree performance enhancement using an artificialneural network implementation,” inArtificial Neural Networks and Sta-tistical Pattern Cognition, I. K. Sethi and A. K. Jain, Eds. Amsterdam,The Netherlands: North-Holland, 1991, pp. 71–88.

[29] R. Simmons and S. Koenig, “ Probabilistic robot navigation in partiallyobservable environments,” inProc. 14th Int. Joint Conf. Artificial Intell.(IJCAI), 1995, pp. 1080–1087.

[30] D. L. Swets and J. Weng, “Using discriminant eigenfeatures for imageretrieval,” IEEE Trans. Pattern Anal. Machine Intell., vol. 18, pp.831–836, Aug. 1996.

[31] C. Thorpe, Ed., Vision and Navigation: The Carnegie MellonNavlab. Norwell, MA: Kluwer, 1990, pp. 9–23.

[32] C. Thorpe, M. Herbert, T. Kanade, and S. Shafer, “Toward autonomousdriving: The CMU Navlab,”IEEE Expert, pp. 31–42, Aug. 1991.

[33] M. Turk and A. Pentland, “Eigenfaces for recognition,”J. CognitiveNeurosci., vol. 3, no. 1, pp. 71–86, 1991.

[34] A. R. Webb and D. Lowe, “The optimized internal representation of mul-tilayer classifier networks performs nonlinear discriminant analysis,”Neural Networks, vol. 3, pp. 367–375, 1990.

[35] J. Weng and S. Chen, “Incremental learning for vision-based naviga-tion,” in Proc. Int. Conf. Pattern Recognition, vol. IV, Vienna, Austria,August 1996, pp. 45–49.

[36] J. Weng and S. Chen, “Vision-guided navigation using SHOSLIF,”Neural Networks, vol. 11, pp. 1511–1529, 1998.

[37] S. Yakowitz, “Nearest neighbor methods for time series analysis,”J.Time Ser. Anal., vol. 8, no. 2, pp. 235–247, 1987.

[38] S. Yakowitz, “Nonparametric density and regression estimation forMarkov sequences without mixing assumptions,”J. Multivariate Anal.,vol. 30, pp. 124–136, 1989.

[39] S. L. Zeger and B. Qaqish, “Markov regression models for time series:A quasilikelihood approach,”Biometrics, vol. 44, pp. 1019–1031, 1988.

Shaoyun Chen(M’99) received the B.S. degree fromXiamen University, China, the M.E. degree from Ts-inghua University, China, in 1988 and 1991, respec-tively, and the Ph.D. degree from Michigan State Uni-versity, East Lansing, in 1998, all in computer sci-ence.

He is now with KLA-Tencor working on waferinspection. Previously he was affiliated with Sensar,Inc., doing active vision and iris recognition. Hismajor interests include computer vision, imageprocessing, and pattern recognition and their

applications, including fingerprint recognition and iris recognition.

Juyang Weng(S’85–M’88) received the B.S. degreefrom Fudan University, Shanghai, China, in 1982 andthe M.S. and Ph.D. degrees from the University ofIllinois, Urbana-Champaign, in 1985 and 1989, re-spectively, all in computer science.

From January 1989 to September 1990, he wasResearcher at Centre de Recherche Informatique deMontréal, Montreal, PQ, Canada, while adjunctivelywith Ecole Polytechnique de Montréal. FromOctober 1990 to August 1992, he was a Visiting Re-search Assistant Professor at University of Illinois,

Urbana-Champaign. In August, 1992, he joined the Department of ComputerScience, Michigan State University, East Lansing, MI, where he is now anAssociate Professor. He is a coauthor of the bookMotion and Structure fromImage Sequences(New York: Springer-Verlag, 1993). His current research in-terests include mental development, computer vision, autonomous navigation,human–machine interface using vision, speech, gesture, and actions. Recently,he been pursuing a new research direction called developmental robots—robotsthat can autonomously develop their own cognitive and behavioral capabilitiesthrough on-line, real-time interactions with their environments, includinghuman teachers, using their sensors and effectors.