6
Toward Natural Language Semantic Sensing in Dynamic State Spaces Nicholas Sweet and Nisar Ahmed Aerospace Engineering Sciences, University of Colorado, Boulder, Colorado 80309 [email protected]; [email protected] I. I NTRODUCTION AND OVERVIEW One of the great challenges of robotics is effective human- robot communication. It is quite challenging for autonomous robots to understand human natural language statements, as these must be mapped from a large highly open space of semantic constructs to precise grounded representations that robots can interpret. Probabilistic models and inference algo- rithms for natural language processing (NLP) have proved to be especially useful and attractive, due to their high degree of modularity, flexibility, computational efficiency, and ability to naturally tie into probabilistic reasoning algorithms for robotic decision-making [1–8]. While state-of-the-art developments in probabilistic methods have largely focused on natural language command and control, there is also great interest in the ability of humans to quickly and efficiently communicate useful sensory information, i.e. to act as ‘human sensors’ that support probabilistic robotic perception, learning and state estimation tasks [9–15]. The problems of exploiting human natural language for ei- ther sensing/estimation or planning/control have much in com- mon, but also lead to different technical challenges for model- ing and reasoning. Specifically: whereas planning/control ulti- mately requires identification of (constraints on) a single most likely set of executable actions to minimize some cost function, sensing/estimation tasks generally require updating complete beliefs over all possible latent state variable hypotheses. The former case can be solved with state-of-the-art point-based optimization strategies, but the latter requires efficient pdf representation and update solutions to tackle the full Bayesian inference problem. This work describes progress toward a probabilistic frame- work for translating free-form natural language observations provided by human sensors into data likelihood functions for efficient online Bayesian human-robot sensor fusion. Similar to, and in parallel with, frameworks such as DCG and HDCG for planning/control [1–3], our goal is to develop a gen- eralizable ‘plug-and-play’ interface between existing robotic Bayesian state estimators and off-the-shelf natural language processors. A key requirement to this end is to capture un- certainties related to both semantic ambiguities and lexical parsing of natural language. In contrast to state-of-the-art planning approaches, we propose a ‘loosely-coupled’ inference approach that conditionally separates the problems of inferring lexical/translation uncertainties and inferring semantic mean- ing of natural language for state estimation. A key challenge to this end is to derive statistically consistent/conservative esti- mates of lexical parser uncertainties from existing off-the-shelf probabilistic NLP components (e.g. syntactic parsers, sense- matchers), so that these can be combined with existing multiple Fig. 1: The initial human-robot interface in the ‘Cops and Robots‘ experiment. Clockwise from top-left: video feed from cop robot; environment and belief map of target locations; question-answer module; human sensor codebook to be replaced with a chat interface. hypothesis semantic data fusion techniques. We describe a motivating application for our work, and provide a general problem statement for mapping free-form natural language chat data provided by humans into large dictionaries of recog- nizable semantic soft data. We propose a generic processing pipeline for estimating required lexical parser uncertainties, and present a preliminary approach and results for estimating lexical uncertainties due to sense-matching ambiguities using pre-trained/off-the-shelf word2vec models [16]. II. PROBLEM FORMULATION, AND RELATED WORK A. Motivation: ‘Cops and Robots’ Dynamic Target Search As a grounding application, we have developed an indoor autonomous target search testbed for the ‘Cops and Robots’ scenario, in which one ‘security cop’ robot (Deckard) searches for multiple ‘robber’ robots, with the assistance of a human ‘deputy’. The indoor search environment consists of one mobile cop robot, 3 mobile robber robots (Roy, Pris, and Zhora), and semantic reference features on a known map, e.g. dividing rooms, areas, furnishings, etc. Each robot uses an iRobot Create base for mobility, a Kinect camera for onboard sensing, and an Odroid U3 on-board computer running ROS. The human deputy is located remotely and uses an interface shown in Figure 1 to communicate with the cop robot. All robots and map objects are localized by a Vicon system. The human deputy can see the cop robot’s camera feed, or select a view from one of three security cameras in the environment. In the initial interface, the human selects from a fixed list of possible semantic observations (lower-left of Fig. 1) to send a structured message to the cop, of the form hTargetIDihis/is notihPrepositionihLandmarki , (1)

Toward Natural Language Semantic Sensing in Dynamic State ... · B. Formal Bayesian Human-Robot Data Fusion Previous work in [10, 18] established a rigorous, formal method to identify

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Toward Natural Language Semantic Sensing in Dynamic State ... · B. Formal Bayesian Human-Robot Data Fusion Previous work in [10, 18] established a rigorous, formal method to identify

Toward Natural Language Semantic Sensingin Dynamic State Spaces

Nicholas Sweet and Nisar AhmedAerospace Engineering Sciences, University of Colorado, Boulder, Colorado 80309

[email protected]; [email protected]

I. INTRODUCTION AND OVERVIEW

One of the great challenges of robotics is effective human-robot communication. It is quite challenging for autonomousrobots to understand human natural language statements, asthese must be mapped from a large highly open space ofsemantic constructs to precise grounded representations thatrobots can interpret. Probabilistic models and inference algo-rithms for natural language processing (NLP) have proved tobe especially useful and attractive, due to their high degree ofmodularity, flexibility, computational efficiency, and ability tonaturally tie into probabilistic reasoning algorithms for roboticdecision-making [1–8]. While state-of-the-art developments inprobabilistic methods have largely focused on natural languagecommand and control, there is also great interest in the abilityof humans to quickly and efficiently communicate usefulsensory information, i.e. to act as ‘human sensors’ that supportprobabilistic robotic perception, learning and state estimationtasks [9–15].

The problems of exploiting human natural language for ei-ther sensing/estimation or planning/control have much in com-mon, but also lead to different technical challenges for model-ing and reasoning. Specifically: whereas planning/control ulti-mately requires identification of (constraints on) a single mostlikely set of executable actions to minimize some cost function,sensing/estimation tasks generally require updating completebeliefs over all possible latent state variable hypotheses. Theformer case can be solved with state-of-the-art point-basedoptimization strategies, but the latter requires efficient pdfrepresentation and update solutions to tackle the full Bayesianinference problem.

This work describes progress toward a probabilistic frame-work for translating free-form natural language observationsprovided by human sensors into data likelihood functions forefficient online Bayesian human-robot sensor fusion. Similarto, and in parallel with, frameworks such as DCG and HDCGfor planning/control [1–3], our goal is to develop a gen-eralizable ‘plug-and-play’ interface between existing roboticBayesian state estimators and off-the-shelf natural languageprocessors. A key requirement to this end is to capture un-certainties related to both semantic ambiguities and lexicalparsing of natural language. In contrast to state-of-the-artplanning approaches, we propose a ‘loosely-coupled’ inferenceapproach that conditionally separates the problems of inferringlexical/translation uncertainties and inferring semantic mean-ing of natural language for state estimation. A key challengeto this end is to derive statistically consistent/conservative esti-mates of lexical parser uncertainties from existing off-the-shelfprobabilistic NLP components (e.g. syntactic parsers, sense-matchers), so that these can be combined with existing multiple

Fig. 1: The initial human-robot interface in the ‘Cops and Robots‘ experiment. Clockwisefrom top-left: video feed from cop robot; environment and belief map of target locations;question-answer module; human sensor codebook to be replaced with a chat interface.

hypothesis semantic data fusion techniques. We describe amotivating application for our work, and provide a generalproblem statement for mapping free-form natural languagechat data provided by humans into large dictionaries of recog-nizable semantic soft data. We propose a generic processingpipeline for estimating required lexical parser uncertainties,and present a preliminary approach and results for estimatinglexical uncertainties due to sense-matching ambiguities usingpre-trained/off-the-shelf word2vec models [16].

II. PROBLEM FORMULATION, AND RELATED WORKA. Motivation: ‘Cops and Robots’ Dynamic Target Search

As a grounding application, we have developed an indoorautonomous target search testbed for the ‘Cops and Robots’scenario, in which one ‘security cop’ robot (Deckard) searchesfor multiple ‘robber’ robots, with the assistance of a human‘deputy’. The indoor search environment consists of onemobile cop robot, 3 mobile robber robots (Roy, Pris, andZhora), and semantic reference features on a known map, e.g.dividing rooms, areas, furnishings, etc. Each robot uses aniRobot Create base for mobility, a Kinect camera for onboardsensing, and an Odroid U3 on-board computer running ROS.The human deputy is located remotely and uses an interfaceshown in Figure 1 to communicate with the cop robot. Allrobots and map objects are localized by a Vicon system. Thehuman deputy can see the cop robot’s camera feed, or select aview from one of three security cameras in the environment.

In the initial interface, the human selects from a fixed listof possible semantic observations (lower-left of Fig. 1) to senda structured message to the cop, of the form

〈TargetID〉 〈is/is not〉 〈Preposition〉 〈Landmark〉 , (1)

Page 2: Toward Natural Language Semantic Sensing in Dynamic State ... · B. Formal Bayesian Human-Robot Data Fusion Previous work in [10, 18] established a rigorous, formal method to identify

Fig. 2: Softmax probability surfaces capturing intrinsic uncertainties in range and bearing.

where the field elements are taken from a known dictionary.Alternatively, structured semantic questions generated by thecop can be answered by the human, where the top 5 questionsformed from the fixed list elements are ranked accordingto maximum expected information gain (posterior entropydecrease) for a given target, e.g. ‘Is Roy in the dining room?’.The interface shows each target’s evolving location pdf as aheat map over the environment. The cop fuses onboard vi-sual information (limited range target detection/non-detectionevents for search and range/bearing data for tracking) withsemantic soft data provided by the human deputy. For now,as in [10, 17], the cop separates the motion planning and datafusion problems; the human has no direct control over the cop’smovement, but can influence the cop’s beliefs in where targetsare (and are not) located. As such, the cop must autonomouslydecide how best to utilize the information it receives. In doingso, it must also cope with the fact that its probabilistic beliefsover dynamic robber locations will generally be described bynon-convex/multimodal pdfs, due to complex target motionand sensing uncertainties.

B. Formal Bayesian Human-Robot Data FusionPrevious work in [10, 18] established a rigorous, formal

method to identify likelihood functions for semantic soft dataprovided by human sensors and (recursively) fuse these withconventional robotic sensor data. We first provide a briefsummary of this approach using general dynamic state spacemodels. For a discrete time index k ∈ Z+, let Xk ∈ Rnbe the uncertain continuous target state vector (e.g., position,heading, velocity). Given a prior belief p(Xk−1), potentiallyconditioned on ‘hard’ robot sensor data ζk and known targetdynamics p(Xk|Xk−1), soft data Dk can be fused to obtainthe posterior target state pdf via the Bayes filter:

p(Xk) =

∫p(Xk|Xk−1)p(Xk−1)dXk−1 (2)

p(Xk|Dk) ∝ P (Dk|Xk)p(Xk). (3)The soft data likelihood P (Dk|Xk) represents a hybrid (i.e.

continuous-to-discrete) mapping between categorical humansensor statements and intrinsically uncertain semantic clas-sification of the true state. We model P (Dk|Xk) throughgeneralized softmax functions [18–20],

P (Dk = l|Xk) = exp (wTl Xk + bl)/

m∑j=1

exp (wTj Xk + bj)

where l selects one of m possible categorical semanticlabels, each parametrized by weights wl and biases bl whichcan be learned from experimental data. Fig. 2 shows anexample softmax model with 17 spatial relation classes. Asdiscussed in [18, 19], generalized softmax models provide arich convenient polytopic geometry for easily embedding apriori contextual information (e.g. object/environment geome-

Fig. 3: DBN models corresponding to different translations/groundings of Ok .

tries, dynamic constraints), leading to significant parametersparsification and greatly mitigating expensive non-convexoptimization procedures for model identification. Furthermore,the product of generalized softmax likelihoods from multipleindependent observations can be quickly compressed into asingle likelihood for efficient online ‘batch’ data fusion. Whenp(Xk) is a Gaussian mixture (GM),

p(Xk) =

mp∑p=1

wpNxk(µp,Σp)

(where wp, µp ∈ Rn, and Σp are the weights, mean vectorand covariance matrix for mixand p), the posterior (3) iswell approximated by another GM via combined variationalinference and importance sampling techniques, enabling fastrecursive state estimation with multimodal pdfs [10]. GMcompression methods [21] can be used to efficiently managemixture growth from multimodal transition pdfs p(Xk|Xk−1)(e.g. unobserved switching dynamics) or semantic likelihoodsp(Dk|Xk) (e.g. non-convex range-only semantic data). Assuch, the GMs enable (parallelizable) maintenance of complexbelief hypotheses over the whole state space for approximateBayesian inference. Fig. 3 shows the dynamic Bayesian net-work (DBN) model for the soft data fusion process for anarbitrary dictionary of m semantic observations at any giventime step. Each Di

k term has its own generalized softmaxlikelihood to describe semantic state space uncertainty forassociated category labels (e.g. D1

k’s label set may describebearings relative to the cop, D2

k’s label set may describe robbervelocity, etc.), and shaded nodes indicate actual observationdata provided by a human.

C. Structured Interfaces for Reporting Semantic Soft DataA human could report soft data by constructing structured

observations l from a list of available statements, as shownlower-left in Fig. 1. This bypasses the difficult problem ofparsing natural language inputs, but leads to several majorlimitations. Firstly, it restricts the human to a rigid pre-defineddictionary and message structure, which may not provide anintuitive or sufficiently rich set of semantics to convey desiredinformation. Secondly, it is very inconvenient to select itemsone-by-one from a list for structured messaging; especially asthe dictionary size m grows, this quickly becomes infeasibleand time-consuming enough to render data irrelevant in dy-namic settings. Furthermore, this approach does not scale wellwith environment/problem complexity and lexical richnessfor soft data reporting. In particular, it is often desirable toprovide observations that activate multiple Di

k terms (bottomof Fig. 3), e.g. ‘Roy is by the table and heading slowly tothe kitchen’ simultaneously provides location, orientation andvelocity information.

Page 3: Toward Natural Language Semantic Sensing in Dynamic State ... · B. Formal Bayesian Human-Robot Data Fusion Previous work in [10, 18] established a rigorous, formal method to identify

D. Proposed Natural Language Chat InterfaceTo address these limitations in the Cops and Robots ap-

plication, we are developing an unstructured natural languagechat interface to support highly flexible ‘free-form’ soft datareporting. The chat interface should ideally support a widerange of semantics and enable fast transmission of rich softdata reports. However, the problem of deriving meaningfuland contextually relevant dynamic state information from afree-form chat observation Ok is highly non-trivial. Unlikewith structured messages, it is infeasible to explicitly constructlikelihood functions p(Ok|Xk) in advance to find the Bayesposterior p(Xk|Ok) directly. Furthermore, many different chatmessages can convey exactly the same or slightly differentvarieties of information, leading to additional uncertainties inlexical meaning (i.e. possible translation errors) in addition tointrinsic semantic (state spatial) uncertainty. For example, thephrases ‘Blue robot moving past the books’, ‘Roy next to thebookcase going to kitchen’, and ‘Robber nearby shelf headingleft’ all overlap in the sense that they could all to essentiallyrefer to a structured set of atomic phrases such as ‘Roy is nearthe bookcase; and Roy is moving toward the kitchen’.

Since similar issues are also encountered in natural lan-guage interpretation for planning/control, modeling approachesdeveloped for that domain could be adopted here. In ap-proaches such as G3 [5] and more recently DCG and HDCG[1–3], MAP-based inference is performed over a large setof random variables defined by factor graphs to determinea single most likely set of latent semantic groundings anddesired actions/action constraints from natural language com-mand inputs. These models consider dense overlapping setsof hierarchical and undirected conditional dependencies thatcomprehensively capture the complex relationships betweendifferent possible linguistic meanings and contexts within amodular joint probability distribution. Due to the complexityof the joint pdf, it is non-trivial to represent the full posteriordistribution over unknown variables and so point-based solu-tions are sought instead. Inference operations over continuousrandom variables are also typically approximated via dis-cretization, which is suitable for online planning scenarios forlow state dimensionalities. These features impose significantoperational limitations from the Bayesian human-robot sensorfusion perspective, as autonomous robots will generally wantto reason over and track all possible likely state hypothesesfor complex applications, especially if these are significantlymultimodal and have many dynamic states as in the Cops andRobots scenario.

Furthermore, state-of-the-art NLP-to-planning models es-sentially rely on ‘tightly coupled’ inference, i.e. in anal-ogy to tightly-coupled GPS-based vehicle navigation systemswhich comprehensively account for and reason jointly overco-dependent uncertainties in vehicle, receiver, satellite andatmospheric propagation error parameters and states [22]. Suchreasoning is arguably quite essential for high precision and‘single-shot’ applications, where information-gathering oppor-tunities are limited. This motivates consideration of simpler‘loosely coupled’ inference architectures for other cases wherecoarser-grained and/or multi-shot reasoning is sufficient. By‘loosely coupled’, we mean to separately account for lexi-cal/translation uncertainties and semantic/meaning uncertain-ties in a statistically consistent way that avoids joint reasoningover a large set of latent variables. Ideally, such a reasoningframework could be designed and implemented with off-the-

shelf NLP components (e.g. probabilistic syntactic parsers andsense-matchers) to provide (hopefully conservative) ‘lumped’approximations of lexical/translation uncertainties, while se-mantic/meaning uncertainties for a given lexical parsing arehandled by a state fusion engine.

To build on the Bayesian GM fusion framework for statespace models, our approach is to translate a given Ok into areasonable ‘on the fly’ estimate of the likelihood p(Ok|Xk)via a very large (possibly expandable) dictionary of latent Di

ksemantic observations, which have known generalized soft-max likelihoods p(Di

k|Xk). In particular, given the expansion(where Ok is conditionally independent of Xk given Di

k)

p(Xk|Ok) ∝ p(Xk)

m∑i=1

mi∑j=1

p(Ok|Dik = j)p(Di

k = j|Xk),

we can generally approximate p(Ok|Dik) as the second sum-

mation term on the RHS, where p(Ok|Dik = j) accounts for

the lexical uncertainty and p(Dik = j|Xk) accounts for the

semantic uncertainty. Since Ok may also point to multiple softobservations (i.e. data about position and heading data for atarget in the same sentence), the Di

k likelihoods on the RHSgenerally could correspond to unique products of independentdictionary terms, e.g. as in the multiple instantiations forthe bottom row of Fig. 3. The general problem, then, is toidentify how Di

k likelihoods (or sets of soft observations)should be ‘activated’ for a given Ok input by identifying thescalars p(Ok|Di

k = j). Since p(Xk)p(Dk|Xk) can generallybe approximated as a GM, it follows then that the LHSp(Xk|Ok) generally corresponds to a ‘mixture of mixtures’(that is, as long as Ok could ‘truly’ match to only one set ofsoft observations, so that p(Xk|Ok) sums to one).

III. METHOD AND PRELIMINARY RESULTSFig. 4 shows a generic serial NLP pipeline for translating

Ok into Dik, which can be wrapped around the semantic data

fusion engine to provide estimates for lexical uncertainty termsp(Ok|Di

k) given a large fixed semantic dictionary definingpossible Di

k soft data primitives. We assume that the soft dataprimitives can be predefined offline for a known environmentwith a fixed number of possible groundings and a finite numberof feasible semantic relations defined in the state space Xk.However, this does not restrict the ability to augment thedictionary or modify semantic likelihoods p(Di

k|Xk) online,e.g. to describe spatial relations for moving landmarks (e.g. thecop robot). Each component layer of the pipeline represents akey step for translating free-form chat inputs into recognizablesemantic information, and thus captures a potential source oflexical parsing errors, which percolate down to the next layer.

A key point is that each component layer can, in princi-ple, be instantiated by any number of existing off-the-shelfNLP tools. Internal components include: a tokenizer, whichsegments a text stream into multi-word tokens; a tagger, whichtags each token with a semantic label; a templater which fitslabel-token pairs into pre-defined sensor statement templates;and a sense-matcher, which maps natural language tokensto the list of pre-defined tokens that populate appropriatepositions in sensor statement templates. Fig. 7 shows thegeneral sensor statement template, which uses token labelsderived from pre-defined Target Description Clauses (TDCs),which are conceptually similar to Spatial Description Clauses

Page 4: Toward Natural Language Semantic Sensing in Dynamic State ... · B. Formal Bayesian Human-Robot Data Fusion Previous work in [10, 18] established a rigorous, formal method to identify

Fig. 4: A proposed NLP pipeline (left) with an example (right) for loosely coupled softdata fusion.

(SDCs) [6]. Different TDC types consist of different compo-nents that form sub-trees of Fig. 7; for instance, an actionTDC, composed of a certainty (how sure the human is ofthe observation), a target (none, one or multiple entities), apositivity (e.g, ‘is’, ‘is not’), and action (description of thetarget motion).

In theory, provided that the possible outputs of eachcomponent layer can be assigned a conditional probability ofbeing correct given the outputs of the previous componentlayer, then p(Ok|Di

k) ∝ p(Dik|Ok)/p(Di

k) can be obtained(up to constant p(Ok)) by estimating p(Di

k|Ok) as the prod-uct of the component layer conditional probabilities (i.e. forparsing events leading up to some Di

k), and finding p(Dik) =∫

p(Xk)p(Dik|Xk)dXk from the corresponding fusion result.

Due to limited space, we focus attention here on obtaininguncertainty estimates from the final sense-matching portionof the pipeline, under the assumption that perfect tokeniza-tion, tagging and templating has been performed. Hence, inthe following, we consider unstructured chat inputs that areprocessed to the point where they should match expected partsof speech for a valid soft data statement, but have variations inthe semantic labels that may be outside of the fixed dictionaryfor Di

k terms. Hence, the problem is to map these semanticlabel variations to the closest matching dictionary terms.

A. Sense-matching with word2vecThe key issue in mapping between unstructured language

and a set of structured semantic templates is the derivation ofa relationship between the estimated 13 million tokens in theEnglish language and the comparatively minuscule number oftokens in a set of predefined template statements. Recent ef-forts, particularly by Mikolov et al. [16], introduced a negativesampling-based approach to efficiently learn multi-dimensionalvector representations of all tokens in a vocabulary. Using this,word2vec tool, we can develop a mapping between a set oftokens contained within an unstructured utterance and a set oftokens contained within a structured semantic template.

Each structured statement Dk can be decomposed intotokens dk,1, dk,2, . . . dk,NM

. If we are given the tokenizationTk of an unstructured statement, decomposed into tokenstk,1, tk,2, . . . tk,NT

, we can form a mapping between the tokensof two statements. This relationship is captured graphically in

Fig. 5: Probabilistic graphical model of the token-matching problem between structuredstatement tokens and observed, processed unstructured statement tokens.

Fig. 6: Token-matching problem reformulated as a Hidden Markov Model.

Fig. 5. To explore this problem in a tractable manner, we assertthat the Markov property holds between unobserved tokens,and arrive at the simpler graphical model shown in Fig. 6,leading to the following Hidden Markov Model (HMM):

P (Dk | Tk) ∝ P (d1, d2, . . . , dNT, t1, t2, . . . , tNT

)

= P (d1)P (t1 | d1)P (d2 | d1)P (t2 | d2)× . . .× P (dNT

| dNT−1)P (tNT| dNT

)(4)

One of the key features of this assumption is that we nowimpose a one-to-one pairing between tokens of structured andunstructured phrases, and thus that both phrases are tokenizedto the same length. If the prior elements of the pipeline areable to provide this tokenization, we only require transitionprobabilities P (dk,i+1 | dk,i) and observation probabilitiesP (tk,i | dk,i) to specify the conditional probability of eachstructured statement P (Dk | Tk).

We can use known structure to determine the transitionprobabilities. Figure 7 illustrates a possible structure for ahuman sensor statement. Each node is a template, containingone or more possible tokens; for instance, the Spatial Relation:Object node may contain, ‘front’, ‘left’, ’back’, ‘right’ and‘near’, whereas the Spatial Relation: Area node may contain‘inside’, ‘near’ and ‘outside’. If each template node is ex-panded into a set of token nodes, then each path through the

Fig. 7: Example human sensor statement template structure; each node decomposes intoa set of possible tokens to generate each template structured statement.

Page 5: Toward Natural Language Semantic Sensing in Dynamic State ... · B. Formal Bayesian Human-Robot Data Fusion Previous work in [10, 18] established a rigorous, formal method to identify

tree from the root to a leaf node generates one statement. Thus,transition probabilities between dki and dki+1 can be derivedby parent-child relationships:

P (dk,i+1 | dk,i) =

{1ν if PARENTdk,i+1

= dk,i0 otherwise

(5)

where ν is the number of tokens captured by the templatenode that contained dk,i+1. This assumes all possible paths areequally likely within the constraints specified by the template.

Calculation of the observation probabilities asks the ques-tion, ‘What is the probability of observing the token tk,i, giventhat I have selected the template token dk,i?’. This questionrelates to the core issue of lexical uncertainty in our problem:the probability of observing, for instance, ‘near’ given thatthe template word is ‘near’ is non-unity; furthermore, since∑tk,i

P (tk,i | dk,i) = 1 and tk,i spans the English language,we notice little discrimination between two words from theconditional probability.

At its core, the Skip-gram based word2vec uses a soft-max function (c.f. [16] eq. 2) to relate the probability of oneword, wi, represented by its input and output vectors vi andv′i, given another word, wj , represented by its input and outputvectors vj and v′j :

p(wi | wj) =exp

(v′ Ti vj

)∑Ww=1 exp (v′ Tw vj)

(6)

Since W is the number of words in the vocabulary (likelyin the millions for unstructured language), p(wi | wj) willbe both small and nearly flat over the vocabulary. A commonmeasure of word similarity, in place of conditional probability,is to instead take the cosine similarity measure between outputword vectors. This similarity measure need not sum to 1for all words and is exactly 1 for identical words. A firststep approach would be to replace the token observationprobability P (tk,i | dk,i) with a token observation similarityscore s(tk,i, dki), weight all possible statements by their jointscores:

s(Dk, Tk) = P (d1)s(t1, d1)P (d2 | d1)s(t2, d2)× . . .× s(dNT

, dNT−1)P (tNT| dNT

)(7)

We can now use arg maxDks(Dk, Tk) to select template

sensor statements that are most similar to the input tokeniza-tion. A proof-of-concept example, seen in Fig. 8, shows theresults of generating s(Dk, Tk) for some test phrases. In thisexample, 2682 possible template statements are considered,which covers 79.81% of the 208 input sentences collectedby our pilot experiment. Four input phrases are shown andassumed to have been tokenized correctly by previous pipelineelements to demonstrate the sense matching portion of thepipeline independently. From left-to-right, the four columnsdemonstrate the effects of increasing dissimilarity with tem-plate statements: the first input sentence is exactly a sensorstatement template; the second input sentences replaces aspatial relation template token, ‘near’, with a non-templatetoken, ‘next to’; the third input sentence replaces the groundingand changes the positivity; and the fourth is an imprecisereformulation of the first sentence. These results are promising,as the top-scoring statements are all qualitatively similar to theoriginal phrase.

IV. CONCLUSION AND ONGOING WORK

This work described initial progress toward a probabilisticframework for translating free-form natural language obser-vations provided by human sensors into recognizable datalikelihood functions for efficient online Bayesian human-robotsensor fusion. We proposed a ‘loosely-coupled’ inferenceapproach that conditionally separates the problems of inferringlexical/translation uncertainties and inferring semantic mean-ing of natural language for state estimation. A key challengeto this end is to derive statistically consistent/conservative esti-mates of lexical parser uncertainties from existing off-the-shelfprobabilistic NLP components (e.g. syntactic parsers, sense-matchers), so that these can be combined with existing multiplehypothesis semantic data fusion techniques. We described amotivating application for our work and a general problemstatement for mapping free-form natural language chat dataprovided by humans into large dictionaries of recognizablesemantic soft data. We presented a generic processing pipelinefor estimating required lexical parser uncertainties, and pre-sented a preliminary approach afor estimating lexical uncer-tainties due to sense-matching ambiguities using word2vecmodels.

Ongoing work includes further analysis and explorationof word2vec and newer related tools, e.g. sense2vec, togenerate more context-sensitive lexical parsing probabilities.Although preliminary results are not provided here, we arealso investigating how a number of other off-the-shelf NLparsing tools and probabilistic techniques can be used togenerate the remainder of lexical parsing uncertainties in theproposed pipeline, e.g. including SpaCy for syntactic parsing(https://spacy.io/docs) and conditional random fields for labeltagging [23]. Important future directions include exploringconnections to factor graph models such as DCG and HDCGto derive more rigorous estimates for lexical uncertainties, e.g.using staged conditional inference approximations in combi-nation with GM-based semantic fusion updates. Finally, wewill explore the rich space of techniques developed withinthe sensor fusion community for fusing data under unknowninformation correlations, which are particularly important forcoping with unknown dependencies or inconsistencies be-tween off-the-shelf parser components [24]. These also includemulti-hypothesis data association methods, which could permitdelayed data fusion through sequential scoring of multipleparallel lexical parsing histories over time.

REFERENCES[1] J. Arkin and T. M. Howard, “Towards learning efficient

models for natural language understanding of quantifiablespatial relationships,” in RSS 2015 Workshop on ModelLearning for Human-Robot Communication.

[2] T. M. Howard, S. Tellex, and N. Roy, “A natural languageplanner interface for mobile manipulators,” in Roboticsand Automation (ICRA), 2014 IEEE International Con-ference on. IEEE, 2014, pp. 6652–6659.

[3] T. M. Howard, I. Chung, O. Propp, M. R. Walter, andN. Roy, “Efficient natural language interfaces for assistiverobots,” in IEEE/RSJ Intl Conf. on Intelligent Robots andSystems (IROS) Work. on Rehabilitation and AssistiveRobotics, 2014.

[4] M. R. Walter, S. Hemachandra, B. Homberg, S. Tellex,and S. Teller, “Learning semantic maps from natural

Page 6: Toward Natural Language Semantic Sensing in Dynamic State ... · B. Formal Bayesian Human-Robot Data Fusion Previous work in [10, 18] established a rigorous, formal method to identify

Fig. 8: Example results from the score-based approach to ranking all 2682 possible template statements given a processed unstructured input statement. The top 5 scoring statementsfor each input phrase are shown.

language descriptions.” Robotics: Science and Systems,2013.

[5] S. Tellex, T. Kollar, and S. Dickerson, “UnderstandingNatural Language Commands for Robotic Navigation andMobile Manipulation.” in AAAI Conference on ArtificialIntelligence, 2011, pp. 1507–1514.

[6] T. Kollar, S. Tellex, D. Roy, and N. Roy, “Towardunderstanding natural language directions,” Proceeding ofthe 5th ACM/IEEE international conference on Human-robot interaction - HRI ’10, p. 259, 2010.

[7] S. Tellex, R. A. Knepper, A. Li, T. M. Howard, D. Rus,and N. Roy, “Assembling Furniture by Asking for Helpfrom a Human Partner,” Collaborative Manipulationworkshop at Human-Robot Interaction, 2013.

[8] S. Hemachandra, F. Duvallet, T. M. Howard, N. Roy,A. Stentz, and M. R. Walter, “Learning Models forFollowing Natural Language Directions in Unknown En-vironments,” in ICRA, 2015.

[9] K. G. Lore, N. Sweet, K. Kumar, N. Ahmed, andS. Sarkar, “Deep value of information estimators forcollaborative human-machine information gathering,” inProceedings of the ACM/IEEE Seventh Int’l Conf. onCyber-Physical Systems (ICCPS 2016). ACM, 2015,pp. 80–89.

[10] N. Ahmed, E. Sample, and M. Campbell, “BayesianMulticategorical Soft Data Fusion for Human–RobotCollaboration,” IEEE Transactions on Robotics, vol. 29,no. 1, pp. 189–206, 2013.

[11] T. Kaupp, A. Makarenko, and H. Durrant-Whyte,“Human–robot communication for collaborative deci-sion makinga probabilistic approach,” Robotics and Au-tonomous Systems, vol. 58, no. 5, pp. 444–456, 2010.

[12] A. Dani, M. McCourt, J. W. Curtis, and S. Mehta,“Information fusion in human-robot collaboration usingneural network representation,” in Systems, Man andCybernetics (SMC), 2014 IEEE International Conferenceon. IEEE, 2014, pp. 2114–2120.

[13] J. Frost, A. Harrison, S. Pulman, and P. Newman, “Proba-bilistic Approach to Modelling Spatial Language with ItsApplication to Sensor Models,” in Workshop on Com-putational Models of Spatial Language Interpretation(COSLI-2010) at Spatial Cognition 2010, J. H. R. Rossand J. Kelleher, Ed., Portland, OR, 2010.

[14] D. L. Hall and J. M. Jordan, Human-centered InformationFusion. Artech House, 2010.

[15] D. Wang, L. Kaplan, H. Le, and T. Abdelzaher, “Ontruth discovery in social sensing: A maximum likelihoodestimation approach,” in Proc. of the 11th Int’l Conf. onInformation Proc. in Sensor Networks. ACM, 2012, pp.233–244.

[16] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Dis-tributed Representations of Words and Phrases and theirCompositionality,” Nips, pp. 1–9, 2013.

[17] S. Ponda, N. Ahmed, B. Luders, E. Sample, T. Hoossainy,D. Shah, M. Campbell, and J. How, “Decentralizedinformation-rich planning and hybrid sensor fusion foruncertainty reduction in human-robot missions,” in 2011AIAA Guidance, Navigation and Control Conf., August2011.

[18] N. Sweet and N. Ahmed, “Structured Synthesis andCompression of Semantic Human Sensor Models forBayesian Estimation,” American Controls Conference,2016.

[19] N. Ahmed and N. Sweet, “Softmax modeling of piece-wise semantics in arbitrary state spaces for plug and playhuman-robot sensor fusion,” in RSS 2015 Workshop onModel Learning for Human-Robot Communication.

[20] N. Ahmed and M. Campbell, “Multimodal operator de-cision models,” in ACC 2008, Seattle, 2008.

[21] A. R. Runnalls, “Kullback-leibler approach to gaussianmixture reduction,” Aerospace and Electronic Systems,IEEE Transactions on, vol. 43, no. 3, pp. 989–999, 2007.

[22] I. Miller, M. Campbell, D. Huttenlocher, F.-R. Kline,A. Nathan, S. Lupashin, J. Catlin, B. Schimpf, P. Moran,N. Zych et al., “Team cornell’s skynet: Robust perceptionand planning in an urban environment,” Journal of FieldRobotics, vol. 25, no. 8, pp. 493–527, 2008.

[23] J. Lafferty, A. McCallum, and F. C. N. Pereira, “Condi-tional random fields: Probabilistic models for segmentingand labeling sequence data,” ICML ’01 Proceedingsof the Eighteenth International Conference on MachineLearning, vol. 8, no. June, pp. 282–289, 2001.

[24] T. Bailey, S. Julier, and G. Agamennoni, “On conserva-tive fusion of information with unknown non-gaussiandependence,” in 2012 Int’l Conf. on Information Fusion(FUSION 2012). IEEE, 2012, pp. 1876–1883.