Finding the next question:My past, present and future research
Nikolaos MavridisMIT Media Lab
A presentation prepared for the PhD proseminar ‘03
Today’s menuRoughly in historical order:
Welcome: Early background
Appetizers: UCLA, ITI, early projects @ MIT
Main course:Ongoing projects @ MIT: Vision, Mental Models, Resolver
Desert and farewell:Big picture!
Welcome!
My early background…
Early background:Inspirational figures ;-)
Inspirational figures (p.II)
Sneak preview:Impersonating my inspiration
More on Gyrus!
Early background p.IIThessaloniki
Sunsets in Athens:Finding the next question!
Teens: ZX Spectrum & soldering ironsObvious choice: ECE at AUTH / Maths at OU
Appetizers:
UCLA, ITI,
early_projects@MIT
Crossing the atlantic: UCLA: I saw LA
Control and Signal ProcessingIRIS: An automotive IR ranging system
Interlude in Greece: 3D Face Recognition
EU IST HiScore at ITI:Novel affordable SLA depth camera (Siemens)Benefits for face recognition: (also gesture)
Det & fraud / Artificially enriched training / Depth in rec
Papers, and thesis supervisionOther interesting groups at ITI: VR haptics, watermarking etc.And the second year…
Hellenic Air Force / OU courses / other readings
MIT: Early projects IFace Categorisation
Roz’s pat recog: Face categorisationGender, age, race, hat, moustache…Category choice?Natural modes, highly-informative features
A posteriori: W. Richards, S. Ullman
MIT: Early projects IIGrounding Motion
Some words about our group: What is grounding, work so farDeb’s comp semantics: Grounding motion
Practical goal:Description from videoclip:
> The red ball entered. The red ball collided with the green pole. The green pole kept standing despite the collision. A brick fell on the table.
Theoretical background:Exploration of conceptual cluster of motionSeeking a path of incremental complexity
Related work: Jeff Siskind, Paul Cohen
Main course:
ongoing_projects@MIT:
I. VisionII. Mental Models
III. Resolver
My talking helper:Ripley the robot!
A 7-dof conversational robotSpeech synth, speech rec, vision, proprioception, gripping…Ideal platform for exploring connections between natural language semantics, perception and actionGoal: perform collaborative manipulation tasks mediated by natural spoken dialogue
Trainmot.mov Beanbag.mov measweight.mov
MIT: Ongoing projects IRipley’s vision system
Visual analysis expansionCogRack Vision: Distributed stereo vision
Capture/Segmentation/Face detection/DisparityRunning on 8 processors – 20-30fps2D-objecter: Region permanence & evidence accumulation
Extensions: multi-object pose recognition through integration with Andre Ribeiro’s object vision system.
Main course:
ongoing_projects@MIT:
I. VisionII. Mental Models
III. Resolver
MIT: Ongoing projects II Mental models for intel.agents
3 stages so far:vRIP: VIRTUAL RIPLEY (fall’02)
Uncoupled simulation with dynamicsvRip, a virtual version of ripley moves, perceives, interacts and in general lives in a virtual world with newtonian dynamics, where objects can be instantiated, event scenarios created, while the world is viewed through vRip’s eyes and felt by its motors etc.
SIMRIP: A MENTAL MODEL FOR RIPLEY (spring - summer’03)Ripley-specific ODE-dependent coupled simulationThe virtual Ripley simulator was connected to real Ripley, thus serving the function of a mental model providing object permanence, visualization of the internalized world through arbitrary viewpoints etc. The model contains structures describing the state of the robot, the objects it is interacting with, and the user, can incorporate unknown values, and provides a neat interface for language comprehension / generation.
REUSABLE MENTAL MODELS & MENTAL IMAGERY FOR CONVERSATIONAL ROBOTSINCORPORATING UNCERTAINY (fall’03 - ?)
Customisable ODE-independent distributed mental model incorporating uncertaintyA reusable, non-Ripley specific mental model architecture is being developed, which provides a unified description of the world in terms of expandable agent structures, and can account for graded uncertainties in derived properties as well as imaginary instantiations of objects. In the long term, multiple instances of such models will be instantiated, enabling primitive experiments in differing worldviews, “doing by seeing”, intention estimation and simplistic theory of mind. Also, customized versions might serve as the heart for other devices.
MOST APPARENT KEY GAINS SO FAR: Systematic world rep, object permanence, artificial viewpoints
Vrip.mov
Simrip.mov
Mental Models:Early motivation
How are people able to think about things that are not directly accessible to their senses at the moment? What is required for a machine to able to talk about things that are out of sight, happened in the past, or view the world through somebody else’s eyes (and mind)? What is the machinery required for the comprehension of a sentence like:
“Give me the green beanbagthat was on my left”
Mental Models: Overview
Why mental models?ArchitectureW: The descriptive languageReusable modelsProperty description structuresS: Sensory structuresF: Instantiators / Predictors / reconcilliatorsFuture plans…
Why mental models? (p.I)Goal: Provide an intermediate representation, mediating between perception and languageIn essence:
an internalized representation of the state of the world as best known so far, in a form convenient for “hooking up”language (shown below: the revisualisation of the rep)
and a set of methods for updating this representation given further relevant sensory data, and predicting future states in the absence of such data
Why mental models? (p.II)But also:
A useful decomposition of a complex problem, suggesting a practical engineering methodology with reusable components, as well as a theoretical frameworkA unified platform for the instantiation of hypothetical scenarios (useful for planning, instantiation of situations communicated through language etc.) (prereq: uncertainty)A starting point for experimental simulations of:
Multi-agent systems with differing partial world knowledge or model structurePrimitive versions of theory of mind by incorporating the estimated world models of other agents (prereq: action rec)Learning parameters or structures of the architectures, and experimenting with learned vs. innate (predesigned) tradeoffs (for example, learning predictive dynamics, senses-to-model maps, language-to-model maps etc.)
Notation & FormalitiesD = {W, S, F}
D = {W, S, F} : A dynamical mental modelW: Mental Model State
W[t]: state of the mental model at time tW: the structure of the state (the chosen descriptive language for the world, ontology). Decompositions might be hierarchical.W = {O1, O2, … } into Objects/Relations (creations/deletions crucial)Oi = {P1, P2, …} into Properties (updates of contents but usually no creations/deletions)
S: Sensory Input:S[t], S, S = {I1, I2, …} (Modalities/Sensors)
F: Update / Prediction functionW[t+1] = F( W[t], S[t] ) as a dynamical systemF is a two-argument update / prediction functionA decomposition: (…also Wh[t]: hypotheticals)
(W[t],S[t])->Ws[t] (sensory-driven changes in W form)W[t]->Wp[t] (prediction-driven changes in W form)W[t+1] = R(Ws[t], Wp[t]) (the “reconcilliation” function)
Block Diagram (&sync issues!)(Also: history/event block and wider overview of Ripley: planner etc.)
M E N T A L M O D E L& R E C O N C IL L I A T O R
( m e n ta l_ m o d e l . e x e )W [ t ] a n d F
M O D A L I T Y -S P E C IF IC
IN S T A T I A T O R S( v is o r . e x e e t c . )
( W [ t ] , S [ t ] ) - > W s [ t ]
V IR T U A LO B J E C T
I N S T A N T IA T O R( im a g in e r .e x e )
( W [ t ] , H [ t ] ) - > W h [ t ]
D Y N A M I C SP R E D I C T O R
( p r e d ic t o r .e x e )W p [ t ]
V IS U A L IS E R( v is u a l is e r . e x e )
S E N S E SS [ t ]
H Y P O T H E S I SG E N E R A T I O N
V I S U A LF E A T U R EA N A L Y S IS
L A N G U A G EU N D E R S T A N D I N G
( b is h o p )v ie w p o in ts e le c t io n
M E N T A L M O D E L S : R ip le y 's c a s e
P r e l i m in a r y b lo c k d i a g r a m , S e p t '0 3N ik o l a o s M a v r i d is , M I T M e d ia L a b
W: the descriptive languageW in conversational setting: include me, you, othersEVERYTHING POTENTIALLY ANIMATE!Indexing: Internal & External Ids, continuity, signaturesBottom-up:
Simple_object (f.e. a cylinder) = geom + body + appearance = dynamics + statics + drawablesObject_relation (binary) (f.e. hinge joint, contact)Compound_object = SimpleObjectMap U ObjectRelationMapAgent = Compound Object U Viewpoint U Gripper U Mover?Agent_relation (f.e. inter-agent joints, visibilty?)Compound_agent = AgentMap U AgentRelationMap = World
Basic properties: in simple_object, object_relationAbsolute properties encoded / Apparent prop. reconstructed!Property description structures (fixed, with confidences, with stochastic model, observational history, categorical form) - later
More on the structures…myObjects:
Packaging: A dozen of .h/.cpp made into libmyobjects.a (+some utils), include “myobjects.h”Expand/rethink types of relations!Think about joint (and relation!) recognition (Mann…)
myModels: ready-made models for specific agents (ripley, human, environment)… packaged in libmymodels… Expand!!!These include parameter sets for customization, creation and deletion functions (as well as sensory update functions?). OuterIDs and body parts?
myObjectsODE:ODE-supplemented version for predictor
Property description structuresThe near future:
Class property_conf{string name; double value; double confidence;}
How to deal with ints/doubles and vectors?How to update conf? Decrease with time?
4-tier structureClass property_4
{categorical_descr c; //variable granularity, context-sensitive boundariesproperty_conf ml; stoch_descr distrib;relevant_sensory_history senspointers;}
Advantages:Confidences vital for incomplete knowledge / information-driven sensingHomogenisation very useful for later experiments in feature selection etc.
Also: confidence vs. variancePredictions & natural laws at different categorical granularities
S: the sensory structuresVision:
Objectworld from 2D ObjecterExtensions for 3D? Shape models? Partial view integration and the instantiator?
Proprioception:JointAnglesPacket form ripley’s controlWeight measurementsDirect access to force feedback?
Switching from continuous to on-demand feeding of new information (I.e. lookat() etc.) – connections to attention, planning etc.
F: update/prediction functionI. Instantiators
Modality-specific instantiators(&updators/destruct.):Send create/update/delete packets to mental_modelThey SHOULD know previous world stateR they modality or agent-specific? MODALITY! Should the generic agent models include specific sensory update functions?
Virtual object instantiator: (R/AR/VR gradation)Sometimes also used for creation of sensory-updated agents (I.e. self) – boundaries?Linguistically A sensory R: Integration of Linguistic & Sensory evidence for the same or different objects, true or imaginaryWhat would the clients need? Let’s choose an API
F: update/prediction functionII. Predictor & Reconcilliator
Prediction rules:Collision detection (collisions as obj_relat)Dynamics (reconcilliation with senses, inference of internal forces…ANIMACY DETECTION!)Out of bounds deletions & object stabilisation
Reconcilliation:How to resolve conflicts between sensed, predicted and requested? (think: multiple sensors in car)Simplistic: When no other info, use prediction. Else, blend senses with prediction?
Future plans I:Richer shapes!
Prerequisites: new vision systemexpansion of structuresuncertainty encoding (f.e. multi-view shape)
Remember, each object consists of: Body + geom + drawable = all have to be taken into account
Recognising shape vs. recognising object: 3D scanner etc.Also related to: more complicated naïve physics and geometry – the ball was in the box etc.
Future plans II:Imaginary scenes
IR/AR/VR mix lAsR linguistically Augmented sensory Reality!I.e. I see a red ball, you told me it’s heavy
Prerequisites:uncertainty encodingVerbal description to world state parser
VITAL for:Converting desired final state of the world from verbal description to a testable condition when given a command, I.e.
“I want the coffee on the table”“Give me a cup”
Understanding narratives:“John was in his bed, when the phone rang”Clean internally encoded version of the above
Planning through simulation
Future plans III:Multiagent systems
Prerequisites:Action recognition across agents (not strict prereq)Thus, useful to start by embedding everything in virtual world wrapper,and cheating on action recognitionAlso, mixed real/virtual agents (Ripleyconversing with a non-existent friend)
Benefits:Systematic external examination of effects of different partial world knowledge or structure/methods of mental models (I.e. contents & form of MM), or even different sensory organs.For example, differing categorical boundaries and negotiated alignment (methods difference, I.e. update/prediction function)Prerequisite for Theory of Mind!
First preliminary examples: Ashwani’s demo for viewpoint-dependent description generation (using the generic MM)
Future plans IV:Theory of mind
Now, each agent’s MMalso contains an estimated mental model of each other agent as part of their descriptions…Prerequisites:
Uncertainty Multi-agent modelsAction recognition across agents (strict prereq now!, +gaze)
Benefits:Start playing with intention though action recognition Interesting coupling with inferred goals etc.“Mind reading” is an immense area for experimentation!Collaborative tasks
Future plans V:Innate vs. learnt
Now that we have a clean architecture to start withhow about learning parameters or structuresof the architecture, and experimenting withlearned vs. innate (predesigned or evolved) tradeoffs?
Examples:Learning predictive dynamics
Where do I expect the object to be?Learning “empirical” newtonian mechanics
Learning senses-to-model mapsWhich property of which object does this sensory signalinform me about, and how do its contents alter the property?
Learning language-to-model maps (example: Deb’s thesis)Which property of which object does this utterance inform me about, and how does it alter the property?
Learning mental model structuresWhich properties should my object descriptions contain?How can I get an empirical derivation of 3D position as a crucial non-apparent property of an object?
Concatenating parts at the input-output equivalence levelForget about all the internalised fuss. Can I get an equivalent structure without postulating and enforcing the exact architecture?
In essence: How arbitrary is everything that was hardcoded? Are some things redundant? Can they be learnt? If so, How?
FINALLY, FOR ALL PREVIOUSLY STATED FUTURE PLANS: Relation with how humans perform (cognitive modeling) - categorical level
MIT: Ongoing projects II Ripley’s mental model
Recap:Vital structure… also reusableClear definitions, formalisms and architectural design importantCrucial experimental infrastructure for many future ideas, providing neat integration as well as a common platform – the reward being simulated computational models instead of vague wordsLots of hidden details to be taken care of (but also often providing useful theoretical insights), technicalities dealing with edge effects, alignment, and quite painful coding… (at least for me!)Nevertheless, vital!
Main course:
ongoing_projects@MIT:
I. VisionII. Mental Models
III. Resolver
MIT: Ongoing projects IIIResolver: To ask or to sense?
Resolver: Selecting and mixing questions with sensing actions towards referent resolution
For machines to speak with humans, they must at times resolve ambiguities. Imagine having a conversational robot, which is able to carry out sensing actionsactions in order to collect more data about its world; for example through active visual attention and touch. Suppose it is also able to gain new information linguistically by asking its human partner questions. Each kind of action, sensing and speech, has associated costs and expected payoffs. Resolver is a planning algorithm that treats these actions in a common framework, enabling such a robot to integrate both kinds of action into coherent behavior, taking into account their costs and expected goal-oriented information-theoretic rewards.Early motivation: ripley’s primitive ambiguity resolution dialogue systemSimilar information-theoretic / utilitarian frame of thought:
E. Horvitz (Microsoft), A. Gorin (AT&T)Wider picture: Language – action parallels (speech act theory, also motor neurons etc.)
FINDING THE NEXT QUESTION!
Resolver:Overview
The problemThe programThe algorithmPerformance evaluationPotential as cognitive modelExtensionsOther applications:
Parallel theory refinement/experiment selection in science
The problemImagine the following scenario:
A human user and a robot are sitting around a table, where some objects have been placed. The human user has selected one of the objects on the table, and asks the robot to give it to him. The robot has not yet attended to the objects. What should the robot’s next moves be? Should it attend to the color of the first object, and then to the sizes of all? Should it attempt to weigh an object? Or should it ask for further information, for example if the desired object is red? Slight variation: the user provides an ambiguous partial description in his request.
IN ESSENCE: Active matching under double uncertainty: for the desired target as well as the options available
The program:Initial state
4 Modes: Virtual world standalone (self-answering) / Virtual world text I/O / Virtual world speech I/O / Full mental-model and ripley connectivity
The program:Intermediate state
After: “The heavy one” - “Is it small? No” - measuresize1-3 - “Is it medium?”
The program:Final state
After: “Is it medium? Yes” – “Is it black? No” – “Is it magenta? Yes”Note cost breakdown. Costs might be given by master planner (tired, curious…)
The algorithm:Assumptions
Assumption families:The objects and the intended referent
Nobj a priori known, unbiased choiceMeasurements and descriptions of the objects
Properties, senses->prop, words->prop (me&user), referent uniqueness up to linguistic description
State and gradation of uncertaintyContents of state (I, O, moves), initial state, full confidence in senses/answers (unchanging, unbiased by construction, cooperative user/hearing and nature/senses)
Priors on a solitary object“Proximal” sensory natural modes, & their linguistic reflection
Priors on the set of objects and the intended referentI belongs to O: interdepend., I unique in O: U interdepend.
Allowable actionsQ1:“Is it red?” / Q2:”What color is it?”/ Q3:”Is it this one?”A1:Measure prop of one O / A2:Measure prop of all O
The algorithm:Stages
Stages:State at each moment
Encoding distributionsEffect of answers and sensory results on the state as a whole
I-O and O-O interdependenceEvaluation of present state
Calculating prob(I=Oi)Choosing the next move
Expected entropy reward of consistent answersApprox and computational tractability (underlying state of world and answer)
The effect of different cost settingsChange ordering – choose dominance / Q1-Q2, A1-A2, Q-A
Fusion of expected information gain with associated costsRequirements for function
Performance evaluation &Potential as cognitive model
Performance evaluation:Quantitative:
2 baselines so far (random non-repetitive, consistent)Metrics: σ, µ of nmoves, and Σcost20-25% better (parameters have effect!)normal modes help even more!
Qualitative:Subjective evaluation of robot behavior for specific cost settings
Potential as cognitive model:Tunable generative model (play with costs!)Acquiring experimental human data
fixations, saccades, words: relative position sensitive costsInput-output equivalence vs. inner workings
One step aheadCost-reward fusionArtificial setting but general applicability (non-sit. context etc.)
Future plans:Extensions
Relax assumptions:Encoding distributionsPruning by clustering and satisficing acceptability thresholdUnknown starting parameters (nobj etc.)Non-cooperative user/natureImperfect linguistic/sensory channel (FUSE)Other molecular action combinationsMulti-view object/shape id. sensory actionsMulti-step, non-approx (another baseline)
Other applicationsIn essence:
Active matching under double uncertainty: for the desired target as well as the options available Thus, to apply, just choose an interpretation of the structures involved!
Parallel theory refinement / experiment selection in scienceA number of groups of theoreticians are constructing theories in order to explain a phenomenon. The extension of these theories to wider domains of validity is a costly process. But so is the setting up of experiments in order to verify the applicability of their predictions to various domains.Consider the identification of:
Sensory property dimensions with application domains of the theoriesQuestions for a property dimension with experiments in an application domainAnswers with experimental results of the above questionsSensory actions with theoretical work towards extending theories to a domainAnd sensory data with the theoretical predictions which are the outcome of the above work
Thus:The user is now identified with nature; nature is questioned by experiments, and answers in the form of experimental results (or freely collected data in a domain). The table previously filled with objects now corresponds to a part of the platonic universe; a subset of the set of possible theories is on the table. One can either examine nature, or examine possible theories in order to reach a (hopefully somewhat permanent) temporary best match*. And thus science marches on, hopefully with resources better targeted towards more vital experiments and theoretical groups.
MIT: Ongoing projects IIIResolver: To ask or to sense?
Recap:We started by wanting to expand Ripley’s ambiguity resolution dialoguesCreated a general algorithm for active matching under uncertaintyIt performs well for the original taskInteresting theoretical points have arisenAlso attractive as cognitive modelMany possible extensions…Many alternative applications!
Desert and farewell:
The Big Picture!
The ultimate goal:Let’s make Ripley and his brothers* more fun to talk to and interact with!And let’s learn more about us on the way…
*Anything from lux the conversational lighting system to intelligent cars to whatever!
Contributions so far: In 2.5 semesters plus summer:
Better vision (distributed, face rec)Mental model: Ripley’s world modelResolver: mixing dialogue & sensing
Publications while @ MIT:D. Roy, K. Hsiao and N. Mavridis, “Conversational Robots: Building Blocks for Grounding Word Meanings”, NAACL HLT’2003 conferenceK. Hsiao, N. Mavridis and D. Roy, “Coupling Perception and Simulation: Steps Towards Conversational Robotics”, IEEE IROS’2003 conferenceD. Roy, K. Hsiao and N. Mavridis, “Mental Imagery for a Conversational Robot”, IEEE Systems, Man & Cybernetics Part B (accepted)
Publications in preparation:Resolver: To ask or to sense? (UAI conf.)Natural modes and fragments in face categorisation (?)
The future…So many obvious next steps,so many open pathways readily accessible,so many open questions unanswered!But also so much dirty work to do…Wider picture:
Apart from language, I am quite interested in modeling and formalising the scientific process, and understanding the continuum between children, folk science and official science, as well as formalising the arguments towards the difference or superiority of science versus alternatives
In practical terms: Probably continue along the obvious lines (as heavily dictated by the group’s needs at the moment), and then narrow down and deepen on something…
In conclusion…(Apologies for the lengthy presentation as well as my horrible sketches;-))
One can never underestimate the importance and joy of
Finding the next question!