LSpeakIt : Immersive interface for 3D object Search
Diogo Gil Vieira HenriquesInstituto Superior Tecnico,
ABSTRACTThe number of available three-dimensional digital objects hasbeen increasing considerably. As a result, retrieving suchobjects from large collections has been subject of research.However, proposed solutions do not explore natural interac-tions in their interfaces. In this work, we propose a speechinterface for 3D object retrieval in immersive virtual environ-ments. For our prototype, LSpeakIt, the context of LEGOblocks was used as a toy-problem. To understand how peo-ple naturally describe objects, we conducted a preliminarystudy. We found that participants mainly resorted to verbaldescriptions. Considering these descriptions, we developedour search by speech system. Taking advantage of a lowcost visualization device, the Oculus Rift, we implementedfour modes for immersive query results visualization. Thesemodes were evaluated by users, being compared against eachother and a traditional approach. Users favored the immersivemodes, despite being more acquainted with the traditional ap-proach. For our final prototype, a fifth mode for results vi-sualization was implemented, adapting the users preferredmodes. We compared the proposed solution with the LEGODigital Designer commercial application. Results suggestthat LSpeakIt can outperform its contestant, while providingusers with a simple and natural way for searching virtual ob-jects, and ensuring better performance and results perceptionthan traditional approaches for 3D object retrieval.
Author KeywordsImmersive interface, Voice Search, 3D Objects, Retrieval
INTRODUCTIONVirtual environments in computer entertainment are becom-ing ever more immersive . Indeed, from CAVE like se-tups  to heads-up displays, such as Oculus Rift1 or the re-cently announced Project Morpheus by Sony, these immer-sive environments are getting increasingly common. Severalfields can take advantage of this, being able to place the userinside of the virtual world in a more credible way than tradi-tional setups, offering more natural interactions with the vir-tual content. One possible application for these kind of user
1Oculus Rift: http://www.oculusvr.com/rift/
Figure 1. User interacting with LSpeakIt prototype.
immsersion is the creation of virtual scenarios , which al-low the user to place virtual objects mimicking physical worldinteractions.
However, when facing the challenge of selecting objects ina collection, traditional solutions based on lists and grids ofthumbnails may not be feasible, due to changes in the inter-action paradigm. Also, it has already been shown that im-mersive environments can be used to enhance retrieval re-sults explorations over traditional approaches . In thiswork, we focus on naturally navigate large collections of ob-jects in immersive environments, in order to select a specificone. For this purpose, we use the LEGO building blocks sce-nario, which has already proven to be a good test bed forresearch, leading to potentially interesting entertainment ap-plications [22, 15, 16, 5]. We developed a system in whichusers can search, select, explore and place 3D LEGO blocksthrough multimodal interactions in a 3D fully immersive en-vironment, as depicted in Figure 1.
In the remaining of the paper we will discuss the state ofthe art for multimedia content searching, focusing on thee-dimensional virtual objects. We then present a preliminarystudy we conducted to understand how people naturally de-scribe physical LEGO blocks and its results. After, our pro-totype is described, followed by an evaluation comparing itagainst a commercial application. Finally, we lay out our con-clusions and point out some directions for future work.
RELATED WORKWith the increase of available objects of any type, retriev-ing specific information presents a challenge, and three-dimensional objects, such as any other type of multimediacontent, are no exception. One of the traditional ways to per-form retrieval consists of using textual queries. However, this
method is not trivial, considering the objects do not usuallycontain sufficient intrinsic information. For instance, the filesnames may not be even related to the objects [23, 3]. Gener-ically, search engines often use text associated with objects,such as captions, references to them or even, when it comesto contents scattered over the Internet, links or file names.This concept has already been applied in image retrieval and 3D object retrieval . In the work of Funkhouser etal. , synonyms of words taken from the texts are also usedto increase the information describing the objects and resolvevocabulary mismatch.
Despite all these solutions, the information to describe a 3Dobject is still insufficient, specially regarding its shape. Someof the proposed solutions use query-by-example to ease thisprocess. The goal of search by example is to obtain simi-lar objects in terms of visual aspects such as the color or shape . Moreover, this solution requires the user toalready have an object identical to the one he is searching,which is usually not the case.
Retrieval by sketching is one way of addressing this prob-lem, offering users the possibility of searching for objectssimilar to the users sketches, for which they do not have amodel to serve as an example . In the work of Santos etal. , users can make sketches that match the dimensions ofa desired LEGO block. Funkhouser et al.  also proposed amethod of sketching several 2D views of the model. In a dif-ferent approach, Liu et al.  attempted to improve searchby sketches taking into account the user profile, i.e. the habitsof the users when drawing. This lead to improved results asusers perform more searches.
Holz and Wilson  followed a different approach in orderto apply a method often used to describe physical objects. Intheir work, the authors focused on recognizing descriptions ofthree-dimensional objects through gestures. This work con-sisted of capturing and interpreting gestures and exploring thespatial perception of the users. The shape and movementof the users hands when describing the objects are used tocreate a three-dimensional shape by filling the voxels whichusers hands crossed. The authors concluded that participantswere able to keep the correct proportions relatively to thephysical objects and, in more detailed areas, users performedgestures more slowly.
However, these works mainly present results relying on thetraditional approach of thumbnails lists. Nakazako et al. presented 3D MARS, which demonstrates the benefits of us-ing immersive environments for multimedia retrieval. Theirwork focused mainly on the presentation of query results for acontent-base image retrieval system, using a CAVE like setup.Extending 3D MARS approach for 3D objects, Pascoal etal.  showed that some challenges can be overcome whenpresenting 3D object retrieval results in this kind of environ-ments. In these works, results are distributed in the virtualspace according to the similarity between them. Users canthen explore the results by navigating in the immersive en-vironment, which is seen through a head-mounted display.Their system also allows a diversified set of different visual-
ization and interaction devices, which were used to test mul-tiple interaction paradigms for 3D object retrieval.
In the area of the information retrieval, we have recently wit-nessed a wide spread of search-by-voice on mobile devices.This new possibility has led to the preference of this searchmethod over the traditional search-by-text . In most cases,when searching by voice, the speech is converted to text,which is then used as a search parameter [25, 12]. More re-cently, Lee and Kawahara  performed a semantic analysisof the speech queries used to search for books, achieving agreater understanding of what the user desires to retrieve.
The retrieval of multimedia content, in particular for three-dimensional objects, has been subject of previous research.Most solutions, although already started to explore naturalmethods for describing objects, such as mid-air gestures, donot yet conveniently explore the potential of the interactionbetween humans and its descriptive power. In other areas,verbal descriptions are already being used for retrieving con-tent. However, they have not yet been applied to 3D objects orcomplemented with other natural descriptive methods. Whenthe viewing retrieved results, some solutions provide immer-sive environments. Nevertheless, some of the results mayappear overlapped if they are too similar. Traditional ap-proaches do not overlap results, but their visualization basedon thumbnails lacks an adequate 3D representation of objectsand is not suitable for interacting with in immersive setups.
PRELIMINARY STUDYTo understand which methods are most natural for people todescribe LEGO blocks, we conducted an experiment. For thisexperiment, ten pairs of participants were involved. In eachpair, one participant had to request specific blocks to the otherparticipant, in order to build a model. Each participant per-formed the two roles: once as a builder and once as supplier.After a preliminary introduction, it was given to the builderstep-by-step instructions to assemble a model, and the sup-plier was given a box of blocks, containing more than thoseneeded to complete the model. A small barrier between thetwo participants prevented the builder from seeing the boxand the supplier from seeing the instructions, but they couldsee each others faces and hand gestures.
The experimental setup was done with four different mod-els, composed of different blocks but with a similar geomet-ric complexity. The instructions for each model included 20steps. Figure 2 illustrates some of the steps of one of t