Kinect based 3D Scene Reconstruction - WSCGwscg.zcu.cz/wscg2014/Short/H47-full.pdf · GuideCane is a small vehicle at the end of a long cane on which the sensors are arranged. Based

Kinect based 3D Scene ReconstructionNiclas ZellerHS Karlsruhe

Moltkestrasse 30,76133 Karlsruhe,

Germanyniclas.zeller@hs-

karlsruhe.de

Franz QuintHS Karlsruhe

Moltkestrasse 30,76133 Karlsruhe,

Germanyfranz.quint@hs-

karlsruhe.de

Ling GuanRyerson University350 Victoria Street,Toronto, M5B 2K3,

[email protected]

ABSTRACTThis paper presents a novel system for 3D scene reconstruction and obstacle detection for visually impaired people,which is based on Microsoft Kinect. From the depth image of Kinect a 3D point cloud is calculated. By using both,the depth image and the point cloud a gradient and RANSAC based plane segmentation algorithm is applied. Afterthe segmentation the planes are combined to objects based on their intersecting edges. For each object a cuboidshaped bounding box is calculated. Based on experiments the accuracy of the presented system is evaluated. Theachieved accuracy is in the range of few centimeters and thus sufficient for obstacle detection. Besides, the papergives an overview about already existing navigation aids for visually impaired people and the presented system iscompared to a state of the art system.

Keywords3D scene reconstruction, electronic travel aid, Microsoft Kinect, obstacle detection, RANSAC plane segmentation,visually impaired

1 INTRODUCTION

For safe traveling in known and unknown environmentsblind as well as visually impaired people depend ontravel assistance devices. Until now the white cane andthe guide dog are the most popular ones. Nevertheless,both the white cane and the dog are not able to detectsuspended objects, which are hanging at head heightin front of the visually impaired person. Besides, therange, especially of a white cane, is limited to one totwo meters and a dog often cannot be used indoor. Al-ternatively, or supportive to these conventional devices,electronic navigation devices can be used. In this pa-per we are especially interested in so called electronictravel aids (ETAs), which are devices that do not requireany additional infrastructure like GPS and man-madelandmarks or prior knowledge about the environment.Other systems can be categorized in electronic orienta-tion aids (EOAs) and position locator devices (PLDs).EOAs provide information to reach a certain destina-tion. This can be offline information (e.g. the floor planof a building) as well as online information (e.g. a con-

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this notice andthe full citation on the first page. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee.

tinuous tracking of the position within a given map).PLDs are used for positioning (e.g. GPS) and thus of-ten are applied within an EOA.

The main goal of an ETA is to warn a visually impairedperson about upcoming obstacles and impart percep-tion of her or his surroundings. Even though there existplenty of assistance devices, most of them are not wellaccepted by the community of the blind and visuallyimpaired. Most of the existing ETAs warn the user onlyabout obstacles right in front of her or him but do notgive any further perception about the person’s environ-ment. Other camera based systems try to impart per-ception of the scene but tend to overwhelm the user’ssenses by conveying redundant information. Section 2gives a short overview about existing systems and re-search projects.

The idea of this paper is to develop a system whichrecords the blind person’s environment and reduces therecording to a minimum amount of significant informa-tion. To record the environment Microsoft Kinect isused. The advantages of Kinect compared to other cam-era systems is its low price as well as the extensive soft-ware development kit (SDK). Thus, Kinect gives theability to develop a low priced prototype system with-out a long familiarization period, making it ideal forrapid prototyping.

Based on the depth image of Kinect, an algorithm is de-veloped which detects obstacles in a scene and modelsthem by a small number of cuboid objects. The infor-

mation needed to describe these cuboids easily can bemodulated onto stereo audio signals for example.

Although the algorithm is developed based on the depthimage captured by Kinect, it can be applied to any otherdepth images. Thus, the presented system is ideal toverify the feasibility of the proposed idea and the pre-sented algorithm easily can be combined with otherdepth camera systems, e.g. a stereo camera system or aplenoptic camera included in glasses.

The presented algorithm first performs a plane segmen-tation. This segmentation is a combination of a gradi-ent based 2D image segmentation and a random sam-ple consensus (RANSAC) based plane segmentation[Fis81a], which is applied to the 3D point cloud. Theplane segmentation will be described in Section 3. Thesecond step of the algorithm is the object modeling andthus the reconstruction of the recorded scene. This ap-proach is based on finding intersecting edges betweenneighboring plane segments. The object modeling willbe presented in Section 4. In Section 5 the algorithm isanalyzed based on its accuracy and Section 6 shows theresults of the system applied in test environments. Sec-tion 7 compares our system to a state of the art system[Rod12a] and Section 8 draws conclusions.

2 STATE OF THE ARTThis section gives an overview about already exist-ing ETA systems. Beside some commercially avail-able tools, there exists a variety of research activi-ties focusing on ETAs. In [Dak10a] Dakopoulos andBourbakis give a good overview about already existingETAs, as well as research activities in this field. Man-duchi and Coughlan [Man12a] present a more generaloverview about electronic assisting devices for blindpeople. Most of the commercially available systemsrely on distance sensors for detecting obstacles withinthe surroundings of a blind person. K-Sonar [Zab06a]for example is a tool which uses ultrasonic sensors toperceive obstacles in front of the user. The received in-formation is conveyed to the user by stereo headphones.K-Sonar either can be used as hand held device or canbe mounted on a long cane. A similar system as K-Sonar is the Laser Long Cane [Rit01a, Rit02a] fromVistac GmbH [Vis13a]. In this system laser distancesensors are attached to a long cane. The sensors mon-itor the area in front of the user on upper body height.The system warns the user by vibrations in the sensordevice about obstacles.

Beside these commercially available products, there ex-ist various research projects, which are focusing on dis-tance sensor based ETAs. Manduchi and Yuan are de-veloping a system for detecting steps using laser dis-tance sensors [Yua04a, Yua05a]. Dunai et al. work ona system called CASBliP [Dun12a]. This tool is based

on a time-of-flight (TOF) line scan camera. The cam-era scans a horizontal plane in front of the blind userand transforms it into stereo audio signals. One big ad-vantage of this system is the long range of about 15 m.

Other ultrasonic based systems are NavBelt and Guide-Cane developed by Shoval et al. [Sho03a]. Both sys-tems are based on the same ultrasonic sensors, whichscan the surroundings. In the NavBelt system the sen-sors are attached to a belt, which is carried by the user.GuideCane is a small vehicle at the end of a long caneon which the sensors are arranged. Based on the sen-sor data, signal processing algorithms calculate a safetypath in traveling direction. In the NavBelt system, theuser is informed about the safety path by audio signals,while GuideCane steers into the direction of this path.One big disadvantage of GuideCane is that the systemis not able to pass or detect steps.

Other ultrasonic systems, like the project of Cardin etal. [Car05a] use tactile feedback to transmit the sensor’sinformation.

More sophisticated than systems based on distance sen-sors are camera based systems. These systems try toconvey perception of the users environment instead ofjust cautioning against upcoming obstacles. One com-mercially available system is vOICe [vOI13a]. Herethe image of a camera, which is arranged in glasses, istransformed into spatial audio signals. These signalsare presented to the user by headphones. Gonzalez-Mora et al. focus on a similar idea. In [Gon06a] asystem is described which uses Head Related TransferFunctions (HRTFs) to modulate audio signals with theimage information. Even though these systems showpromising results, interpreting those signals involves along lasting training period. Besides, the often verysensitive aural sense of blinds cannot be used for othertasks. Thus, other systems try to reduce the camera datato a small amount of significant information before it istransmitted to the user. Dakopoulos for example de-scribes in his PhD-thesis a prototype system [Dak09a]that uses a binocular stereo camera to receive a depthimage of the scene in front of the user. The depth im-age is reduced to a resolution of 4 pixel×4 pixel. Thisdepth information is presented to the user by a 4×4vibration motor array, which is attached to her or hisstomach.

Saez Martinez and Escolano Ruiz also work on a stereocamera based system for obstacle detection [Sae08a].Here, algorithms are used to combine sequences ofdepth images to a 3D-map. For obstacle detectionthe user’s motion is estimated based on the image se-quences and thus obstacles in travel direction are de-tected. However, the system only warns the user aboutupcoming obstacles and does not use the image infor-mation for scene percipience.

Rodriguez et al. describe another stereo camera basedsystem [Rod12a]. A short description of this approachis given during the comparison in Section 7.

Most of the camera based systems developed so farmainly focus on detecting obstacles with more or lesshigh spatial resolution. Thus, they can prevent the userfrom being overwhelmed by a huge amount of data. Butthereby they also reduce very much the conveyed infor-mation. We in contrast are interested in remodeling therecorded scene. This later will give us the opportunityto classify the remodeled objects. For example by clas-sifying a number of steps as a stairway, which otherwisewould be considered as an insuperable obstacle. Thus,we can provide to the user a high amount of informationwith limited amount of data.

3 DEPTH IMAGE PLANE SEGMENTA-TION

A pixel in the depth image is defined by its 2D imagecoordinates xI and yI and will be denoted by the vectorXI = (xI ,yI)

T in this paper. Each pixel contains a depthvalue d, which represents the distance to the corre-sponding object point. In the following the depth valuefor a pixel XI will be denoted by d(XI) or d(xI ,yI).

The 3D world coordinate system is defined based onthe orientation and position of Kinect. A point in theworld coordinate system is defined by the coordinatesxW , yW , and zW and will be denoted by the vector XW =(xW ,yW ,zW )T . All three coordinates (xW , yW , and zW )have the unit centimeter.

3.1 Gradient based depth image segmen-tation

The gradient based depth image segmentation is usedto perform a rough preprocessing, which reduces falsesegmentation during the RANSAC algorithm. Besides,the gradient based algorithm is very sensitive in de-tecting small steps, which could be disregarded by theRANSAC algorithm. To perform the algorithm, out ofthe depth image d(XI), its gradient vector ~g(XI) is cal-culated. The gradient vector ~g(XI) can be calculatedbased on any common gradient filter (e.g. Sobel orCanny operator). For the experiments presented in Sec-tions 5 and 6 the gradient vector ~g(XI) is defined asgiven in eq. (1). This gradient definition is very sensi-tive on small edges and steps since no low pass filteringis applied to the depth image. Compared to the Sobeland Canny operator the computation time of the filterin eq. (1) is very low. Besides, the calculated gradientdoes not have to be very accurate since after the gra-dient based segmentation the RANSAC algorithm per-forms an accurate plane segmentation.

~g(XI) =

(d(xI +1,yI)−d(xI ,yI)d(xI ,yI +1)−d(xI ,yI)

)(1)

After calculating the gradient vector~g(XI), it is filteredby a median filter. The median filter reduces the numberof outlying values caused by interpolation artifacts andquantization errors.

For segmentation each pixel is compared with each ofits four neighboring pixels. Two neighboring pixels Xi

Iand X j

I are assigned to the same segment if the follow-ing two conditions are satisfied:

1. The norm of the difference vector ∆g between~g(XiI)

and~g(X jI ) has to be below a threshold Tg.

∆g =∥∥∥~g(Xi

I)−~g(

X jI

)∥∥∥ (2)

2. The difference ∆d between the real depth valued(X j

I ) and the estimated depth value d̂(X jI ) at the

position X jI has to be below a threshold Td .

∆d =∣∣∣d(X j

I

)− d̂(

X jI

)∣∣∣ (3)

d̂(

X jI

)= d

(Xi

I)+~g(Xi

I)T ·

(x j

I − xiI

y jI − yi

I

)(4)

3.2 Depth image to point cloud transfor-mation

As already mentioned above, the RANSAC algorithmwill be performed based on a set of 3D points in worldcoordinates. Thus, each pixel in the depth image XI hasto be transformed into a 3D point in world coordinatesXW . This is done based on the transformation matrixA, which is a 4×4 matrix defined in eq. (5). The trans-formation defined by A is a combination of a rotation, atranslation and the central projection performed by thecamera. To describe the non-linear central projection bya system of linear equations the homogeneous compo-nent k has to be introduced as a fourth dimension. Forthe given definition it is considered that the unit vectorof the depth component ~ed is orthogonal to the imageplane but independent of the homogenous component kas given in eq. (5).

k · xIk · yI

dk

= A ·

xWyWzW1

=

a11 a12 a13 a14a21 a22 a23 a24a31 a32 a33 a34a41 a42 a43 1

·

xWyWzW1

(5)

From the system of equations given in eq. (5) the equa-tions (6) and (7) are received by inserting the fourthrow into the first and second and rearranging them af-terwards. Eq. (8) is received as the third row in eq. (5).

Since all three equations are linear in the coefficients ofA and image as well as world coordinates can be mea-sured by calibration points, all 15 coefficients of A canbe estimated by linear regression.

xI = a11 · xW +a12 · yW +a13 · zW +a14

−a41 · xW xI−a42 · yW xI−a43 · zW xI (6)yI = a21 · xW +a22 · yW +a23 · zW +a24

−a41 · xW yI−a42 · yW yI−a43 · zW yI (7)d = a31 · xW +a32 · yW +a33 · zW +a34 (8)

Equivalently to the image points XI in the 2D segmen-tation the 3D points XW can be aligned in segments.

3.3 RANSAC plane segmentationAfter all pixels XI are projected into 3D points XW , tothe 3D points of each segment the RANSAC algorithm[Fis81a] is applied.

RANSAC is an iterative method to robustly estimatea certain model from a number of measurements. Inour application RANSAC is used to fit planes to the 3Dpoint cloud XW .

In each iteration step the algorithm randomly picksthree sample points (X1

W , X2W , and X3

W ) out of the setof input points and defines a plane Π fitting these threepoints. For each point XW in the set the distance dΠ tothe plane Π is calculated. It is checked whether the dis-tance dΠ is underneath a threshold TRAN or not. Pointswith a distance dΠ smaller than TRAN are considered tobe part of the estimated plane. All other points are con-sidered to be outliers. This procedure is repeated Ntrialstimes to find the best fitting plane. Best fit is defined inthe sense of lowest number of outliers.

The number Ntrials of iteration steps which are neededto reach convergence can be calculated as follows: Ifwe consider that an iteration step results in nIN inliersby a total number of nPT S points in the input set, theprobability that all three independently picked samples(X1

W , X2W , and X3

W ) are inliers can be estimated as givenin eq. (9).

P(all 3 samples are inliers)≈(

nIN

nPT S

)3

(9)

From eq. (9) the probability that at least one of the threesamples is an outlier is given in eq. (10).

P(at least 1 sample is an outlier) =1−P(all 3 samples are inliers) (10)

We define the constraint that with a certain probability,which is denoted by p in eq. (11), there must be at leastone trial without any sample being an outlier. With this

constraint the number of trials needed can be calculatedas follows.

Ntrails =⌈log(1− p)

log(P(at least 1 sample is an outlier))

⌉(11)

For the experiments presented in Sections 5 and 6 theprobability, that at least one trial occurs where none ofthe three samples is an outlier, was set to p = 0.99.

The number of inliers nIN in eq. (9) does not result fromthe best fitting plane Πn but from a randomly chosenplane Π. Thus the minimum needed number of trailsNtrails is always overestimated except for the case thatthe plane Π is already the best fitting plane.

The RANSAC algorithm results in an estimated planeΠn, a set of inliers, which builds a new segment and aset of outliers. To the set of outliers again the RANSACalgorithm is applied until the number of outliers is un-derneath a defined minimum segment size.

4 OBJECT MODELINGBased on the plane segments, geometric objects aremodeled. This is done by combining planes to objects.At the current state of the algorithm only cuboid ob-jects are modeled. Most scenarios where people moveare dominated by man-made objects, which can be de-scribed more or less by cuboids. In further developmentsteps it will be considered to include also different geo-metric shapes (e.g. cylinders or spheres). However, forobstacle warning, cuboid objects are sufficient. The ob-ject modeling is divided in several steps. In the first stepintersecting edges between neighboring planes are cal-culated. Then the floor plane is detected and extracted.In the third step the cuboid objects are modeled out ofthe plane segments and the intersecting edges.

4.1 Plane intersectionSince, by definition, a plane has infinite extent, therealways exists an intersecting edge between any twoplanes which are not parallel to each other. The chal-lenge in the intersecting edge retrieval is to consideronly those edges as existent which do exist in therecorded scene.

To solve this problem a neighborhood graph is calcu-lated. Even though all plane segments are defined al-ready in 3D world coordinates, the neighborhood graphis built based on image coordinates. This is because thecomputational effort for finding neighbors in a 2D pixelgrid is enormously reduced compared to the case of a3D point cloud. Due to the known transformation be-tween image and world coordinates, after establishingthe neighborhoods the graph easily can be translated tothe 3D points.

In this graph each segment (plane) defines a vertex. Allvertices, for which the segments i and j are direct neigh-bors, are connected by an edge ei j. Out of the marginal3D points between two segments a straight line L̂i j

c is es-timated by linear regression. Besides, the real intersect-ing edge between the planes Πi and Π j Li j

c is calculatedanalytically. Within the range of adjacent points be-tween both planes the maximum distance between theestimated and the analytically calculated edge is deter-mined. If the maximum distance between both edgeslies below a defined threshold, the edge is considered toexist.

All retrieved intersecting edges are defined as directedstraight lines. This means the direction vector ~di j of Li j

chas to be defined with a certain direction. By definitionthe segment i is on the left and the segment j on theright side of the projection of Li j

c onto the depth imageplane in pointing direction.

After defining intersecting edges there will be pixelsthat are assigned to the wrong segment. This mean pix-els of the segment i which are on the right side of Li j

c andpixels of j which are on the left side of Li j

c . These pix-els are assigned to the respectively other segment andthe 3D points XW are projected onto the correspondingplane.

4.2 Floor plane extractionBefore objects can be built, the floor plane has to beextracted. This is done based on two parameters. Theplane orientation, which is defined by a reference nor-mal vector~nre f and the distance between the floor planeand the optical camera center ~c, which is defined bydre f .

To classify a plane as floor plane the angle between theplane normal vector and the reference ∆φ has to be un-derneath a threshold Tφ and the distance to the cameracenter ∆d has to be lower than a threshold Tdist .

All plane segments, which are lying underneath thefloor plane, are erased. This of course does not con-form all scenes but simplifies the processing. Planes un-derneath the floor plane result, for instance, from stairsgoing downwards. In future development these planesas well as gaps within the floor plane will be consideredbecause detecting downward stairs and gaps in the flooris an absolute must for a reliable system.

4.3 Object reconstructionTo build objects out of the plane segments each inter-secting edge Li j

c is classified to be either a convex or aconcave edge from Kinect’s perspective.

To classify the intersecting edges the normal vectors ~nof the plane segments have to point to the direction ofthe optical camera center~c . This can be realized since

Π1

Π2

(a)

~n1×~n2

~n2 ~n1

(b)

Figure 1: Classification of a concave crossing edges.(a) Two planes Π1 and Π2 connected by a concave in-tersecting edge L12

c (red line). (b) Cross product of thenormal vectors~n1 and~n2.

−~n and ~n define the same plane and thus normal vec-tors, which are pointing away from the optical center,just can be flipped. Since the intersecting edge Li j

c be-tween the planes Πi and Π j has a certain direction, suchthat Πi is left and Π j is right of the edge in pointing di-rection, this property is used to classify the edges. If thecross product of ~ni and ~n j points to the same directionas the direction vector ~di j of Li j

c , the edge is classifiedas convex. Otherwise the edge is classified as concave(see Fig. 1).

Li jc =̂

convex if ~ni‖~ni‖ ×

~n j‖~n j‖ =

~di j

‖~di j‖

concave else.(12)

After classifying all edges into convex or concave ob-jects are built. All planes connected by a convex edgeare combined to one object.

For each object a cuboid shaped bounding box is cal-culated. To avoid underestimation of the object dimen-sions the boundaries are chosen such that all 3D pointsassigned to the object are included within the cuboidbounding box. This mostly results in an overestimationof the object size but nevertheless potential obstaclesnever will be neglected.

5 ACCURACY ANALYSISIn this section two different experiments are shown,which evaluate the accuracy of the described algorithmsapplied to images gathered with Microsoft Kinect.

5.1 Plane accuracyIn the first experiment planar panels (size19 cm×29 cm) are placed parallel to the depthimage plane of Kinect. Each panel is placed on adefined position within the world coordinate system(see Fig. 2(a)). The scene is recorded by the depthcamera of Kinect and the algorithm is applied. Eachpanel results in a rectangular object in the reconstructedscene, defined by its four corner points. Based onthe position of the reconstructed object, the accuracyof the system can be measured. In this experimentonly the z-component of an object is evaluated since

(a) (b)

Figure 2: Setups for accuracy analysis.

the algorithm will overestimate the boundaries in thedirections xW and yW as described in Section 4.3.

Fig. 3 presents the root mean square error (RMSE)of the retrieved planes for different distances to Kinect(blue asterisks). For each corner of a plane the error eiof the z-component between retrieved and given valueis calculated, resulting in four error values for eachpanel. Four panels were recoded in each distance re-sulting in a total number of N = 16 error values per dis-tance. Out of these error values the RMSE is calculatedas given in eq. (13).

RMSE =

√1N

N

∑i=1

e2i (13)

As one can see, the RMSE rises approximately quadrat-ically with the distance. Nevertheless, even in a dis-tance of 2.5 m still an RMSE of 1.7 cm is reached.

5.2 Intersecting edge accuracyIn the second experiment the panels are arranged asshown in Fig. 2(b). The intersecting edge between thetwo panels lies on a defined position in the world coor-dinate system. The scene is recorded by Kinect and thealgorithm is applied. The algorithm calculates an in-tersecting edge between the two planes resulting fromthe panels. The error between the real and the calcu-lated edge ei is defined as the distance between bothedges at the two marginal points of the edge. Thus, foreach recorded edge two error values are received. Foreach distance six objects were recorded, resulting in atotal number of N = 12 error values. For this setup theRMSE is also calculated as given in eq. (13).

Fig. 3 shows the RMSE for different distances to theKinect sensor (red circles). As one can see, only dis-tances below 1.5 m are evaluated. The reason for that isthe threshold of the RANSAC algorithm, which is in-creased linearly, dependent on the depth value d. Withrising depth value d, the range of the quantization stepsalso rises and thus the accuracy decays. To avoid falsesegmentation resulting in many small planes, the seg-mentation thresholds (Td , Tg, and TRAN) are adjusted.Thus, the two panels will be segmented as one consec-utive plane at far distances. Nevertheless, for close dis-tances up to about 1.5 m experiments show very accu-rate results.

50 100 150 200 2500

1

2

3

4

Distance [cm]

RM

SE[c

m]

PlanesEdges

Figure 3: Accuracy analysis. Blue asterisks: RMSEof retrieved planes. Red circles: RMSE of retrievedintersecting edges.

6 APPLICATIONIn this section the results of the algorithm for two differ-ent scenes are presented to get a qualitative assessmentof the system.

Fig. 4(a) shows the RGB image of the first recordedscene. The scene includes several items, which are sup-posed to be detected and modeled as objects. The ta-ble in the back of the scene represents a large obstacle,which should be modeled as one object. Also the trashbin and the book lying on the floor are supposed to bedetected. These two objects represent dangerous obsta-cles a person might trip on. Fig. 5(a) shows the RGBimage of the second scene. In this scene a stairway isrecoded. The main goal for this scene is to model thesingle steps as objects. Additionally, in both scenes thefloor plane must be classified to reconstruct the scenecorrectly.

Fig. 4(b) shows the output of the algorithm correspond-ing to the scene in Fig. 4(a) and Fig. 5(b) the onecorresponding to Fig. 5(a), respectively. In both fig-ures an object is represented by its bounding box in 3Dworld coordinates. Besides, the floor plane as well asthe boundaries of the field of view are plotted in the fig-ure. As one can see, in both scenes the floor plane wasdetected correctly.

In the first scene (Fig. 4) the objects of main interests,the book, the trash bin and the table, were detected andmodeled as objects by the algorithm. Nevertheless, theresult of the scene reconstruction is not perfect. For ex-ample the left edge of the table is separated into severalsmall object. This comes from the high quantizationof the depth information d for large values. The quan-tization causes planes standing in a steep angle to theimage plane to result in a stepped gradient, instead ofa homogenous depth gradient. Thus, the single stepsare segmented into single planes by the algorithm in-stead of one large plane. The reconstructed scene alsoincludes wrongly modeled objects for the items stand-ing on the table. Nevertheless, those items are not ofmajor interest for the task of obstacle detection.

In Fig. 5(b) one can see that this scene was recon-structed very well, too. The lowest five steps of the

200 400 600

200

400

xI [pixel]

y I[p

ixel

]

(a) Kinect RGB image

−100 0100

200

300

0

100

200

xW [cm]

zW [cm]

y W[c

m]

(b) 3D scene reconstruction

Figure 4: Result of the obstacle detection system fortest scene 1.

stairway are modeled as separate object. All furthersteps are out of range for the depth image system. Thebanisters are not modeled very well since they havevery small and reflective surfaces. The left sidewall alsoresults in several objects since it is separated by the leftbanister. This problem has to be minded in the furtherdevelopment process.

Both scenes show good results of the algorithm. How-ever, there are scenarios the system cannot handle prop-erly, especially when there exist lots of undefined areaswithin the depth image. These undefined areas resultwhen items shadow each other but also at the presentsof direct solar irradiation [Ell12a]. Thus, Kinect is in-appropriate for outdoor scenarios.

7 COMPARISON TO A STATE OF THEART SYSTEM

In this Section the algorithm presented in this paper iscompared to the system presented by Rodriguez et al.in 2012 [Rod12a]. This comparison shows the novelcontribution of our approach to the research field ofETAs. [Rod12a] is quite comparable to our approachsince it also focuses on simplifying the recorded 3Dpoint cloud. Besides, the paper contains a very exten-sive experimental part with feedback from visually im-paired persons, who were testing their system. Basedon this feedback we can foresee the demand on our sys-

200 400 600

200

400

xI [pixel]

y I[p

ixel

]

(a) Kinect RGB image

−100 0100

200

300

0

100

200

xW [cm]

zW [cm]y W

[cm

]

(b) 3D scene reconstruction

Figure 5: Result of the obstacle detection system fortest scene 2.

tem and on what we have to focus in further develop-ments.

The system presented in [Rod12a] is a stereo camerabased system. From the stereo images a disparity mapand thereby a 3D point cloud is calculated. By usingRANSAC for plane fitting, the floor plane is extractedfrom the point cloud. All 3D points, which are not partof the floor plane, are divided into bins. This is doneby projecting all points on the floor plane and defin-ing a polar grid, which forms the bin margins. Eachbin, which contains a sufficient high amount of points,is considers as obstacle. The system already has anaudio feedback included, which is based on bone con-duction technology and thus, does not block the user’sears. Even though [Rod12a] is not based on MicrosoftKinect, it is quite comparable to the approach presentedhere. Our approach easily can be adapted to a stereocamera system just as [Rod12a] can be adapted to Mi-crosoft Kinect.

At the current state of development, there are two bigadvantages of the system presented by Rodriguez et al.compared to our approach. Firstly, it is already runningin real time and secondly it has already audio feedbackincluded.

In terms of conveying information about the user’s sur-rounding our system is much more precise than the onein [Rod12a]. While Rodriguez et al. have 12 rigid po-

sitions in front of the user where objects can occur, inour approach objects are placed independently from anygrid in the 3D space. The necessity of placing the ob-stacles independently from any grid is confirmed by thefeedback of testing persons in [Rod12a]. The visuallyimpaired, who where testing the system demanded formore resolution in the depth domain since they were notable to estimate the relevance of an obstacle.

Another advantage of our approach compared to[Rod12a] and basically all other existing systems isthat our system tries to preserve the approximate shapeof a recorded object. Thus, in further developmentstairs as well as other objects can be classified basedon their characteristic shape. Nevertheless, still anappropriate interface to the user has to be developed.

8 CONCLUSIONSThe overview of state of the art systems in Section 2shows that until now there is no operating ETA whichsatisfies all demands of blind people. Besides, thefeedback of visually impaired people documented in[Rod12a] states a demand of high sophisticated ETAs.This demand is very encouraging for our system to bedeveloped further.

The accuracy analysis presented in this paper showsthat the developed system works accurately in the rangeof a few meters in front of the user. In addition the twoapplications presented in Section 6 show the capabilityof the system. Scenes can be reconstructed in detail byprimitive cuboid objects and even small objects can bedetected.

In further development our system has to be miniatur-ized and has to be combined with a feedback system.Besides, experiments with testing persons have to beorganized.

9 REFERENCES[Car05a] Cardin S., Thalmann, D. and Vexo F. Wear-

able obstacle detection system for visually im-paired people. VR Workshop on Haptic and Tac-tile Perception of Deformable, pp.50-55, 2005.

[Dak09a] Dakopoulos D. TYFLOS: a wearable nav-igation prototype for blind & visually impaired;design, modelling and experimental results. PhD-Thesis, Wright State University, 2009.

[Dak10a] Dakopoulos D. and Bourbakis N.G. Wear-able obstacle avoidance electronic travel aids forblind: A survey. IEEE Transactions on Systems,Man, and Cybernetics, Part C: Applications andReviews, vol.40, no.1, pp.25-35, January 2010.

[Dun12a] Dunai L., Garcia B.D., Lengua I. and Peris-Fajarnes G. 3D CMOS sensor based acoustic ob-ject detection and navigation system for blindpeople. Annual Conference on IEEE Industrial

Electronics Society, IECON, vol.38, pp.4208-4215, 2012.

[Ell12a] El-laithy R.A., Huang J. and Yeh M. Study onthe use of Microsoft Kinect for robotics applica-tions. Position Location and Navigation Sympo-sium (PLANS), IEEE/ION, pp.1280-1288, April2012.

[Fis81a] Fischler M.A. and Bolles R.C. Random sam-ple consensus: a paradigm for model fitting withapplications to image analysis and automated car-tography. Communications of the ACM, vol.24,no.6, pp.381-395, July 1981.

[Gon06a] Gonzalez-Mora J.L., Rodriguez-HernandezA.F., Burunat E., Martin F. and Castellano, M.A.Seeing the world by hearing: Virtual AcousticSpace (VAS) a new space perception system forblind people. Information and CommunicationTechnologies (ICTTA ’06), vol.1, pp.837-842,2006.

[Man12a] Manduchi R. and Coughlan J. (Computer)vision without sight. Communications of theACM, vol.55, no.1, pp.96-104, 2012.

[Rit01a] Ritz M., Koenig L. and Woeste L. Orienta-tion aid for the blind and the visually disabled.US Patent US 6298010 B1, October 2001.

[Rit02a] Ritz M., Koenig L. and Woeste L. Orienta-tion aid for the blind and the visually disabled.US Patent US 6489605 B1, December 2002.

[Rod12a] Rodriguez A., Yebes J.J., Alcantarilla P.F.,Bergasa L.M., AlmazÃ¡n J. and Cela A. Assist-ing the visually impaired: obstacle detection andwarning system by acoustic feedback. Sensors,vol.12, no.12, pp17476-17496, 2012.

[Sae08a] Saez Martinez J.M. and Escolano Ruiz F.Stereo-based aerial obstacle detection for the vi-sually impaired. Workshop on Computer VisionApplications for the Visually Impaired, 2008.

[Sho03a] Shoval S., Ulrich I. and Borenstein J.NavBelt and the Guide-Cane - Robotic-basedobstacle-avoidance systems for the blind and vi-sually impaired. IEEE Robotics & AutomationMagazine, vol.10, no.1, pp.9-20, 2003.

[Vis13a] Vistac GmbH. Retrieved February 2014. fromhttp://www.vistac.de.

[vOI13a] vOICe. Augmented Reality for the To-tally Blind. 1996. Retrieved February 2014 fromhttp://www.seeingwithsound.com/.

[Yua04a] Yuan D. and Manduchi R. A Tool for RangeSensing and Environment Discovery for the Blind.IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition Workshop(CVPRW ’04), pp.39-47, 2004.

[Yua05a] Yuan D. and Manduchi R. Dynamic envi-

ronment exploration using a virtual white cane.IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition (CVPR ’05),pp.243-249, 2005.

[Zab06a] Zabonne Ltd. K-Sonar. 2006. RetrievedFebruary 2014 from http://zabonne.co.nz/.

Documents

Kinect based 3D Scene Reconstruction - WSCGwscg.zcu.cz/wscg2014/Short/H47-full.pdf · GuideCane is a small vehicle at the end of a long cane on which the sensors are arranged. Based