2
Mobile Augmented Reality using Scalable Recognition and Tracking Jaewon Ha, Jinki Jung, ByungOk Han, Kyusung Cho, Hyun S. Yang Computer Science Department, Korea Advanced Institute of Science and Technology ABSTRACT In this paper, a new mobile Augmented Reality (AR) framework which is scalable to the number of objects being augmented is proposed. The scalability is achieved by a visual word recognition module on the remote server and a mobile phone which detects, tracks, and augments target objects with the received information from the server. The server and the mobile phone are connected through a conventional Wi-Fi. In the experiment, it takes 0.2 seconds for the cold start of an AR service initiation on a 10k object database, which is fairly acceptable in a real-world AR application. KEYWORDS: Augmented Reality, Tracking. INDEX TERMS: K.6.1 [Management of Computing and Information Systems]: Project and People ManagementLife Cycle; K.7.m [The Computing Profession]: MiscellaneousEthics 1 INTRODUCTION Recent mobile augmented reality researches have reached the maturity in which mobile phones can service AR without the cumbersome structured environment of fiducial markers and fast enough to service in real-time. State-of-the-art works on natural features such as SIFT and Fern have been successively ported to the mobile phone [7]. Furthermore, suited tracking methods for the mobile phone are proposed and verified for AR systems in [7] and [8]. But still, they are not scalable to the number of objects to be serviced. Some works exist addressing scalability issues. In [6], they showed a similar framework which combines a scalable recognition module and a detection/tracking module. However it was not a mobile framework and the scalability is only showed for hundreds of objects. In [3], they fully outsourced the detection/tracking through a distributed network, but they didnt show its scalability. Therefore, we propose a new mobile AR framework that is scalable to the number of objects being augmented and providing an elaborated level of tracking. The scalability is achieved by preparing a scalable recognition module on the server side. The elaborated level of tracking is achieved by the fine-tuned tracking algorithm on the mobile phone side. These two sides are connected through a conventional Wi-Fi. In the experiment, it takes 0.2 seconds for the cold start of an AR service initiation. As initiation does not happen frequently, this latency is acceptable in the real-world application. The performance of the bag of visual words used on the server side is already verified up to 1 million objects in [1] and [5]. 2 FRAMEWORK Our proposed framework is mainly configured in two parts, the mobile phone and the remote server recognition module; they are connected by Wi-Fi. The user is required to select a region of interest, and then the selected region is sent to the server. At the server, the recognition process is preceded using a vocabulary tree [5]. Several candidates are retrieved from the vocabulary tree and the best match is found using the PROSAC [1] based geometric matching method. Then the server sends corresponding tracking information and AR contents to the mobile side. With the tracking information received, the mobile phone detects/tracks and augments AR contents. The overall framework is depicted in Figure 1. Figure 1. Overview of the framework 3 SERVER SIDE: SCALABLE RECOGNITION To achieve scalability, the bag of visual word scheme is used. Among its variation, a vocabulary tree [5] is chosen. The process of bag of visual words is mainly divided into two, the quantization step and the retrieval step. In the quantization step, the representative data points called visual words are extracted. In the retrieval step, the best matches are searched with fast and accurate methods using TF-IDF scheme [5]. For feature detector and descriptor, GPU version of SIFT [4] is used. 4 MOBILE SIDE: DETECTION AND TRACKING Once receiving the keypoint locations and their descriptions, detection and tracking is preceded on the mobile side. First, detection is carried on, and if the detection result is fine, tracking is carried on repeatedly. The tracking quality is evaluated at every frame and if the result is bad, it goes back to the detection phase. 4.1 Detection Concerning the computational power and memory consumption of mobile phones, the Modified-SIFT [7] is selected as the mobile phone side detector. Modified-Fern [7] was also a candidate, but because learned Fern descriptors are usually megabyte size, sending it over the network was not a viable option. Therefore Modified-SIFT is chosen for the overall fast response of the framework. After feature extraction, the initial homography is estimated by using PROSAC [1]. 211 IEEE Virtual Reality 2011 19 - 23 March, Singapore 978-1-4577-0038-5/11/$26.00 ©2011 IEEE

[IEEE 2011 IEEE Virtual Reality (VR) - Singapore, Singapore (2011.03.19-2011.03.23)] 2011 IEEE Virtual Reality Conference - Mobile Augmented Reality using scalable recognition and

  • Upload
    hyun-s

  • View
    217

  • Download
    4

Embed Size (px)

Citation preview

Page 1: [IEEE 2011 IEEE Virtual Reality (VR) - Singapore, Singapore (2011.03.19-2011.03.23)] 2011 IEEE Virtual Reality Conference - Mobile Augmented Reality using scalable recognition and

Mobile Augmented Reality using Scalable Recognition and Tracking

Jaewon Ha, Jinki Jung, ByungOk Han, Kyusung Cho, Hyun S. Yang

Computer Science Department, Korea Advanced Institute of Science and Technology

ABSTRACT

In this paper, a new mobile Augmented Reality (AR) framework which is scalable to the number of objects being augmented is proposed. The scalability is achieved by a visual word recognition module on the remote server and a mobile phone which detects, tracks, and augments target objects with the received information from the server. The server and the mobile phone are connected through a conventional Wi-Fi. In the experiment, it takes 0.2 seconds for the cold start of an AR service initiation on a 10k object database, which is fairly acceptable in a real-world AR application. KEYWORDS: Augmented Reality, Tracking. INDEX TERMS: K.6.1 [Management of Computing and Information Systems]: Project and People Management—Life Cycle; K.7.m [The Computing Profession]: Miscellaneous—Ethics

1 INTRODUCTION

Recent mobile augmented reality researches have reached the maturity in which mobile phones can service AR without the cumbersome structured environment of fiducial markers and fast enough to service in real-time.

State-of-the-art works on natural features such as SIFT and Fern have been successively ported to the mobile phone [7]. Furthermore, suited tracking methods for the mobile phone are proposed and verified for AR systems in [7] and [8]. But still, they are not scalable to the number of objects to be serviced.

Some works exist addressing scalability issues. In [6], they showed a similar framework which combines a scalable recognition module and a detection/tracking module. However it was not a mobile framework and the scalability is only showed for hundreds of objects. In [3], they fully outsourced the detection/tracking through a distributed network, but they didn’t show its scalability.

Therefore, we propose a new mobile AR framework that is scalable to the number of objects being augmented and providing an elaborated level of tracking. The scalability is achieved by preparing a scalable recognition module on the server side. The elaborated level of tracking is achieved by the fine-tuned tracking algorithm on the mobile phone side. These two sides are connected through a conventional Wi-Fi. In the experiment, it takes 0.2 seconds for the cold start of an AR service initiation. As initiation does not happen frequently, this latency is acceptable in the real-world application. The performance of the bag of visual words used on the server side is already verified up to 1 million objects in [1] and [5].

2 FRAMEWORK

Our proposed framework is mainly configured in two parts, the mobile phone and the remote server recognition module; they are connected by Wi-Fi. The user is required to select a region of interest, and then the selected region is sent to the server. At the server, the recognition process is preceded using a vocabulary tree [5]. Several candidates are retrieved from the vocabulary tree and the best match is found using the PROSAC [1] based geometric matching method. Then the server sends corresponding tracking information and AR contents to the mobile side. With the tracking information received, the mobile phone detects/tracks and augments AR contents. The overall framework is depicted in Figure 1.

Figure 1. Overview of the framework

3 SERVER SIDE: SCALABLE RECOGNITION

To achieve scalability, the bag of visual word scheme is used. Among its variation, a vocabulary tree [5] is chosen. The process of bag of visual words is mainly divided into two, the quantization step and the retrieval step. In the quantization step, the representative data points called visual words are extracted. In the retrieval step, the best matches are searched with fast and accurate methods using TF-IDF scheme [5]. For feature detector and descriptor, GPU version of SIFT [4] is used.

4 MOBILE SIDE: DETECTION AND TRACKING

Once receiving the keypoint locations and their descriptions, detection and tracking is preceded on the mobile side. First, detection is carried on, and if the detection result is fine, tracking is carried on repeatedly. The tracking quality is evaluated at every frame and if the result is bad, it goes back to the detection phase.

4.1 Detection

Concerning the computational power and memory consumption of mobile phones, the Modified-SIFT [7] is selected as the mobile phone side detector. Modified-Fern [7] was also a candidate, but because learned Fern descriptors are usually megabyte size, sending it over the network was not a viable option. Therefore Modified-SIFT is chosen for the overall fast response of the framework. After feature extraction, the initial homography is estimated by using PROSAC [1].

211

IEEE Virtual Reality 201119 - 23 March, Singapore978-1-4577-0038-5/11/$26.00 ©2011 IEEE

Page 2: [IEEE 2011 IEEE Virtual Reality (VR) - Singapore, Singapore (2011.03.19-2011.03.23)] 2011 IEEE Virtual Reality Conference - Mobile Augmented Reality using scalable recognition and

4.2 Tracking

Initiated from the detection phase, the purpose of tracking is to find the 6DOF pose of target objects. To be robust and fast, the coarse-to-fine matching [2] is used. In the coarse step, keyframe based tracking is used. In the fine step, frame-to-frame tracking is used. The keyframe based tracking is free from drift but suffers from jitter whereas the frame-to-frame tracking is free from jitter but suffers from drift. Using both together, these deficiencies of each method are complemented. For technical details, the reader is recommended to read [2], [7], and [8].

5 EXPERIMENT

The experiment was carried on an Android Nexus One. It has a 1GHz processor, 320x240 camera stream, and 20Hz camera speed. The network connection is IEE 802.1b 100Mbps. The database has 10k music CD covers and corresponding music video scenes are augmented on the covers.

5.1 Initial start time interval

SendROI

RecognitionSend

Tracking info

DetectionGPU-SIFT

Figure 2. Time band of the initial start time

Time spent sending the region of interest (ROI) through the network was 25ms ( 10ms). Recognition on a 10k database was up to 100ms including the GPU version of SIFT description. Pure network overhead in sending/receiving was 50ms ( 20), which is almost equal to the time spent for detection, 50ms ( 15ms). Though the experiment was carried in a clean-traffic network condition, as the size of data send/receive is up to 117.5Kb - JPEG compression could be applied to gray image of 80Kb - and only happens at user initialization, the external network condition is not expected to affect the overall response of the system much.

5.2 Overall performance time measure

The overall time spent on the real use experiment is depicted in Figure 3. The horizontal axis is the number of frames. The vertical axis is the time spent on that frame in milliseconds. For every user reinitiallization of the AR service, the time spent for processing the frame is increased nearly 200ms. As user reinitialization does not happen frequently, the latency from reinitialization would not hurt the overall real-time performance of the system much. Trailing peaks near 80ms is due to internal garbage collection of the Android OS. Except for reinitialization, the average processing frame rate for 71 frames is 23.9Hz, which is faster than the mobile camera capture speed of 20Hz.

Figure 3. Overall performance time

6 CONCLUSION

In this paper, a new mobile AR framework that achieves scalability and accurate tracking is proposed and verified. The performance of the scalable recognition module and the tracking module is verified from previous researches [5] and [7]. We focused on integrating both techniques and tested whether it could provide the AR service as a real-world application concerning the time spent, especially at the bottleneck – the initial time. The experiment showed about 200ms for the initiation time. Since the initiation does not happen frequently, this is not expected to harm the overall real-time response of the system. Therefore the proposed framework successfully achieved scalability with an accurate tracking capability on mobile phones in a fairly acceptable manner.

Figure 4. Augmentation of a music video on the CD cover

7 ACKNOWLEDGEMENTS

This research was supported by a grant (07High Tech A01) from High Tech Urban Development Program funded by the Ministry of Land, Transportation, and Maritime Affairs of the Korean government and the ICC research grant funded by KAIST.

REFERENCES

[1] Chum, O., Matas, J. 2005. Matching with PROSAC – Progressive Sample Consensus. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Diego, USA, June 20-26, 2005)

[2] CHUM, O., PHILBIN J., SIVIC J., ISARD M., ZISSERMAN A. 2007, Total Recall - Automatic Query Expansion with a Generative Feature Model for Object Retrieval IEEE International Conference on Computer Vision, (Rio de Janeiro, Brazil, October 14-20, 2007)

[3] KLEIN, G., MURRAY, D. 2007. Parallel Tracking and Mapping for Small AR Workspaces 6th IEEE and ACM International Symposium on Mixed and Augmented Reality.(Nara, Japan 13 - 16 November)

[4] Koch, R., Evers-Senne, J.-F., Schiller, I., Wuest, H., Stricker, D. Architecture and Tracking Algorithms for a Distributed Mobile Industrial AR System", ICVS 2007

[5] LOWE, D. 2004. Distinctive Image Features from Scale-Invariant Keypoints International Journal of Computer Vision, vol 60, no 2, pp 91 – 110

[6] NISTER, D., STEWENIUS, H. 2008, Scalable recognition with vocabulary tree. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (New-York, USA, June. 17-22, 2006)

[7] Pilet, J., Saito, H. Virtually Augmenting Hundreds of Real Pictures: An Approach based on Learning, Retrieval, and Tracking. IEEE Virtual Reality.(Waltham, MA, March 2010)

[8] WAGNER, D., REITMAYR, G., MULLONI, A, DRUMMOND, T., SCHMALSTIEG, D. 2008. Real-Time Detection and Tracking for Augmented Reality on Mobile Phones. IEEE Transactions on Visualization and Computer graphics, vol 16, no. 3, pp 355-368.

[9] WAGNER, D., SCHMALSTIEG, D., BISCHOF, H. 2009. Multiple Target Detection and Tracking with Guaranteed Framerates on Mobile Phones. IEEE International Symposium on Mixed and Augmented Reality (Florida, USA, October 19-22, 2009 )

212