Spartacus: Spatially-Aware Interaction for Mobile …zhengs/MySite/attachments/Spartacus... · Spartacus: Spatially-Aware Interaction for Mobile Devices Through Energy-Efﬁcient

Spartacus: Spatially-Aware Interaction for Mobile DevicesThrough Energy-Efficient Audio Sensing

1Zheng Sun, 1Aveek Purohit, 2Raja Bose, and 1Pei Zhang1Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA

2Microsoft Silicon Valley, Sunnyvale, CA{zhengs, apurohit, peizhang}@cmu.edu, {raja.bose}@microsoft.com

ABSTRACTRecent developments in ubiquitous computing enable ap-plications that leverage personal mobile devices, such assmartphones, as a means to interact with other devices intheir close proximity. In this paper, we propose Spartacus,a mobile system that enables spatially-aware neighboringdevice interactions with zero prior configuration. Usingbuilt-in microphones and speakers on commodity mobiledevices, Spartacus uses a novel acoustic technique basedon the Doppler effect to enable users to accurately initiatean interaction with a neighboring device through a pointinggesture. To enable truly spontaneous interactions on energy-constrained mobile devices, Spartacus uses a continuousaudio-based lower-power listening mechanism to trigger thegesture detection service. This eliminates the need for anymanual action by the user.

Experimental results show that Spartacus achieves anaverage 90% device selection accuracy within 3m for mostinteraction scenarios. Our energy consumption evaluationsshow that, Spartacus achieves about 4X lower energy con-sumption than WiFi Direct and 5.5X lower than the latestBluetooth 4.0 protocols.

Categories and Subject DescriptorsC.3 [Special-purpose and application-based systems]:Signal processing systems

KeywordsInteraction; audio sensing; device pairing; spatial interac-tion; gesture; mobile system; context aware

1. INTRODUCTIONRecent developments in ubiquitous computing enable ap-

plications that leverage personal mobile devices, such assmartphones, as a means to interact with other devicesin their close proximity. Numerous application scenariosare either commonly observed or can be foreseen in the

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.MobiSys’13, June 25−28, 2013, Taipei, TaiwanCopyright 2013 ACM 978-1-4503-1672-9/13/06 ...$15.00.

near future, ranging from Person-to-Person interactions –For example, conference attendees exchanging contact infor-mation, or friends playing multi-player pass-the-parcel mo-bile games, to Person-to-Device interactions – For example,a student printing out a document by interacting with anearby printer, remote control of projectors, accessing prod-uct information in stores, controlling digital display screens,or changing thermostat settings.

Establishing such an interaction requires that the deviceinitiating the interaction makes the target device aware ofits intent and sets up a connection, without any prior con-figuration. While users themselves can intuitively know andidentify the nearby target device based on its relative loca-tion, the existing methods require additional manual effortor prior configuration to translate this spatial-awareness toan identifier understandable by the device.

For example, device discovery features in Bluetooth orWi-Fi can be used to initiate the interaction, where the usermust scan for nearby devices and select the target from alist. Similarly, other methods have been proposed whereusers (the initiator and the target) must share informationa priori [3] or perform synchronized actions [6].

Recent work, Point & Connect [15], has proposed a novelsystem that uses a pointing gesture to initiate interactionsbetween a mobile device and its target. However, the sys-tem requires an initial channel of communication such asa local Wi-Fi or Bluetooth network to exist beforehand toenable a device to point-and-connect to the target device.In addition, due to the energy-constrained nature of mobilephones, the system service cannot be run continuously inthe background and requires users to manually trigger it, onall nearby devices.

In this paper, we propose Spartacus, a mobile systemthat enables spatially-aware neighboring device interactionswith zero prior configuration. First, Spartacus uses a novelacoustic technique based on the Doppler effect to enableusers to accurately initiate an interaction with a partic-ular target device in their proximity through a pointinggesture. Second, Spartacus uses a continuous audio-basedlow-power listening service that runs in the background andautomatically triggers the relatively power hungry gesturedetecting service. This enables Spartacus to run contin-uously in the background and removes the need for anymanual user actions prior to the interaction. The systemcan be implemented as a software application on commoditymobile devices without any special hardware or softwarerequirement.

As a proof of concept, we implemented the Spartacus

system on the Android smartphones, and tested the sys-tem using various device models under different interactionscenarios. Our experimental results show that Spartacussupport spontaneous device interactions with an average of90% device selection accuracy within 3m for most interactionscenarios. Evaluation results of energy consumptions showthat, Spartacus consumes about 150mW power. As a ref-erence, we also compare energy consumption of Spartacuswith the state-of-the-art peer-to-peer scanning techniques,and show that Spartacus achieves about 4X lower energyconsumption than WiFi Direct and 5.5X lower than thelatest Bluetooth 4.0 protocols, respectively.

The key contributions of this paper are as follows:

1. We proposed a novel acoustic technique based on theDoppler effect to enable spatially-aware interactionsbetween devices that provides high accuracy and sup-ports numerous natural human gestures.

2. We developed a novel undersampling audio signal pro-cessing pipeline that achieves better accuracy with-out increasing computational complexity, allowing themethod to be used on commodity mobile devices.

3. We designed and implemented a low-power listeningprotocol using periodic audio sensing to trigger ges-ture detection that reduces energy consumption andenables the system to run continuously in the back-ground. This enables the interaction to be truly spon-taneous without any manual actions on part of theuser.

4. We analyzed and experimentally validated the designtradeoffs in achieving low latency and power consump-tion given hardware and software limitations of com-modity mobile devices.

The rest of the paper is organized as follows. Section 2gives a system overview of Spartacus. We describe thedesign of algorithms in Section 3 and discuss implementationdetails in Section 4. Section 5 shows evaluation results.Related work is shown in Section 6. Finally, Section 8concludes the paper.

2. SYSTEM OVERVIEWSpartacus supports spatially-aware device selections in

close proximity (i.e. within 5m), using intuitive pointinggestures. As illustrated in Figure 1, when a user wants toinitiate an interaction with a nearby device, she makes apointing gesture towards the device using her mobile phone.Then the interaction with the target device is automaticallyinitiated.

To select the target device, Spartacus emits a continuousaudio tone at a known frequency during the course of thepointing gesture. Other devices that are close to the user’sphone will be able to capture the tone via their microphones.Due to the motion of the gesture, a Doppler frequency shiftis observed in the received audio deviating from the originalfrequency, which is a monotonically increasing function ofthe gesture velocity [4]. Therefore, since the gesture is madein the direction of the target device, the target device canbe isolated by finding the peak frequency shift among nearbydevices.

1

2

3

4

Figure 1: An interaction scenario of the Spartacusinteraction system. A user selects a nearby devicefor interaction by quickly pointing her mobile phonetowards the targeting device.

To eliminate any configurations from users in the entireinteraction process, Spartacus needs to automatically trig-ger the gesture detection service on nearby devices beforethe gesture. Previous work adopts a continuous listeningapproach [15]. Given the limited energy budget on mobiledevices, this approach becomes impractical to be active allthe time and users are required to manually turn on thegesture detection service before the interaction. In con-trast, Spartacus solves this problem by providing a zero-configuration interaction trigger mechanism, whereby mo-bile devices running Spartacus continuously perform a pe-riodic low-power listening using their built-in microphones.Before a user issues the gesture, an audio beacon with ashort duration (i.e. a couple of seconds) is emitted. Nearbydevices that successfully capture this beacon trigger theirgesture detection service and begin listening for the ges-ture tone from the initiator. After the gesture is over, thegesture detection service is turned off to conserve energy.Our experimental results show that, the low-power listeningprotocol is more energy-efficient than other state-of-the-artpeer-to-peer scanning techniques, such as WiFi Direct andBluetooth 4.0. (Detailed in Section 5.5)

We have implemented the Spartacus system on the An-droid mobile platform (Detailed in Section 4). The entiresystem is implemented in software, and runs as a mobilephone app, without any change to the existing mobile oper-ating system and APIs. Since Spartacus leverages existingbuilt-in microphones and speakers on commodity mobiledevices, such as smartphones, it does not require any extrahardware.

2.1 Design Challenges of SpartacusTo support spatially-aware device selections and zero-

configuration interaction triggers, the Spartacus system ad-dresses a number of major technical challenges:

1. High Resolution Doppler-Shift Detection – Spar-tacus selects the target device through the analysis ofpeak Doppler frequency shifts in the pointing gestures.However, pointing gestures of average users are usuallytransient (i.e. generally shorter than 0.5s), and thegesture velocity is low, which leads to limited Dopplerfrequency shifts. Thus, accurately detecting the tran-sient peak velocities requires digital signal processingtechniques with a high frequency resolution, withoutsacrificing the time resolution. Spartacus addressesthis challenge by utilizing an undersampling technique,and increases the frequency-domain resolution by 5Xas compared to traditional FFT-based approaches.This translates to a 2X increase in angular resolution.(Detailed in Section 3.1)

2. High-Accuracy Device Selection – To successfullyselect the target device, Spartacus needs to accuratelyestimate the peak frequency shifts observed by eachnearby device, and to select the one with the maximumpeak shifts. To address this challenge, we design andimplement a bandpass audio signal processing pipelinein Spartacus, which is robust against ambient andintermittent high frequency acoustic noises. (Detailedin Section 3.2)

3. Energy-efficient Interaction Trigger – To triggerinteractions on nearby devices without user configu-rations, mobile devices need to be always ready tocapture the pointing gestures, while keeping the en-ergy consumption low. Spartacus addresses this chal-lenge by designing a low-power audio listening protocolthat periodically detects incoming interaction triggers.(Detailed in Section 3.3)

3. SYSTEM DESCRIPTIONUsing Spartacus, a user initializes a device interaction

process by pointing her mobile phone towards a target de-vice, during which an audio tone with a known frequencyis emitted. Spartacus utilizes audio signal processing tech-niques to detect the frequency shifts in the audio tone, suchthat the target device can be selected by searching for themaximum peak frequency shifts among multiple candidatedevices. This section provides detailed system descriptionof this process.

3.1 Detect Doppler Shift with High Resolu-tion

To select the target device, Spartacus emits a continu-ous audio tone at a known frequency f0 during the courseof the pointing gesture. According to the Doppler effect,the frequency of the received tone can be calculated asf = ( c+vR

c−vS)f0, where c, vR, and vS are the speed of sound, of

the gesture receiver, and of the gesture sender, respectively[4]. In our case, assuming the receiver is stationary duringthe course of the gesture (i.e. vR = 0), and the speedof sound c is constant, the observed frequency becomes amonotonically increasing function of vS . Since the usermade the gesture directionally towards the target device,the target device would be able to observe the maximumDoppler shift. Thus, by comparing the peak frequency shiftamong nearby devices, the target device can be selected.

D0α DA

DB

D'0β

Figure 2: Geometry of tone transmission between mo-bile devices. When D0 is moved towards the targetdevice DA during the pointing gesture, the phone dis-placement increases the directional difference from α toβ.

3.1.1 Deriving Angular ResolutionSpartacus selects the target device DA by measuring the

time-varying Doppler frequency shifts. This is accomplishedby running an audio signal analysis process over a seriesof overlapped audio frames. Applying the FFT transformon each audio frame, the quantity of the frequency shiftin this audio frame of DA can be calculated as ∆nA =(fA − f0) · NF F T

Fs, where fA is the observed tone frequency

of DA, f0 the frequency of the original tone, Fs the sam-pling rate, NFFT the number of FFT points, and ∆nA thecalculated frequency shift expressed in terms of FFT points.Assume the target device is stationary during the course ofthe gesture, i.e. vR = 0, given the equation of the Dopplereffect, ∆nA can also be expressed as ∆nA = c

c−vS· f0·NF F T

Fs,

where c and vS are the speed of sound and the gesturesender, respectively. As shown in Figure 2, suppose anotherdevice DB is in close proximity to the target device DA. Letthe angle between the three devices be α, then the velocityobserved at DB will be vS ·cosα. Therefore, correctly select-ing DA to be the target device requires ∆nA is at least oneFFT point larger than ∆nB through the FFT analysis, i.e.∆nA−∆nB > 1. Given that c is constant and the maximumvelocity of a particular gesture is fixed, this requirement canbe expressed as ( 1

c−vS− 1c−vS ·cosα

) · c·f0·NF F TFs

> 1. Reorga-

nizing this inequality, the minimum differentiable αres, i.e.the angular resolution, can be expressed as

cosαres <

c− 1

1c−vS

− Qc

!/vS , (1)

where Q = Fsf0·NF F T

. Using Equation 1, we can compute

the achievable angular resolution of Spartacus. For example,suppose a user used audio tones at 20KHz, and pointed herphone with an average 3.4m/s peak velocity. At the receiverdevice, the audio signal was sampled at 44100Hz, utilizing a2048-point FFT, then the angular resolution would be 26.7◦.Suppose a target device is 5m away, this translates to 2.3mspatial resolution, which is too large to conduct practicalinteractions! We will evaluate the velocity variance of thepointing gestures in Section 5.1.

3.1.2 Improving Resolution using UndersamplingTo improve the angular resolution, Equation 1 indicates

that αres is a monotonically increasing function of Q, whichimplies three options to reduce αres: 1) increasing the orig-inal tone frequency f0, 2) increasing the number of FFTpoints NFFT , or 3) decreasing the sampling rate Fs.

The first two options both have drawbacks. For thefirst option, audio tones with higher frequencies experiencestronger energy degradation, which consequently reduces the

supported interaction range (We will evaluate the effect ofenergy degradation of sound in Section 4.2). For the secondoption, increasing the number of FFT points would involvea higher computational burden. Our experimental resultsindicate that, using 10ms audio frames and 2048-point FFT,processing one second audio takes 7 seconds on modernmobile devices, which is too slow for any interactions thatrequire high responsiveness, such as mobile gaming.

To avoid these drawbacks, Spartacus exploits the lastoption, which is to decreasing the sampling rate Fs. Thisoption seems to be impractical because Nyquist samplingtheorem states that the sampling frequency has to be largerthan twice the maximum signal frequency, to perfectly re-constructed the original signal, otherwise the sampled signalwould be aliased [14]. However, if the bandwidth of a band-pass signal is significantly smaller than the central frequencyof the signal, it is still possible to sample the signal at a muchlower rate than the Nyquist sampling rate, without causingthe alias.

This technique is called undersampling, which has beenused in RF communication and image processing systems,for analyzing bandpass signals [5, 20]. In Spartacus, sincethe tones are transmitted at a very high frequency (i.e.above 18KHz), whereas the frequency shifts of tones areonly a few hundred Herz, the entire received audio tonecan be seen as a bandpass signal. Therefore, using the un-dersampling technique can significantly reduce the requiredsampling rate. The effect of the undersampling technique tothe spectrums is shown in Figure 4.

Denote the lowest and the highest band limits of thereceived frequency-shifted tone as fL and fH , respectively,then the bandwidth of the signal is B = fH−fL. Accordingto the undersampling theorem, the condition for an accept-able new sampling rate is that shifts of the bands from fLto fH and from -fH to -fL must not overlap when shifted byall integer multiples of the new sampling rate F ?s [5]. Thiscondition can be interpreted as the following constraint:

2 · fHn≤ F ?s ≤

2 · fLn− 1

,∀n : 1 ≤ n ≤ bfHBc, (2)

where b·c is the flooring operation, B the signal bandwidth,and n = Fs/F

?s the undersampling factor. In our experi-

ments, we observe that average users can generate pointinggestures with an average peak velocity of 3.4m/s, whichequals to 200Hz frequency shift. Considering the edge effectof FFT transforms, we conservatively assume the bandwidthof the received tone signal is 2KHz, which is sufficient toavoid spectrum aliasing for human pointing gestures. Takethis value of the bandwidth into Equation 2, we get a list ofpossible F ?s and n combinations, as shown in Table 1.

3.1.3 Determining Undersampling ParametersThese parameter combinations provide rich options for us

to design our system. However, when choosing undersam-pling parameters, there are a number of design considera-tions: 1) A higher n is generally favored, as it leads to alower new sampling rate, which results in a better angularresolution. 2) Since the central frequency of the tones isfL + B

2, a higher fL increases the central frequency of the

tones. During our experiments, as will be shown in Section4.2, higher tone frequencies generally cause greater energydegradation of sound, which would significantly reduce theinteraction range and the accuracy of tone detection. In

practice, to support sufficient interaction range and achieveoptimal device selection accuracy, we empirically avoidedusing fL higher than 19KHz (i.e. the entire bandwidth of theaudio tone resides between 19KHz and 21KHz). However,we would note that, thought this general design considera-tion holds, the frequency of the audio tones could also be var-ied according to the specific frequency response performanceof the microphones and speakers. 3) Commodity mobiledevices support a limited choices of audio sampling rates,which generally include 8KHz, 16KHz, 32KHz, 44.1KHz,and 48KHz, etc. This means the original audio samplingrate can not be arbitrarily set, which consequently limitsthe choices of F ?s and n.

After examining all these design considerations, we foundthat, only when n=5, 6, or 7 given Fs = 44.1KHz, or when n= 4 given Fs = 48KHz, the parameter combinations satisfythe central frequency requirement. Furthermore, amongthese parameter combinations, most of the cases lead to alow undersampling factor n, except the case where n = 7given Fs = 44.1KHz. Thus, we finally choose this pairof parameters in Spartacus, which yields a new samplingrate F ?s = 44.1/7 = 6.3KHz. Given this lower samplingrate, based on Equation 1, the theoretical angular resolutionis improved to 10◦. As compared to the original angularresolution of 26.7◦ discussed in Section 3.1.1, this is morethan 2.5X better than the original resolution.

3.2 Select Target Device with High Accuracy

3.2.1 Bandpass Signal Processing PipelineUsing the undersampling technique, we set the central

tone frequency at 20KHz and use a bandwidth of 2KHz,which satisfies the fL requirement shown in Table 1. Thisdesign choice also avoids stronger energy degradations. Us-ing the undersampling process, the spectrum of the originalaudio samples is essentially mapped from 19KHz-21KHz to amuch lower spectrum from 0.58KHz-2.58KHz with a centralfrequency at 1.58KHz. However, since the new samplingrate is much lower than the Nyquist rate, aliasing arises inthe original sampled audio signals. This makes the audiotone buried under ambient noises and makes frequency shiftdetection impossible.

Table 1: Candidate combinations of undersamplingparameters. “?” is finally chosen.

Fs (KHz) n F ?s = Fs/n (KHz) Supported fL (KHz)

44.1 5 8.8 (17.7, 20.0)44.1 6 7.3 (18.4, 20.0)44.1 7 6.3 (18.9, 20.0) ?44.1 8 5.5 (19.3, 20.0)48 4 12 (18.0, 22.0)48 5 9.6 (19.2, 22.0)48 6 8 (20.0, 22.0)48 7 6.9 (20.6, 22.0)

To solve this problem, we designed and implemented anaudio signal processing pipeline, as shown in Figure 3, to re-cover the frequency shifts. First, each mobile device receivesand samples audio data at the default 44.1KHz rate. As willbe discussed in Section 5.1, our experimental results showthat the peak velocities of pointing gestures only lasts tensof milliseconds, we split consecutive audio samples into 10msanalysis windows, with a 75% overlapping ratio, achieving ahigh time-domain resolution. Then, a 10-order Butterworth

Bandpass Filter

Framing

Undersampling

Fast Fourier Transform

Tone Detection (freq domain)

Tone Positioning (time domain)

Frequency Shift Estimation

Smoothing

Raw Audio Samples

time out?yes

no

Low-power Audio Listening

Start beacon deteced?

Continuous Audio Recording

Doppler Shift Estimation

no

yes

App Initialization

Figure 3: Signal processing pipeline.

fFs / 2

f

fL fH

F*s / 2f*L f*H

0

0

(a)

(b)

Figure 4: Illustration of the effect of undersampling tothe spectrums. (a) The originally received audio toneis located from fL to fH . (b) After undersampling theaudio samples, the tone is shifted to lower frequencies.

bandpass filter is used to attenuate out-of-band signals lowerthan 19KHz or above 21KHz. Then, the filtered audiosamples go through the undersampling filter, which uses thenew sampling rate F ?s . With n = 7, this filter essentiallykeeps every 7th sample and deletes other samples. Note thatalthough this process reduces the number of samples beinganalyzed, it still preserves the necessary frequency-domaininformation due to undersampling. The entire bandpass-filtering and undersampling operations process audio sam-ples as a streamline, with a total time complexity of onlyO(N) time, which is efficient for responsive interactions.

To estimate frequency shifts, each undersampled audioframe is processed using a 2048-point FFT transform, andFFT transforms of multiple consecutive audio frames areprocessed altogether to find the frequency shift. Specifi-cally, an energy-based tone detector first compares energyof the spectrum of each frame from 1.08KHz to 2.08KHz(i.e. corresponds to 19.5KHz to 20.5KHz of the originalspectrum before undersampling) to that of the entire 0Hzto 3.15KHz (i.e. F ∗s /2) spectrum. If the former energy isM times greater than the latter energy, this audio frame isset to “1”, otherwise “0”. Second, a tone positioning oper-ator determines the frequency bin with the highest energyfor each frame that have been set to “1”, and links thesefrequency bins through a moving-average smoothing opera-tion. This operation eliminates intermittent high frequencyspikes caused by acoustic noises, such as clangs of metals.

Spe

ctru

m a)

−20

0

20

FF

T

b)

0 200 400 600 800 1000−20

0

20

Frames

FF

T

c)

Original DetectionAfter Smoothing

Figure 5: Effect of the undersampling analysis. (a)Spectrogram of the original audio. (b) Frequency shiftdetection using the traditional FFT approach. (c) Fre-quency shift detection using the proposed undersam-pling technique. The detected peak frequency shift ismarked as “�”.

Finally, the peak frequency shift is determined by findingthe maximum frequency shift among all the frequency bins.During our experiments, we found that M = 1.5 led torobust performance in various indoor environments. Theeffect of this process is shown on Figure 5. Note that, byusing the undersampling technique, the amount of frequencyshifts has been significantly increased as compared to thetraditional FFT approach.

After each device detects the Doppler frequency shifts, allthe devices that detect a positive shift value report theirfrequency shift to the sender device, along with the device’sID information. The sender device then compares all thereceived Doppler shifts and determines the target device.

We would like to note that, when Spartacus is used in acrowded indoor scenarios, such as an airport, where numer-ous devices might be interacting with each other, contentionsmay occur. However, due to the intrinsic rapid energy degra-dation of audio signals, as will also be shown in Section 4.2,different interaction sessions can be automatically separatedif they are spatially far away from each other (i.e. more than5m). If many interactions take place in close proximity, acoordination mechanism could be used to create a contentionwindow that conditionally accepts the reports of the Dopplershifts, so that different interaction sessions can be separatedin the time domain. Previous research provides variouscontention-control schemes, however, this is out the scopeof the current work.

3.2.2 Angular Gain through Pointing GesturesSpartacus uses the bandpass signal processing pipeline

to estimate frequency shifts in audio tones. However, dueto geometric constrains, the condition of whether a nearbydevice would be selected as the target device is determinedby the angular resolution αres of the system. As indicatedin Section 3.1, when the number of FFT points is 2048,the smallest angular resolution is 10◦ when the undersam-pling factor n is equal to 7. However, during our tests inpractice, we found that when candidate devices are close tothe user (i.e. within 3m), the device selection accuracy isbetter than the analysis. This improvement is caused by

0 20 40 60 800

10

20

30

40

50

60

70

80

90

α (Degree)

β (D

egre

e)

Distance = 0.75mDistance = 1mDistance = 2mDistance = 3m

Figure 6: Angular gains caused by the stretch of userarms during the pointing gestures. These gains improvethe device selection accuracy of Spartacus.

the increased directional difference, as illustrated in Figure2. During a pointing gesture, since the user stretches herarm toward the target device DA, the phone displacementmakes the effective directional difference α increase. Thisangular change is significant when the candidate devices DAand DB are close to D0. As shown in Figure 6, when DAand DB are 0.75m away and α = 10◦, the two devices aresupposed to be barely differentiable. However, assumingthe user’s arm is 60cm, the effective angular difference β isincreased to 55◦, which makes the two devices much easierto be differentiated. Clearly, this angular gain diminisheswhen the distance between D0 to the two devices increases,as well as when the user does not fully stretch her arm duringthe pointing gestures.

3.3 Energy-efficient Interaction TriggeringSpartacus seeks to enable spatially-aware interaction be-

tween an initiating device and receiving devices in its neigh-borhood. This requires that the receiving devices are lis-tening to gesture broadcasted by the initiator. However,such continuous listening for gestures consumes a significantamount of energy and is a challenge to realize on energy-limited mobile devices. Therefore, in Spartacus we leveragea low-power audio listening protocol to save energy. Thissection describes the design details of this protocol.

3.3.1 Low-Power Audio ListeningThe audio based low-power listening protocol has the fol-

lowing advantages:1. Ubiquitous Hardware Support – Microphones

and speakers are ubiquitous on most devices such as mobilephones, laptops, smart-TV sets, etc. No extra hardwaremodifications are need to implement this protocol. In addi-tion, the cost of commodity audio hardware is very low.

2. Limited Range – The objective of this protocol is toinitiate interaction strictly with devices in close proximity tothe sender, typically line of sight and single space scenarios.Audio has a limited range and attenuates considerable whenpassing through walls. This makes the medium better suitedfor detecting neighboring devices within the same space, un-like radio based discovery such as WiFi and Bluetooth. Thisadvantage also helps the Spartacus system to automatically

separate concurrent interaction sessions that are spatiallyapart.

3. Energy Efficient – The continuous periodic listeningand beaconing protocol using audio is orders of magnitudemore energy efficient than discovery schemes using WiFi orBluetooth, partly because these communication protocolsare not designed for continuous discovery. Moreover, finegrained control or modification of WiFi and Bluetooth de-vice discovery schemes is not possible on consumer hardwaredevices.

The audio based protocol has two major modes:1. Periodic Listening – All devices periodically wake up

(every Trx) and record sound for a duration drx. The timedrx is the amount of audio required to detect if a specificbeacon tone frequency is being broadcasted by an initiatingnode.

2. Beaconing – Whenever a sender node wishes tobroadcast a gesture, it first emits a beacon tone for a du-ration dtx. This guarantees that all the receiving devicesin range will receive the beacon and switch to continuouslistening mode to record the gesture.

The theory of low-power listening states that, to guaranteeevery beacon will be successfully detected by nearby devices,the beacon duration dtx must be at least as large as Trx [16].This relationship indicates that, the shorter the duration ofthe beacons, i.e. Trx, the shorter the user needs to waitbefore starting the gesture, thus the more natural the gesturecan be accomplished. However, a short beacon duration re-quires increasing the duty cycles of the interaction listeningon the receivers, which consequently consumes more energy.We will evaluate the tradeoff between energy consumptionvs. duty cycles in Section 5.5.

To identify the sender, Spartacus encodes the device IDusing the Reed-Solomon coding, and sends the ID in a trans-mission immediately succeeding the beacon transmission. Inthe current implementation of Spartacus, the device ID ismodulated using a 16 Frequency Shift-Keying (FSK) schemewith a central frequency at 19KHz. Keys are separated inthe frequency domain using a 50Hz guard band. Note that,since the central frequency of the tone emitted during thepointing gesture is 20KHz, the transmission of the device IDis at least 200Hz lower than the gesture tone, thus will notcause ambiguities.

3.3.2 Dealing with Wakeup JitterThe low-power listening protocol relies on the assumption

that the duration of the beacon (dtx) transmitted by thegesture transmitter is greater than the maximum interval(Trx) between adjacent listening events on the receivingdevices. However, since mobile platforms, such as Android,are not real-time operating systems, wakeup jitters can beobserved between when an API starts recording sound andwhen the system actually begins recording. We empiricallymeasure the latency of an API call on Galaxy Nexus Androidphones. As shown in Figure 7, the average jitter in thewakeup timer is about 70ms, with a standard deviation of15ms. This result indicates that, using the original beaconduration will cause the receiving devices to miss interactiontriggers, which leads to a failure of capturing the upcominggesture tone. To solve this problem, as shown in Figure 8,we include an additional guard band in the beacon lengthbased on empirical measurements to account for the wakeupjitters.

0 30 60 90 120 1500

5

10

15

20

25

30

35

40P

erce

ntag

e

Time (ms)

Figure 7: Distribution of wakeup jitters.

Trxdrx

dtx

t

t

τrx(a)

(b) σtx

Figure 8: The graph shows the relationship between thewakeup period Trx and the wakeup duration drx of thegesture receivers, and the beacon duration dtx of thegesture transmitter. Note that, due to the existence ofthe wakeup jitter τrx, an additional guard band σtx isused in the beacons.

4. IMPLEMENTATIONTo evaluate the interaction system, we implemented the

Spartacus system on the Android platform, and tested iton various phones, including Galaxy Tab, Nexus 7, GalaxyNexus, and HTC One S. This section discuss the implemen-tation details of Spartacus.

4.1 Software ImplementationTo evaluate the interaction system, we implemented the

Spartacus system on the Android platform, and tested iton various phones, including Galaxy Tab, Nexus 7, GalaxyNexus, and HTC One S. The current Spartacus has a clientend and a server end. The client end runs as an Androidapp, which has 4 components: GestureSensing, LowPow-

erListening, AudioModem, and the GUI. The design ofSpartacus strictly follows the Object-Oriented programmingparadigm, such that each component is standalone, and canbe easily incorporated into any existing apps as an add-on interaction support. For example, if an app needs toconduct the low-power audio listening, it simply calls theLPL.start() to start the listening thread, and a messagehandler is used to receive notifications in case an interactionrequest is detected. Similarly, for issuing gestures, an appcould simply call GestureSensing.makeGesture(), then anaudio tone at a predefined frequency will be generated dur-ing the gesture; on the receiver’s end, a counterpart Ges-

tureSensing.analyzeGesture() will trigger a backgroundthread that runs the signal processing pipeline describedin Section 3.2 to compute the frequency shifts. The entire

Spartacus project contains a total of 3500+ lines of Javacode, including both the server and the client.

4.2 Hardware Limitations on Mobile DevicesTo generate and receive audio tones emitted during

the course of the gestures, Spartacus leverages the built-in speakers and microphones on mobile devices, and usesoftware-implemented digital signal processing algorithms todetect the gestures. In our implementation process, we con-ducted experiments to understand the limitations of modernmobile devices in emitting high frequency audio tones.

Previous research has found that microphones on com-modity mobile phones supports sampling rates up to 48KHz,which enables the use of tone frequencies as high as 24KHz[4, 7, 8]. In Spartacus, we use tone frequencies higher than20KHz, which is inaudible [8]. However, a downside ofusing high frequency tones is the potential stronger energydegradation of sound. To investigate this effect on mobiledevices, we model the transmission of audio tones as threeconsecutive processes, as shown in Figure 9. Similar to aRF communication system, the degradation of signal energyoccurs inside of the speakers and the microphones, as wellas during the transmission in the acoustic channel. Weinvestigate each of these processes.

First, to quantize the energy degradation of sound, webenchmarked the frequency responses of speakers and mi-crophones on various mobile devices. A Sennheiser MKE2P microphone and a Yamaha NX-U10 speaker are used asreferences. As shown in Figure 10(a) and 10(b), we observethat, due to the characteristics of the hardware, a signifi-cant energy degradation exists for audio tones higher than15KHz. This phenomenon is not surprising considering thatthe hardware of mobile phones are designed intentionallyfor human conversations and music, which feature dominantenergies generally lower than 15KHz. We also found that,as tone frequencies go up, the degradation becomes evenlarger. To be specific, as the tone frequency increases every1KHz, the degradation of sound energies increases 5dB onspeakers, and 3.3dB on microphones, respectively. Second,we investigate the degradation of sound energy caused bythe transmission through acoustic channels. As shown inFigure 10(c), we observe that there is an average 3.2dB/menergy decrease of sound from 1m to 6m, irrespective of thetone frequency.

In summary, we found that the energy degradation ofsound is collaboratively caused by the frequency responses ofthe speakers, the microphones, and the transmission throughacoustic channels. For interactions in close proximity, thedegradation largely comes from the frequency responses ofthe hardware. These results indicate that, to reduce energydegradation and increase interaction range, audio tones withlower frequencies should be leveraged.

5. EVALUATIONIn this section, we evaluate the performance of Spartacus

given different interaction scenarios. Spartacus leverages thepointing gestures of users to accurately select the target de-vice, without user configuration. To understand the bound-ary of system performance, we first conduct experiments toinvestigate trajectory and velocity variances of the pointinggestures of average users. Then we provide performanceanalysis of Spartacus, by discussing the device selectionaccuracy and the interaction range. Finally we discuss the

Speaker Microphone

Original Audio Data

Acoustic Channel

DAC + Amplifier

ADC + Low-pass Filtering

Received Audio Data

Tone Sender Tone Receiver

Figure 9: Audio tone transmission between mobile devices.

10 12 14 16 18 20−50

−40

−30

−20

−10

0

Frequency (KHz)

Sou

nd L

evel

(dB

)

MKE 2P (Refernece)Galaxy TabNexus 7Galaxy NexusHTC One S

(a)

10 12 14 16 18 20−30

−25

−20

−15

−10

−5

0

5

Frequency (KHz)

Sou

nd L

evel

(dB

)

Yamaha (Reference)Galaxy TabNexus 7Galaxy NexusHTC One S

(b)

1 2 3 4 5 6−10

−8

−6

−4

−2

0

2

Distance (m)

Sou

nd L

evel

(dB

)

Tone Freq = 15KHzTone Freq = 16.4KHzTone Freq = 17.8KHzTone Freq = 19.4KHz

(c)

Figure 10: A graph showing frequency responses of various mobile devices (1/16 octave band-filtered from 15KHz). (a)Frequency responses of speakers as a function of the tone frequency. (b) Frequency responses of microphones as a functionof the tone frequency. (c) Attenuation of sound energy vs. distance.

interaction latency, and present the power consumption ofSpartacus as compared to other state-of-the-art techniques.

5.1 Evaluation of Pointing GesturesTo accurately select the target device, Spartacus makes

an assumption that when a user points her phone towardsthe target device, the target device will always observe thelargest Doppler frequency shift. However, in practice, usergestures can be significantly diverse in terms of directionalprecision, velocity, and trajectory. Such diversities makeaccurately estimating the peak frequency shift significantlychallenging.

In order to investigate what factors may affect the per-formance of Spartacus, we conducted experiments to fullyunderstand the characteristics of pointing gestures of aver-age users. Some fundamental questions that we would wantto investigate include:

1. How diversely do users point their phones, and howfast can a user point?

2. If the user points fast enough, how often does the tar-get device observe the highest frequency shift, thus thehighest velocity, of the gesture?

3. If we want to estimate the frequency shifts, how muchfrequency- and time-domain resolution do we need tosuccessfully capture the peak frequency shift inside of agesture?

To answer these questions, we conducted experiments with12 participants (6 females) and investigated their pointinggestures. To capture the gesture trajectories, we attacheda video recorder on the ceiling right above the participants,and videotaped the entire experiment. The video we tookrecorded complete 2D trajectories of the gestures. Before

doing the experiment, we briefed the participants on theidea of Spartacus, and let them to freely choose any naturalgestures they wanted. During the experiment, each par-ticipant performed 10 gestures towards a target device 2maway from them, using a Galaxy Nexus phone. A red markerwas attached to the participants’ hands for motion-tracking.After the experiment, we detected hand trajectories of theparticipants using image processing techniques. Then, weestimated the velocities of the gestures given the frame rateof the video. We summarize our findings as below:

Finding 1: As shown in Table 2, three types of gesturetrajectories were seen during the experiments. Among thethree gesture types, a majority of the participants (10 out of12) predominantly utilized a vertically downward pointingmovement. To further analyze the Doppler frequency shifts,we plotted the spectrogram of each gesture and correlatedthe spectrogram with the recorded video. As shown inFigure 11(a) and 11(b), we found that the arm trajectoryvariance of the participants who performed this gesture waslimited – in most cases, the velocity of users’ arms increasedduring the process of stretching out the users’ arms (i.e.Frame #30), and the peak velocities predominantly occurredclose to the end of this process (i.e. Frame #31). Thenthe arm reached a full stretch, and the velocity diminishedquickly. We found most of the participants fully stretchedout their arms in the experiments, which translates to a55cm to 75cm phone displacement and angular gains, asdiscussed in Section 3.2.2. We will show how this factoraffects the performance of Spartacus in Section 5.3.1. Sincegesture trajectories can be easily differentiated using built-ininertial sensors of the mobile devices, as a proof of concept,we focus on evaluating this vertically downward gesturetrajectory in the current design of Spartacus.

Finding 2: To investigate how often the target device

1 2

3 4

Frame #25 Frame #30

Frame #31 Frame #34

(a)

Frame #25

Frame #30 Frame #31

Frame #34

Time (s)

Fre

quen

cy S

hift

(Hz)

0.2 0.45 0.7

510

340

170

0

−170

−340

−510

(b)

Figure 11: The graphs show (a) a vertically downwardpointing gesture performed by one student and (b) thecorresponding Doppler frequency shifts of each frame.The yellow line indicates the direction of the targetdevices.

Table 2: Trajectory variations of gestures

Gesture Type Total Count (Male/Female)

Vertically downward 10 (6/4)Vertically upward 1 (0/1)Horizontally outward 1 (0/1)

could successfully observe the maximum velocity of the ges-tures, we computed directions of the highest 70% velocitiesof the gestures, and calculated the angular deviation fromthe direction towards the target device. As shown in Figure12(b), we found that in most cases, the highest velocitieswere predominantly facing towards the target device, withan average ±7.5◦ angular bias. When interacting with adevice 3 meters away, this bias translates to 0.38m. Thisresult indicates that participants are able to precisely pointthe phones towards the target device, and selecting thetarget device using the maximum velocity is appropriate.

Finding 3: The peak velocity of the gestures of all par-ticipants was 3.4m/s on average, as shown in Figure 12(a).This is similar to results reported in related work [4]. Usingan audio tone at 20KHz and assuming the speed of soundis 343m/s, this translates to a maximum Doppler frequencyshift of 200Hz. Moreover, we found that the peak velocityis transient. As shown in Figure 12(c), most of the gestureslasted less than one second, and the peak velocities appearedand diminished within 25ms. These findings indicate thatto accurately estimate the peak velocity, Spartacus needs ahigh time-domain resolution to position the peak frequencyshifts, as well as a high frequency-domain resolution to esti-mate the quantity of the shift.

After understanding the variances of the pointing ges-tures, we evaluate the performance of Spartacus under dif-ferent indoor scenarios.

5.2 Experimental SetupThe first set of experiments was done in a student lounge

in our university, in which multiple ambient noise conditionswere tested. The other two sets of experiments were donein a student cubicle area and along a hallway. During eachexperiment, a target device was placed in front of a studentwith distance d; another device was placed at α◦ apart fromthe direction of the target device, with the same distance.For each test, a student pointed a Galaxy Nexus phone 25times towards the target device, with a peak velocity ofabout 3m/s. To guarantee each individual gesture was notmade too fast or too slow, after the experiments, we manu-ally investigated the duration of each gesture, and selected20 from the 25 gestures that achieved the closest peak veloc-ity for analysis. To compute the frequency shift, audio toneswere captured at the two candidate devices at 44.1KHz,undersampled 7 times to 6.3KHz, and then processed usingthe signal processing pipeline discussed in Section 3.2. A2048-point FFT was applied for each 10ms analysis window,with a 75% overlapping ratio. These parameters were usedthroughout the entire evaluation process.

5.3 Device Selection Accuracy

5.3.1 Performance with Distances and AnglesSpartacus selects the target device by comparing the de-

tected peak velocity of user gestures among nearby devices.As shown in Figure 15, as the distances between devicesincrease, the device selection accuracy drops gradually. Thisis consistent with our analysis of sound energy degrada-tion. As described in Section 3.2, Since Spartacus lever-ages an energy comparison method for tone detections, theenergy difference between tones and other frequency bandsdecreases as the distances increase. Moreover, due to theperformance gain caused by the stretch of arms, for all testeddevice directions, Spartacus achieved 90% accuracy within4 meters, and 80% within 5 meters.

To evaluate how directional changes affect the perfor-mance, we keep the devices at fixed distances, and changeα. As shown in Figure 14, as α decreases, the accuracy ofdevice selection drops. For interactions within 3 meters, thedevice selection accuracies are above 90% when devices areat least 30◦ apart. When the devices are at least 45◦ apart,the accuracies are above 90% within 5 meters.

It is worthing noting that, although the theoretical an-gular resolution of Spartacus is 10◦, we still achieved highdevice selection accuracy (i.e. above 90%) when the direc-tional difference between devices are lower than 20◦ andthe interaction range is within 1 meters. This is becausewhen the interaction range is short, the angular gain dueto the stretch of arms is significant. These evaluationresults show that the performance of Spartacus is robustagainst directional and distance changes for interactions inclose proximity. However, as distances keep increasing (i.e.above 5 meters), the performance drops gradually due tothe decrease of sound energy and the resultant difficultyin detecting the audio tones. This observation justifies theclose-range interaction scenarios of Spartacus, where variousubiquitous applications take place.

2 3 4 5 60

0.2

0.4

0.6

0.8

1

Gesture Speed (m/s)

Cum

ulat

ive

Dis

trib

utio

n

Empirical CDF

(a)

−20 −10 0 10 200

5

10

15

20

25

Per

cent

age

Angular Deviation (Degree)

(b)

0 0.2 0.4 0.6 0.8 10

5

10

15

20

Per

cent

age

Gesture Duration (s)

(c)

Figure 12: Characteristics of pointing gestures. (a) Speed distribution. (b) Directional precision. (c) Duration variance.

Lounge Cubicles Hallway

Figure 13: The three different indoor scenarios where we conducted experiments.

10 20 30 40 50 60 70 80 9050

60

70

80

90

100

α (Degree)

Acc

urac

y (%

)

Distance = 0.5mDistance = 1mDistance = 3mDistance = 5m

Figure 14: Performance of device selection when devicedirections change.

5.3.2 Performance Under Noisy ConditionsTo evaluate the performance of Spartacus against different

noise backgrounds, we conducted experiments in the studentlounge at different times. The first experiment was donewhen a couple of students were having a group discussionin the lounge, resulting human conversations in the back-ground. Previous research has shown that the sound ofclanging metals would generate high frequency noise [8].To evaluate this effect, we purposely played a piece of rockmusic (i.e. “Burn It Down” of Linkin Park) in the back-ground, which features rich clangs of metal instruments.In both experiments, we changed the distance from 1m to2m. The experimental results of the lounge are used asa comparison. As shown in Figure 17, the experimentalresults show that under 1m distance, the performance forall three cases for all tested directions are constant, withhigher than 95% accuracy. When the distance is increased to

1 2 3 4 550

60

70

80

90

100

Distance (m)

Acc

urac

y (%

)

α = 60°

α = 45°

α = 30°

Figure 15: Performance of device selection when devicedistances change.

2m, except the 30◦ case under human conversations, all thethree cases achieved above 90% accuracy. Audio spectrumindicates that, as shown in Figure 16, metal clangs canhardly reach frequencies above 18KHz, which has limitedeffect to Spartacus. These experimental results indicatethat the performance of Spartacus is robust against commonindoor noises.

5.3.3 Performance with Different ScenarioTo evaluate the performance under different indoor sce-

narios, we conducted experiments in cubicle areas and ahallway, as shown in Figure 13. Due to limited space inthese scenarios, we only tested performance up to 1.5m with30◦. Figure 18 shows the results. For the distance of 0.5m,all three cases had 100% device selection accuracies. As thedistance increases, the performance of the cubicle area andthe hallway is seen to have slight decreases. This is primarily

5 10 15 20

−50

0

5 10 15 20

−50

0

Aud

io P

ower

(dB

)

5 10 15 20

−50

0

Frequency (KHz)

Ambient Noise

Music

Speech

Figure 16: Spectrums of the three noise cases measuredat 1m. Note that, in all cases, the audio tone at 20KHz(i.e. the spikes on the spectrums) for the pointinggestures is easily detectable.

due to the stronger multi-path effects in the two scenariosas compared to the student lounge. However, in all threecases, Spartacus has achieved higher than 85% accuracy.

5.4 Interaction LatencyUsing the undersampling technique, Spartacus reduces the

number of audio samples that need to process, without de-creasing the time-domain resolution. As discussed in Section3.1, if undersampling technique was not used, Spartacuswould have to increase the number of FFT points to achievethe equivalent angular resolution, which as a consequence,involves longer processing time. To evaluate this perfor-mance, we tested the processing latency of Spartacus usingdifferent FFT points. As shown in Figure 19, Spartacusleverages a 2014-point FFT processing, which takes 1.5s toprocess a 1-second gesture audio. To achieve the same an-gular resolution, a traditional FFT approach would have touse at least an 8192-point FFT processing, which takes 8.7s!Such a processing latency is impractical for any responsiveinteractions, such as mobile gaming. These results justifythe use of undersampling in Spartacus.

5.5 Power ConsumptionSpartacus leverages a low-power audio listening technique

to automatically sense the potential interaction requestsfrom nearby devices, so as to eliminate any user involvement.In this section, we evaluate the performance of Spartacusin terms of power consumption in low-power listening. Tocompare the performance under different duty cycles, wefixed the duration of each listening session to 200ms, andchanged the periods. To eliminate hardware variations, allthe experiments were done on the Galaxy Nexus mobilephones. In the experiment, the screens and the CPUs ofthe mobile phones were completed shut off, and the powerconsumption was measured using an oscilloscope. For eachtest, we measured the power consumption by running Spar-tacus’s low-power listening task for 5min, and we computethe mean and the standard deviation of the collected data.

The experimental results are shown in Figure 20. As abaseline, we first measure the power consumption of thedevices being idle, i.e. turning off the screen and shuttingdown all the application processes. This yields the base-

30 35 40 45 50 55 6050

60

70

80

90

100

α (Degree)

Acc

urac

y (%

)

Ambient NoiseMusicConversation

(a) Distance = 1m

30 35 40 45 50 55 6050

60

70

80

90

100

α (Degree)

Acc

urac

y (%

)

Ambient NoiseMusicConversation

(b) Distance = 2m

Figure 17: Performance of device selection with variousbackground noise conditions.

line power consumption at 75mW. Then we evaluate theperformance of Spartacus. When the duty cycle is 5% (i.e.the period equals 4s), Spartacus consumes 120mW power.As the duty cycle increases, the power consumption goesup gradually. At the maximum 25% duty cycle, Spartacusconsumes 250mW. As a comparison, we tested the state-of-the-art peer-to-peer scanning techniques, and found thatWiFi Direct consumes 460mW of power, and Bluetooth 4.0consumes 670mW. These results indicate that on average,Spartacus achieves about 4X lower energy consumption thanWiFi Direct and 5.5X lower than the latest Bluetooth 4.0protocols, respectively.

6. RELATED WORKSpartacus leverage built-in microphones and speakers on

commodity mobile devices for both active device interactionand passive listening for potential interaction requests. Inthis section, we compare Spartacus with previous work thatleverage audio signals in two main categories: mobile sensingand device interaction.

6.1 Audio Processing in Mobile SensingMicrophones have been widely used in mobile sensing

applications to capture audio data. For example, Miluzzoet al. has used human conversation snippets for analyzing

0.5 1 1.550

60

70

80

90

100

α (Degree)

Dis

tanc

e (m

)

LoungeCubicleHallway

Figure 18: Performance of device selection in differentindoor scenarios. α is fixed at 30◦.

1024 2048 4096 81920

2

4

6

8

10

Number of FFT Points

Pro

cess

ing

Tim

e (s

)

Figure 19: The processing time of a 1-second audiorecording using different number of FFT points.

social activities [12]. SurroundSense also utilized audio data,combined with data from other sensing modalities, suchas accelerometers, cameras, and magnetometers to detectlocations of users for social context inferences [1]. Similarly,Lu et al. has provided a tailored signal processing pipelinefor audio sensing and learning, such that unknown socialevents can be automatically identified and easily labeled [9].These early projects have largely focused on the knowledgediscovery tasks of using audio data, while assuming a fixedor default audio sensing mechanism.

For energy-efficient continuous audio sensing, some morerecent work has been published. JigSaw and Darwin Phonesfocused on enabling energy-efficient continuous sensing andcollaborative learning techniques [10, 11]. MoVi presentedan approach of using collaboratively sensed audio datafrom multiple participants to create integrated social eventrecords [2]. SwordFight provide a continuous and accuratedistance ranging technique using time difference of soundarrivals [22].

All the previous work mentioned above leveraged micro-phones in a continuous manner. Spartacus differs fromthem by proposing a low-power audio listening technique forpassive interaction sensing. Since interaction requests fromnearby devices can happen spontaneously, it requires Sparta-cus to coordinate the tradeoff between detection latency and

0 5 10 15 20 25 300

100

200

300

400

500

600

700

Duty Cycle (%)

Pow

er C

onsu

mpt

ion

(mW

)

Wifi DirectBluetooth 4.0SpartacusBaseline

Figure 20: Power consumption of Spartacus under dif-ferent duty cycle settings. As references, power con-sumptions of scanning of Bluetooth 4.0 and WiFi Directare also shown.

energy efficiency. Other papers have also presented systemsusing audio tones [7] for indoor counting or localization[8]. Differing from them, Spartacus leverages tones at highfrequency to generate Doppler effect for device interaction.

6.2 Spatially-Aware Device InteractionsInteractions through mobile devices generally require two

pieces of functionalities, i.e. device selection and interactiondetection. For selecting devices, previous work has shownthat touching, scanning, and pointing are the most commoninteractions when performing user-mediated object selectionand indirect remote controls [17]. Point & Connect (P&C)proposed an interaction technique based on time differenceof sound arrivals. Both Spartacus and P&C can be usedfor device interactions in close proximity. However, the twosystems differ significantly in several aspects. First, P&Crequires users to manually set up a dedicated broadcastWiFi channel or start discoverable mode of their Bluetooth.Enabling P&C may prevent the users from using their de-fault WiFi networks. In contrast, Spartacus runs in themobile devices as a system service, and the periodic inter-action detection mechanism automatically detects potentialinteractions. Second, P&C was focused on providing thedevice selection solution, while making an assumption thatthe surrounding devices have already launched the relatedservice and continuously waiting for interaction requests. Inpractice, this continuous audio listening would consume sig-nificant energy. Therefore, Spartacus uses the duty-cycledaudio listening mechanism, and our evaluation results haveshown this mechanism is energy-efficient, as compared to thetraditional peer-to-peer communication techniques, such asWiFi Direct and Bluetooth 4.0.

SoundWave leverages the Doppler effect to sense user ges-tures for close-range (i.e. within 1m) single-device interac-tions [4]. As the Doppler effect is made by moving user armsin front of a laptop, and the laptop is both the transmitterand the receiver, the generated frequency shift is doubled.This significantly increases the detection accuracy of thefrequency shifts and makes detection easier. In contrast,Spartacus selects the target device by comparing the peakfrequency shifts between multiple devices, which requireshigher frequency- and time-domain resolution. We solve

these problems by proposing a bandpass signal processingpipeline to increase Doppler shift estimation accuracy.

To detect nearby devices and interaction requests, varioustechniques can be used. However, existing techniques eitherrequired extra infrastructure (i.e. laser tags and pointers[13]), or did not provide spatial accuracy sufficient for sup-porting device interactions (i.e. WiFi-based indoor localiza-tion [21]). Other work involved extra effort from users to ini-tiate interactions (e.g. the simultaneous shaking required bythe Bump app), or is not energy efficient (e.g. WiFi or Blue-tooth techniques). PANDAA provided an automatic devicelocationing service using ambient sound generated in indoorenvironments, without requirements of extra infrastructuresor extra manual effort, and has achieved centimeter-levellocationing accuracy [18]. However, PANDAA only sup-ports devices in stationary placements, thus if moving usersare involved in interactions, other mechanisms would haveto be used. More recent work in Polaris provided an in-door orientation determination technique that could supportspatially-aware indoor device interactions [19]. However,since Polaris dealt with only absolute directional relation-ships of devices (i.e. relative to the earth’s magnetic field),nearby devices can hardly be differentiated if their relativeorientations change, such as in the interaction scenarios thatinvolve moving users. In this paper, we describe the low-power audio listening technique used in Spartacus to detectnearby interaction requests, which periodically wakes upthe CPU and microphones on mobile devices, and detectsaudio beacons emitted by nearby devices. We also presentexperimental results that show the low-power audio listeningtechnique is much more energy-efficient than other state-of-the-art techniques, such as WiFi Direct or Bluetooth 4.0.

7. DISCUSSIONIn the current stage, the Spartacus system focuses mainly

on the signal processing techniques and prove the novelinteraction concept through the use of the Doppler effect.This section addresses a number of related system issuesand discusses their possible solutions.

7.1 Energy-Efficient Interaction TriggersTo support automatic interaction triggers without user

involvement, Spartacus leverages energy-efficient audio sens-ing to automatically detect interaction intents from nearbydevices. This makes the system more appropriate to beused on mobile devices with limited energy budgets. Whilethe current system adopts the continuous low-power audiolistening technique, we would note that, if the energy con-straint is not a major concern in particular applications,the system could also be enabled on demand. Moreover,in addition to the proposed energy-efficient audio sensingtechnique, the system could also be triggered by other tra-ditional communication schemes, such as Bluetooth or WiFiDirect.

When choosing between different interaction triggeringmechanisms, there are a few design tradeoffs. On one hand,as we have pointed out in Section 5.5, the proposed low-power audio listening technique is more energy-efficient thanthe Bluetooth 4.0 and the WiFi Direct techniques. On theother hand, due to the intrinsic slow data rate of audiosignals, the user has to wait for a couple of seconds for a“warmup beacon”before doing the gesture. As a comparison,this issue can be solved, given that the relatively faster

wireless communication schemes could be used, such as theBluetooth and the WiFi techniques.

7.2 Security IssuesSpartacus leverages the Doppler effect to support spatially-

aware device interactions. This interaction may be vulnera-ble to security attacks. For example, when a user generates apointing gesture, a malicious device standing close by couldpretend to have detected higher Doppler shifts than otherdevices, so that it deceives the sender into thinking it wasthe receiver; similarly, the same trick could also be usedto prevent other receivers from being connected with thesender. To avoid these issues, a secured connection mecha-nism could be used, so that only trusted and authenticateddevices are allowed to report their Doppler shifts. Alterna-tively, knowledge from the users could also be used to reducethe possibility of malicious recipients. For example, afterthe user’s device determines the potential receiver who hasreported the maximal Doppler shifts, the name and identityof receiver’s owner would be shown on the user’s device.Then the user would be able to determine the correctness ofthe interaction recipient.

7.3 Contentions Among Interaction SessionsWhen the Spartacus system is used in a crowded scenario,

e.g. an airport, where many users would use the systemconcurrently, contentions could be an issue for device pairingtechniques. However, since Spartacus leverages an audiosensing mechanism for close-range interactions, concurrentinteractions can be automatically separated if they are faraway from each other (i.e. beyond 5m), due to the fastdegradation of sound energy. If the interactions do occurwith close proximity, a contention coordination mechanism(e.g. a contention window that receives the Doppler shiftmeasures) could also be leveraged to solve the problem.

8. CONCLUSIONThis paper presents Spartacus, a spatially-aware inter-

action system that allows users of mobile devices to es-tablish spontaneous interactions with high accuracy, lowlatency, and low energy consumption. Spartacus does notrequire extra hardware. Leveraging sensors ubiquitous inmost commodity smartphones, Spartacus enables users touse intuitive pointing gestures to select target devices withzero prior configuration. We provide a comprehensive eval-uation of the Spartacus system in various use conditions.Our experimental evaluations show that Spartacus performssignificantly better than existing device interaction systemsin terms of intuitiveness, accuracy, latency, and energy con-sumption. This new paradigm of mobile interactions willenable natural, fast, and seamless interactions in numerousemerging applications.

9. ACKNOWLEDGEMENTThis work was partially the results of the generous sup-

port from Intel Inc., and National Science Foundation grantnumber CNS-1135874 and 1149611. In addition, the authorswould like to thank our shepherd Prof. Shyam Gollakota forproviding valuable guidance during the process of finalizingthe camera-ready version of the paper. Moreover, the au-thors would also like to thank Prof. Ian Lane of CMU forclarifying audio-related concepts and for lending us the refer-ence microphones and speakers for doing the experiments.

10. REFERENCES[1] M. Azizyan, I. Constandache, and R. Roy Choudhury.

SurroundSense: mobile phone localization viaambience fingerprinting. In Proceedings of the 15thannual international conference on Mobile computingand networking, MobiCom ’09, pages 261–272, NewYork, NY, USA, 2009. ACM.

[2] X. Bao and R. Roy Choudhury. MoVi: mobile phonebased video highlights via collaborative sensing. InProceedings of the 8th international conference onMobile systems, applications, and services, MobiSys’10, pages 357–370, New York, NY, USA, 2010. ACM.

[3] M. T. Goodrich, M. Sirivianos, J. Solis, G. Tsudik,and E. Uzun. Loud and Clear: Human-VerifiableAuthentication Based on Audio. In DistributedComputing Systems, 2006. ICDCS 2006. 26th IEEEInternational Conference on, page 10, 2006.

[4] S. Gupta, D. Morris, S. Patel, and D. Tan.SoundWave: using the doppler effect to sense gestures.In Proceedings of the 2012 ACM annual conference onHuman Factors in Computing Systems, CHI ’12, pages1911–1914, New York, NY, USA, 2012. ACM.

[5] H. Harada and P. Ramjee. Simulation and SoftwareRadio for Mobile Communications. Boston: ArtechHouse, 2002.

[6] K. Hinckley. Synchronous gestures for multiplepersons and computers. In Proceedings of the 16thannual ACM symposium on User interface softwareand technology, UIST ’03, pages 149–158, New York,NY, USA, 2003. ACM.

[7] P. G. Kannan, S. Padmanabha, M. C. Chan,A. Ananda, and L.-S. Peh. Low Cost Crowd Countingusing Audio Tones. In The 10th ACM Conference onEmbedded Networked Sensor Systems (SenSys 2012),pages 155–168, 2012.

[8] P. Lazik and A. Rowe. Indoor Pseudo-ranging ofMobile Devices using Ultrasonic Chirps. In The 10thACM Conference on Embedded Networked SensorSystems (SenSys 2012), pages 99–112, 2012.

[9] H. Lu, W. Pan, N. D. Lane, T. Choudhury, and A. T.Campbell. SoundSense: scalable sound sensing forpeople-centric applications on mobile phones. InProceedings of the 7th international conference onMobile systems, applications, and services, MobiSys’09, pages 165–178, New York, NY, USA, 2009. ACM.

[10] H. Lu, J. Yang, Z. Liu, N. D. Lane, T. Choudhury,and A. T. Campbell. The Jigsaw continuous sensingengine for mobile phone applications. In Proceedings ofthe 8th ACM Conference on Embedded NetworkedSensor Systems, SenSys ’10, pages 71–84, New York,NY, USA, 2010. ACM.

[11] E. Miluzzo, C. T. Cornelius, A. Ramaswamy,T. Choudhury, Z. Liu, and A. T. Campbell. Darwinphones: the evolution of sensing and inference onmobile phones. In Proceedings of the 8th internationalconference on Mobile systems, applications, andservices, MobiSys ’10, pages 5–20, New York, NY,USA, 2010. ACM.

[12] E. Miluzzo, N. D. Lane, K. Fodor, R. Peterson, H. Lu,M. Musolesi, S. B. Eisenman, X. Zheng, and A. T.Campbell. Sensing meets mobile social networks: thedesign, implementation and evaluation of the CenceMe

application. In Proceedings of the 6th ACM conferenceon Embedded network sensor systems, SenSys ’08,pages 337–350, New York, NY, USA, 2008. ACM.

[13] B. A. Myers, R. Bhatnagar, J. Nichols, C. H. Peck,D. Kong, R. Miller, and A. C. Long. Interacting at adistance: measuring the performance of laser pointersand other devices. In Proceedings of the SIGCHIConference on Human Factors in Computing Systems,CHI ’02, pages 33–40, New York, NY, USA, 2002.ACM.

[14] A. V. Oppenheim and R. W. Schafer. Digital SignalProcessing. Prentice Hall.

[15] C. Peng, G. Shen, Y. Zhang, and S. Lu. Point &Connect: intention-based device pairing for mobilephone users. In Proceedings of the 7th internationalconference on Mobile systems, applications, andservices, MobiSys ’09, pages 137–150, New York, NY,USA, 2009. ACM.

[16] A. Purohit, B. Priyantha, and J. Liu. WiFlock:Collaborative group discovery and maintenance inmobile sensor networks. In Information Processing inSensor Networks (IPSN), 2011 10th InternationalConference on, pages 37–48, Apr. 2011.

[17] E. Rukzio, K. Leichtenstern, V. Callaghan, P. Holleis,A. Schmidt, and J. Chin. An experimental comparisonof physical mobile interaction techniques: touching,pointing and scanning. In Proceedings of the 8thinternational conference on Ubiquitous Computing,UbiComp’06, pages 87–104, Berlin, Heidelberg, 2006.Springer-Verlag.

[18] Z. Sun, A. Purohit, K. Chen, S. Pan, T. Pering, andP. Zhang. PANDAA: physical arrangement detectionof networked devices through ambient-soundawareness. In Proceedings of the 13th internationalconference on Ubiquitous computing, UbiComp ’11,pages 425–434, New York, NY, USA, 2011. ACM.

[19] Z. Sun, A. Purohit, S. Pan, F. Mokaya, R. Bose, andP. Zhang. Polaris: getting accurate indoor orientationsfor mobile devices using ubiquitous visual patterns onceilings. In Proceedings of the Twelfth Workshop onMobile Computing Systems & Applications,HotMobile ’12, pages 14:1—-14:6, New York, NY,USA, 2012. ACM.

[20] P. Vandewalle, L. Sbaiz, J. Vandewalle, andM. Vetterli. How to take advantage of aliasing inbandlimited signals. IEEE Conference on Acous- tics,Speech and Signal Processing, 3:948–951.

[21] H. Wang, S. Sen, A. Elgohary, M. Farid, M. Youssef,and R. R. Choudhury. No need to war-drive:unsupervised indoor localization. In Proceedings of the10th international conference on Mobile systems,applications, and services, MobiSys ’12, pages197–210, New York, NY, USA, 2012. ACM.

[22] Z. Zhang, D. Chu, X. Chen, and T. Moscibroda.SwordFight: enabling a new class of phone-to-phoneaction games on commodity phones. In Proceedings ofthe 10th international conference on Mobile systems,applications, and services, MobiSys ’12, pages 1–14,New York, NY, USA, 2012. ACM.

Documents

Spartacus: Spatially-Aware Interaction for Mobile …zhengs/MySite/attachments/Spartacus... · Spartacus: Spatially-Aware Interaction for Mobile Devices Through Energy-Efﬁcient