Tongue-n-Cheek: Non-contact Tongue Gesture Recognitionzhengli1/papers/IPSN2015Tongue.pdfABSTRACT Tongue gestures are a key modality for augmentative and ... ondly, the tongue is controlled

Tongue-n-Cheek: Non-contact Tongue GestureRecognition

Zheng LiUniversity of Maryland,

Baltimore [email protected]

Ryan RobucciUniversity of Maryland,


Nilanjan BanerjeeUniversity of Maryland,


Chintan PatelUniversity of Maryland,


ABSTRACTTongue gestures are a key modality for augmentative andalternative communication in patients suffering from speechimpairments and full-body paralysis. Systems for recogniz-ing tongue gestures, however, are highly intrusive. Theyeither rely on magnetic sensors built into dentures or ar-tificial teeth deployed inside a patient’s mouth or requirecontact with the skin using electromyography (EMG) sen-sors. Deploying sensors inside a patient’s mouth can beuncomfortable for long-term use and contact-based sensorslike EMG electrodes can cause skin abrasion. To addressthis problem, we present a novel contact-less sensor, calledTongue-n-Cheek, that captures tongue gestures using an ar-ray of micro-radars. The array of micro-radars act as prox-imity sensors and capture muscle movements when the pa-tient performs the tongue gesture. Tongue-n-Cheek convertsthese movements into gestures using a novel signal process-ing algorithm. We demonstrate the efficacy of Tongue-n-Cheek and show that our system can reliably infer gestureswith 95% accuracy and low latency.

Categories and Subject DescriptorsB.m [Hardware]: Miscellaneous

General TermsHardware Implementation, Sensor Design

KeywordsTongue Gestures, Micro-radars, Paralysis Patients

1. INTRODUCTIONParalysis caused by spinal cord injuries is on the rise in

the United States and across the globe. In the United States

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.IPSN’15, April 13–17, 2015, Seattle, WA, USA.Copyright 2015 ACM 978-1-4503-3475-4/15/04 ...$15.00.

Figure 1: A prototype for the Tongue-n-Cheek sys-tem worn by a user. The sensor array comprisesof three micro-radars bolted to thinly cut acrylicsheets attached to a helmet. We have fabricated acustom-designed PCB that houses a filtering circuit,an amplifier, and an analog-to-digital converter.

alone, there are 250,000 to 500,000 new cases of spinal cordinjuries registered every year [1]. Based on the severity of theinjury, a patient might suffer from partial or full-body paral-ysis and limited mobility. For instance, a patient sufferingfrom injury to the High-Cervical Nerves (C1-C4) have paral-ysis in their arms, hands, trunk, and legs [2]. Such patientscan only use head, tongue, and eye movements as inputs forgesture recognition and environmental control. Amongst thethree gesture inputs, tongue is a popular input organ for alarge population of paralysis patients. This is due to tworeasons. First, the tongue is a flexible organ and can beused to perform a large set of gestures. It is used for humanspeech and therefore is an intuitive gesture input organ. Sec-ondly, the tongue is controlled by the cranial nerve [3] whichis embedded in the skull and is not easily damaged duringspinal cord injuries. So it is suitable for gesture recognitionfor a large population of paralysis patients.

The assistive devices built for gesture recognition and en-vironmental control for severe paralysis patients using tonguegesture recognition, however, suffer from several drawbacks.For instance, state-of-the-art tongue gesture recognition sys-tems use magnetic sensors that need to be placed into den-tures or embedded into the tip of the tongue [4]. Such asensor, although accurate, requires surgical deployment andis highly invasive. Similarly, researchers have proposed theuse of infra-red sensors placed on the tongue which is equallyinvasive and intrusive [4]. A less intrusive sensor system fortongue gesture recognition uses the concept of surface elec-tromyography (sEMG) that involves mounting the sensorson the skin surface close to the tongue [5, 6]. The sEMG sen-sors sensed and combined movement of suprahyoid muscles.Based on the relationship between tongue movement andsuprahyoid muscles movement, the sensors can infer specifictongue gestures. While the system is non-invasive and lessintrusive than sensors embedded inside the human mouth,EMG sensors are contact-based electrodes. These sensorsuse Ag/AgCl electrodes dipped into a polymer gel for bettercontact. The gel, however, dries out quickly and can causeskin abrasion. Since paralysis patients have low skin sensi-tivity, injuries caused by skin abrasion can be deleterious.Even more expensive dry EMG sensors can also cause skinabrasion. Completely non-intrusive systems such as Tongi-ble [7] that uses a camera to determine tongue movementsrely on the user opening their mouth when they performs agesture—an action that may be difficult for severely para-lyzed patients.

To address the above drawbacks in existing systems, inthis paper we present Tongue-n-Cheek, a non-contact andnon-intrusive tongue gesture recognition system. Our wear-able prototype system mounted on a helmet is illustrated inFigure 1. Tongue-n-Cheek uses an array of micro-radars todetect movement of suprahyoid and cheek muscles. The sys-tem converts these movements into a set of reliable gesturesusing our signal processing algorithm. The design, imple-mentation, and evaluation of Tongue-n-Cheek presents thefollowing research contributions.

• Non-contact tongue gesture recognition: Tongue-n-Cheek uses an array of micro-radars to detect suprahy-oid and cheek muscle movements correlated with tonguegestures. The system is completely non-invasive andminimally intrusive. The sensor is proximity-basedand addresses skin irritation issues with contact-basedtongue gesture systems that use EMG electrodes andinvasive systems that use magnetic sensors embeddedinside the mouth. The sensor array works in a collab-orative way to cancel noise due to movements of thehead and stray movements in the environment and canreliably detect movements caused by the tongue.

• Signal processing on a micro-radar array: Tongue-n-Cheek uses a signal processing algorithm that uses acombination of energy and velocity detection based onthe phase difference between the I and Q channels forthe received signal. These input signals are convertedinto gestures in real-time.

• Prototype implementation and evaluation: Wehave prototyped Tongue-n-Cheek, a fully functionalembedded sensing-and-processing system, using threemicro-radars that operate at a frequency of 24 GHz.

We evaluate Tongue-n-Cheek using five subjects , andshow that the microcontroller-based system can accu-rately detect gestures with a 95% accuracy in real-time.

2. RELATED WORKTongue-n-Cheek builds on previous work on tongue-gesture

recognition systems, gesture recognition systems for paral-ysis patients, and micro-radar systems for healthcare. Herewe compare and contrast Tongue-n-Cheek with the mostrelevant literature.

2.1 Tongue-Gesture Recognition SystemExisting tongue-gesture recognition systems can be cat-

egorized into three broad groups. The most widely knowntongue-gesture recognition systems use sensors embedded in-side the mouth. For instance, the TongueDrive system [8]uses a tongue piercing to place a magnetic sensor insidethe mouth, and an array of magnetic sensors outside themouth. The array can infer the position of the tongue in-side the mouth and hence can detect tongue gestures. Thepaper reported the system can achieve 90% accuracy inabout 1s with a set of six gestures. Similarly, researchershave used infrared proximity sensors embedded inside themouth to detect movements of the tongue with an averageaccuracy of 92.2% with four gestures [9]. Unfortunately,these systems are invasive. The second category of sensingsystems use EMG electrodes to detect muscle movementswhen the tongue moves. The TongueSee [6] system detectsmovement in the suprahyoid muscles that is correlated totongue gestures with 94.2% accuracy in 0.0625s with a setof six gestures. Other systems use the concept of surfaceelectromyography (sEMG) that involves mounting the sen-sors on the skin surface close to the tongue [5, 6]. Thesesensors, however, are contact-based and can cause skin ir-ritation that can cause injuries in paralysis patients. Thethird category of systems are completely non-intrusive anduses cameras to detect tongue [7]. The system requires thepatient to open his mouth to perform the gesture, whichcan cause fatigue. Tongue-n-Cheek is a completely non-invasive and non-contact tongue gesture recognition systemthat addresses the shortcomings of the above systems by us-ing micro-radars built into a wearable device. Moreover, ourevaluation demonstrates that Tongue-n-Cheek has a signif-icant improvement in terms of latency while reaching thestate-of-art recognition accuracy.

2.2 Gesture Recognition for Paralysis PatientsThere is a large corpus of research on building assistive-

care devices for patients with limited mobility. These in-clude the use of EEG sensors [10, 11] to detect brain wavesfor machine input, capacitive sensors [12, 13] to detect handgestures, and eye-tracking systems based on cameras [14].These different sensing modalities, however, do not focuson tongue gesture measurements. The goal with Tongue-n-Cheek is to use an array of micro-radars as proximity sensorsfor detecting movement of both cheek and suprahyoid mus-cles.

2.3 Micro-radar sensorsIn our implementation, we use a RFBeam K-LC2 [15] sen-

sor, a dual channel 24 GHz micro-radar. For our application,the micro-radar sensor should have the following two char-

acteristics that the RFBeam sensor satisfies. First, it shouldhave dual channel (I/Q channels) output. With dual chan-nel output, the sensor can distinguish movement directionand hence recognize a larger set of gestures. The detailsof analyzing the output I/Q channels is discussed in § 4.Secondly, the sensor system should be portable. A majorcomponent contributing to the size of the sensor is the sizeof the antenna. The single radiating element in the antennais proportional to the wavelength of the transmitted wave.Our sensor operates at 24 GHz, hence, the wavelength isapproximately 12.5 mm, so the sensor size is 25 mm · 25mm. While a higher frequency micro-radar will increase theenergy consumption and hardware complexity, a lower fre-quency radar will lead to a larger sized sensor. For example,the commercially available HB100 sensor manufactured byAgilSense operates at 10 GHz and is 4 times larger than oursensor [16].

Micro-radars have been used for several healthcare appli-cations [17]. These include fall detection [18], remote heart-rate detection [19], respiration pattern detection [20], cardio-pulmonary activity assessment [21], and tumor tracking [22].These applications demonstrate that micro-radars are safeas a non-contact based sensor. The innovative solution pre-sented in this paper uses these sensors for tongue gesturerecognition.

3. GESTURE SET AND BACKGROUNDOur current gesture set comprises of six tongue gestures

and one neutral gesture:

1. Front: moving forward

2. Back: moving backward

3. Right-hold: moving to the right and touching theright cheek

4. Right-release: moving from the cheek back to center

5. Left-release: moving to the left and touching the leftcheek

6. Left-hold: moving from the cheek back to the center

7. Neutral: not moving the tongue.

These gestures are intuitively related to steering a powerwheel chair for paralysis patients. Left and right tonguemovements can be used for turning the wheelchair towardsthe left and right directions while the hold and release ges-tures can encode “keep turning” and “stop turning” respec-tively. The front and back tongue gestures can be used formoving the wheelchair forward and in reverse while no ges-tures correspond to placing the wheelchair in neutral. Theabove set of gestures can also be used for computer mousecontrol. The left-hold and left-release as well as right-holdand right-release gestures can be used to control the left andright button of the mouse, while the front and back gesturescan correspond to the scroll wheel of the mouse. In addi-tion to these seven gesture, Tongue-n-Cheek also supports awakeup gesture: Puff which corresponds to filling the mouthwith air and releasing the air, characterized uniquely by asimultaneous expansion on both sizes of the face. The puffgesture is a common gesture used in assistive care sip-and-puff switches.

Stylohyoid

Geniohyoid

Cheek

Figure 2: The muscles that control the tongue. InTongue-n-Cheek we capture the movement of thesemuscles using an array of micro-radars. c©OlekRemesz

Table 1: The table shows the group of muscles thatmove when the six gestures are performed exceptNeutral

Front GeniohyoidBack Stylohyoid

Right-hold CheekRight-release Cheek

Left-hold CheekLeft-release Cheek

Figure 2 illustrates the muscles around the jaw and onthe face. Table 1 shows the combination of muscles thatmove when the above gestures are performed. The funda-mental contribution with Tongue-n-Cheek is inferring thesegesture reliably using a touch-less proximity based sensor.Additionally, we demonstrate that Tongue-n-Cheek detectsthese gestures while canceling exogenous movements due tothe head and other objects in the environment.

4. Tongue-n-Cheek HARDWARE SYSTEMThe physical sensors employed for gesture recognition are

three micro-radars mounted on a wearable device. The radarssense movements using the principle of Doppler effect. TheDoppler effect (or Doppler shift) is the difference betweenthe observed frequency and the emitted frequency of a re-flected wave when an object moves relative to the sourcegenerating the waves. Assuming the source and destinationare co-located on the radar, the received frequency is higherthan the emitted frequency when the object moves towardsthe radar. Inversely, the received frequency is lower than theemitted frequency when the object moves away from thesource. In the presented application the movement mea-sured is skin movement relative to the micro-radars. Theskin movements detected are due to a combination of mus-cles when the tongue performs a gesture. The group ofmuscles that move when a tongue gesture is performed isdescribed in Table 1. If fr is the frequency of the receivedwave, ft is the frequency of the transmitted wave, v is thevelocity of the moving target object (collection of musclesin our case), and c is the speed of light, then the following

equation can be used to quantify the relationship betweenfr and ft:

fr = ft ·(c− vc+ v

)(1)

The shift in frequency is defined as fd = fr − ft, whichis approximately equal to 2·v

cwhen v � c, and can be used

to determine the gestures. We empirically determined thatvelocity of movement of the muscles that correspond to ourgestures lie between 0.02 m/s and 0.05 m/s. Since our radarworks at a frequency of 24 GHz, the frequency shift we wantto measure lies in the interval [2.7, 8.0]Hz. As our systemdetects tongue gestures in real-time, it needs to continuouslydetermine the frequency shift (fd) of the received wave fromthe transmitted wave. One straightforward solution to re-cover fd is to use STFT(Short Time Fourier Transform) toget the spectrogram. However, the time resolution of the re-covered fd will be limited by the time-frequency uncertaintyof STFT. Also, the computational complexity will increasethe cost in terms of energy, memory, and latency. Anothermethod is to demodulate the signal in analog and get fd bycalculating the instantaneous phase of the shifted signal. Inthe presented prototype, the received Doppler shifted waveis demodulated in the I(t) (in-phase real component of thewave) and Q(t) (quadrature or 90◦ shift of real components)channels. The Appendix shows the derivation of I(t) andQ(t) after applying low-pass filters.

An indicator of the phase difference between the transmit-ted and received signal can be found by taking the arctan ofI(t)Q(t)

.

Θ = arctan

(I(t)

Q(t)

)= 2πfdt+ φ− π

2=

4πvt

λ+ φ− π

2(2)

Note arctan() is value in the interval (π/2,−π/2), mean-ing an additive ambiguity of an integer multiple of π is addedto the result. Section §5 contains a discussion on remov-ing this ambiguity so that dΘ

dtprovides the velocity of the

movement that can be used as a distinguishing feature forinferring the gesture. This feature is used to design a robustsignal processing algorithm described in §5.

To calculate the differential of Θ, however, the outputs(I(t) and Q(t)) need to be amplified and filtered. This isbecause even with close distances between the radar andthe moving muscle (close to 1 cm), the output of the I andQ channels still have a small energy of -60 dBm. To thisend, custom printed circuit board was fabricated (Figure 1)with an analog bandpass filter ( 1 Hz to 60 Hz), an ampli-fier, and a 12-bit analog-to-digital converter (ADC) on theboard. A ADC was added to the custom board instead ofusing the micro-controller ADC for two reasons. First, si-multaneous capture of the data from many I and Q chan-nels is not supported by the micro-controller. Secondly, theTongue-n-Cheek micro-radar system has sensors placed at adistance from the micro-controller board and for prototyp-ing purposes it was not desirable to transfer analog signalslong wires since it requires careful shielding to minimize lossand noise.

Another issue with the system is the DC component ofthe signal. Even after filtering in both the demodulation

and amplification stages, the DC component is likely to existin the final analog output (input to the ADC) due to DCoffsets. To address this problem, the offset was characterizedexperimentally and removed during the digital processingstage.

For our prototype implementation, RFbeam K-LC2 dualchannel radars served as the sensors. Each sensor operatedat 5 V and 32 mA in the experiments. As a control unit, anAVR Butterfly was used, which has a ATmega169P micro-controller with a 8 MHz clock and 1 KByte RAM.

To demonstrate that the prototype can capture subtletongue gestures, Figure 3 (a) and (b) are presented. Forthis figure a subject performed the left-hold and left-releasegestures. Figure 3 (a) shows the filtered and amplified Iand Q channels waves when the left-hold gesture is beingperformed and differential of Θ, and Figure 3 (b) graphsthe I and Q channels and the differential of Θ when theleft-release gesture is performed. Figure 3(a) and (b) clearlyshows that the I and Q lags and leads each other respec-tively for the two gestures. The figure also demonstratesthat the differential of Θ captures this instantaneous phasedifference.

4.1 Micro-Radar ArrayMicro-radars are directional and hence their cone of view

is narrow (it is 30◦ in our prototype). Hence, capturingmovement of a specific group of muscles would require di-recting the sensor towards that group of muscles. Our ges-ture set requires capturing movement of muscles on the leftand right cheek, and the hyoid muscles below the jaw. Tocapture these movements simultaneously, we use an array ofmicro-radars. The array also helps address a critical chal-lenge in our system–canceling noise due to movements of thehead and stray movements in the environment. To illustratethe problem Figure 4 is presented, which compares the sig-nature of the received waves when the user just performshead movements and when he performs head movementsand tongue gestures when an array of two sensors are used.Sensor 1 in the figure is placed near the cheek and Sensor 2is placed close to the forehead. The figure that taking thedifferential of Θ can help distinguish head movements fromtongue gestures.

5. SIGNAL PROCESSING ALGORITHMFigure 5 illustrates the overall system architecture of Tongue-

n-Cheek. Data from the array of micro-radars demodulatedin the I and Q channels is filtered using an analog bandpassfilter and then amplified. The amplified signals are read by amicro-controller and converted into gestures. The key nov-elty of the signal processing algorithm co-designed with thephysical system, described below, is in the optimizations tomake it run in real-time minimal embedded platforms. Sincean embedded platform is computationally weak, an impor-tant goal, therefore, is to minimize the computation-heavyoperations such as multiplications and divisions.

There are two key steps in the Tongue-n-Cheek signal pro-cessing algorithm. First, the algorithm determines whichmicro-radar is detecting a gesture. It then segments the in-coming I and Q data to determine the chunk of data wherethe gesture is being performed. Secondly, the algorithm ex-tracts a single feature from the data chunk— the deriva-tive of Θ defined in Equation 2. If the derivative is posi-tive, then the muscle corresponding to the tongue gesture

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.21.0

−0.5

0.0

0.5

1.0

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.20

1

2

3

4

5

Time (seconds)Voltage(V)

IQ

Time (seconds)

DataBaseline

d(Theta)/dt(rad/s)

−

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00

1

2

3

4

5

Time (seconds)

Vol

tage

I(V

)

IQ

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0−1.0

−0.5

0.0

0.5

1.0

Time (seconds)

DataBaseline

d(T

heta

)/dt

(rad

/s)

(a) (b)

Figure 3: (a) The filtered and amplified output of the I channel and the Q channel, and differential of Θwhen a user performs a left-hold gesture. Note that the I channel lags the Q channel, and the differentialof Θ captures the movement. (b) The filtered and amplified output of the I and the Q channels, and thedifferential of Θ when the user performs a left-release gesture. Note that the I channel leads the Q channel,and the differential of Θ captures the movement.

0.0 0.5 1.0 1.5012345

sensork1nearkupperkhead

Vol

tage

k(V

)

I

Q

0.0 0.5 1.0 1.5012345

sensork2nearkcheek

Vol

tage

k(V

) I

Q

0.0 0.5 1.0 1.5−1

0

1detect

algorithm

Timek(seconds)

Sensork1 Sensork2 Baseline

Diff

Arc

tank

(rad

)

0.0 0.5 1.0 1.50

2

4 sensorm1mnearuppermhead

I

Q

0.0 0.5 1.0 1.50

2

4sensorm2

nearmcheek

I

Q

0.0 0.5 1.0 1.5−1

0

1detectionalgorithm

Timem(seconds)

Sensorm1 Sensorm2 Baseline

whenmperformingmtonguemgesture

whenmperformingmtonguemgesture

whenmperformingmextramtonguemgesture

Diff

Arc

tanm

(rad

)V

olta

gem(

V)

Vol

tage

m(V

)

(a) (b)

Figure 4: (a) I and Q channel outputs for two micro-radars, one placed near the forehead (sensor 1) and oneplaced near the cheek (sensor 2) when the head of the subject moves without any tongue gestures. The thirdsubfigure shows the differentials of Θ for both sensors. (b) I and Q channel outputs for two micro-radars whenthe user moves his head and performs the left-hold gesture. The third subfigure that shows the differentialsof Θ. The figure demonstrates that dΘ

dtcan be used to distinguish head movements from the tongue gesture.

is likely moving towards the sensor (e.g., the front, right-hold, left-hold and puff gestures) or else the muscle islikely moving away from the sensor (e.g., the back, right-release, left-release gestures). It should be noted thatin general multipath propagation should be considered asmultiple reflection paths for the radar are likely. The sum-mation of these paths result in a summation of several I andQ components dependent on the strength of the signal fromeach particular path. However, if it is assumed there is aprimary movement surface generating a dominating returncomponent then it can be assumed that dΘ

dtis due to that

surface of movement. This assumption constrains the phys-ical design of the system to minimize what the radar “sees”other than the desired skin area. In the algorithm, the di-rection of this primary movement (movement of the muscles

towards or way from the micro-radar sensor) is found by

determining the direction of change in Θ = arctan(I(t)Q(t)

)after applying a temporal phase unwrapping process. A de-scription of the signal detection and segmentation, featureextraction scheme, and parameter training follows next.

5.1 Signal Detection and SegmentationThe signal processing algorithm in Tongue-n-Cheek re-

ceives the I and Q channel signals from the array of micro-radars. The first task that Tongue-n-Cheek performs isto analyze these signals and determine which radar is de-tecting a gesture. The radars are strategically placed suchthat a single radar points to a specific muscle group thatmoves due to a set of gestures. For instance, plausible sen-

Analog filter

Micro-dopplerradar

Micro-dopplerradar

Micro-dopplerradar

Analog filterAnalog filter

Amplifier Amplifier Amplifier

AtoD AtoD AtoD

I(t) I(t) I(t) Q(t) Q(t) Q(t)

Feature Detection

Signal Detection and Segmentation

Classification

Mouse control wheelchair control

wired/wireless interface

microcontroller

customdesigned

PCB

sensorarray

Figure 5: End-to-end system architecture forTongue-n-Cheek. An array of micro-radars collectdata on muscle movements due to tongue gesturesand the received signal is demodulated in the I andQ channels. The signals from the channels are fil-tered and amplified and then read from a 12-bitanalog-to-digital converter by the micro-controller.The micro-controller runs our gesture recognitionalgorithm that simultaneously analyzes the signalsfrom the radar arrays and converts them into tonguegestures. The gestures can be used for wheelchaircontrol, environmental control, or mouse control.

sor array configuration would have a radar pointing at theright cheek (radar 1), a radar pointing at the left cheek(radar 2), and a radar pointing at the lower jaw (radar3). If a left-hold or left-release gesture is performed, thenradar 2 will detect the gesture. For this detection, for everyradar, Tongue-n-Cheek calculates the energy in the signalE =

∑n0+Ln=n0

(I[n]2 +Q[n]2), where L is the length of a ges-ture defined in terms of the number of samples and is aninput parameter for Tongue-n-Cheek. Note that the energyE in the signal is calculated on a moving window of lengthL. Once the energy E crosses a threshold δe (determinedduring a training phase), the segmentation module looks fora maximum peak in energy till the energy E falls below δe. Ifthe center of this maximum peak occurs at time t0, the timeperiod for the gesture is taken from time (t0 − L, t0]. Fig-ure 6 illustrates how the segmentation module works. Thismethod of detecting when the gesture occurs has the follow-ing advantage. The smaller peaks, as shown in the figure,caused by stray movements in the vicinity of the primarymuscle corresponding to the tongue gesture are filtered bythe segmentation algorithm. Another issue is how to setthe energy threshold δe. As the captured energy of tonguegestures vary based on radar position, helmet position, andusers, a calibration phase is used to learn the threshold spe-cific to the user and the helmet mounting. Detail of thistraining phase is described in §5.3.

5.2 Feature ExtractionAlthough Equation 2 demonstrates that the velocity of the

muscle moving with respect to the radar can be determined

Energy

Time

L

gesturerecognized

straymovement

threshold

crossingbelow threshold

crossingabove threshold

Figure 6: Signal segmentation. Tongue-n-Cheekcalculates the sum of the energy in the I and Qchannels using a sliding window of length L. Thesegmentation module looks for time instances whenthe energy crosses above a threshold. It then looksfor energy peaks till the energy value falls belowthe threshold. The center of the maximum peakencountered is taken as the center of the gesture.

using a first-order derivative of Θ = arctan(I(t)Q(t)

), in prac-

tice this velocity cannot be determined directly from arctansince it is not a one-to-one function. As arctan is boundedby ±π

2its temporal signal must undergo phase unwrapping.

Therefore, d(t) is calculated as follows.

d(t) = phase unwrap

(arctan

(I(t)

Q(t)

))(3)

The unwrapping is valid as long as we are sampling atleast as fast as the Nyquist rate according to the maximumdoppler shift frequency. The unwrap function is illustratedin Figure 7. However, calculating d(t) requires evaluation ofa trigonometric function which can be prohibitively slow tocalculate on an embedded system. Note that this functionhas to continuously recalculated for realtime gesture recog-nition. Hence, it is important to optimize the calculation ofthis function to reduce the number of arithmetic operations.

There are a few methods available in the literature thatallows approximation of the arctan function. For instance,arctan is commonly approximated by the following equa-tion [23]. arctan (Q/I) ≈ IQ

I2+0.28125Q2 . The scale multipli-

cation by 0.28125×X can be implemented as (X/4+X/32)which can be implemented with two right shifts and an addi-tion. In total 3 multiplications, 2 additions and a divisionsare required to approximate the arctan reasonably over arange ±45◦ range. More expensive alternatives are Taylorapproximations, while computationally cheaper alternativesinclude using interpolation on top of precomputed lookuptables.

In Tongue-n-Cheek the primary feature is the sign of thederivative of the phase unwrapped arctan. This sign is cal-culated using two multiplications and one comparison. Thealgorithm is based on the following intuition: the increaseor decrease of the phase-unwrapped arctan can be predictedby the increase or decrease of the ratio of I and Q.

A computationally-inexpensive method for determining ifIQ

increases or decreases in a given timestep is presented

here. The testIn+1

Qn+1> In

Qncan be rewritten by multiplying

each side of the inequality by the multiplicand (Qn+1 ·Q).To account for the effect of the sign of this multiplicand onthe inequality test we write the following, ignoring the casewhere either Q or Qn is zero:[

In+1

Qn+1>

InQn

]≡

{In+1Qn > InQn+1 if sign(Qn) = sign(Qn+1)

In+1Qn < InQn+1 if sign(Qn) 6= sign(Qn+1)(4)

By assuming that

• the sampling rate at least as fast as the Nyquist rateso that one is able to perform phase unwrapping byensuring that the phase does not change by more thanπ radians/sample.

• the amplitude is non-zero (I2 + Q2 > 0) so that casewhen consecutive zero values in the denominator isavoided.

• dΘdt6= 0

, the two zero cases are sufficiently covered by[In+1

Qn+1>

InQn

]≡

{In < 0 if Qn = 0

In+1 > 0 if Qn+1 = 0(5)

The system presented ensures that the above three condi-tions are satisfied during classification. The above equationoutputs an array of +1 and -1 for a time window of the ges-ture determined by the segmentation module. To classify theentire movement as positive or negative (the sensor movingaway and towards the radar), we employ a heuristic, whichis a median filter to the array. The filter rejects sporadichigh-frequency impulse noise [24]. From this we decide ifthe movement is positive or negative depending on the pos-itive or negative result of the median. As there might beother muscle movement or head movement while perform-ing the tongue gesture, we need to identify the sensor thatcaptures the majority of the movement. Specifically, the leftand right sensors are considered as higher priority sensorsthan the bottom sensor, as the suprahyoid muscle will moveonly slightly when a subject performs left/right hold/releasegestures. Between the left and right sensor, the movementwith larger normalized gesture energy is considered, assum-ing the intended gesture will generate a higher energy signalin the corresponding sensor. Then based on the combinationof which radar detected the majority movement and the signof the movement, the gestured can be uniquely determined.For instance, if the radar pointing at the left cheek detectsthe movement and the movement is positive, the gesture isleft-hold.

5.3 Parameter estimation and trainingThe algorithm uses two parameters: L: the gesture length

and δe: the energy threshold. In Tongue-n-Cheek these twoparameters are determined during a training period. Duringthe training period, a user performs each gesture k times.The system then performs peak detection on the energy cal-culated from the I and Q channels of the received data. δeis then calculated as the average of the minimum value ofthe negative peaks and the maximum value of the positivepeaks. The optimal value of L is experimentally determined.In §6, the sensitivity of Tongue-n-Cheek with respect to Lis evaluated.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.00

2

4

Vol

tage

5(V

) IQ

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0−2

0

2

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0−5

0

5

10

arct

an(I

/Q)5

(rad

)U

nwra

p5(r

ad)

Time5(seconds)

Figure 7: From top to bottom: I(t)/Q(t),arctan(I(t)/Q(t), and phase unwrap(arctan(I(t)/Q(t)))

6. SYSTEM EVALUATIONOur evaluation of Tongue-n-Cheek addresses the following

key questions.

• How accurate is Tongue-n-Cheek in recognizing theseven gestures? What are the key causes of mis-classificationof gestures?

• What are the trade-offs between system accuracy andamount of training required?

• How sensitive is the accuracy of Tongue-n-Cheek tothe choice of L–the input gesture length?

Towards evaluating these key questions, presented also areresults on the accuracy of the wake-up gesture, real-timeperformance of the signal processing algorithm, and micro-benchmarks on the micro-radar sensor.

6.1 Experimental SetupTongue-n-Cheek was evaluated on five test subjects. The

subjects wear our wearable helmet with the sensors builtinto it (prototype illustrated in Figure 1). Since the helmetsize did not immediately fit all subjects, it was adjusted forevery subject. The subject was then asked to perform eachgesture a small number of times (3 in all cases). The systemcalculates the energy threshold, δe for each micro-radar, us-ing this training set. The subject was then asked to performa set of gestures following instructions from a simple userinterface that printed out the instructions. Each subjectperformed each gesture twelve times (eighty-four gesturesin total). Though the experiments are stimulus-based, sub-jects may still have unintentional facial muscle movementand head movement during the experiment. These move-ment might cause false classifications. The subjects alsoperformed the wakeup gesture multiple times. Results onsystem accuracy are presented next.

6.2 System AccuracyFigure 8 presents a confusion matrix that illustrates the

classification accuracy of Tongue-n-Cheek. The confusionmatrix is built using all the gestures performed across all

0.0 0.5 1.0 1.50

5left−cheek sensor

Time (seconds)

Vol

tage

(V

)

IQ

0.0 0.5 1.0 1.50

5right−cheek sensor

Time (seconds)

Vol

tage

(V

)

IQ

0.0 0.5 1.0 1.50

5below−jaw sensor

Time (seconds)

Vol

tage

(V

)

IQ

0.0 0.5 1.0 1.50

2

4 left−cheekjsensor

Timej(seconds)

Vol

tage

j(V

)

IQ

0.0 0.5 1.0 1.50

2

4 right−cheekjsensor

Timej (seconds)

Vol

tage

j(V

)

IQ

0.0 0.5 1.0 1.50

2

4 below−jawj sensor

Timej(seconds)

Vol

tage

j(V

)

IQ

(a) (b)

Figure 10: The figure explains the reason behind the misclassifications (a) Front gestures are misclassifiedas neutral since there are some instances when the user did not move his tongue forward enough for theenergy threshold to be passed. The figure shows that the I and Q channel amplitudes do not change muchwhen the gesture is performed. (b) Left-hold gestures are misclassified as right-release for a single user. Thisis because when the user performs a left-hold, the right cheek muscle is also stretched inwards causing theconfusion. The figure shows that both the right-cheek and left-cheek sensors capture the two motions.

Predicted

Actual

93 0 0 5 2 0 0

0 100 0 0 0 0 0

0 0 97 0 3 0 0

0 0 0 95 0 3 2

0 2 0 2 92 2 3

0 0 0 2 0 93 5

0 0 0 0 2 3 95

Left−hold

Left−release

Right−hold

Right−Release

Back

Front

Neutral

Left−hold

Left−release

Right−hold

Right−release

Back

Front

Neutral

Figure 8: The figure shows a confusion matrix il-lustrating the classification accuracy of Tongue-n-Cheek. The confusion matrix is calculated by com-bining all the gestures across all the users (a totalof 420 gestures). The average accuracy of Tongue-n-Cheek is 95%. Two of the highest misclassifica-tions occur when left-hold gesture is misclassified asright-release, and the front gesture is misclassifiedas neutral.

the users, a total of 420 gestures. The average accuracy ofTongue-n-Cheek is 95%, which is higher than the accuracy ofmost contact-based tongue gesture recognition systems. Forthe result in Figure 8, the system was trained on a datasetwhere the users performed each gesture only three times.Hence, the training involved is minimal, making the sys-tem highly usable. Figure 8 shows that the left-release andright-hold gestures are recognized with an accuracy of 100%and 97% respectively. Most of the misclassifications occurin two cases: (1) the left-hold gesture is recognized as theright-release, and (2) the front gesture is misclassified as theneutral gesture. We further evaluate the underlying cause of

0.8

0.9

1.0

1 2 3 4 5Subject number

Acc

urac

y

Figure 9: The figure shows accuracy of Tongue-n-Cheek per user. Tongue-n-Cheek can recognize ges-tures for User 1 and 2 with a 100% accuracy. User 3has the lowest median accuracy of 83%. This is dueto the misclassifications illustrated in Figure 10 (b).

these misclassifications in Figure 10 (a) and (b). Figure 10(a) shows that when the front gestures are performed byone of the users, he does not pull his tongue forward enoughto cross our energy threshold. In fact, the amplitude ofthe I and Q channel waves do not change at all, causingthe misclassification. We believe that with proper train-ing this problem can be addressed. Similarly, Figure 10 (b)demonstrates the case when the left-hold gestures are mis-classified as right-release. This happens for a specific userbecause when he performs the left-hold gestures, the rightcheek muscles are pulled in as well, causing the sensor onthe right cheek to infer that the right-release gesture is be-ing performed. The number of misclassifications, however,

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6Numbers of Training sets

Acc

ura

cy

Figure 11: The figure shows how the accuracy of thesystem varies when the number of training sets is in-creased. One training set comprises of each gestureperformed once. For this result, we calculate theaverage accuracy across all our subjects. The graphshows that the accuracy flattens after two trainingsets.

are very few, only 5% of the total number of gestures on av-erage, making Tongue-n-Cheek a very accurate tongue ges-ture recognition system. We further evaluate the accuracyof recognizing gestures per user (Figure 9). Tongue-n-Cheekcan recognize the gestures from User 1 and 2 with a 100%accuracy. The median accuracy for User 3 is the lowest at83%. This is because of the misclassifications discussed inFigure 10 (b).

6.3 System Accuracy Vs TrainingAn important trade-off in our system is how the accuracy

of gesture recognition changes with increase in the train-ing size. The training size influences the calculation of theenergy threshold, δe. Figure 11 graphs the change in accu-racy as the number of training sets used is increased from1 to 6. Each training set comprises of one of each gestureperformed once by a subject. We find that the median ac-curacy of recognizing gesture is more than 95% when only2 training sets are used. This illustrates that the amountof training required by Tongue-n-Cheek is very low, make ithighly usable.

6.4 Sensitivity to Input parameter L

We next evaluate how the accuracy of Tongue-n-Cheekchanges with different L. Figure 12 graphs this trade-off.The figure shows that as long as the gesture length is lowerthan the time between performing gestures, our system ac-curately detects the gestures. When the gesture length L ismore than the time between gestures (around 2 seconds inour experiments), our system captures more than one ges-ture in a one time window. Note that our signal processingalgorithm uses the maximum peak to determine a single ges-ture in a period corresponding to one gesture length. Whenthe system is deployed in practice, however, the frequencyof performing gesture will be low so even a rough guess of Lwill produce high accuracy as long as L is below the inter-gesture time, making it a easy parameter to set.

0.03 0.27 0.51 0.75 0.99 1.23 1.47 1.71 1.95

0.7

0.8

0.9

1.0

Gesture Length (second)

Acc

urac

y

Figure 12: The figure shows how the accuracy ofthe system varies when the L, the gesture length, isvaried. L is an input to our signal processing algo-rithm. The figure shows that as long as L is belowthe inter-gesture time, Tongue-n-Cheek’s accuracyis above 80%.

0 20 40 60 80 100 120 1400.0

0.2

0.4

0.6

0.8

1.0

Distance between radar and cheek (cm)

Acc

urac

y

Figure 13: The figure shows the accuracy of recog-nizing the left-hold gesture as the distance betweenthe radar and the cheek changes.

6.5 Micro-benchmarksIn this section, micro-benchmarks for the Tongue-n-Cheek

system are quantified. Specifically, we quantify the accu-racy of inferring the wake-up gesture, the latency associatedwith executing the signal processing algorithm on the micro-controller, and benchmarks on the range of the micro-radars.From our experiments we found that the accuracy of infer-ring the wake-up gestures is 98.2% and the system has fewfalse positives in detecting the wake-up gesture. We alsoprofiled the running time and the used memory of our op-timized signal processing algorithm. For our prototype, weused a AVR Butterfly that operated at 8 MHz. Our algo-rithm utilized less than 800 Bytes of RAM and and takesless than 2.1µs to analyze the I and Q data from the threemicro-radar sensors and convert them into one of the sevengestures, demonstrating that the system can run in real-timeefficiently. We have also evaluated the trade-off between the

range and accuracy of the micro-radar sensor. Figure 13graphs the accuracy of recognizing the left-hold gesture asthe distance between the cheek and the radar is varied. Thefigure shows that Tongue-n-Cheek can accurately detect thegestures even at a distance of 80 cm. Hence, the sensor ar-ray can be placed at a further distance, and it should stillbe able to infer the gestures accurately.

7. FUTURE WORKThere are several avenues of future work that we are pur-

suing as part of this project. We have detailed some of ourfuture work below.

Explore other sensorsWe are in the process of evaluating other micro-radars thatoperate at lower frequencies than 24 GHz. At lower frequen-cies, the skin penetration is deeper. However, the dopplershift will also be smaller. We also envision using ultra-soundsensor arrays for tongue gesture recognition in the future.

Larger sensor arrayWe are experimenting with a larger array of micro-radar sen-sors. With sensors strategically placed, the system can can-cel stray and unexpected movements caused by wheelchairvibration or movements in the surroundings.

Larger gesture setWe are working on increasing the number of gestures in ourgesture set. Moreover, we are working on designing a setof gestures customized to a user. Such personalized gesturesare important, especially for paralysis patients where the ac-ceptable movements might vary based on the specific injurytype.

More user-friendly prototypeIn terms of system usability, we are planning to build ourprototype as an add-on to the wheelchair to get rid of thehelmet. We are considering to evaluate our system on paral-ysis patients, as they might have some additional musclemovement while performing designed tongue gesture, suchas twisting.

8. CONCLUSIONSIn this paper, we present Tongue-n-Cheek , a non-contact,

minimally intrusive tongue gesture recognition system. Thisis unlike state-of-the-art tongue gesture recognition systemsthat either use contact-based EMG sensors or invasive mag-netic sensors. Tongue-n-Cheek uses an array of micro-radarsto detect muscle movements caused by tongue gestures. Sig-nals received by the micro-radar array is converted intogesture using our signal processing algorithm that is opti-mized to run in realtime on a computationally weak micro-controller platform. We evaluate Tongue-n-Cheek on fivesubjects and show that it has an average accuracy of 95%in detecting gestures.

9. ACKNOWLEDGEMENTSThe authors would like to thank our shepherd Dr. Polly

Huang and the anonymous reviewers for their insightful com-ments. This material is based upon work supported bythe National Science Foundation under awards CNS-1305099and IIS-1406626, CNS-1308723, CNS-1314024, and the Mi-crosoft SEIF Awards. Any opinions, findings, and conclu-sions or recommendations expressed in this material arethose of the authors and do not necessarily reflect the viewsof the NSF or Microsoft.

AppendixIf the transmitted signal T (t) = AT cos(2π ft t), and thereceived signal R(t) = AR(t) cos (2π (ft + fd) t+ φ), where φis the phase related to the initial distances between objectsin the scene, including the target surface, and the sensor.AT and AR(t) are the amplitudes of the transmitted andreceived signals. Therefore, the I-channel wave equation isthe following:

I ′(t) = R(t)T (t) (6)

= ATAR(t) cos (2πftt) cos (2π (ft + fd) t+ φ) (7)

=ATAR(t)

2(cos(2πfdt+ φ) + cos(2π(2ft + fd)t+ φ))

(8)

The higher-frequency component at 2ft + fd Hz can beeliminated using an appropriate low-pass filter to recoverI(t):

I(t) =ATAR(t)

2cos(2πfdt+ φ) (9)

Similarly, the received wave demodulated in the Q channelis the following.

Q′(t) = R(t)T (t) (10)

=ATAR(t)

2(− sin(2πfdt+ φ) + sin(2π(2ft + fd)t+ φ))

(11)

After applying a low pass filter, Q(t) is recovered:

Q(t) = −ATAR(t)

2sin(2πfdt+ φ) (12)

10. REFERENCES[1] International Perspectives on Spinal Cord Injury.

WHO Press, World Health Organization, 20 AvenueAppia, 1211 Geneva 27, Switzerland, 2013.

[2] http://www.spinalinjury101.org/details/

levels-of-injury.

[3] H. Kenneth Walker and W. Dallas Hall. ClinicalMethods: The History, Physical, and LaboratoryExaminations. 3rd edition. Butterworth-Heinemann,Boston, 1990.

[4] Xueliang Huo, Jia Wang, and Maysam Ghovanloo. Amagneto-inductive sensor based wirelesstongue-computer interface. IEEE Transactions onNeural Systems and Rehabilitation Engineering,16(5):497–504, August 2008.

http://www.spinalinjury101.org/details/levels-of-injury

http://www.spinalinjury101.org/details/levels-of-injury

[5] M. Sasaki, K. Onishi, T. Arakawa, A. Nakayama,D. Stefanov, and M. Yamaguchi. Real-time estimationof tongue movement based on suprahyoid muscleactivity. In Engineering in Medicine and BiologySociety (EMBC), 2013 35th Annual InternationalConference of the IEEE, pages 4605–4608. IEEE, July2013.

[6] Qiao Zhang, Shyamnath Gollakota, Ben Taskar, andRajesh P. N. Rao. Non-intrusive tongue machineinterface. In CHI ’14 Proceedings of the SIGCHIConference on Human Factors in Computing Systems,pages 2555–2558. ACM, April 2014.

[7] Li Liu, Shuo Niu, Jingjing Ren, and Jingyuan Zhang.Tongible: a non-contact tongue-based interactiontechnique. In ASSETS ’12 Proceedings of the 14thinternational ACM SIGACCESS conference onComputers and accessibility, pages 233–234. ACM,October 2012.

[8] Xueliang Huo, Jia Wang, and Maysam Ghovanloo.Introduction and preliminary evaluation of the tonguedrive system: wireless tongue-operated assistivetechnology for people with little or no upper-limbfunction. Journal of rehabilitation research anddevelopment, 45(6):921–930, 2007.

[9] T Scott Saponas, Daniel Kelly, Babak A Parviz, andDesney S Tan. Optically sensing tongue gestures forcomputer input. In Proceedings of the 22nd annualACM symposium on User interface software andtechnology, pages 177–180. ACM, 2009.

[10] Jonathan R. Wolpaw and Dennis J. McFarland.Control of a two-dimensional movement signal by anoninvasive brain-computer interface in humans.Proceedings of the National Academy of Sciences ofthe United States of America, 101(51):17849–17854,December 2004.

[11] Kevin R Wheeler and Charles C Jorgensen. Gesturesas input: Neuroelectric joysticks and keyboards. IEEEpervasive computing, 2(2):56–61, 2003.

[12] A Nelson, J Schmandt, P Shyamkumar, W Wilkins,D Lachut, N Banerjee, S Rollins, J Parkerson, andV Varadan. Wearable multi-sensor gesture recognitionfor paralysis patients. In Sensors, 2013 IEEE, pages1–4. IEEE, 2013.

[13] Alexander Nelson, Jackson Schmadt, William Wilkins,James P Parkerson, and Nilanjan Banerjee. Systemsupport for micro-harvester powered mobile sensing.In Real-Time Systems Symposium (RTSS), 2013 IEEE34th, pages 258–267. IEEE, 2013.

[14] David Beymer and Myron Flickner. Eye gaze trackingusing an active stereo head. In Computer Vision andPattern Recognition, 2003. Proceedings. 2003 IEEEComputer Society Conference on, volume 2, pagesII–451. IEEE, 2003.

[15] http:

//www.rfbeam.ch/products/k-lc2-transceiver/.

[16] https://www.openimpulse.com/blog/

products-page/product-category/

hb100-microwave-sensor-module/.

[17] Changzhi Li, V.M. Lubecke, O. Boric-Lubecke, andJenshan Lin. A review on recent advances in dopplerradar sensors for noncontact healthcare monitoring.Microwave Theory and Techniques, IEEETransactions on, 61(5):2046–2060, May 2013.

[18] Liang Liu, Mihail Popescu, Marjorie Skubic, MarilynRantz, Tarik Yardibi, and Paul Cuddihy. Automaticfall detection based on doppler radar motionsignature. In Pervasive Computing Technologies forHealthcare (PervasiveHealth), 2011 5th InternationalConference on, pages 222–225. IEEE, 2011.

[19] Amy Droitcour, Victor Lubecke, Jenshan Lin, andOlga Boric-Lubecke. A microwave radio for dopplerradar sensing of vital signs. In Microwave SymposiumDigest, 2001 IEEE MTT-S International, volume 1,pages 175–178. IEEE, 2001.

[20] O Boric Lubecke, P-W Ong, and VM Lubecke. 10 ghzdoppler radar sensing of respiration and heartmovement. In Bioengineering Conference, 2002.Proceedings of the IEEE 28th Annual Northeast, pages55–56. IEEE, 2002.

[21] Amy D Droitcour, Todd B Seto, Byung-Kwon Park,Shuhei Yamada, Alex Vergara, Charles El Hourani,Tommy Shing, Andrea Yuen, Victor M Lubecke, andO Boric-Lubecke. Non-contact respiratory ratemeasurement validation for hospitalized patients. InEngineering in Medicine and Biology Society, 2009.EMBC 2009. Annual International Conference of theIEEE, pages 4812–4815. IEEE, 2009.

[22] Changzhan Gu, Ruijiang Li, Steve B Jiang, andChangzhi Li. A multi-radar wireless system forrespiratory gating and accurate tumor tracking in lungcancer radiotherapy. In Engineering in Medicine andBiology Society, EMBC, 2011 Annual InternationalConference of the IEEE, pages 417–420. IEEE, 2011.

[23] Richard G Lyons. Understanding digital signalprocessing. Pearson Education, 2010.

[24] DRK Brownrigg. The weighted median filter.Communications of the ACM, 27(8):807–818, 1984.

http://www.rfbeam.ch/products/k-lc2-transceiver/

http://www.rfbeam.ch/products/k-lc2-transceiver/

https://www.openimpulse.com/blog/products-page/product-category/hb100-microwave-sensor-module/



Documents

Tongue-n-Cheek: Non-contact Tongue Gesture Recognitionzhengli1/papers/IPSN2015Tongue.pdfABSTRACT Tongue gestures are a key modality for augmentative and ... ondly, the tongue is controlled