Inaudible Voice Commands:
The Long-Range Attack and Defense
University of Illinois at Urbana-Champaign
Nirupam Roy Haitham Hassanieh Romit Roy ChoudhurySheng Shen
50 million voice assistants are sold in US
Inaudible Acoustics
Normal Sound(< 24 kHz)
Ultrasound
(> 25 kHz)
“Inaudible Acoustics”
(> 25 kHz)
“Alexa, open the garage door!”Ok
Inaudible Acoustics
Talk Outline
0. [BackDoor], [DolphinAttack], [Princeton Video]
MobiSys’17(Best Paper)
CCS’17 arXiv
Talk Outline
Today’s Talk:
0. [BackDoor], [DolphinAttack], [Princeton Video]
2. How to defend against these attacks?
1. How to launch long-range (realistic) attacks?
Talk Outline
Today’s Talk:
0. [BackDoor], [DolphinAttack], [Princeton Video]
1. How to launch long-range (realistic) attacks?
2. How to defend against these attacks?
10k 20k 30k 40k 50k 60k 70k 80k 90k 100k
Microphone frequency spectrum
Diaphragm Amplifier Filter ADC
Am
plit
ud
e
Microphonefilter
10k 20k 30k 40k 50k 60k 70k 80k 90k 100k
Microphone frequency spectrum
Diaphragm Amplifier Filter ADC
Am
plit
ud
e
Amplifier Filter ADC
Microphonefilter
Am
plit
ud
e
10k 20k 30k 40k 50k 60k 70k 80k 90k 100k
Microphone frequency spectrum
Diaphragm
Air Vibration Electric Voltage
Input
Output
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
AmplifierDiaphragm
Vout = a1Vin
Vout = a1Vin
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Vout = a1Vin+ a2Vin2 +...
AmplifierDiaphragm
Input
Output
Nonlinear
Input
Output
Vout = a1Vin
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Vout = a1Vin+ a2Vin2 +...
AmplifierDiaphragm
Input
Output
Nonlinear
Input
Output
a2Vin2
10k
Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k
Am
plit
ud
e
InaudibleAudible
Microphonefilter
F1F2
10k
Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k
Am
plit
ud
e
InaudibleAudible
Microphonefilter
Vout = a1Vin+ a2Vin2
( sin F1 + sin F2 )2
F1F2
10k
Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k
= cos 2F1
cos 2F2
cos (F1+F2)
cos (F1- F2)
-
-
-
+
Am
plit
ud
e
InaudibleAudible
Microphonefilter
Vout = a1Vin+ a2Vin2
F1F2
( sin F1 + sin F2 )2 = cos 2F1
cos 2F2
cos (F1+F2)
cos (F1- F2)
2F2 (F1+F2)2F1
10k
Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k
-
-
-
+
Am
plit
ud
e
InaudibleAudible
Microphonefilter
2F2 (F1+F2)2F1
Vout = a1Vin+ a2Vin2
F1F2
( sin F1 + sin F2 )2 = cos 2F1
cos 2F2
cos (F1+F2)
cos (F1- F2)
10k
Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k
-
-
-
+
Am
plit
ud
e
InaudibleAudible
Microphonefilter
2F2 (F1+F2)2F1
Vout = a1Vin+ a2Vin2
F1F2
( sin F1 + sin F2 )2 = cos 2F1
cos 2F2
cos (F1+F2)
cos (F1- F2)
(F1-F2)
10k
Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k
-
-
-
+
Am
plit
ud
e
InaudibleAudible
Microphonefilter
F1F2
10k
Frequency20k 30k 40k 50k 60k 70k 80k 100k90k
(F1-F2)
Am
plit
ud
e
InaudibleAudible
Microphonefilter
F2
10k
Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k
Am
plit
ud
e
InaudibleAudible
Microphonefilter
V(t) = “Alexa, open the garage door!”
F2
10k
Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k
V(t) = “Alexa, open the garage door!”
Ok
Am
plit
ud
e
InaudibleAudible
Microphonefilter
3-5 ft
Can someone attack from a longer range?
Can someone attack from a longer range?
Can someone attack from a longer range?
High power makes ultrasonic speakers audible
Speakers have nonlinearity too!
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k
Voice Command:
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k
Speaker Nonlinearity
=
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k
Speaker Nonlinearity
Am
plit
ud
e
10k 20k 30k 40k 50k 60k 70k 80k 90k
=
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k
Speaker Nonlinearity
Am
plit
ud
e
10k 20k 30k 40k 50k 60k 70k 80k 90k
+
=
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k
Speaker Nonlinearity
Am
plit
ud
e
10k 20k 30k 40k 50k 60k 70k 80k 90k
Audible Inaudible
=
General to all speakers!
Our Solution: “Leakage Optimization”
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker Nonlinearity Audible Leakage
Speaker input
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker output
Bandwidth: B
Bandwidth: B
𝑉 −𝑓 ∗ 𝑉(𝑓)
Am
plit
ud
eA
mp
litu
de
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker input
Speaker output
Speaker Nonlinearity Audible Leakage
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker input
Speaker output
Speaker Nonlinearity Audible Leakage
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker input
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker output
Speaker Nonlinearity Audible Leakage
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker input
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker output
Speaker Nonlinearity Audible Leakage
Chopping compresses the leakage band
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker input
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker output
Leakage from Speaker
Sound P
ressure
Level in
dB
Frequency in Hz
15.6 31.2 62.5 125 250 500 1000 2000 4000 8000 16000
80
60
40
20
0
-20
Threshold of Hearing Curve
𝐿 𝑓
𝑇 𝑓
Human Hearing Threshold
min𝑓[𝑇 𝑓 − 𝐿(𝑓)]𝑀𝑎𝑥𝑖𝑚𝑖𝑧𝑒
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑓0 ≤ 𝑓1 ≤ 𝑓2 ≤ … ≤ 𝑓𝑁
Minimum Gap
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker inputA
mp
litu
de
Speaker output
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker inputA
mp
litu
de
Speaker output
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Inaudible
F2
F2
F2
Am
plit
ud
e
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Speaker inputA
mp
litu
de
Speaker output
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Microphonefilter
Microphone output
10k
Frequency
20k 30k 40k 50k 60k 70k 80k 90k 100k
Am
plit
ud
e
Inaudible
Evaluation
Inaudible voice command: Long rangeInaudible voice commands: Long range
25 feet
Speaker array running leakage optimization
Evaluation
Wake-word hit rate
Evaluation
Wake-word hit rate Command detection accuracy
Evaluation
Maximum activation distance for different input power
Talk Outline
Today’s Talk:
0. [BackDoor], [DolphinAttack], [Princeton Video]
1. How to launch long-range (realistic) attacks?
2. How to defend against these attacks?
Talk Outline
Today’s Talk:
0. [BackDoor], [DolphinAttack], [Princeton Video]
2. How to defend against these attacks?
1. How to launch long-range (realistic) attacks?
Am
plit
ud
e
10kFrequency
20k 30k 40k 50k 60k 70k
Core Question:
Is this a “non-linear signal” or normally recorded signal
?Voice signal
Inaudible Voice Attack
𝑣(𝑡) ∙ sin 𝜔1𝑡 + 𝑐 ∙ sin 𝜔1𝑡2
= 𝑣 𝑡 + 𝑐′𝑣2 𝑡 + ⋯
𝑣 𝑡
Am
plit
ud
e
10kFrequency
20k 30k 40k 50k 60k 70k
Core Question:
Is this a “non-linear signal” or normally recorded signal
Nonlineardistortion
Voice signal
𝑣(𝑡) ∙ sin 𝜔1𝑡 + 𝑐 ∙ sin 𝜔1𝑡2 = 𝑣 𝑡 + 𝑐′𝑣2 𝑡 + ⋯
Difficult to decouple “voice signal” and “non-linear signal”
Am
plit
ud
e
10kFrequency
20k 30k 40k 50k 60k 70k
Human voice signals present opportunities …
Signal Forensics
Am
plit
ud
eFrequency
f
2f3f
4f
Opportunity #1: Voice > 50 Hz
Human Voice
Am
plit
ud
eFrequency
f
2f3f
4f
Energy atsub-50Hz band
Opportunity #1: Voice > 50 Hz
Human Voice
Am
pl.
Time
Energy variation in v(t)
Am
pl.
Time
Energy variation in v2(t)
Correlation
Opportunity #2: Correlation
Am
plit
ud
e
10kFrequency
20k 30k 40k 50k 60k 70k
Opportunity #3: Amplitude Skewness
=Amplitude skew
+
1000-10010
-1
8
0
2
4
6
Attack voiceLegitimate voice
Sub-50Hz power Correlation
Am
plit
ude
ske
w
Real voice
Attack voice
5000 Test Cases
Overall Detection Accuracy
To summarize…
Inaudible Acoustics (> 25 kHz): “Alexa, open the garage door!”Ok
Ok
Ok