Inaudible Voice Commands: The Long-Range Attack and Defense · 2019. 5. 13. · Inaudible Voice...

Preview:

Citation preview

Inaudible Voice Commands:

The Long-Range Attack and Defense

University of Illinois at Urbana-Champaign

Nirupam Roy Haitham Hassanieh Romit Roy ChoudhurySheng Shen

50 million voice assistants are sold in US

Inaudible Acoustics

Normal Sound(< 24 kHz)

Ultrasound

(> 25 kHz)

“Inaudible Acoustics”

(> 25 kHz)

“Alexa, open the garage door!”Ok

Inaudible Acoustics

Talk Outline

0. [BackDoor], [DolphinAttack], [Princeton Video]

MobiSys’17(Best Paper)

CCS’17 arXiv

Talk Outline

Today’s Talk:

0. [BackDoor], [DolphinAttack], [Princeton Video]

2. How to defend against these attacks?

1. How to launch long-range (realistic) attacks?

Talk Outline

Today’s Talk:

0. [BackDoor], [DolphinAttack], [Princeton Video]

1. How to launch long-range (realistic) attacks?

2. How to defend against these attacks?

10k 20k 30k 40k 50k 60k 70k 80k 90k 100k

Microphone frequency spectrum

Diaphragm Amplifier Filter ADC

Am

plit

ud

e

Microphonefilter

10k 20k 30k 40k 50k 60k 70k 80k 90k 100k

Microphone frequency spectrum

Diaphragm Amplifier Filter ADC

Am

plit

ud

e

Amplifier Filter ADC

Microphonefilter

Am

plit

ud

e

10k 20k 30k 40k 50k 60k 70k 80k 90k 100k

Microphone frequency spectrum

Diaphragm

Air Vibration Electric Voltage

Input

Output

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

AmplifierDiaphragm

Vout = a1Vin

Vout = a1Vin

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Vout = a1Vin+ a2Vin2 +...

AmplifierDiaphragm

Input

Output

Nonlinear

Input

Output

Vout = a1Vin

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Vout = a1Vin+ a2Vin2 +...

AmplifierDiaphragm

Input

Output

Nonlinear

Input

Output

a2Vin2

10k

Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k

Am

plit

ud

e

InaudibleAudible

Microphonefilter

F1F2

10k

Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k

Am

plit

ud

e

InaudibleAudible

Microphonefilter

Vout = a1Vin+ a2Vin2

( sin F1 + sin F2 )2

F1F2

10k

Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k

= cos 2F1

cos 2F2

cos (F1+F2)

cos (F1- F2)

-

-

-

+

Am

plit

ud

e

InaudibleAudible

Microphonefilter

Vout = a1Vin+ a2Vin2

F1F2

( sin F1 + sin F2 )2 = cos 2F1

cos 2F2

cos (F1+F2)

cos (F1- F2)

2F2 (F1+F2)2F1

10k

Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k

-

-

-

+

Am

plit

ud

e

InaudibleAudible

Microphonefilter

2F2 (F1+F2)2F1

Vout = a1Vin+ a2Vin2

F1F2

( sin F1 + sin F2 )2 = cos 2F1

cos 2F2

cos (F1+F2)

cos (F1- F2)

10k

Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k

-

-

-

+

Am

plit

ud

e

InaudibleAudible

Microphonefilter

2F2 (F1+F2)2F1

Vout = a1Vin+ a2Vin2

F1F2

( sin F1 + sin F2 )2 = cos 2F1

cos 2F2

cos (F1+F2)

cos (F1- F2)

(F1-F2)

10k

Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k

-

-

-

+

Am

plit

ud

e

InaudibleAudible

Microphonefilter

F1F2

10k

Frequency20k 30k 40k 50k 60k 70k 80k 100k90k

(F1-F2)

Am

plit

ud

e

InaudibleAudible

Microphonefilter

F2

10k

Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k

Am

plit

ud

e

InaudibleAudible

Microphonefilter

V(t) = “Alexa, open the garage door!”

F2

10k

Frequency20k 30k 40k 50k 60k 70k 80k 90k 100k

V(t) = “Alexa, open the garage door!”

Ok

Am

plit

ud

e

InaudibleAudible

Microphonefilter

3-5 ft

Can someone attack from a longer range?

Can someone attack from a longer range?

Can someone attack from a longer range?

High power makes ultrasonic speakers audible

Speakers have nonlinearity too!

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k

Voice Command:

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k

Speaker Nonlinearity

=

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k

Speaker Nonlinearity

Am

plit

ud

e

10k 20k 30k 40k 50k 60k 70k 80k 90k

=

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k

Speaker Nonlinearity

Am

plit

ud

e

10k 20k 30k 40k 50k 60k 70k 80k 90k

+

=

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k

Speaker Nonlinearity

Am

plit

ud

e

10k 20k 30k 40k 50k 60k 70k 80k 90k

Audible Inaudible

=

General to all speakers!

Our Solution: “Leakage Optimization”

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker Nonlinearity Audible Leakage

Speaker input

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker output

Bandwidth: B

Bandwidth: B

𝑉 −𝑓 ∗ 𝑉(𝑓)

Am

plit

ud

eA

mp

litu

de

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker input

Speaker output

Speaker Nonlinearity Audible Leakage

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker input

Speaker output

Speaker Nonlinearity Audible Leakage

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker input

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker output

Speaker Nonlinearity Audible Leakage

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker input

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker output

Speaker Nonlinearity Audible Leakage

Chopping compresses the leakage band

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker input

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker output

Leakage from Speaker

Sound P

ressure

Level in

dB

Frequency in Hz

15.6 31.2 62.5 125 250 500 1000 2000 4000 8000 16000

80

60

40

20

0

-20

Threshold of Hearing Curve

𝐿 𝑓

𝑇 𝑓

Human Hearing Threshold

min𝑓[𝑇 𝑓 − 𝐿(𝑓)]𝑀𝑎𝑥𝑖𝑚𝑖𝑧𝑒

𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑓0 ≤ 𝑓1 ≤ 𝑓2 ≤ … ≤ 𝑓𝑁

Minimum Gap

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker inputA

mp

litu

de

Speaker output

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker inputA

mp

litu

de

Speaker output

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Inaudible

F2

F2

F2

Am

plit

ud

e

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Speaker inputA

mp

litu

de

Speaker output

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Microphonefilter

Microphone output

10k

Frequency

20k 30k 40k 50k 60k 70k 80k 90k 100k

Am

plit

ud

e

Inaudible

Evaluation

Inaudible voice command: Long rangeInaudible voice commands: Long range

25 feet

Speaker array running leakage optimization

Evaluation

Wake-word hit rate

Evaluation

Wake-word hit rate Command detection accuracy

Evaluation

Maximum activation distance for different input power

Talk Outline

Today’s Talk:

0. [BackDoor], [DolphinAttack], [Princeton Video]

1. How to launch long-range (realistic) attacks?

2. How to defend against these attacks?

Talk Outline

Today’s Talk:

0. [BackDoor], [DolphinAttack], [Princeton Video]

2. How to defend against these attacks?

1. How to launch long-range (realistic) attacks?

Am

plit

ud

e

10kFrequency

20k 30k 40k 50k 60k 70k

Core Question:

Is this a “non-linear signal” or normally recorded signal

?Voice signal

Inaudible Voice Attack

𝑣(𝑡) ∙ sin 𝜔1𝑡 + 𝑐 ∙ sin 𝜔1𝑡2

= 𝑣 𝑡 + 𝑐′𝑣2 𝑡 + ⋯

𝑣 𝑡

Am

plit

ud

e

10kFrequency

20k 30k 40k 50k 60k 70k

Core Question:

Is this a “non-linear signal” or normally recorded signal

Nonlineardistortion

Voice signal

𝑣(𝑡) ∙ sin 𝜔1𝑡 + 𝑐 ∙ sin 𝜔1𝑡2 = 𝑣 𝑡 + 𝑐′𝑣2 𝑡 + ⋯

Difficult to decouple “voice signal” and “non-linear signal”

Am

plit

ud

e

10kFrequency

20k 30k 40k 50k 60k 70k

Human voice signals present opportunities …

Signal Forensics

Am

plit

ud

eFrequency

f

2f3f

4f

Opportunity #1: Voice > 50 Hz

Human Voice

Am

plit

ud

eFrequency

f

2f3f

4f

Energy atsub-50Hz band

Opportunity #1: Voice > 50 Hz

Human Voice

Am

pl.

Time

Energy variation in v(t)

Am

pl.

Time

Energy variation in v2(t)

Correlation

Opportunity #2: Correlation

Am

plit

ud

e

10kFrequency

20k 30k 40k 50k 60k 70k

Opportunity #3: Amplitude Skewness

=Amplitude skew

+

1000-10010

-1

8

0

2

4

6

Attack voiceLegitimate voice

Sub-50Hz power Correlation

Am

plit

ude

ske

w

Real voice

Attack voice

5000 Test Cases

Overall Detection Accuracy

To summarize…

Inaudible Acoustics (> 25 kHz): “Alexa, open the garage door!”Ok

Ok

Ok

Recommended