26
Inferring Speech Activity from Encrypted Skype Traffic YuChun Chang , KuanTa Chen, ChenChi Wu, and ChinLaung Lei Oct. 27, 2008 2008/10/27 1

Inferring Speech Activity from Encrypted Skype Traffic

Embed Size (px)

DESCRIPTION

Normally, voice activity detection (VAD) refers to speech processing algorithms for detecting the presence or absence of human speech in segments of audio signals. In this paper, however, we focus on speech detection algorithms that take VoIP traffic instead of audio signals as input. We call this category of algorithms network-level VAD. Traditional VAD usually plays a fundamental role in speech processing systems because of its ability to delimit speech segments. Network-level VAD, on the other hand, can be quite helpful in network management, which is the motivation for our study. We propose the first real-time network-level VAD algorithm that can extract voice activity from encrypted and non-silence-suppressed Skype traffic. We evaluate the speech detection accuracy of the proposed algorithm with extensive reallife traces. The results show that our scheme achieve reasonably good performance even high degree of randomness has been injected into the network traffic.

Citation preview

Page 1: Inferring Speech Activity from Encrypted Skype Traffic

Inferring Speech Activity from Encrypted Skype Traffic

Yu‐Chun Chang, Kuan‐Ta Chen, Chen‐Chi Wu, and 

Chin‐Laung Lei

Oct. 27, 2008

2008/10/27 1

Page 2: Inferring Speech Activity from Encrypted Skype Traffic

Outline

• Introduction

• Data description

• Proposed scheme

• Performance evaluation

• Conclusion

2008/10/27 2

Page 3: Inferring Speech Activity from Encrypted Skype Traffic

Introduction

• VAD (Voice Activity Detection)– The algorithm to extract the presence or absence of human speech in speech processing.

• Source‐level VAD– Audio signal 

– Silence suppression

• Network‐level VAD– Network traffic

– Flow identification, QoS measurement

2008/10/27 3

Page 4: Inferring Speech Activity from Encrypted Skype Traffic

• The differences between source‐level and network‐level VAD

2008/10/27 4

source‐level network‐level

input audio signal network traffic

location speaker’s host network node

purpose silence suppressionecho cancellation

traffic managementQoS measurement

Page 5: Inferring Speech Activity from Encrypted Skype Traffic

Introduction (contd.)

• Challenges– Payload encryption

– Skype do not support silence suppression

• Contribution– We propose a network‐level VAD that can infers speech activity from encrypted and non‐silence‐suppressed VoIP traffic.

2008/10/27 5

Page 6: Inferring Speech Activity from Encrypted Skype Traffic

Data Description

• Experiment setup

2008/10/27 6

(Chosen by Skype)

Network traffic Audio signal

Page 7: Inferring Speech Activity from Encrypted Skype Traffic

Data Description (contd.)

• Trace summary

2008/10/27 7

Total # of traces # TCP # UDP

1839 1427 412

# Relay node Mean packet size Mean time period

1677 109.6 bytes 612.5 sec

Page 8: Inferring Speech Activity from Encrypted Skype Traffic

Proposed Scheme

• The indicator of voice activity – packet size

• Smoothing

• Adaptive thresholding

2008/10/27 8

Page 9: Inferring Speech Activity from Encrypted Skype Traffic

The indicator of voice activity – Packet size

2008/10/27 9

Page 10: Inferring Speech Activity from Encrypted Skype Traffic

Smoothing

• EWMA (Exponentially Weighted Moving Average)

2008/10/27 10

1)1( −−+= iii PYP λλEWMA :

Y : Observed packet sizeP : Smoothed packet size

)2.0( =λ

Page 11: Inferring Speech Activity from Encrypted Skype Traffic

Adaptive thresholding

2008/10/27 11

Packet Size (bytes)

Page 12: Inferring Speech Activity from Encrypted Skype Traffic

Adaptive thresholding (contd.)

2008/10/27 12

P : 140 bytes

T1 : 74 bytesT2 : 80 bytes

(P + T1)/2 = 107 bytes(P + T2)/2 = 110 bytes

Page 13: Inferring Speech Activity from Encrypted Skype Traffic

Adaptive thresholding (contd.)

2008/10/27 13estimated ON periods

Packet Size (bytes)

Page 14: Inferring Speech Activity from Encrypted Skype Traffic

Performance Evaluation

• Number of ON periods

2008/10/27 14

periodsONtrueofNumberperiodsONestimatedofNumber

________

Page 15: Inferring Speech Activity from Encrypted Skype Traffic

Performance Evaluation (contd.)

• Average length of ON periods

2008/10/27 15

periodsONtrueoflengthMeanperiodsONestimatedoflengthMean

__________

Page 16: Inferring Speech Activity from Encrypted Skype Traffic

Performance Evaluation (contd.)

2008/10/27 16

• State correctness

NorMNandM

____

True speech activity (M) : 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0

Estimated speech activity (N): 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 1

M and N: 0 0 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0

M  or  N: 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1

ON  period ‐> 1OFF period ‐> 0

Page 17: Inferring Speech Activity from Encrypted Skype Traffic

Performance Evaluation (contd.)

• State correctness

2008/10/27 17

Page 18: Inferring Speech Activity from Encrypted Skype Traffic

Conclusion

• We propose the network‐level VAD which infers speech activity from network traffic instead of audio signal.

• We propose a VAD algorithm that can extract voice activity from encrypted and non‐silence‐suppressed VoIP network traffic.

2008/10/27 18

Page 19: Inferring Speech Activity from Encrypted Skype Traffic

• Thanks

2008/10/27 19

Page 20: Inferring Speech Activity from Encrypted Skype Traffic

Backup slides

2008/10/27 20

Page 21: Inferring Speech Activity from Encrypted Skype Traffic

VAD on audio signaling

2008/10/27 21

)log(*10 2∑=i

iSvolume

J.‐S. R. Jang, “Audio signal processing and recognition,”http://www.cs.nthu.edu.tw/jang

Static threshold : 183 db

Page 22: Inferring Speech Activity from Encrypted Skype Traffic

2008/10/27 22

Page 23: Inferring Speech Activity from Encrypted Skype Traffic

2008/10/27 23

I am a student of National Taiwan University.

Page 24: Inferring Speech Activity from Encrypted Skype Traffic

Performance Evaluation (contd.)

• Number of ON periods

2008/10/27 24

Page 25: Inferring Speech Activity from Encrypted Skype Traffic

Performance Evaluation (contd.)

• Average length of ON periods

2008/10/27 25

Page 26: Inferring Speech Activity from Encrypted Skype Traffic

Performance Evaluation (contd.)

2008/10/27 26

• State correctness

NMNM

∪∩

True speech activity (M) : 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0

Estimated speech activity (N): 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 1