View
626
Download
4
Embed Size (px)
DESCRIPTION
Normally, voice activity detection (VAD) refers to speech processing algorithms for detecting the presence or absence of human speech in segments of audio signals. In this paper, however, we focus on speech detection algorithms that take VoIP traffic instead of audio signals as input. We call this category of algorithms network-level VAD. Traditional VAD usually plays a fundamental role in speech processing systems because of its ability to delimit speech segments. Network-level VAD, on the other hand, can be quite helpful in network management, which is the motivation for our study. We propose the first real-time network-level VAD algorithm that can extract voice activity from encrypted and non-silence-suppressed Skype traffic. We evaluate the speech detection accuracy of the proposed algorithm with extensive reallife traces. The results show that our scheme achieve reasonably good performance even high degree of randomness has been injected into the network traffic.
Citation preview
Inferring Speech Activity from Encrypted Skype Traffic
Yu‐Chun Chang, Kuan‐Ta Chen, Chen‐Chi Wu, and
Chin‐Laung Lei
Oct. 27, 2008
2008/10/27 1
Outline
• Introduction
• Data description
• Proposed scheme
• Performance evaluation
• Conclusion
2008/10/27 2
Introduction
• VAD (Voice Activity Detection)– The algorithm to extract the presence or absence of human speech in speech processing.
• Source‐level VAD– Audio signal
– Silence suppression
• Network‐level VAD– Network traffic
– Flow identification, QoS measurement
2008/10/27 3
• The differences between source‐level and network‐level VAD
2008/10/27 4
source‐level network‐level
input audio signal network traffic
location speaker’s host network node
purpose silence suppressionecho cancellation
traffic managementQoS measurement
Introduction (contd.)
• Challenges– Payload encryption
– Skype do not support silence suppression
• Contribution– We propose a network‐level VAD that can infers speech activity from encrypted and non‐silence‐suppressed VoIP traffic.
2008/10/27 5
Data Description
• Experiment setup
2008/10/27 6
(Chosen by Skype)
Network traffic Audio signal
Data Description (contd.)
• Trace summary
2008/10/27 7
Total # of traces # TCP # UDP
1839 1427 412
# Relay node Mean packet size Mean time period
1677 109.6 bytes 612.5 sec
Proposed Scheme
• The indicator of voice activity – packet size
• Smoothing
• Adaptive thresholding
2008/10/27 8
The indicator of voice activity – Packet size
2008/10/27 9
Smoothing
• EWMA (Exponentially Weighted Moving Average)
2008/10/27 10
1)1( −−+= iii PYP λλEWMA :
Y : Observed packet sizeP : Smoothed packet size
)2.0( =λ
Adaptive thresholding
2008/10/27 11
Packet Size (bytes)
Adaptive thresholding (contd.)
2008/10/27 12
P : 140 bytes
T1 : 74 bytesT2 : 80 bytes
(P + T1)/2 = 107 bytes(P + T2)/2 = 110 bytes
Adaptive thresholding (contd.)
2008/10/27 13estimated ON periods
Packet Size (bytes)
Performance Evaluation
• Number of ON periods
2008/10/27 14
periodsONtrueofNumberperiodsONestimatedofNumber
________
Performance Evaluation (contd.)
• Average length of ON periods
2008/10/27 15
periodsONtrueoflengthMeanperiodsONestimatedoflengthMean
__________
Performance Evaluation (contd.)
2008/10/27 16
• State correctness
NorMNandM
____
True speech activity (M) : 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0
Estimated speech activity (N): 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 1
M and N: 0 0 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0
M or N: 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1
ON period ‐> 1OFF period ‐> 0
Performance Evaluation (contd.)
• State correctness
2008/10/27 17
Conclusion
• We propose the network‐level VAD which infers speech activity from network traffic instead of audio signal.
• We propose a VAD algorithm that can extract voice activity from encrypted and non‐silence‐suppressed VoIP network traffic.
2008/10/27 18
• Thanks
2008/10/27 19
Backup slides
2008/10/27 20
VAD on audio signaling
2008/10/27 21
)log(*10 2∑=i
iSvolume
J.‐S. R. Jang, “Audio signal processing and recognition,”http://www.cs.nthu.edu.tw/jang
Static threshold : 183 db
2008/10/27 22
2008/10/27 23
I am a student of National Taiwan University.
Performance Evaluation (contd.)
• Number of ON periods
2008/10/27 24
Performance Evaluation (contd.)
• Average length of ON periods
2008/10/27 25
Performance Evaluation (contd.)
2008/10/27 26
• State correctness
NMNM
∪∩
True speech activity (M) : 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0
Estimated speech activity (N): 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 1