Upload
owe-ewo
View
233
Download
0
Embed Size (px)
Citation preview
7/31/2019 Audio Intro
1/17
1
1
Audio
Theory and Characteristics
EE1432 Pengolahan Sinyal Multimedia
Endang Widjiati [email protected]
Bidang Studi Telekomunikasi Multimedia
Jurusan Teknik Elektro
Fakultas Teknologi Industri
Institut Teknologi Sepuluh Nopember
2
Introduction
Sound within the human hearing range called audio; and the waves
in this frequency range called acoustic signal. Speech is an acoustic
signal produced by humans
Typical audio signal classes: telephone speech, wideband speech and
wideband audio. The differences are in bandwidth, dynamic range,
and in listener expectation of offered quality
Some important concepts:- sampling the analog signal in time dimension
- quantization the analog signal in amplitude dimension
- Nyquist theorem
7/31/2019 Audio Intro
2/17
2
3
Introduction
The frequency range is divided into:
Multimedia systems typically make use of sound only within the
frequency range of human hearing; usually from 8 kHz to 48 kHz. Amplitude of the sound waves is a property heard as loudness
from 1 Ghz to 10 THzHypersound
from 20 KHz to 1 GHzUltrasound
from 20 Hz to 20 KHzHuman hearing frequency range
from 0 to 20 HzInfra sound
4
Introduction
SNR: ratio of the power of the correct signal to the noise; measurethe quality of the signal. Usually measured in decibels (dB).
The levels of sound we hear are described in terms of dB, as a ratioto the quietest sound we are able to hear.
Magnitudes of common sounds, in decibels
Other concepts: SQNR and segmental SNR
120Threshold of discomfort40Average room
140Threshold of pain60Conversation
Damage eardrum
Riveter
Train through station
Loud radio
Busy street
Very quiet room
Rustle of leaves
Threshold of hearing
16070
10020
9010
800
7/31/2019 Audio Intro
3/17
3
5
Introduction
Coding of the audio gets its compression without making
assumptions about the nature of the audio source. The coder exploits
the perceptual limitations of human auditory system.
Much of the compression results from the removal of perceptually
irrelevant parts of the audio signal. Removal of such part results in
inaudible distortions, thus audio can compress any signal meant to
be heard by the human ear
6
Introduction
Audio format
Audio Quality vs Data Rate
Popular audio file format: .au (Unix workstation), .aiff (MAC), .wav
(PC, DEC workstation)
Sample Rate Data rate (uncompressed) Frequency Band
[KHz] [KBytes/sec] [Hz]
Telephone 8 8 Mono 8 200-3,400
AM Radio 11.025 8 Mono 11.0 100-5.500
FM Radio 22.05 16 Stereo 88.2 20-11,000
CD 44.1 16 Stereo 176.4 20-20,000
DAT 48 16 Stereo 192.0 20-20,000
DVD Audio 192 (max) 24 (max) up to 6 channels 1,200.0 (max) 0-96,000 (max)
Quality Bits per sample Mono/Stereo
7/31/2019 Audio Intro
4/17
7/31/2019 Audio Intro
5/17
5
9
MIDI
Control panel: it controls functions that are not directly concerned
with notes and durations, e.g. sets volume
Auxilary controllers: control the notes played on the keyboard.
Two common variables arepitch bendand modulation
Memory: store patches for the sound generators and setting on the
control panel
MIDI messages
Transmit information between MIDI devices and determine type of
musical events can be passed from device to device
Format of MIDI messages consists of the status byte (the first byte of
any message describe the kind of message) and data bytes (the
following bytes)
10
MIDI
Classification MIDI messages
Channel messages: messages that are transmitted on individual
channels rather that globally to all devices in the MIDI network
Channel voice messages: instruct the receiving instrument to
assign particular sounds to its voice; turn notes on and off; alter the
sound of the currently active note or notes. e.g. note on, note off,
control change, etc.
Channel mode messages: determine the way that a receiving MIDI
device responds to channel voice messages. They set the MIDI
channel receiving modes for different MIDI devices, stop spurious
notes from playing and affect local control of a device. e.g. local
control, all notes off, omni mode off, etc.
7/31/2019 Audio Intro
6/17
6
11
MIDI
System messages: carry information that is not channel specific, such
as timing signal for synchronization, positioning information in pre-
recorded MIDI sequences, and detailed setup information for the
destination device.
System real-time messages: messages related to synchronization.
E.g. system reset, timing clock (MIDI clock), etc.
System common messages: commands that prepare sequencers and
synthesizers to play a song. E.g. song select, tune request, etc. System exclusive messages: messages related to things that cannot
be standardized, and addition to the original MIDI specification. It
is a stream of bytes that start with a system-exclusive-message,
where the manufacturer is specified, and ends with an end-of-
exclusive message.
12
MIDI
General MIDI
Requirements for general MIDI compatibility:
- Support all 16 channels
- Each channel can play a different instrument/program (multitimbral)
- Each channel can play many voices (polyphony)
- Minimum of 24 fully dynamically allocated voices
MIDI + instrument Patch Map + Percusion Key Map a piece ofMIDI music sounds the same anywhere it is played
- Instrument patch map is a standard program list consisting of 128
patch types
- Percussion map specifies 47 percussion sounds
- Key-based percussion is always transmitted on MIDI channel 10.
7/31/2019 Audio Intro
7/17
7
13
Psychoacoustics model
Threshold in quiet
Put a person in a quiet room. Raise level of 1 kHz tone until just
barely audible. Vary the frequency and plot
The threshold levels are frequency dependent. The human ear is
most sensitive to 2-4 KHz.
14
Psychoacoustics model
Frequency masking
Play 1 KHz tone (masking tone) at fixed level (60dB). Play test tone
at different level (e.g. 1.1 kHz), and raise level until just
distinguishable. Vary the frequency of the test tone and plot the
threshold when it becomes audible
7/31/2019 Audio Intro
8/17
8
15
Psychoacoustics model
The threshold for the test tone is much larger than the threshold in
quiet, near the masking frequency
Repeat similar experiment for various frequencies of masking tones,
yields:
Critical Bands: the widths of the masking bands for different
masking tones are different, increasing with the frequency of the
masking tone. About 100Hz for masking frequency < 500Hz, grow
larger and larger above 500Hz.
16
Psychoacoustics model
Temporal masking
If we hear a loud sound, then it stops, it takes a little while until we
can hear a soft tone nearby
Play 1 KHz masking tone at 60dB, plus a test tone at 1.1 KHz at
40dB. Test tone cant be heard (its masked). Stop masking tone,
then stop test tone after a short delay. Adjust delay time to the
shortest time that test tone can be heard (e.g., 5ms). Repeat withdifferent level of the test tone and plot:
7/31/2019 Audio Intro
9/17
9
17
Psychoacoustics model
Temporal masking
Try other frequencies for test tone (masking tone duration constant).
Total effect of temporal masking:
18
Psychoacoustics model
Perceptual audio coding
Quantization:
The maximum quantization error for a uniform quantizer with
stepsize Q is Q/2
The quantization noise introduced by reducing 1 bit for each
sample (or increase the stepsize by a factor of 2) is 6dB
Subband coding:
Decompose a signal into separate frequency bands by using a
filter bank
Quantize samples in different bands with accuracy proportional
to perceptual sensitivity
7/31/2019 Audio Intro
10/17
10
19
Psychoacoustics model
Perceptual audio coding
The quantization step-size for each frequency band is set so that the
quantization noise is just below the masking level, which is
determined by taken into account of all three masking effects
20
MPEG
MPEG Motion Picture Experts Group; an ISO standard for the high
fidelity compression of digital audio.
MPEG/audio coder gets its compression without making assumption
about the nature of the audio source. It exploits the perceptual
limitations of the human auditory system
MPEG-1 standard: defines coding standards for both audio and video,
and how to packetize the coded audio and video bits to provide timesynchronization
Total rate: 1.5 Mbps for audio and video
Video (352*240 pels/frame, 30 frame/s): 30 Mbps 1.2 Mbps
Audio (2 channels, 48 Ksamples/s, 16 bits/sample): 2*768 kbps
7/31/2019 Audio Intro
11/17
11
21
MPEG
MPEG-2: for better quality audio and video (520*480 pels/frame)
Supports one or two audio channels in one of the four modes:
Monophonic mode for a single audio channel
Dual-monophonic mode for two independent audio channels
(similar to stereo)
Stereo mode for stereo channels with a sharing bits between the
channels, but no joint-stereo coding
Joint stereo mode either takes advantage of correlations between
stereo channels or irrelevancy of the phase difference between
channels, or both
22
MPEG
MPEG-1 Audio coding block diagram:
7/31/2019 Audio Intro
12/17
12
23
MPEG
MPEG layers
MPEG defines 3 layers for audio. Basic model is the same, but codec
complexity increases with each layer
Input sequence is separated into 32 frequency bands. Each subband
filter produces 1 sample out for every 32 samples in
Layer 1 processes 12 samples at a time in each subband
Layer 2 and Layer 3 process 36 samples at a time
24
MPEG
Subband filtering and framing:
7/31/2019 Audio Intro
13/17
13
25
MPEG
Basic steps in algorithm:
Use convolution filters to divide the audio signal into frequency
subbands that approximate the 32 critical bands sub-band filtering
Determine amount of masking for each band based on its frequency
(threshold-in-quiet), and the energy of its neighboring band (frequency
masking) (this is called thepsychoacoustic model)
If the energy in a band is below the masking threshold, dont encode it
Otherwise, determine number of bits needed to represent the
coefficient in this band such that the noise introduced by quantization
is below the masking effect (recall that 1 bit of quantization introduces
about 6 dB of noise)
26
MPEG
Basic steps in algorithm:
Format bitstream: insert proper headers, code the side information, e.g.
quantization scale factors for different bands, and finally code the
quantized coefficient indices, generally using variable length encoding,
e.g. Huffman coding
7/31/2019 Audio Intro
14/17
14
27
MPEG
Example:
Assume that the levels of 16 of the 32 bands are:
Assume that if the level of the 8th band is 60dB, it gives a masking of
12dB in the 7th band, 15 in the 9th.
Level in 7th band is 10dB (15dB), so send it
can encode with up to 2 bits (=12dB) of quantization error. If the
original sample is represented with 8 bits, then we can reduce it to 6
bits.
Band 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Level (dB) 0 8 12 10 6 2 10 60 35 20 15 2 3 5 3 1
28
MPEG
MPEG-1 audio layers: Performance Comparison
MPEG defines 3 layers audio. Basic model is same (as described thus
far), but coding efficiency increases with each layer, at the expense of
the codec complexity.
5 = perfect, 4 = just noticable . 1 = very annoying
Raw data rate per audio channel: 48 KHz samples/s*16 bits/sample = 768 kbps
LayerTarget bit
rateRatio
quality @
64 kbits
quality @
128 kbits
Layer 1 192 kbit 4:1 -- --Layer 2 128 kbit 6:1 2.1 to 2.6 4+
Layer 3 64 kbit 12:1 3.6 to 3.8 4+
7/31/2019 Audio Intro
15/17
15
29
MPEG
At the time of MPEG-1 audio development (finalized 1992), layer 3
was considered too complex to be practically useful. But today, layer 3
is the most widely deployed audio coding method (known as MP3),
because it provides good quality at an acceptable bit rate. It is also
because the code for layer 3 is distributed freely
30
MPEG
Technical difference of audio layers:
Input sequence is separated into 32 frequency bands. Each subband
divides into frames, each contains 384 samples, 12 samples from each
subbands
Layer 1: DCT type filter with one frame and equal frequency spread
per band. Psychoacoustic model only uses frequency masking
Layer 2: Use three frames in filter (before, current, next, a total of
1152 samples). This models a little bit of the temporal masking
Layer 3 (MP3): Better critical band filter is used (non-equal
frequencies), psychoacoustic model includes temporal masking effects,
takes into account stereo redundancy, and uses Huffman coder
7/31/2019 Audio Intro
16/17
16
31
MPEG
MPEG-4
A new standard, which became international in early 1999, that takes
into account that a growing part of information is read, seen and heard
in interactive ways
It supports new forms of communications, in particular:Internet,
Multimedia andMobile Communications.
MPEG-4 represents an audiovisual scene as a composition of (potential
meaningful) objects and supports the evolving ways in whichaudiovisual material is produced, delivered, and consumed.
E.g. computer-generated content becomes part in the production of an
audiovisual scene. In addition, interaction with objects with scene is
possible.
The future: MPEG-7 & MPEG-21
32
References
Z.N. Li and M.S. Drew, Fundamentals of Multimedia, PearsonPrentice Hall, 2004
S. Furui, Digital Speech Processing, Synthesis, and Recognition,Marcel Dekker, Inc, 1989
R. Steinmetz and K. Nahrstedt, Multimedia: Computing,Communications & Applications, Prentice Hall PTR, 1995
B. Gold and N. Morgan, Speech and Audio Signal Processing,Processing and Perceptual of Speech and Music, John Wiley & Sons,Inc. 2000
D. Pan, A Tutorial on MPEG/Audio Compression, IEEEMultimedia, pp. 60-74, summer issue, 1995
P. Noll, Digital Audio for Multimedia, Proc. Signal Processing forMultimedia, NATO Advance Audio Institute, 1999
7/31/2019 Audio Intro
17/17
17
33
References
T. Painter and A. Spanias, Perceptual Coding of Digital Audio, Proc.
of IEEE, vol. 88. No 4, April 2000
Audio Compression,
http://www.cs.sfu.ca/undergrad/CourseMaterials/CMPT479/material/n
otes/Chap4/Chap4.3/Chap4.3.html
Multimedia Data Representation,
http://www.cs.sfu.ca/CourseCentral/365/li/material/notes/Chap3/Chap
3.1/Chap3.1.html ISO, Overview of the MPEG-4 Standard,
http://www.chiariglione.org/mpeg/standards/mpeg-4/mpeg-4.html