Session 08 9

Embed Size (px)

Citation preview

  • 8/8/2019 Session 08 9

    1/52

    Speech Signal Analysisand Coding

    Dr. Arun Kumar

    Centre for Applied Research in Electronics

    (CARE), IIT Delhi

    [email protected]

  • 8/8/2019 Session 08 9

    2/52

    Contents Speech Processing Applications

    Speech Signal Understanding

    Speech Production

    Speech Signal Characteristics and Analysis

    Speech Coding Coding Standards

    Coder Attributes including Quality Evaluation

    Coding Methodologies

  • 8/8/2019 Session 08 9

    3/52

    Speech Transmission

    Trunk-line telephony Wireless telephony

    Speech Storage

    Voice Mail, Voice Memo, Answeringmachines

    Speech Synthesis Text-to-speech-synthesis

    Automatic information services

    Speech Processing Applications

  • 8/8/2019 Session 08 9

    4/52

    Speaker Verification and Identification

    Phone banking Secure entry

    Aids for the Handicapped

    Variable rate playback

    Hearing aids

    Reading machine for visually impaired Visual display of speech information for

    hearing impaired

    Speech Processing Applications

  • 8/8/2019 Session 08 9

    5/52

    Speech Enhancement Echo and noise cancellation

    Speech Recognition

    Automatic language translation

    Voice Personality Transformation

    Voice conversion from source to target

    Speech Processing Applications

  • 8/8/2019 Session 08 9

    6/52

    It is the variation of pressure, fromatmospheric pressure, as a function oftime, caused by traveling waves from

    the speakers mouth (apart fromnostrils, cheeks and throat).

    The Speech Signal

  • 8/8/2019 Session 08 9

    7/52

    Units:

    SPL (Sound Pressure Level) in dB

    relative to a reference level.

    Reference: 10 16 W/cm2

    - Corresponds to just barely audible

    The Intensity Level of Speech

  • 8/8/2019 Session 08 9

    8/52

    0

    20

    5560

    70

    80

    100

    120

    dB

    Just barely

    audible

    Whisper

    Airplane

    Rock concert

    Heavytraffic Variations in normal voice

    level (1 meter distance frommouth)

    The Intensity Level of Speech

  • 8/8/2019 Session 08 9

    9/52

    Energy of speech during 1 s

    2 x 10 5 Joules

    (It takes 100 Joules to light a 100 W bulb for1 s)

    Strongest vowel: /a/ as in talk

    Weakest vowel: /i/ as in see

    Strongest consonant: /r/ as in run Weakest consonant: // as in thin

    The Intensity Level of Speech

  • 8/8/2019 Session 08 9

    10/52

    Audio

    SignalCategory

    Bandwid

    th(Hz)

    Sampling

    Rate(kHz)

    Source

    Rate(kbps)

    Telephone

    BandSpeech

    300-3400 8.0 128

    Wideband

    Speech50-7000 16.0 256

    WidebandAudio

    20-20,000 44.1/48.0 705/768

    Speech & Audio Signal Specs.

  • 8/8/2019 Session 08 9

    11/52

    Speech Articulation by the Vocal System

    Reproduced from: D. OShaughnessy, Human and machine speech communication, IEEE Press, 2000

  • 8/8/2019 Session 08 9

    12/52

    Speech Classes by Articulation

    Voiced speech

    Unvoiced speech

    Transient (stop) sounds

  • 8/8/2019 Session 08 9

    13/52

    The relationship between speechsounds (phonemes) and their acousticrealizations

    Waveform

    Spectrum

    Spectrogram

    Acoustic Analysis of Speech

  • 8/8/2019 Session 08 9

    14/52

    Time Waveform of a Speech Sentence

    0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4

    -1

    - 0 . 8

    - 0 . 6

    - 0 . 4

    - 0 . 2

    0

    0. 2

    0. 4

    0. 6

    0. 8

    Ti m e ( s )

    A

    m

    plit

    ud

    e

    (TH)

    THIS IS GOOD

    (i) s(s)

    (i) s(s)

    (G) U (O) d

    (D)

  • 8/8/2019 Session 08 9

    15/52

    Vowels High energy, periodic, steady state utterance

    Unvoiced fricatives Low energy, noise-like, steady-state utterance

    Voiced fricatives

    Low energy, element of periodicity, steady-stateutterance

    Stops

    Transient release, medium to low energy Nasals

    Low-to-medium energy, periodic, steady-stateutterance

    Waveform Analysis of a Speech

  • 8/8/2019 Session 08 9

    16/52

    Fundamental frequency F0 / Pitch period

    F0 Male FemaleAverage (Hz) 132 223

    Range (Hz) 50-250 120-500

    Acoustic Analysis of Vowels

  • 8/8/2019 Session 08 9

    17/52

    Stop Consonants

    Momentary blockage of the vocal tract (50-100ms): Closure phase

    Release burst (shortest acoustic event)

    Voice onset time (VOT)

    Fricatives

    Narrow constriction somewhere in vocaltract

    Turbulent airflow through the constriction

    Acoustic Analysis of Consonants

  • 8/8/2019 Session 08 9

    18/52

    TheInternational

    Phonetic

    Alphabet

    (IPA)

  • 8/8/2019 Session 08 9

    19/52

    Universal Speech Production Model

    Output

    speech

    ImpulseTrain

    Generator

    GlottalPulseModel

    WhiteNoise

    Generator

    VocalTractFilter

    Voiced orUnvoiced

    switch

    Radiation

    Model

    VoicedGain

    UnvoicedGain

  • 8/8/2019 Session 08 9

    20/52

    Vocal Tract Model

    Time-varying all-pole linear filter excited by asource signal.

    H(z) models the vocal tract system.

    H(z)=1/A(z)

    e[n] s[n]

    )(

    1

    1

    1

    )(

    1

    zAza

    zH P

    i

    i

    i

    =

    ==

  • 8/8/2019 Session 08 9

    21/52

    0 500 1000 1500 2000 2500 3000 3500 4000-100

    -80

    -60

    -40

    -20

    0

    20

    40

    60

    80

    Frequency (Hz)

    Mag(dB)

    Voiced Speech Spectrum

  • 8/8/2019 Session 08 9

    22/52

    0 500 1000 1500 2000 2500 3000 3500 4000-100

    -80

    -60

    -40

    -20

    0

    20

    40

    60

    80

    Frequency (Hz)

    Mag(dB)

    Superimposed 2nd-order LP Envelope

  • 8/8/2019 Session 08 9

    23/52

    0 500 1000 1500 2000 2500 3000 3500 4000-100

    -80

    -60

    -40

    -20

    0

    20

    40

    60

    80

    Frequency (Hz)

    Mag(dB)

    Superimposed 2nd, 6th order LP Envelopes

  • 8/8/2019 Session 08 9

    24/52

    0 500 1000 1500 2000 2500 3000 3500 4000-100

    -80

    -60

    -40

    -20

    0

    20

    40

    60

    80

    Frequency (Hz)

    Mag(dB)

    Superimposed 2nd, 6th, &10th order LP Envelopes

  • 8/8/2019 Session 08 9

    25/52

    0 500 1000 1500 2000 2500 3000 3500 4000-100

    -80

    -60

    -40

    -20

    0

    20

    40

    60

    80

    Frequency (Hz)

    Mag(dB)

    Superimposed 2nd, 6th, 10th & 16th order LP Envelopes

  • 8/8/2019 Session 08 9

    26/52

    Unvoiced Speech and 10th order LP Residual

    0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 1 9

    - 0 . 1 8

    - 0 . 1 7

    - 0 . 1 6

    - 0 . 1 5

    - 0 . 1 4

    - 0 . 1 3

    - 0 . 1 2

    - 0 . 1 1

    - 0 . 1

    T im e ( m s )

    Amplit

    ude

    0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 2

    - 0 . 1 5

    - 0 . 1

    - 0 . 0 5

    0

    0 . 0 5

    0 . 1

    0 . 1 5

    T im e ( m s )

    Amplitud

    e

  • 8/8/2019 Session 08 9

    27/52

    Voiced Speech and 10th-order LP Residual

    0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 8

    - 0 . 6

    - 0 . 4

    - 0 . 2

    0

    0 . 2

    0 . 4

    0 . 6

    T i m e ( m s )

    Amplitude

    0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 1 5

    - 0 . 1

    - 0 . 0 5

    0

    0 . 0 5

    0 . 1

    0 . 1 5

    0 . 2

    T i m e ( m s )

    Amplitu

    de

    Short-term correlation

    Long-term correlation

  • 8/8/2019 Session 08 9

    28/52

    Speech Coding

  • 8/8/2019 Session 08 9

    29/52

    For telephone band (or narrowband) speech: Signal Bandwidth: 300-3400 Hz

    Sampling Rate: 8000 Hz Resolution: 16 bits / sample linear PCM

    Uncompressed bit rate:16 bits/sample x 8000 samples/s

    = 128 Kbit/s

    What is the minimum coding rate fortransmitting the message information?

    Coding Rates

  • 8/8/2019 Session 08 9

    30/52

    Coder Classes according to Bit-Rate

    B > 16 Kbps High bit rate coders

    4 < B

  • 8/8/2019 Session 08 9

    31/52

    ITU-T: International Telecommunications Union(UN)

    MPEG: Motion Pictures Experts Group(ISO/UN)

    INMARSAT: Intl. Maritime Satellite Corporation

    for geo-synchronous satellites US Government: DoD, NATO

    TIA: Telecom Industry Association - for North

    American Telecom standards ETSI: European Telecom. Standards Institute

    Standards Organizations

  • 8/8/2019 Session 08 9

    32/52

    Name Coding TypeBit-rate

    (kbps)Organization Year

    G.711/

    G.712

    PCM -law/

    A-law 64 ITU-T 1972

    G.721/G.723

    G.726/G.727ADPCM

    32/24/40/

    16ITU-T

    1984/86/

    88/90

    G.728 LD-CELP 16 ITU-T 1992

    G.729 CS-ACELP 8.0 ITU-T 1995

    G.723.1 ACELP 6.3/5.3 ITU-T 1995

    G.722(Wideband)

    SB-ADPCM

    48/56/64 ITU-T 1985

    Speech Coding Standards

  • 8/8/2019 Session 08 9

    33/52

    Name Coding TypeBit-rate

    (kbps)Organization Year

    G.722.1 (Wideband)Transform

    24/32 ITU-T 1999

    Inmarsat IMBE 4.15 INMARSAT 1990

    IS-54 (old) VSELP 7.95 TIA 1992

    GSM-FR RPE-LTP 13 GSM 1991

    GSM-HR CELP 5-6 GSM 1994

    GSM-EFR CELP 12.2 GSM 1997

    Speech Coding Standards

  • 8/8/2019 Session 08 9

    34/52

    Name Coding TypeBit-rate

    (kbps)Organization Year

    IS-641(new) ACELP 7.4 TIA 1997Iridium AMBE 2.4 Iridium 1996

    MPEG-4 HVXC 2-4 MPEG/ISO 1999

    MPEG-4 CELP 4-24 MPEG/ISO 1999

    FS-1015 LPC-10 2.4US-DoD

    /NATO1984

    FS-1016 CELP 4.8 US-DoD/NATO

    1989

    MELP MELP 2.4US-DoD

    /NATO

    1996

    Speech Coding Standards

  • 8/8/2019 Session 08 9

    35/52

    Coding Methodologies

    Waveform coding

    Vocoding or parametric coding

    Hybrid coding

    Coding Methodologies

  • 8/8/2019 Session 08 9

    36/52

    Classes according to Coding Type

    Bit rate (Kbps)

    Quality

    Poor

    Fair

    Good

    Excellent

    Parametric Coders

    Waveform

    approximating

    coders

    1 42 168 32 64

    HybridCoders

  • 8/8/2019 Session 08 9

    37/52

    Coding Standards

    Bit rate (Kbps)

    Quality

    Poor

    Fair

    Good

    Excellent

    Parametric Coders

    Waveform approximating

    coders

    1 42 168 32 64

    Hybrid Coders

    G.726G.711

    Linear

    PCM

    GSM EFR

    FS1015

    G.723.1

    G.729

    G.728

    IS96

    GSM/2

    GSM FR

    MELP

  • 8/8/2019 Session 08 9

    38/52

    PCM Coding

    Q[.]

    x[n] x[n]

    i[n]

    Instantaneous, non-uniform quantization

    For time-varying energy signals eg speech,uniform quantization is inefficient.

    If signal energy is halved, SQNR falls 6 dB. SQNR is independent of signal level in Log

    quantizer.

  • 8/8/2019 Session 08 9

    39/52

    ADPCM Coding

    + Q[.]

    Encoder

    +P

    Decoder +

    P

    Input

    x[n] -

    d[n]

    x[n]

    c[n]d[n]

    x[n]

    c[n]

    d[n] x[n]

    x[n]

  • 8/8/2019 Session 08 9

    40/52

    Prediction in the context of Coding

    0 5 1 0 1 5 2 0- 0 . 8

    - 0 . 6

    - 0 . 4

    - 0 . 2

    0

    0 . 2

    0 . 4

    0 . 6

    T i m e ( m s )

    Amplitude

    0 5 1 0 1 5 2 0- 0 . 8

    - 0 . 6

    - 0 . 4

    - 0 . 2

    0

    0 . 2

    0 . 4

    T i m e ( m s )

    Amplitude

    Signal and first-difference signal

  • 8/8/2019 Session 08 9

    41/52

    DPCM with fixed predictor can give 4-11 dBimprovement over PCM.

    PCM with adaptive quantization can give ~ 5

    dB improvement over -law non-adaptivePCM.

    DPCM with adaptive prediction can give 10-12 dB improvement over fixed predictor.

    ADPCM Coding

    C d E it d Li P di ti (CELP)

  • 8/8/2019 Session 08 9

    42/52

    Code Excited Linear Prediction (CELP)

    Coding

    Most coders in 4.8-16 kbps are based

    on Linear Prediction Analysis-by-Synthesis (LPAS) coding.

    CELP belongs to LPAS paradigm ofspeech coding.

    G i Li P di i A l i

  • 8/8/2019 Session 08 9

    43/52

    Generic Linear Prediction Analysis

    Synthesis (LPAS) Coder

    Excitation

    Generator

    Error

    Minimization

    Synthesis

    Filter

    LP Analysis

    +

    Input

    speech

    -

  • 8/8/2019 Session 08 9

    44/52

    CELP Decoder

    ExcitationGenerator G/A(z)

    Excitation parameters

    LP and Gain parameters

    Synthesized speech

  • 8/8/2019 Session 08 9

    45/52

    Speech Quality

    Objective measures

    Segmental SNR

    Itakura-Saito distance measure

    Spectral distortion (SD)

    ITU-T P.862 Recommendation

    Subjective measures

    Mean opinion score (MOS)

    Diagnostic Rhyme Test (DRT)

    Diagnostic Acceptability Measure (DAM)

    Speech Quality Measurement

    Ab l C R i T (MOS)

  • 8/8/2019 Session 08 9

    46/52

    Listening quality scale

    Excellent 5

    Good 4Fair 3

    Poor 2

    Bad 1

    Absolute Category Rating Tests (MOS)

    Di ti Rh T t

  • 8/8/2019 Session 08 9

    47/52

    Measures speech intelligibility

    Listeners are presented with one of twowords which differ only in leadingconsonant

    Examples:

    Meet - Beat

    Than - Dan

    Met - Net

    Jest - Guest

    Diagnostic Rhyme Test

    Di ti Rh T t

  • 8/8/2019 Session 08 9

    48/52

    Total possible pairs = 96

    Intelligibility score, S, is given by:

    N(correct) N(incorrect)

    S = 100 x

    N(test pairs)

    Coder Rate (kbps) DRT MOS

    FS1016 4.8 91.7 3.3

    G.728 16 93.0 3.9

    Diagnostic Rhyme Test

    P t l l ti f h lit (PESQ)

  • 8/8/2019 Session 08 9

    49/52

    Part of ITU-T P.862 standard

    Objective is to mimic sound perception bypersons in real life

    PESQ simulates expts. in which subjects

    judge speech quality Physical signals are mapped to

    psychophysical representations that match

    internal representations in the head

    Perceptual evaluation of speech quality (PESQ)

    Speech Coder Complexity Issues

  • 8/8/2019 Session 08 9

    50/52

    Complexity

    Computational complexity

    Simplex/half-duplex/full-duplex real timeperformance on a single DSP

    Fixed point vs. floating point

    CELP coders are computationally complex

    Memory requirement

    Storage of look-up tables, codebooks etc.

    Speech Coder Complexity Issues

  • 8/8/2019 Session 08 9

    51/52

  • 8/8/2019 Session 08 9

    52/52

    Thank You!