tts_overview.ppt

Embed Size (px)

Citation preview

  • 7/31/2019 tts_overview.ppt

    1/20

    Text-To-Speech Synthesis

    An Overview

  • 7/31/2019 tts_overview.ppt

    2/20

    What is a TTS System

    Goal

    A system that can read any text

    Automatic production of new sentences

    Not just audio playback Simple voice response systems

    Definition

    The production of speech by machines, by way

    of the automatic phonetization of the sentencesto utter

  • 7/31/2019 tts_overview.ppt

    3/20

    Text-To-Speech

    Text Processing

    Text Normalization

    Pronunciation

    Timing and Intonation

    Speech Generation

    Segmental Concatenation

    Waveform Synthesis

  • 7/31/2019 tts_overview.ppt

    4/20

    Functional Diagram

    Natural LanguageProcessing

    Digital SignalProcessing

    Narrow PhoneticTranscription

    PhonesProsody

    MorphosyntacticAnalysis

    Letter-to-SoundProsody Generation

    Mathematical ModelsAlgorithms

    Computations

    Text Speech

    TTS Synthesizer

  • 7/31/2019 tts_overview.ppt

    5/20

    The Natural Language ProcessingModule

    Morphosyntactic Analyzer

    NLP Module

    Letter-to-SoundModule

    Natural ProsodyGenerator

    Contextual

    Analyzer

    Syntactic and

    Prosodic Parser

    Morphological

    AnalyzerPreprocessor

    Phone NamesProsody

    Text

  • 7/31/2019 tts_overview.ppt

    6/20

    Text Preprocessing

    Challenges

    Text Segmentation Tokenization (i) () (know) ( ) (1) (,) (000) ( ) (words)

    Sentence End Detection Jones lives at the end of St. James St.

    Normalization Abbreviations .: , , .: ,

    Acronyms , ,

    Numbers 1.023,32 12/1/2002 13:23 12.15

  • 7/31/2019 tts_overview.ppt

    7/20

    Text Preprocessing

    Dealing with Non-Standard Words

    Tokenizer

    Breaks up single tokens that need splitting

    12:35AM -> 12 : 35 AM

    Classifier Determines the most likely class for a given

    token

    January 1956 1956 potatoes

    Expansion Module Methods for expanding numbers and classes

    that can be handled algorithmically

  • 7/31/2019 tts_overview.ppt

    8/20

    Text Preprocessing

    Dealing with Non-Standard Words

    Not all tokens can be handled with a deterministicset of rules

    Methods for designing domain-dependent expansionand tagging modules

    Supervised: work on tagged text corpus

    Unsupervised: work on raw text

    optptop

    otpt

    ||

    Determines the probability of a tagtgiven the observed string o

    p(o): the probability of the observed text

    p(t): the prior probability of observing the tag tin the text

    p(o|t): a trigram letter language model for predicting observationsof a particulat tag t

  • 7/31/2019 tts_overview.ppt

    9/20

    Morphological Analysis

    Function Words Determiners, Pronouns, Prepositions,

    Conjunctions

    Skeleton of sentence

    Stored in lexicon, along with pronunciation

    Content Words Inflection + Compounding

    Used for pronunciation and stressing

  • 7/31/2019 tts_overview.ppt

    10/20

    Synthesis

    Input

    Sequence of phonemes

    Prosodic Information

    Output

    Digital Speech

  • 7/31/2019 tts_overview.ppt

    11/20

    Synthesis Strategies

    Synthesis by Rule Cognitive approach of the phonation mechanism

    Speech is produced by mathematical rules thatformally describe the influence of phonemes onone another

    Synthesis by Concatenation Limited knowledge of the data to be handled

    Elementary speech units are stored in a

    database and then concatenated and processedto produce the speech signal

  • 7/31/2019 tts_overview.ppt

    12/20

    Synthesis by Rule

    Functional Diagram

    DSP Module

    Speech Science

    Rule Matching

    Speech

    Phone NamesProsody

    SpeechAnalysis

    SpeechCorpus

    ParametricSpeechCorpus

    RuleDatabase

    RuleFinding

    Signal Processing

    Signal Synthesis

  • 7/31/2019 tts_overview.ppt

    13/20

    Synthesis by Rule

    Analysis and Synthesis

    Preparation Words are read by professional speaker

    Data Parameterization through speech analyzer

    Rule extraction (manual)

    Trial and Error Optimization

    Synthesis Rules are matched to phonetic input

    Production of parametric signal

    Synthesis of speech signal by re-implementinganalysis model

  • 7/31/2019 tts_overview.ppt

    14/20

    Synthesis by Rule

    Segmental Quality

    Rule Efficiency

    Corpus Quality

    Choice of utterances and recording quality

    Intrinsic Errors: Accuracy of model describing high-quality speech

    Even simple analysis-resynthesis may produceproblems!

    Extrinsic Errors: Parameter extraction algorithm

    Improvements during Trial-Error tuning

  • 7/31/2019 tts_overview.ppt

    15/20

    Synthesis by Rule

    Formant Synthesizers

    + Speech is a dynamic evolution of up to 60parameters

    Formant, antiformant frequencies and bandwidths

    Glottal waveforms

    + Almost free of modeling errors

    Difficult to estimate

    Time consuming

    Intensive trial-error testing to cope with extrinsic

    errors

    Signal Buzziness Low Signal Quality

    High-quality synthesis rules are yet to be discovered

  • 7/31/2019 tts_overview.ppt

    16/20

    Synthesis by Concatenation

    Functional Diagram

    DSP Module

    Speech Science

    Segment ListGeneration

    Speech

    Phone NamesProsody

    SegmentInfo

    Signal ProcessingProsody Matching

    SynthesisSegment

    DB

    Concatenation

    Signal Synthesis

    SpeechDecoding

    SelectiveSegmentation

    SpeechCorpus

    SpeechSegment

    DBSpeechAnalysis

    ParametricSegment

    DBEqualization

    SpeechCoding

  • 7/31/2019 tts_overview.ppt

    17/20

    Synthesis by Concatenation

    Analysis Database Preparation

    Choose the appropriate speech units

    Diphones, Half-Syllables and Triphones

    Compile and record utterances

    Segment signal and extract speech units

    Store segment waveforms (along with context) andextended information in database

    Extract parameters and createparametricsegmentdatabase

    Useful for data compaction Easier prosody matching and modification

    Perform amplitude equalization to preventmismatches

  • 7/31/2019 tts_overview.ppt

    18/20

    Synthesis by Concatenation

    Unit Database Issues

    Very large combinatorial space of combinations ofphonemes and prosodic contexts

    In English: 43 phones, 79,507 possible triphones,only 70,000 used

    Which of them should we keep?

    Unit Selection vs Concatenative Synthesis

    We record a large speech corpus

    In unit selection, the corpus is segmented into

    phonetic units, indexed, and used as-is Unit selection is made on-line

    In Concatenative synthesis, the selection is made off-line and manually!

  • 7/31/2019 tts_overview.ppt

    19/20

    Concatenating Segments

    The PSOLA Method

    Pitch Synchronous Overlap and Add

    A window (2-pitch periods long) is multiplied with thesignal

    The signal is broken into a set of localized signals

    (non-zero only at the window intervals)

    Pitch Modification

    Relative shifting of localized signals

    Spacing reflects pitch duration

    Good result for modification factor =[0.6 1.5] Duration

    Localized signals are added or deleted from output

  • 7/31/2019 tts_overview.ppt

    20/20

    Concatenative and Rule Based SynthesisComparison

    Concatenative Synthesis is the state-of-the-art

    Storage is of little concern now

    Storing the segment database is no longer an issue

    Advances in ensuring smoothness in concatenations

    Rule-based synthesis output used to be smoother

    Certain sounds are too hard to be produced by rule

    Vowels are easy to create by rule

    Bursts, voiceless stops are too difficult, we do not

    fully understand their production mechanisms