tts_overview.ppt

7/31/2019 tts_overview.ppt

1/20

Text-To-Speech Synthesis

An Overview


2/20

What is a TTS System

Goal

A system that can read any text

Automatic production of new sentences

Not just audio playback Simple voice response systems

Definition

The production of speech by machines, by way

of the automatic phonetization of the sentencesto utter


3/20

Text-To-Speech

Text Processing

Text Normalization

Pronunciation

Timing and Intonation

Speech Generation

Segmental Concatenation

Waveform Synthesis


4/20

Functional Diagram

Natural LanguageProcessing

Digital SignalProcessing

Narrow PhoneticTranscription

PhonesProsody

MorphosyntacticAnalysis

Letter-to-SoundProsody Generation

Mathematical ModelsAlgorithms

Computations

Text Speech

TTS Synthesizer


5/20

The Natural Language ProcessingModule

Morphosyntactic Analyzer

NLP Module

Letter-to-SoundModule

Natural ProsodyGenerator

Contextual

Analyzer

Syntactic and

Prosodic Parser

Morphological

AnalyzerPreprocessor

Phone NamesProsody

Text


6/20

Text Preprocessing

Challenges

Text Segmentation Tokenization (i) () (know) ( ) (1) (,) (000) ( ) (words)

Sentence End Detection Jones lives at the end of St. James St.

Normalization Abbreviations .: , , .: ,

Acronyms , ,

Numbers 1.023,32 12/1/2002 13:23 12.15


7/20

Text Preprocessing

Dealing with Non-Standard Words

Tokenizer

Breaks up single tokens that need splitting

12:35AM -> 12 : 35 AM

Classifier Determines the most likely class for a given

token

January 1956 1956 potatoes

Expansion Module Methods for expanding numbers and classes

that can be handled algorithmically


8/20

Text Preprocessing

Dealing with Non-Standard Words

Not all tokens can be handled with a deterministicset of rules

Methods for designing domain-dependent expansionand tagging modules

Supervised: work on tagged text corpus

Unsupervised: work on raw text

optptop

otpt

||

Determines the probability of a tagtgiven the observed string o

p(o): the probability of the observed text

p(t): the prior probability of observing the tag tin the text

p(o|t): a trigram letter language model for predicting observationsof a particulat tag t


9/20

Morphological Analysis

Function Words Determiners, Pronouns, Prepositions,

Conjunctions

Skeleton of sentence

Stored in lexicon, along with pronunciation

Content Words Inflection + Compounding

Used for pronunciation and stressing


10/20

Synthesis

Input

Sequence of phonemes

Prosodic Information

Output

Digital Speech


11/20

Synthesis Strategies

Synthesis by Rule Cognitive approach of the phonation mechanism

Speech is produced by mathematical rules thatformally describe the influence of phonemes onone another

Synthesis by Concatenation Limited knowledge of the data to be handled

Elementary speech units are stored in a

database and then concatenated and processedto produce the speech signal


12/20

Synthesis by Rule

Functional Diagram

DSP Module

Speech Science

Rule Matching

Speech

Phone NamesProsody

SpeechAnalysis

SpeechCorpus

ParametricSpeechCorpus

RuleDatabase

RuleFinding

Signal Processing

Signal Synthesis


13/20

Synthesis by Rule

Analysis and Synthesis

Preparation Words are read by professional speaker

Data Parameterization through speech analyzer

Rule extraction (manual)

Trial and Error Optimization

Synthesis Rules are matched to phonetic input

Production of parametric signal

Synthesis of speech signal by re-implementinganalysis model


14/20

Synthesis by Rule

Segmental Quality

Rule Efficiency

Corpus Quality

Choice of utterances and recording quality

Intrinsic Errors: Accuracy of model describing high-quality speech

Even simple analysis-resynthesis may produceproblems!

Extrinsic Errors: Parameter extraction algorithm

Improvements during Trial-Error tuning


15/20

Synthesis by Rule

Formant Synthesizers

+ Speech is a dynamic evolution of up to 60parameters

Formant, antiformant frequencies and bandwidths

Glottal waveforms

+ Almost free of modeling errors

Difficult to estimate

Time consuming

Intensive trial-error testing to cope with extrinsic

errors

Signal Buzziness Low Signal Quality

High-quality synthesis rules are yet to be discovered


16/20

Synthesis by Concatenation

Functional Diagram

DSP Module

Speech Science

Segment ListGeneration

Speech

Phone NamesProsody

SegmentInfo

Signal ProcessingProsody Matching

SynthesisSegment

DB

Concatenation

Signal Synthesis

SpeechDecoding

SelectiveSegmentation

SpeechCorpus

SpeechSegment

DBSpeechAnalysis

ParametricSegment

DBEqualization

SpeechCoding


17/20


Analysis Database Preparation

Choose the appropriate speech units

Diphones, Half-Syllables and Triphones

Compile and record utterances

Segment signal and extract speech units

Store segment waveforms (along with context) andextended information in database

Extract parameters and createparametricsegmentdatabase

Useful for data compaction Easier prosody matching and modification

Perform amplitude equalization to preventmismatches


18/20


Unit Database Issues

Very large combinatorial space of combinations ofphonemes and prosodic contexts

In English: 43 phones, 79,507 possible triphones,only 70,000 used

Which of them should we keep?

Unit Selection vs Concatenative Synthesis

We record a large speech corpus

In unit selection, the corpus is segmented into

phonetic units, indexed, and used as-is Unit selection is made on-line

In Concatenative synthesis, the selection is made off-line and manually!


19/20

Concatenating Segments

The PSOLA Method

Pitch Synchronous Overlap and Add

A window (2-pitch periods long) is multiplied with thesignal

The signal is broken into a set of localized signals

(non-zero only at the window intervals)

Pitch Modification

Relative shifting of localized signals

Spacing reflects pitch duration

Good result for modification factor =[0.6 1.5] Duration

Localized signals are added or deleted from output


20/20

Concatenative and Rule Based SynthesisComparison

Concatenative Synthesis is the state-of-the-art

Storage is of little concern now

Storing the segment database is no longer an issue

Advances in ensuring smoothness in concatenations

Rule-based synthesis output used to be smoother

Certain sounds are too hard to be produced by rule

Vowels are easy to create by rule

Bursts, voiceless stops are too difficult, we do not

fully understand their production mechanisms

Documents

tts_overview.ppt