Upload
parameshwari-ramdass
View
212
Download
0
Embed Size (px)
Citation preview
7/31/2019 tts_overview.ppt
1/20
Text-To-Speech Synthesis
An Overview
7/31/2019 tts_overview.ppt
2/20
What is a TTS System
Goal
A system that can read any text
Automatic production of new sentences
Not just audio playback Simple voice response systems
Definition
The production of speech by machines, by way
of the automatic phonetization of the sentencesto utter
7/31/2019 tts_overview.ppt
3/20
Text-To-Speech
Text Processing
Text Normalization
Pronunciation
Timing and Intonation
Speech Generation
Segmental Concatenation
Waveform Synthesis
7/31/2019 tts_overview.ppt
4/20
Functional Diagram
Natural LanguageProcessing
Digital SignalProcessing
Narrow PhoneticTranscription
PhonesProsody
MorphosyntacticAnalysis
Letter-to-SoundProsody Generation
Mathematical ModelsAlgorithms
Computations
Text Speech
TTS Synthesizer
7/31/2019 tts_overview.ppt
5/20
The Natural Language ProcessingModule
Morphosyntactic Analyzer
NLP Module
Letter-to-SoundModule
Natural ProsodyGenerator
Contextual
Analyzer
Syntactic and
Prosodic Parser
Morphological
AnalyzerPreprocessor
Phone NamesProsody
Text
7/31/2019 tts_overview.ppt
6/20
Text Preprocessing
Challenges
Text Segmentation Tokenization (i) () (know) ( ) (1) (,) (000) ( ) (words)
Sentence End Detection Jones lives at the end of St. James St.
Normalization Abbreviations .: , , .: ,
Acronyms , ,
Numbers 1.023,32 12/1/2002 13:23 12.15
7/31/2019 tts_overview.ppt
7/20
Text Preprocessing
Dealing with Non-Standard Words
Tokenizer
Breaks up single tokens that need splitting
12:35AM -> 12 : 35 AM
Classifier Determines the most likely class for a given
token
January 1956 1956 potatoes
Expansion Module Methods for expanding numbers and classes
that can be handled algorithmically
7/31/2019 tts_overview.ppt
8/20
Text Preprocessing
Dealing with Non-Standard Words
Not all tokens can be handled with a deterministicset of rules
Methods for designing domain-dependent expansionand tagging modules
Supervised: work on tagged text corpus
Unsupervised: work on raw text
optptop
otpt
||
Determines the probability of a tagtgiven the observed string o
p(o): the probability of the observed text
p(t): the prior probability of observing the tag tin the text
p(o|t): a trigram letter language model for predicting observationsof a particulat tag t
7/31/2019 tts_overview.ppt
9/20
Morphological Analysis
Function Words Determiners, Pronouns, Prepositions,
Conjunctions
Skeleton of sentence
Stored in lexicon, along with pronunciation
Content Words Inflection + Compounding
Used for pronunciation and stressing
7/31/2019 tts_overview.ppt
10/20
Synthesis
Input
Sequence of phonemes
Prosodic Information
Output
Digital Speech
7/31/2019 tts_overview.ppt
11/20
Synthesis Strategies
Synthesis by Rule Cognitive approach of the phonation mechanism
Speech is produced by mathematical rules thatformally describe the influence of phonemes onone another
Synthesis by Concatenation Limited knowledge of the data to be handled
Elementary speech units are stored in a
database and then concatenated and processedto produce the speech signal
7/31/2019 tts_overview.ppt
12/20
Synthesis by Rule
Functional Diagram
DSP Module
Speech Science
Rule Matching
Speech
Phone NamesProsody
SpeechAnalysis
SpeechCorpus
ParametricSpeechCorpus
RuleDatabase
RuleFinding
Signal Processing
Signal Synthesis
7/31/2019 tts_overview.ppt
13/20
Synthesis by Rule
Analysis and Synthesis
Preparation Words are read by professional speaker
Data Parameterization through speech analyzer
Rule extraction (manual)
Trial and Error Optimization
Synthesis Rules are matched to phonetic input
Production of parametric signal
Synthesis of speech signal by re-implementinganalysis model
7/31/2019 tts_overview.ppt
14/20
Synthesis by Rule
Segmental Quality
Rule Efficiency
Corpus Quality
Choice of utterances and recording quality
Intrinsic Errors: Accuracy of model describing high-quality speech
Even simple analysis-resynthesis may produceproblems!
Extrinsic Errors: Parameter extraction algorithm
Improvements during Trial-Error tuning
7/31/2019 tts_overview.ppt
15/20
Synthesis by Rule
Formant Synthesizers
+ Speech is a dynamic evolution of up to 60parameters
Formant, antiformant frequencies and bandwidths
Glottal waveforms
+ Almost free of modeling errors
Difficult to estimate
Time consuming
Intensive trial-error testing to cope with extrinsic
errors
Signal Buzziness Low Signal Quality
High-quality synthesis rules are yet to be discovered
7/31/2019 tts_overview.ppt
16/20
Synthesis by Concatenation
Functional Diagram
DSP Module
Speech Science
Segment ListGeneration
Speech
Phone NamesProsody
SegmentInfo
Signal ProcessingProsody Matching
SynthesisSegment
DB
Concatenation
Signal Synthesis
SpeechDecoding
SelectiveSegmentation
SpeechCorpus
SpeechSegment
DBSpeechAnalysis
ParametricSegment
DBEqualization
SpeechCoding
7/31/2019 tts_overview.ppt
17/20
Synthesis by Concatenation
Analysis Database Preparation
Choose the appropriate speech units
Diphones, Half-Syllables and Triphones
Compile and record utterances
Segment signal and extract speech units
Store segment waveforms (along with context) andextended information in database
Extract parameters and createparametricsegmentdatabase
Useful for data compaction Easier prosody matching and modification
Perform amplitude equalization to preventmismatches
7/31/2019 tts_overview.ppt
18/20
Synthesis by Concatenation
Unit Database Issues
Very large combinatorial space of combinations ofphonemes and prosodic contexts
In English: 43 phones, 79,507 possible triphones,only 70,000 used
Which of them should we keep?
Unit Selection vs Concatenative Synthesis
We record a large speech corpus
In unit selection, the corpus is segmented into
phonetic units, indexed, and used as-is Unit selection is made on-line
In Concatenative synthesis, the selection is made off-line and manually!
7/31/2019 tts_overview.ppt
19/20
Concatenating Segments
The PSOLA Method
Pitch Synchronous Overlap and Add
A window (2-pitch periods long) is multiplied with thesignal
The signal is broken into a set of localized signals
(non-zero only at the window intervals)
Pitch Modification
Relative shifting of localized signals
Spacing reflects pitch duration
Good result for modification factor =[0.6 1.5] Duration
Localized signals are added or deleted from output
7/31/2019 tts_overview.ppt
20/20
Concatenative and Rule Based SynthesisComparison
Concatenative Synthesis is the state-of-the-art
Storage is of little concern now
Storing the segment database is no longer an issue
Advances in ensuring smoothness in concatenations
Rule-based synthesis output used to be smoother
Certain sounds are too hard to be produced by rule
Vowels are easy to create by rule
Bursts, voiceless stops are too difficult, we do not
fully understand their production mechanisms