Upload
martin-richardson
View
214
Download
1
Embed Size (px)
Citation preview
VoiceUNL : a proposal to represent speech control mechanisms
within the Universal Networking Digital Language
Mutsuko Tomokiyo (GETA-CLIPS-IMAG) & Gérard Chollet (ENST)reviewed by Christian Boitet (GETA-CLIPS-IMAG)
Content
• Background of this work
• Proposal of extension of UNL - Speech to speech MT
• Emotion representation
Background
• Normalangue - normalization of linguistic resources (2002)TECHNOVOC and RNIL (2002) - normalization of technologies applied in the domain of the engineering of written and spoken language, SIEMENS, TELISMA, IDYLIC, DIALOCA, ELAN Speech, ST Microelectronics, LORIA, ENST Paris
• Lingtour (2002) - multilingual-multimedia MT , TsingHua University (China), Paris 8 University (France), INT (France), ENST-Paris and Bretagne (France) and CLIPS (France)
Extension of UNL (1)
Example : - May I smoke?- No! You may not, Victor .
[S:01] {org:e1} - May I smoke?{/org} {unl} agt(smoke(icl>do).@entry.@present.@may.@interrogative, I) {/unl}[/S]
[1]
Extension of UNL (2)
[S:02] {org:e2}- No! you may not, Victor [arte]
{/org}{unl}agt(smoke(icl>do).@entry.@present.@may.@not, you)mod(smoke(icl>do).@entry.@present.@may.@not, no)mod(no, !(icl>symbol).@interjection)mod(smoke(icl>do).@entry.@present.@may.@not,
Victor(icl>name).@vocative){/unl}[/S]
Speech to Speech Machine Translation (SSMT)
1. Speaker recognition
2. Gestures, facial movement andspeech recognition
3. Transcription and text transfer (UNL)
4. Target language generation (Ariane-G5)
5. Voice, speech, gestures synthesis
[Furui,03,Blanchon,02]
Emotion representation (1)
Classification of emotions :
(1) happiness, (2) sadness,
(3) disgust, (4) surprise,
(5) fear, (6) anger,
(7) irritation, (8) hesitation,
(9) uncertainty, (10) neutral[morita,89; Ekman ,79, 03; OOC,90; ESPIRE, 00]
Emotion representation (2)
Emotion eliciting factors and task facets in SSMT:
• lexicon (sad, happy, etc)• phatics (ah, hein, etc.)• prosodies (fast, slow, strong, etc.)• voice (noisy, soft, young, etc.)• gestures (movements of hands, mouth, eyes, etc.)
Emotion representation (3)
1 2 3 4 5 6 7 8 9 10lexicon * * * * * * * * * *phatics * * * * * * * * * *prosodies * * * * * * * * *voice * * * * * * * *hands * *mouth * eyes * * * * * *eyebrows * * * *head *shoulders * * *
Emotion representation (4)
Speaker recognition and voice synthesis :
• gender,
• age,
• Variant (natural, artificial, etc.),
• voice name (high-pitched, husky, etc.)[BMC,02; W3C rec, 02]
Emotion representation (5)
Prosody :• Pitch : x-high, high, medium, low, x-low, default • Range,• Rate : x-fast, fast, medium, slow, x-slow, default • Duration,• Volume : silent, x-soft, soft, medium, loud, x-
loud, default• Emphasis,• Break [BMC,02; W3C rec.,02]
Emotion representation (6)
Lexicon and Speech acts :Inform, Offer, Offer-follow-up, Promise, Yn-question, Action-request, Confirmation-question, Do-you-understand-question, Permission-request, Wh-question, Yes, No, Acknowledge, Thanks, Thanks-response, Farewell, Good-wishes, Good-wishes-response, Greet, Apology, Apology-response, Alert, Instruct, Confirmation-question-to-self, Invite, Vocative, Topic, Expressive
[tomokiyo, 00]
Emotion representation (7)
Facial movements : left, right, up, down• mouth• eyes• eyebrows
Body movements : left, right, up, down• hands• shoulder• heads [ACE, 02; BMC, 02; MPEG-4, 00]
Emotion representation (1)
<?xml version="1.0" encoding="iso-8859-1 ?>
<!--<?xml-stylesheettype="text/xsl" href="newshow2.xsl"?> -->
<!-- XML for TV -->
<!DOCTYPE D (View Source for full doctype...)>
<D dn=" TV " on="TV. 1.2" dt="2003">
<Paragraph number="1">
<Sentence :snumber="1">
<org lang="el"> May I smoke?</org>
<unlsem>
agt(smoke(icl>do).@entry.@present.@may.@interrogative, I)
</unlsem>
<speech-act>type=”Yn-question”
may I smoke ? </speech-act>
<prosody>may I <emphasis> smoke</emphasis> ? </prosody>
</Sentence >
</Paragraph>
</D>
Emotion representation (2)
<?xml version="1.0" encoding="iso-8859-1 ?>
<!--<?xml-stylesheettype="text/xsl" href="newshow2.xsl"?> -->
<!-- XML for TV -->
<!DOCTYPE D (View Source for full doctype...)>
<D dn=" TV " on="TV. 1.2" dt="2003">
<Paragraph number="1">
<Sentence snumber="2">
<org lang="el"> No!, you may not, Victor.</org> <unlsem>
aoj(smoke(icl>do).@entry.@present.@may.@not, you)
mod(smoke(icl>do).@entry.@present.@may.@not, no)
mod(no, !(icl>symbol))
mod(smoke(icl>do).@entry.@present.@may.@not, Victor(icl>name).@vocative)
</unlsem>
<speech-act> type=”Expressive” No! type=”Inform” you may not, type=”Vocative” Victor </speech-act>
<prosody> <emphasis> No!</emphasis> you may <emphasis> not </emphasis> Victor </prosody>
<emotion> type =”surprise” lexicon=”No!” eyebrows=”left-and right raised” No! you may not</emotion>
</Sentence >
</Paragraph>
</D>
Reflections and Next step
• Extension of UNL –from written text processing to SSMT in multimodality and multilingualism, focussing on emotion representation
• Visual corpus development
• Development of a prototype with speech and image interface