AN EXTENSIBLE TRANSCODER FOR HTML TO VOICEXML CONVERSION by Narayanan Annamalai B.E

AN EXTENSIBLE TRANSCODER FOR HTML TO VOICEXML CONVERSION

by

Narayanan Annamalai B.E.

Master’s Thesis

Advisors:Dr. Gopal Gupta

andDr. B Prabhakaran

THE UNIVERSITY OF TEXAS AT DALLASMay 2002

By 2003 - One billion people will use wireless

devices. By 2005 - Half of them will have Internet connectivity. Growth far surpasses that of wire-bound Internet

users. New Technology is needed to support the masses of

Customers to use handheld devices for Internet

access. The technology should be easy to use and efficient. The right choice is – Speech Recognition

The Scenario

The talk is organized in the following manner:

Motivation to solve the problem

Introduction to basic concepts in VoiceXML

Objectives

System Model and Assumptions

Translation Logic

Extensible Feature

Conclusion and Future Work

Motivation

Drawback of Existing Web Infrastructure – content

Users of WAP – not satisfied

Not feasible to maintain multiple versions

Client

WEB SERVER

(content in format A)

FORMAT TRANSLATOR

(Convert A to B)

Request BB

A

B

Related Work

The visually impaired – used Screen readers.

Frankie James proposed Auditory HTML Access System (AHA) – used distinct tones

Above two systems – No Interactive feature

Stuart Goose et al. proposed HTML to VoXML converter. VoXML is the ancestor of VoiceXML.

Gopal Gupta et al. proposed a denotational semantics based approach. Dealt with a subset of Tags.

Present Scenario

PSTN

INTERNET

Mobile User

Voice Server

Transcoder

WEB SERVER

Req.

http req. html

VoiceXML

VoiceXML

Audio

Objectives

Provide means for Visually impaired to access the Web.

Strive to express the structure of HTML pages in Voice form.

Application can be custom made with respect to User’s wish.

Make the transcoder extensible – to accommodate new HTML tags in future

What is VoiceXML?

VoiceXML – Standard developed by VoiceXML forum (AT & T, Motorola, IBM, Lucent)

Markup language used for creating Human – Computer interfaces through telephone.

User can interact with a VoiceXML page through spoken or DTMF inputs (Telephone key press).

Plays synthesized speech, audio files using TTS (Text to speech) converters

HTML vs VoiceXML

HTML VoiceXML

1. Single unit, presented with full efficiency.

2. Displays several inputs at the same time.

3. Input does not need any grammar for validation.

1. Consists of forms and blocks alone.

2. Inputs are collected sequentially

3. Every input needs a grammar for validation.

Assumptions

Input HTML file needs to comply with the following rules:

Every open tag should have a corresponding close tag.

The input file should be error free.

The file should use only the tags that are specified in the HTML standard. Some browsers inserts special characters during editing.

System Model

The application is realized in two phases

I. Parsing Phase

II. Translation Phase

Parsing Phase: The Input HTML file is parsed and the HTML node tree is obtained as output. Parser used - purpose is Web-Wise Systems HTML parser

Translation Phase: Each HTML node is converted in to corresponding VoiceXML node.

System Architecture

Input Provider

Parser

Transcoder

Internal data sheet

External data sheet

Output VoiceXML file

Parsing Phase

HTML file cannot be converted in a tag-by-tag basis or sentence-by-sentence basis.

The structure of the HTML file should be transported to the VoiceXML file.

HTML file is parsed and the root node of the input file is obtained. Any HTML file’s root node would be the <html> node

<html>

<head> <body>

<html>

<head><title>

Example 1</head></title>

<body>

<h1> Hello World </h1>

</body>

</html>

Input HTML file Output parse tree

(htmlRoot = new RootNode())

.addNode(new PageNode()

.addNode(new HeadNode()

.addNode(new TitleNode()

.addNode(new StringNode().setHtmlData(“Example1”))

) //end TitleNode

) //end HeadNode

.addNode(new BodyNode()

.addNode(new H1Node().setAlign(``center’’)

.addNode(new StringNode().setHtmlData( ``Hello World ‘’))

) // end H1 Node

) // end Body Node

) //end PageNode

Parsing Example

Translating Phase: Issues

Translating phase: Node tree is traversed recursively (from left to right – depth first).

Html node converted to appropriate VoiceXML node.

Issues:

Verify inputs before submission – different from HTML

Highly structured – follows strict convention eg. consider <prompt> It is a beautiful city </prompt> syntactically right, but can be child of only field or block

One to one conversion not possible always

Translation Logic

The entire VXML page should have only blocks and forms.

HTML form and VoiceXML form - basic difference is submission method and form declaration.

Automatic name generation required for VXML forms.

Forms are used for collecting inputs from user. Input obtained through more than one type.

Forms: radio tag

Radio tags – provide choices, user selects one choice.

HTML – radio tags does not have closing tag.

When one choice selected, other becomes inactive.

Challenge is to identify the last ‘radio’ button of the same type.

example: Input HTML section

<form>

<INPUT type = radio name = “sex’’ value=“male”> Male <br>

<INPUT type = radio name = “sex’’ value=“female”> Female <br>

<h1> End of Radio </h1>

</form>

Forms: radio tag (contd.)

Output VoiceXML section ……

<field name=“sex”>

<prompt> Please select an Entrée, what sex <enumerate/></prompt>

<option dtmf=“1” VALUE=“Male”> Male </option>

<option dtmf=“2” VALUE=“Female”> Female </option>

</field> …….

Form node

Radio: male sex

Radio: female sex

h1

String: ‘end of radio’

Form: Text Box

text box and text area are used to obtain String inputs from user.

No sample space for string : e.g., name of person.

VoiceXML inputs needs a grammar always. <record> elements are used to solve problem.

User can specify record time and attributes.

<submit> needs list of fields and URL for submission.

Should verify the inputs with user before submission.

Form: text box (contd.)

Sample HTML extract Corresponding VoiceXML extract

…….

<form action=ww method=XX>

<LABEL for=“firstname”> Firstname </LABEL>

<INPUT type=“text” id=“firstname”>

<INPUT type=“submit” value= “send”>

</form>

……..

……..

<form id=“f2”>

<record name=“firstname” beep=“true” maxtime=“10s” finalsilence=“4000ms” dtmfterm=“true”>

<prompt> At tone, speak First name: </prompt>

<noinput> I did not hear anything, please try again </noinput>

<filled> <prompt> Your input is <audio expr=“firstname”/></prompt>

</filled>

…….

<submit next=WW method=XX namelist= …..> </form>

Links

In HTML, links are given by <a href..> tag in two ways:

• To different part of the same document.

• To a different document altogether.

In VXML, links are provided by <goto next ..> method.

To Internal documents: Sub-dialogs are created. Sub-dialog is like a function call. <goto next= sub-dialog name>

To External documents: <goto next=URL>. The target HTML URL is converted to a VoiceXML page, thus VoiceXML URL is provided.

List and Image Tags

In HTML ordered and unordered lists are present.

List contains text, so it can be read out easily.

System recognizes ordered list – speaks out the numbered items.

Image Tags: Description of the image is read out.

<ol>

<li> First </li>

<li> Second </li>

</ol>

“ Beginning of an ordered list.

Item 1 First

Item 2 Second

Ending of an ordered list”

HTML extract Audio Output

Table

In HTML – used to present information in tabular form.

Table contains rows and columns, rows may contain tables. Nested table is possible.

Information – text, can be read out.

Our system maintains table numbers, row number, column numbers and differentiates row and column headings.

Frames

Frames – integral part.

Source HTML only contains links to other HTML pages (each link is a separate frame)

Limitation of oral medium – all frames cannot be spoken simultaneously.

Transition to frames provided using <goto next=URL> element.

HTML URLs converted into VoiceXML pages.

All Frame URLs stored in separate array, transcoded to VoiceXML recursively.

Text Display Tags

Tags used for display – does not make much sense in VoiceXML.

Function of the some display tags can be spoken out orally

<block>…….</block> and <prompt>…….</prompt> are tags used to speak out text enclosed between them.

Content to be spoken can be tailored using Interface sheet.

The Interface sheet – also used to add new HTML tags making the system Extensible

Extensible Feature of Transcoder

A

B

Input Attributes

HTML Tags Corresponding Text spoken

Input duration in seconds for Text-box :

Input duration in seconds for Text-Area :

………….

<blockquote>

</blockquote>

…………

Starting of text quoted from elsewhere

Ignore

…………..

Row A – Input Attributes can be supplied by the user

Row B – Treatment of HTML tags can be altered, ignored. New tags can be added in this section.

Conclusion

Our transcoder capable of converting any HTML (4.0 or lower version) file to corresponding VoiceXML file.

Prominent feature of the Transcoder – Extensibility and User Inter-activeness.

HTML to VoiceXML paves the way for Anytime, Anywhere Internet access for mobile clients.

Future Work

Our system will strive to remove the restriction – all open tags in the input file should have close tags.

Try to process applets and Scripts that may be available in input HTML page.

Analyzing the feasibility of implementing out Transcoder in Proxy Servers.

Documents

AN EXTENSIBLE TRANSCODER FOR HTML TO VOICEXML CONVERSION by Narayanan Annamalai B.E