AT&T 2012 DevLab Speech API Deep Dive

09.25.2012

©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

AT&T SPEECH API DEEP DIVEMichael Owens (@mko on Twitter, mowens on Github)Jay Lieske ( [email protected], jayatyp on Github)

AT&T Developer Program

September 25, 2012

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

WHAT IS THEAT&T SPEECH API?

2


How theAT&T SpeechAPI Works

2


Powered by AT&T WATSON℠

2

• Developed 20+ years• Optimized for different usage scenarios:• Web Search

• Business Search

• Question & Answer

• Voicemail-to-Text

• Short Message (SMS)

• TV Search/Remote (U-Verse)

• Generic Speech-to-Text


Simple Speech-to-Text

2

•One REST endpoint•Accepts audio in WAV or AMR• Structured JSON response

•Text spoken by user•Metrics to evaluate recognition quality

•AT&T Native SDKs for Android and iOS handle audio capture and streaming


Apps in the Wild

2

Speak4itAT&T-Translator U4Verse-Easy-Remote


GETTING STARTED WITH THE AT&T SPEECH API

3


Sign Up for API Access

2

• j.mp/ATTDevSignUp• Free API Access for

DevLab Attendees•Detailed Instructions in

your Attendee Packet• Sign up with code

“APILAB12”•AT&T Staff is on hand to

answer questions and help get you set up


Before You Code

2

• Get your API Keys from Developer portal:• Client ID (“API Key” on the AT&T Developer Portal)• Client Secret (“Secret Key” on the AT&T Developer Portal)

• OAuth 2.0 client_credentials grant type• OAuth 2.0 access_token• Audio File Types:

• AMR: narrowband, 12.2 kbits/s, 8 kHz sampling

• WAV: 16 bit PCM WAV, single channel, 8 kHz sampling

• Audio File Length:• Voicemail: 4 minutes or less

• Other: 1 minute or less


Step 1: Connect via OAuth

2

POST

https://api.att.com/oauth/tokenRequest URL:

Content-Type: application/x-www-form-urlencoded

client_id=ATT_API_CLIENT_ID&client_secret=ATT_API_CLIENT_SECRET&grant_type=client_credentials&scope=SPEECH

Request Body:

Response Body: { "access_token": "xxyz123"}

Request Method:

Request Headers:


Step 2: POST Audio to AT&T(Non-Streaming HTTP Request)

2

POST

Request URL:Accept: application/json

Authorization: Bearer xxyz123

Content-Type: audio/wav

Content-Length: 1534

X-SpeechContext: BusinessSearch

Request Method:

Request Headers:https://api.att.com/rest/1/SpeechToText

Request Body:Note: The Audio Binary Datagoes directly in POST Body,not a MIME Attachment.

AUDIO_BINARY_DATA


Step 2: POST Audio to AT&T(Streaming HTTP Request)

2

POST

Request URL:Accept: application/json

Authorization: Bearer xxyz123

Content-Type: audio/amr

Transfer-Encoding: chunked

X-SpeechContext: QuestionAndAnswer

Request Method:

Request Headers:https://api.att.com/rest/1/SpeechToText

Request Body:Note: Numbers are therecommended chunk sizein hexadecimal format.

200AUDIO_BINARY_DATA_CHUNK200AUDIO_BINARY_DATA_CHUNK0


AT&T SPEECH API EXAMPLE APPLICATION

4

Download the Source:https://github.com/attdevsupport/2012DevLabExamples


Transcription in Three Steps

2

Capturing audio input differs from platform to platform.

In our Basic Example, we use a small Adobe Flex app to access the mic via Flash, capture the audio in one of the two accepted formats, then save that newly created audio file to disk on the server.

In our Speech Labs, we will look at the methods by which you can capture and stream audio directly to the Speech API.

1. Capture Audio Input

Once the audio input has been captured, we send the compatible audio file from our server to the Speech API using a simple POST.

In our Basic Example, we use a small Node.js module called “Watson.js” (NPM: “watson-js”) to OAuth to the Speech API and then POST the audio file.

In our Speech Labs, we will do this on iOS, Android, and Web.

2. POST Audio to AT&T

The AT&T API sends back a very easy to parse JSON object with the interpreted text.

In our Basic example, we output this to the user’s screen pretty printed and syntax highlighted, but you could do much more.

In our Speech Labs, we will look at other ways to use this data, like searching for businesses on Foursquare.

3. Use AT&T API Response


Watson.jsNode.js API Wrapper for the AT&T Speech API

5

GitHub: http://github.com/mowens/watson-js/ NPM: https://npmjs.org/package/watson-js


Using Watson.js

2

var options = { client_id: ATT_API_CLIENT_ID, client_secret: ATT_API_CLIENT_SECRET, access_token: ACCESS_TOKEN, scope: "SPEECH", context: "Generic", access_token_url: "https://api.att.com/oauth/token", api_domain: "api.att.com" };

var Watson = new WatsonClient.Watson(options);

var WatsonClient = require(‘watson-js’);1. Require API Wrapper

2. Set API Client Options

3. Instantiate New API Client


The Methods of Watson.js

2

Watson.getAccessToken(callback)

Method for requesting a new OAuth Access Token using the Client Credentials grant type and passes the returned Access Token to the passed callback function.

Watson.speechToText(speechFile, accessToken, callback)

Method for piping a speech file (passed as an absolute file location) to the AT&T Speech API using the passed access token. The API Response’s JSON is returned to the passed callback function as parsed JSON.


AT&T SPEECH API EXAMPLE APP CODE WALKTHROUGH

6

Using the AT&T Speech API to convert generic audio to text in a web browser.

example-basic in the examples repo


Frameworks & Requirements:

2

Server-side:

•Node.js: JavaScript platform for building fast, scalable network apps

•FS: Node.js File System module

•Express: Minimal web application framework for Node.js

•Optimist: Lightweight option parsing module for Node.js

•HBS: Express View Engine wrapper for Handlebars

•Watson.js: Simple API Wrapper for AT&T Speech API

Client-side:

• jQuery: The gold standard of client-side JavaScript libraries

•swfobject: JavaScript to make embedding Flash objects easier

•Bootstrap: Twitter’s CSS framework for quickly developing web apps


Capture Audio Input

2

recorder.swf:Adobe Flex app that accesses the user’s microphone and emits events to JS

recorder.js:JavaScript interface to receive events, update UI, and POST file to Node.js

Node.js upload script:

function cp(source, destination, callback) {fs.readFile(source, function(err, buf) {

fs.writeFile(destination, buf, callback);});

}app.post('/upload', function(req, res) {

cp(req.files.upload_file.filename.path, __dirname + req.files.upload_file.filename.name, function(err) {

res.send({ saved: 'saved' });return;

}); });


POST Audio to AT&T

2

AJAX Request via POST from client side to Node.js// Receive an AJAX POST from client-side JavaScriptapp.post('/speechToText', function(req, res) {

// Pass the audio file and access token to AT&T Speech APIWatson.speechToText(__dirname + '/public/audio/audio.wav', this.access_token, function(err, reply) {

// Pass any errors associated with API call to client-side JSif(err) { res.send({ error: err }); return; }

// Return the parsed JSON to client-side JavaScriptres.send(reply);return;

});

});


Use Speech API Response

2

Example API Response, returned from call using Content-Type of ‘application/json’:

{"Recognition": { "ResponseId": "74a964bf2fe", "NBest": [ { "WordScores": [1, 0.75, 1, 0.75], "Confidence": 0.75, "Grade": "accept", "ResultText": "This is a test.", "Words": [“This”, “is”, “a”, “test.”], "LanguageId": "en-us", "Hypothesis": "This is a test." } ] }}

Response-Parameter What-The-Response-Parameter-Means

Recognition Body"object"for"the"AT&T"Speech"API"Response

ResponseId Unique"IdenGfier"for"a"specific"API"call

NBestArray"of"hypothesis"objects"(possible"transcripGons"of"audio"data).

ResultTextPlainKtext,"cleaned"up"representaGon"of"the"Hypothesis."This"should"be"used"when"displaying"the"text"to"users."

ConfidenceConfidence"score"for"the"overall"Hypothesis."Scored"on"a"scale"from"0"(not"confident)"to"1.0"(very"confident)

Grade Recommended"acGon"to"take"with"the"current"Hypothesis:"accept,"reject,"or"confirm

WordsArray"of"the"individual"words."Confidence"scores"for"each"word"are"available"in"the"WordScores"array."

WordScoresArray"of"individual"confidence"scores"for"each"word"in"the"ResultText"parameter."Corresponds"to"Words"array.

LanguageIdRepresentaGon"of"the"response"language."Supports"English"&"Spanish"in"Generic;"EnglishKonly"in"other"contexts.

Hypothesis The"raw"transcripGon"of"the"audio"that"was"interpreted.


Up Next:

2

Michael Fitzpatrick


Up Next:

2

Jason GoeckeAdam Kalsey


ADVANCED EXAMPLESWhat can you do with Speech-to-text?

7

You could…• Make your mobile or web application accessible with voice commands• Post tweets using voice commands in a simple Twitter app• Add on-the-fly transcripts while recording in a podcasting app• Add captioning to videos hosted on your website automatically• Create real-time closed captions of a conference speaker’s presentation• Search for nearby places to check in at on Foursquare


Speech Labs

2

We’re now going to break out into three clusters, each focusing on a different technology stack. Work independently or with a partner!

In the Web Speech Lab, Michael will be on hand to help get your Node.js app working with the AT&T Speech API. Code up your own Speech API app from scratch, or you can start from a boilerplate app that uses Foursquare to search for locations and allow you to check-in from your web browser!

Web (Flex + Node.js)

In the iOS Speech Lab, Brant will help you try out the AT&T Speech API on iOS and go into more depth about the AT&T Speech SDK for iOS.

The mobile SDK allows you to quickly capture and stream audio from your iPhone or iPad app to the AT&T Speech API.

iOS (Objective-C)

In the Android Speech Lab, Jay will help you try out the AT&T Speech API on Android and go into more depth about the AT&T Speech SDK for Android.

The mobile SDK allows you to quickly capture and stream audio from your Android phone or tablet app to the AT&T Speech API.

Android (Java)

©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

THANKS! ANY QUESTIONS?Michael Owens (@mko on Twitter, mowens on Github)Jay Lieske ( [email protected], jayatyp on Github)

AT&T Developer Program

September 25, 2012

Documents

AT&T 2012 DevLab Speech API Deep Dive