28
09.25.2012

AT&T 2012 DevLab Speech API Deep Dive

Embed Size (px)

DESCRIPTION

Speech given at the 2012 DevLab ( http://2012devlab.com ) about AT&T's Speech API.

Citation preview

Page 1: AT&T 2012 DevLab Speech API Deep Dive

09.25.2012

Page 2: AT&T 2012 DevLab Speech API Deep Dive

©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

AT&T SPEECH API DEEP DIVEMichael Owens (@mko on Twitter, mowens on Github)Jay Lieske ( [email protected], jayatyp on Github)

AT&T Developer Program

September 25, 2012

Page 3: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

WHAT IS THEAT&T SPEECH API?

2

Page 4: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

How theAT&T SpeechAPI Works

2

Page 5: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Powered by AT&T WATSON℠

2

• Developed 20+ years• Optimized for different usage scenarios:• Web Search

• Business Search

• Question & Answer

• Voicemail-to-Text

• Short Message (SMS)

• TV Search/Remote (U-Verse)

• Generic Speech-to-Text

Page 6: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Simple Speech-to-Text

2

•One REST endpoint•Accepts audio in WAV or AMR• Structured JSON response

•Text spoken by user•Metrics to evaluate recognition quality

•AT&T Native SDKs for Android and iOS handle audio capture and streaming

Page 7: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Apps in the Wild

2

Speak4itAT&T-Translator U4Verse-Easy-Remote

Page 8: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program3©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

GETTING STARTED WITH THE AT&T SPEECH API

3

Page 9: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Sign Up for API Access

2

• j.mp/ATTDevSignUp• Free API Access for

DevLab Attendees•Detailed Instructions in

your Attendee Packet• Sign up with code

“APILAB12”•AT&T Staff is on hand to

answer questions and help get you set up

Page 10: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Before You Code

2

• Get your API Keys from Developer portal:• Client ID (“API Key” on the AT&T Developer Portal)• Client Secret (“Secret Key” on the AT&T Developer Portal)

• OAuth 2.0 client_credentials grant type• OAuth 2.0 access_token• Audio File Types:

• AMR: narrowband, 12.2 kbits/s, 8 kHz sampling

• WAV: 16 bit PCM WAV, single channel, 8 kHz sampling

• Audio File Length:• Voicemail: 4 minutes or less

• Other: 1 minute or less

Page 11: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Step 1: Connect via OAuth

2

POST

https://api.att.com/oauth/tokenRequest URL:

Content-Type: application/x-www-form-urlencoded

client_id=ATT_API_CLIENT_ID&client_secret=ATT_API_CLIENT_SECRET&grant_type=client_credentials&scope=SPEECH

Request Body:

Response Body: { "access_token": "xxyz123"}

Request Method:

Request Headers:

Page 12: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Step 2: POST Audio to AT&T(Non-Streaming HTTP Request)

2

POST

Request URL:Accept: application/json

Authorization: Bearer xxyz123

Content-Type: audio/wav

Content-Length: 1534

X-SpeechContext: BusinessSearch

Request Method:

Request Headers:https://api.att.com/rest/1/SpeechToText

Request Body:Note: The Audio Binary Datagoes directly in POST Body,not a MIME Attachment.

AUDIO_BINARY_DATA

Page 13: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Step 2: POST Audio to AT&T(Streaming HTTP Request)

2

POST

Request URL:Accept: application/json

Authorization: Bearer xxyz123

Content-Type: audio/amr

Transfer-Encoding: chunked

X-SpeechContext: QuestionAndAnswer

Request Method:

Request Headers:https://api.att.com/rest/1/SpeechToText

Request Body:Note: Numbers are therecommended chunk sizein hexadecimal format.

200AUDIO_BINARY_DATA_CHUNK200AUDIO_BINARY_DATA_CHUNK0

Page 14: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program4©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

AT&T SPEECH API EXAMPLE APPLICATION

4

Download the Source:https://github.com/attdevsupport/2012DevLabExamples

Page 15: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Transcription in Three Steps

2

Capturing audio input differs from platform to platform.

In our Basic Example, we use a small Adobe Flex app to access the mic via Flash, capture the audio in one of the two accepted formats, then save that newly created audio file to disk on the server.

In our Speech Labs, we will look at the methods by which you can capture and stream audio directly to the Speech API.

1. Capture Audio Input

Once the audio input has been captured, we send the compatible audio file from our server to the Speech API using a simple POST.

In our Basic Example, we use a small Node.js module called “Watson.js” (NPM: “watson-js”) to OAuth to the Speech API and then POST the audio file.

In our Speech Labs, we will do this on iOS, Android, and Web.

2. POST Audio to AT&T

The AT&T API sends back a very easy to parse JSON object with the interpreted text.

In our Basic example, we output this to the user’s screen pretty printed and syntax highlighted, but you could do much more.

In our Speech Labs, we will look at other ways to use this data, like searching for businesses on Foursquare.

3. Use AT&T API Response

Page 16: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program5©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Watson.jsNode.js API Wrapper for the AT&T Speech API

5

GitHub: http://github.com/mowens/watson-js/ NPM: https://npmjs.org/package/watson-js

Page 17: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Using Watson.js

2

var options = { client_id: ATT_API_CLIENT_ID, client_secret: ATT_API_CLIENT_SECRET, access_token: ACCESS_TOKEN, scope: "SPEECH", context: "Generic", access_token_url: "https://api.att.com/oauth/token", api_domain: "api.att.com" };

var Watson = new WatsonClient.Watson(options);

var WatsonClient = require(‘watson-js’);1. Require API Wrapper

2. Set API Client Options

3. Instantiate New API Client

Page 18: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

The Methods of Watson.js

2

Watson.getAccessToken(callback)

Method for requesting a new OAuth Access Token using the Client Credentials grant type and passes the returned Access Token to the passed callback function.

Watson.speechToText(speechFile, accessToken, callback)

Method for piping a speech file (passed as an absolute file location) to the AT&T Speech API using the passed access token. The API Response’s JSON is returned to the passed callback function as parsed JSON.

Page 19: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program6©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

AT&T SPEECH API EXAMPLE APP CODE WALKTHROUGH

6

Using the AT&T Speech API to convert generic audio to text in a web browser.

example-basic in the examples repo

Page 20: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Frameworks & Requirements:

2

Server-side:

•Node.js: JavaScript platform for building fast, scalable network apps

•FS: Node.js File System module

•Express: Minimal web application framework for Node.js

•Optimist: Lightweight option parsing module for Node.js

•HBS: Express View Engine wrapper for Handlebars

•Watson.js: Simple API Wrapper for AT&T Speech API

Client-side:

• jQuery: The gold standard of client-side JavaScript libraries

•swfobject: JavaScript to make embedding Flash objects easier

•Bootstrap: Twitter’s CSS framework for quickly developing web apps

Page 21: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Capture Audio Input

2

recorder.swf:Adobe Flex app that accesses the user’s microphone and emits events to JS

recorder.js:JavaScript interface to receive events, update UI, and POST file to Node.js

Node.js upload script:

function cp(source, destination, callback) {fs.readFile(source, function(err, buf) {

fs.writeFile(destination, buf, callback);});

}app.post('/upload', function(req, res) {

cp(req.files.upload_file.filename.path, __dirname + req.files.upload_file.filename.name, function(err) {

res.send({ saved: 'saved' });return;

}); });

Page 22: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

POST Audio to AT&T

2

AJAX Request via POST from client side to Node.js// Receive an AJAX POST from client-side JavaScriptapp.post('/speechToText', function(req, res) {

// Pass the audio file and access token to AT&T Speech APIWatson.speechToText(__dirname + '/public/audio/audio.wav', this.access_token, function(err, reply) {

// Pass any errors associated with API call to client-side JSif(err) { res.send({ error: err }); return; }

// Return the parsed JSON to client-side JavaScriptres.send(reply);return;

});

});

Page 23: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Use Speech API Response

2

Example API Response, returned from call using Content-Type of ‘application/json’:

{"Recognition": { "ResponseId": "74a964bf2fe", "NBest": [ { "WordScores": [1, 0.75, 1, 0.75], "Confidence": 0.75, "Grade": "accept", "ResultText": "This is a test.", "Words": [“This”, “is”, “a”, “test.”], "LanguageId": "en-us", "Hypothesis": "This is a test." } ] }}

Response-Parameter What-The-Response-Parameter-Means

Recognition Body"object"for"the"AT&T"Speech"API"Response

ResponseId Unique"IdenGfier"for"a"specific"API"call

NBestArray"of"hypothesis"objects"(possible"transcripGons"of"audio"data).

ResultTextPlainKtext,"cleaned"up"representaGon"of"the"Hypothesis."This"should"be"used"when"displaying"the"text"to"users."

ConfidenceConfidence"score"for"the"overall"Hypothesis."Scored"on"a"scale"from"0"(not"confident)"to"1.0"(very"confident)

Grade Recommended"acGon"to"take"with"the"current"Hypothesis:"accept,"reject,"or"confirm

WordsArray"of"the"individual"words."Confidence"scores"for"each"word"are"available"in"the"WordScores"array."

WordScoresArray"of"individual"confidence"scores"for"each"word"in"the"ResultText"parameter."Corresponds"to"Words"array.

LanguageIdRepresentaGon"of"the"response"language."Supports"English"&"Spanish"in"Generic;"EnglishKonly"in"other"contexts.

Hypothesis The"raw"transcripGon"of"the"audio"that"was"interpreted.

Page 24: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Up Next:

2

Michael Fitzpatrick

Page 25: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Up Next:

2

Jason GoeckeAdam Kalsey

Page 26: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program7©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

ADVANCED EXAMPLESWhat can you do with Speech-to-text?

7

You could…• Make your mobile or web application accessible with voice commands• Post tweets using voice commands in a simple Twitter app• Add on-the-fly transcripts while recording in a podcasting app• Add captioning to videos hosted on your website automatically• Create real-time closed captions of a conference speaker’s presentation• Search for nearby places to check in at on Foursquare

Page 27: AT&T 2012 DevLab Speech API Deep Dive

AT&T Developer Program2©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

Speech Labs

2

We’re now going to break out into three clusters, each focusing on a different technology stack. Work independently or with a partner!

In the Web Speech Lab, Michael will be on hand to help get your Node.js app working with the AT&T Speech API. Code up your own Speech API app from scratch, or you can start from a boilerplate app that uses Foursquare to search for locations and allow you to check-in from your web browser!

Web (Flex + Node.js)

In the iOS Speech Lab, Brant will help you try out the AT&T Speech API on iOS and go into more depth about the AT&T Speech SDK for iOS.

The mobile SDK allows you to quickly capture and stream audio from your iPhone or iPad app to the AT&T Speech API.

iOS (Objective-C)

In the Android Speech Lab, Jay will help you try out the AT&T Speech API on Android and go into more depth about the AT&T Speech SDK for Android.

The mobile SDK allows you to quickly capture and stream audio from your Android phone or tablet app to the AT&T Speech API.

Android (Java)

Page 28: AT&T 2012 DevLab Speech API Deep Dive

©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

THANKS! ANY QUESTIONS?Michael Owens (@mko on Twitter, mowens on Github)Jay Lieske ( [email protected], jayatyp on Github)

AT&T Developer Program

September 25, 2012