Looking into the Future: Using Google's Prediction API

  • Published on
    21-Apr-2017

  • View
    432

  • Download
    2

Embed Size (px)

Transcript

  • Looking into the FutureUsing Googles Prediction API

    Justin Grammens Recursive Awesome & IoT Weekly

  • What is Prediction?

    Defined by Wikipedia as: A statement about an uncertain event.

    Continues on to read It is often, but not always, based upon experience or knowledge.

    In statistics, prediction is a part of Statistical Inference.

  • Statistical Inference Statistical inference is the process of deducing

    properties of an underlying distribution by analysis of data.

    Two major paradigms used for statistical inference

    Frequentist Inference

    Bayesian Inference

  • Frequentist Inference Data is repeatable random sample with a specific

    probability

    Parameters and probabilities remain constant during the test

    Results are independent results from prior tests

    Q: Will the sun rise tomorrow? Whats the probability of a sun dying based on all the suns in the universe

  • Bayesian Inference Take into account prior results and subjective

    beliefs

    Update probabilities of occurrence based on new data

    Tests are NOT run in isolation and affect one another

    Q: Will the sun rise tomorrow? Depends on how many times we have seen it rise in the past

  • Predictions by Machines

    Could therefore define prediction as an informed guess or opinion.

    Software systems have to be trained before they can be effective.

    source: reading.pppst.com

    http://reading.pppst.com/prediction.html

  • What is Prediction API? Announced at Google I/O in 2011

    Provides pattern-matching and machine learning capabilities.

    Handles both numeric or text input

    Handles both classification or regression output

    Access from App Engine, client libs and command line

    Able to retrain the model on the fly - Bayesian?

  • What Are Some Usages?

  • What Do You Need?

    Google Account

    Google Platform Console project

    Google Predication API Activated

    Google Cloud Storage API Activated

  • Steps Involved Define what you are trying to accomplish

    Find the training data and format to support your goal (hardest part)

    Upload training data to Google Cloud Storage

    Train the system against the data you provide

    Send queries to your model

    Upload additional data with new information gained.

  • Hosted Model The Prediction API hosts a gallery of user-submitted

    models

    Owners can charge for the use of the model

    Hosted models are versioned so they an be updated easily

    Models are submitted in PMML format

    XML-based language to define statistical & data models

    Appears to currently be a waitlist

  • How To Train 3 ways to create and train the correct type of model

    CSV File - Lives on Google Cloud Storage

    Training data embedded in request

    Limited to the size of an HTTP Request < 2MB

    Empty model created and trained with update calls

  • CSV File Rules Maximum file size 2.5 GB

    No header row. Yes, to the system its irrelevant

    One example per line

    The first column indicates to the system the type of model.

    Ideally remove punctuation (other then apostrophes) from your data.

  • CSV File Rules Text Strings

    Double quotes around all text strings

    Text matching is case-sensitive

    Numeric Values

    Integer and decimals are supported

    Numbers: "1", "23", 999"

    Strings: "6 12", colt 45"

  • Structuring Data Example Value

    The Answer

    Features

    No limit on number of feature

    More features & examples the better

    To train 16MB ~ 1 hour

  • Whats The Answer?

  • Regression ModelExample Data

    Define your data to support numbers and strings

    Query of Seattle, 288, sunny, might get back value of 62

    Dont need to match any values in the dataset

    Fill model with all columns then query with first column missing

  • Classification ModelExample Data

    Query of Lose weight now! you would get result of spam

    Returns the category from the dataset

  • Authorization You must use OAuth 2.0 to authorize requests

    Can share your model with others

    View: User can call Analyze, Get, List and Predict on the project and/or any model owned by the project.

    Edit: User has all the permissions of Can view, but can also Delete, Insert, and Update any models owned by the project.

    Is Owner: User has all the permissions of Can edit, but can also grant permissions to other users to access the project.

  • Tips & Tricks The more examples & features the better results

    However - Adding more features doesnt always give better predictions

    is_comedy is_drama is_action is_horror

    Y N N N

    VS

    genre

    Comedy

  • Tips & Tricks

    Need to add a numeric aspect to the genre?

    Add additional genre columns and weight it based on count

    genre genre genre genre genre

    Drama Drama Drama Comedy Comedy

  • Tips & Tricks Always put something into each feature

    Include all the features that you know about

    For Regression:

    Make sure will have the time to ensure the values are correct

    Conversely, if you have exact numbers use them

    Try to have at least a few hundred examples for each category

  • Tips & Tricks

    Can only compare against known relationships

    Cant feed an untrained title and user to get rating

    Solution is to break the title into genre, director, actors

    Rating user_name movie_title9.5 Justin Star Wars2.2 Justin Disaster Movie5.0 Justin Billy Madison

  • Lets Talk Data! Nice Ride

    Based on the starting station, predict the ending station

    New York Cab Rides

    Given a starting GPS coordinate, predict where the cab ride will end

    Sentiment Analysis

    Based on the state of the union speech define the sentiment

  • Based on the starting station, can we predict the ending station?

  • Nice Ride Location Rides

    https://www.niceridemn.org/data/

    Offers a live XML stream to update along the way

    https://www.niceridemn.org/data/

  • Nice Ride Location RidesStarted

    with this:

    Next: Ended with this:

  • Nice Ride Insert DataID &

    Location

  • Nice Ride Running Prediction

    Status

  • Lessons Learned I forgot to put the

    values in quotes. Treated it as numerical regression.

    Verify how its interpreting your data with get call.

    Type

  • Nice Ride Location Rides

    Show Scripts, API & Results

  • Can we predict the movement of NYC cabs?

  • NYC Cab Ride Data

    Data DictionaryData Website

    http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdfhttp://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

  • Sample Data

    Contains pickup & drop off latitude and longitude

  • Theres A Problem

    Asking for 2 inputs and 2 outputs!

    Not possible with Prediction API as it only supports one dependent variable. :(

    Change of plan

  • Lets predict the cost of a NYC cab ride instead!

  • Prediction Demo Features are

    distances (B)

    Examples are prices (A)

    Is this accurate?

    Different fares based on areas of the city

  • Ok, not really Let's use location based

    data instead

  • Prediction Demo

    Latitude / Longitude are the features (B, C, D, E

    Price Is The Example (A)

    Examples

  • NYC Cab Ride Location

    Show Scripts, API & Results

  • Sentiment Analysis of a Speech

  • Speech Sentiment Always Check Your Data!

    Website incorrectly claimed positive(4), negative(0) and neutral(2) sentiment.

    Data had groups of sentiment values.

    Source

    http://help.sentiment140.com/for-students/

  • Speech SentimentFeatureExample Value

    Training Examples

  • Sentiment Training

  • Sentiment Example

    Show Scripts, API & Results

    Obama State of the Union Speech - 1/16

    Donald Trump Speech Des Moines, IA - 1/24

    https://medium.com/@WhiteHouse/president-obama-s-2016-state-of-the-union-address-7c06300f9726#.ardf6wqm6http://www.p2016.org/photos15/summit/trump012415spt.html

  • Smart Spreadsheets

    Install Smart Autofill Add-on

  • Smart Spreadsheets

    Prediction API used to fill in missing values

  • Smart Spreadsheets

    Select columns to use for data training

  • Smart Spreadsheets

    Example Values are populated

  • Final Thoughts - Overfitting

    Overfitting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study.

    Therefore, a model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.

    Exact query should not return EXACT examples

    https://groups.google.com/forum/#!topic/prediction-api-discuss/n64eHnv5iug

  • Thank YouJustin Grammens

    justin@recursiveawesome.com http://recursiveawesome.com

    Checkout my IoT Weekly Newsletter http://iotweeklynews.com

    http://recursiveawesome.comhttp://iotweeklynews.com