53
Introduction to Data Science, Understanding Machine Learning and Embracing it within IoT Solution

Machine learning workshop @DYP Pune

Embed Size (px)

Citation preview

Page 1: Machine learning workshop @DYP Pune

Introduction to Data Science, Understanding Machine Learning and Embracing it within IoT Solution

Page 2: Machine learning workshop @DYP Pune

Meet Ganesh Raskar | @geekwhocodes

• Intern at RapidCircle India • Microsoft Student Partner • Microsoft Certified Professional • Microsoft Specialist (HTML5, CSS3 & JavaScript, Azure Web Services) • Periodic Blogger (http://geekwhocodes.me) • Lifelong learner

Email [email protected] https://twitter.com/geekwhocodes

About https://about.me/geekwhocodes

LinkedIn https://in.linkedin.com/in/geekwhocodes

Page 3: Machine learning workshop @DYP Pune

 

Module 01 | Introduction to Data ScienceModule 02 | Understanding Machine LearningModule 03 | Machine Learning Workflow• Module 03-01 | Regression • Module 03-02 | Classification• Module 03-03 | Clustering • Module 03-04 | Recommenders

Module 04 | Demo – Classification Module 05 | Demo - Embracing ML in IoT solution

Modules

Page 4: Machine learning workshop @DYP Pune

Module 01 | Introduction to Data Science

It has it’s own jargon

Page 5: Machine learning workshop @DYP Pune

What is Data Science ?

• Evolving subject, no single definition• Requires a range of skills

Data science is the exploration and quantitative analysis of all available structured and unstructured data to develop understanding, extract knowledge, and formulate actionable results.

Page 6: Machine learning workshop @DYP Pune

Action Decision

Why did it happen?

What will happen?

What should I do? Decision automation

Decision support

Data

What happened? Manual process

Value

Data Decision Actions

Page 7: Machine learning workshop @DYP Pune

What Types of AnalyticsRetrospective analytics

Real-time analytics IntelligentSaaS apps

Predictive analytics

• Predictive analytics calibrated on past data, tells us what to expect• Prescriptive analysis tells what actions to take

Predictive vs Prescriptive Analytics

Page 8: Machine learning workshop @DYP Pune

Module 02 | Understanding Machine Learning

What is Machine Learning?

Page 9: Machine learning workshop @DYP Pune

What Machine Learning

does?

Finds patterns in dataUses those patterns to predict the future Examples:• Detecting credit card fraud• Determining whether a customer is

likely to switch to a competitor• Detecting machine failure• Lots more

Page 10: Machine learning workshop @DYP Pune

What Does It Mean

to Learn?

How did you learn to read?Learning requires:• Identifying patterns• Recognizing those patterns when you

see them again• Theory -> Simulation -> Try to

understand things>This is what machine learning does

Page 11: Machine learning workshop @DYP Pune

Finding Patterns Name Amount Fraudule

ntOmkar ₹ 10,000 NoAmit ₹ 17,000 YesAnkit ₹ 20,000 YesGanesh ₹ 19,000 No

A simple example

Name Amount Issued Used Age Fraudulent

Omkar ₹ 10,000 India India 27 NoGanesh ₹ 23,000 India India 21 NoAnkit ₹ 12,000 India USA 25 YesAmit ₹ 2,000 USA India 27 YesAvani ₹ 14,000 India Amsterda

m26 No

Vinit ₹ 69,000 India Holand 25 YesAditi ₹ 70,000 USA USA 26 NoSwapnil ₹ 9,000 India India 21 NoGayatri ₹ 30,000 India London 20 Yes

A bit complex example

What’s the pattern for fraudulent transactions?

Page 12: Machine learning workshop @DYP Pune

Machine Learning in a Nutshell

Machine learning algorithm

Model

Application

Contains patterns

Finds patterns

Recognizes patterns

Supplies new data to see if it matches known patterns

Data

Page 13: Machine learning workshop @DYP Pune

Why Machine

Learning is so hot right now?

Doing machine learning well requires:• Lots of data• Lots of compute power• Effective machine learning

algorithmsAll of those things are now more available than ever

Page 14: Machine learning workshop @DYP Pune

Who’s Interested in Machine Learning ?

Business LeadersW ant solutions to business problems

Data Scientists W ant powerful,

easy-to-use tools

Software DevelopersW ant to create

better applications

Page 15: Machine learning workshop @DYP Pune

Who is Data Scientist?

Someone who knows about:• Statistics• Machine learning software• Some problem domain (ideally)

Key facts about data scientists:• Good ones are scarce• Good ones are expensive

Page 16: Machine learning workshop @DYP Pune

The Role of R

R is an open source programming language• Supports machine learning,

statistical computing, and more• Has many available packages• R is very popular• Many commercial machine

learning offerings support RBut it’s not the only choice:• Python is also increasingly

popular

Page 17: Machine learning workshop @DYP Pune

 

• Machine learning lets us find patterns in existing data, then create and use a model that recognizes those patterns in new data

• Machine learning has gone mainstream• Big vendors think there’s big money in this

market• Machine learning can probably help

your organization

Summary

Page 18: Machine learning workshop @DYP Pune

Module 03 | Machine Learning Process

Page 19: Machine learning workshop @DYP Pune

ML Process is:

Iterative• In both big and small ways

Challenging• It’s rarely easy

Often rewarding• But not always

Page 20: Machine learning workshop @DYP Pune

First Step : Asking The Right Question

Choosing what question to ask is the most important part

of the ML process

Ask yourself : Do you’ve the right data to answer the question?

Ask yourself: Do you know how you’ll evaluate the result?

Page 21: Machine learning workshop @DYP Pune

Machine Learning Flow

Chosen Model

Deploy chosen model

Candidate Model

Apply learning algorith

m to data

Prepared Data

Apply pre-

processing to data

Iterate to find the best model

Iterate until data is ready

ML AlgorithmsApplications

Raw Data

Raw Data

Choose data

DataProcessing Modules

Page 22: Machine learning workshop @DYP Pune

Repeating The Process

Raw Data

Prepared Data

Apply pre-

processing to data

Deploy chosen model

Apply learning algorith

m to data

Chosen Model

Candidate Model

Re-create model regularly

Page 23: Machine learning workshop @DYP Pune

Scenario : Predicting Customer Churn

Detailed Call Data

ModelMachine

Call Center Staff

Call Center ApplicationAggregated

CRM Call Data Data

Data for ML

ML Prep Application

Hadoop, Spark, etc.

Aggregation Application

Customers

Page 24: Machine learning workshop @DYP Pune

 

• Choose the right question• Data Transformation• Iterate until you have a model that

makes good predictions• Periodically rebuild the model• Deploy the solution

Summary

Page 25: Machine learning workshop @DYP Pune

The Closer Look at Machine Learning

ML has it’s own jargon

Page 26: Machine learning workshop @DYP Pune

Terminology

The value you want to predict is in the

training dataThe data is labeled

The value you want to predict is not in the training data

The data is unlabeled

Training Data

Supervised Learning

Unsupervised Learning

Most common

The prepared data used to create a modelCreating a model is called training a model

* We’ll focus on Supervised ML

Page 27: Machine learning workshop @DYP Pune

training or prepared data

Data Processing for Supervised Machine Learning

Features Target Value

Available Data Preprocessing Modules

1) Read raw data

2) Create trainin

gdata

Data Source 2

. . .

Data Source 1

Data Source N

100011010011110111110110

Page 28: Machine learning workshop @DYP Pune

Categorizing Machine Learning ProblemsRegression Classification

Clustering Recommenders

For Predicting real-valued outcomes :• How many customers will visit our site

next week?• How may TV’s will sell next year?• Can we predicts someone’s income from

their click through information?• How many? It’s regression problem

For predicting truth valued outcomes:• Will I pass next semester ?• Is this transaction is fraudulent ?• Is this a spam e-mail?

For solving Unsupervised learning problems :

• Identifying chair from bunch of different objects?

• Hand-writing recognition • Is this Ganesh's voice ?

Recommending products based on history:• Building recommender engines

Page 29: Machine learning workshop @DYP Pune

 

• Machine learning has come of age• Machine learning isn’t hard to

understand• Although it can be hard to do well

• Machine learning can probably help your organization

Summary

Page 30: Machine learning workshop @DYP Pune

Module 03-01 | Regression

How many ?

Page 31: Machine learning workshop @DYP Pune

Regression

• Introduction to Regression• Simple Linear Regression (1 Feature)• Ridge Regression• SVM Regression• Cross-Validation

Page 32: Machine learning workshop @DYP Pune

Introduction to Regression • Each observation is represented by a set of numbers.

A person is represented as:

Labels, called ySingle feature, called x

Need a function that estimates y for a new x.

Clicks[10][7][…]

Income53-15…

NameGanes

hAnkit[…]

Page 33: Machine learning workshop @DYP Pune

Simple Linear Regression• Formally, given training set (xi,yi) for i=1…n, we want to

create a regression model f that can predict label y for a new x.

f(x) = function(Number of Businessweek clicks)

2000Number of Business week clicks In

com

e0

f(x)

1,000K

f(x) = 100K + 5K*Number of Businessweek clicks(x)

• Want model to be as close to data as possible. want these to be small: yi f (xi ) equivalently want these to be small: (yi f (xi ))

2

SSE(f) : Summation of above function• You do not need to solve the minimization

problem – the machine learning algorithm will do it for you.

Page 34: Machine learning workshop @DYP Pune

Ridge Regression • Extension to Simple Linear Regression• Formally, given training set (xi,yi) for i=1…

n, we want to create a regression model f that can predict label y for a new x.

Estimated income:

f(x) = function(feature1, feature2, feature3, feature4, feature5,… etc.)

For instance,f(x) = 3*Number of visits +10*Number of Businessweek clicks +100*Number people emailed per day +2*Number of purchases of over 5K within the last month +10*Number of visits to airlines

But f(x) could be much more complicated

Page 35: Machine learning workshop @DYP Pune

Ridge Regression

Over-fitting Model :• Multiple features• Wrong ML algorithm• It just remember the data• Worst

Could choose b0, b1, b2, etc., to minimize the total error on the training set + regularization term <- keeping the model simple

• C will be calculated using Cross Validation

• This is called “Ridge Regression”

Page 36: Machine learning workshop @DYP Pune

Support Vector Machine Regression

0

• The difference between Ridge & SVM is how they measure difference between prediction and the truth

• Epsilon to – as long as f(x) & y within the epsilon on either sides, the value of [ y - f(x) ] = 0

• You don’t need to do it by yourself, it’ll covered by ML algorithm

Page 37: Machine learning workshop @DYP Pune

Cross Validation

• Cross Validation (CV) is the most popular way to evaluate a machine learning algorithm on a dataset.

• You will need a dataset, an algorithm, and an evaluation measure for the quality of the result. The evaluation measure might be the squared error between the predictions and the truth.

• Divide the data into approximately-equally sized 10 “folds”• Train the algorithm on 9 folds, compute the evaluation measure on the last fold.• Repeat this 10 times, using each fold in turn as the test fold.• Report the mean and standard deviation of the evaluation measure over the 10

folds.

Train Test

Page 38: Machine learning workshop @DYP Pune

Module 03-02 | Classification

True or False?Class1 or Class2 or Class N?

Page 39: Machine learning workshop @DYP Pune

Classification

• What is classification?• Loss functions for classification• Logistic regression• SVM• AdaBoost

• Decision trees• Multiclass classification• Imbalanced learning• ROC curves and the AUC

Page 40: Machine learning workshop @DYP Pune

Introduction to Classification • Formally, given training set (xi,yi) for i=1…n, we want to create a

classification model f that can predict label y for a new x.

A person is represented as:

Labels, called yfeatures, called x

Need a function that estimates y for a new x.

[5][10][7][…]

1-11…

[12][14][47][…]

[51][15][8][…]

[25][30][9][…]

1 2 0 1

Is s\he passed last semester?

Number of backlogs last year

Is s\he active on social network?

Is s\he have smartphone?1

Page 41: Machine learning workshop @DYP Pune

Introduction to Classification

8Study no. of hours per day

Las

t Yea

r Bac

k lo

g0

3f(x)>0 f(x)=0

f(x)<0

Fail

Pass

f(x) = function(Last Year Back log, Study No. of hr/day)

The machine learning algorithm will create the function f for you. It might be very complicated, but the way to use is not complicated:

The predicted value of y for a new x is the sign of f(x).

Page 42: Machine learning workshop @DYP Pune

Module 03-02 | Clustering

Page 43: Machine learning workshop @DYP Pune

Clustering• Clustering is an key unsupervised problem.• “Unsupervised” means that the training data has no ground truth labels to learn from.• This means they are much harder to evaluate.

Supervised:

chair?

(not a chair)

(chair)(not a chair)

(not a chair)

(chair)

(chair)

(not a chair)

Page 44: Machine learning workshop @DYP Pune

Unsupervised:

Clustering

• “Unsupervised” means that the training data has no ground truth labels to learn from.

Page 45: Machine learning workshop @DYP Pune

Applications include:• Automatically grouping documents/webpages into topics– For instance, grouping news stories from today into categories

• Clustering large number of products– E.g. online shopping sites (search)

• Clustering customers into those with similar purchase behavior

Clustering

Page 46: Machine learning workshop @DYP Pune

Module 03-02 | Recommenders

Page 47: Machine learning workshop @DYP Pune

Introduction to Recommenders

• Self Expletory• Market Basket Analysis

• Customer purchasing behaviour • Increase sales and maintain inventory

Facebook, LinkedIn Matrix Factorisation Collaborative FilteringK-NN & Pearson

Content-based Bayesian classifiers, cluster analysis,Decision trees, artificial neural networksUsed in : Nextflix

Page 48: Machine learning workshop @DYP Pune

Recommenders Terminology :

Items : [1,2,3,4,5,6,,7,8,9]Itemset : any subset {3,5} {5,8} {1,3}.. Etc.Transaction : {2,3} {4,9} {7,2} {9,3}Rule : eg. {7 -> 2}

Support of itemset : proportion of transactions containing itemset(if user buy 7, what are chances to buy 2 as well.)

• Collaborative Filtering • Content-Based Filtering – works on the metadata of item • Hybrid Approach

Page 49: Machine learning workshop @DYP Pune

• User – Movie Matrix• Goal is to predict user’s rating for the movie

that he didn’t watched yet• The intuition behind using matrix

factorization to solve this problem is that there should be some latent features that determine how a user rates an item.

User 1User 2User 3

.

.

.

User n

Recommenders

Page 50: Machine learning workshop @DYP Pune

Module 04 | Demo

Will I pass next semester?

Page 51: Machine learning workshop @DYP Pune

Module 05 | Demo

How can we use ML in IoT?

Page 52: Machine learning workshop @DYP Pune

Information

Intel Edison

• Dual-core, dual-threaded Intel® Atom™ CPU at 500 MHz

• 32-bit Intel® Quark™ microcontroller at 100 MHz

• 1 GB LPDDR3 memory• 20 digital input/output pins including 4 pins as

PWM(pulse width modulation) outputs• 6 analog inputs• 1 I2C• 1 ICSP(In-Circuit Serial Programming)• Micro USB device connector• SD Card connector• BLE 4.0• Yocto Linux 1.6*

Water Flow sensor (1-30L/min) – My experiment specific

Page 53: Machine learning workshop @DYP Pune

Thank you