32
1

빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

1

Page 2: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

2© 2017 The MathWorks, Inc.

빅데이터처리및머신러닝기법

Application Engineer

엄준상 과장

Page 3: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

3

Data Analytics

Turn large volumes of complex data into actionable information

source: Gartner

Page 4: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

4

Data Analytics Workflow

Integrate Analytics with

Systems

Desktop Apps

Enterprise Scale

Systems

Embedded Devices

and Hardware

Files

Databases

Sensors

Access and Explore

Data

Develop Predictive

Models

Model Creation e.g.

Machine Learning

Model

Validation

Parameter

Optimization

Preprocess Data

Working with

Messy Data

Data Reduction/

Transformation

Feature

Extraction

Page 5: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

5

Example: Working with Big Data in MATLAB

Objective: Create a model to predict the cost of a taxi ride in New York City

Inputs:

– Monthly taxi ride log files

– The local data set is small (~20 MB)

– The full data set is big (~25 GB)

Approach:

– Acecss Data

– Preprocess and explore data

– Develop and validate predictive model (linear fit)

Work with subset of data for prototyping

Scale to full data set on a cluster

Page 6: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

6

Example: Working with Big Data in MATLAB

Page 7: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

7

Data Access and Pre-processing – Challenges

Data aggregation

– Different sources (files, web, etc.)

– Different types (images, text, audio, etc.)

Data clean up

– Poorly formatted files

– Irregularly sampled data

– Redundant data, outliers, missing data etc.

Data specific processing

– Signals: Smoothing, resampling, denoising,

Wavelet transforms, etc.

– Images: Image registration, morphological

filtering, deblurring, etc.

Dealing with out of memory data (big data)

Challenges

Data preparation accounts for about 80% of the work of data

scientists - Forbes

Page 8: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

9

Data Analytics Workflow: Data Access

C Java Fortran Python

Hardware

Software

Servers and Databases

Repositories – SQL, NoSQL, etc.

File I/O – Text, Spreadsheet, etc.

Web Sources – RESTful, JSON, etc.

Business and Transactional Data

Engineering, Scientific and Field

Data Real-Time Sources – Sensors,

GPS, etc.

File I/O – Image, Audio, etc.

Communication Protocols – OPC

(OLE for Process Control), CAN

(Controller Area Network), etc.

Page 9: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

10

Data Analytics Workflow: Big Data Access and Pre-processing

Page 10: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

11

Big Data in Recent Releases

datastore

– Tabular text files

– Images

– Excel spreadsheets

– (SQL) Databases

– HDFS (Hadoop)

– S3 (Amazon Web Services)

MATLAB MapReduce

– Scales from Desktop to Hadoopairdata = datastore('*.csv');

airdata.SelectedVariables = {'Distance', 'ArrDelay‘};

data = read(airdata);

Page 11: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

12

Data Analytics Workflow: Big Data Access and Pre-processing

Page 12: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

14

tall arrays in

New data type designed for data that doesn’t fit into memory

Lots of observations (hence “tall”)

Looks like a normal MATLAB array

– Supports numeric types, tables, datetimes, strings, etc…

– Supports several hundred functions for basic math, stats, indexing, etc.

– Statistics and Machine Learning Toolbox support

(clustering, classification, etc.)

Page 13: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

15

tall arraySingle

Machine

Memory

tall arrays

Automatically breaks data up into s

mall “chunks” that fit in memory

Tall arrays scan through the datase

t one “chunk” at a time

Processing code for tall arrays is th

e same as ordinary arrays

Single

Machine

MemoryProcess

Page 14: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

16

tall array

Cluster of

Machines

Memory

Single

Machine

Memory

tall arrays

With Parallel Computing Toolbox, pr

ocess several “chunks” at once

Can scale up to clusters with MATL

AB Distributed Computing Server

Single

Machine

MemoryProcess

Single

Machine

MemoryProcess

Single

Machine

MemoryProcess

Single

Machine

MemoryProcess

Single

Machine

MemoryProcess

Single

Machine

MemoryProcess

Page 15: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

17

Demo: Working with Tall Arrays

Page 16: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

18

Data Access and pre-processing – challenges and solution

Data aggregation

– Different sources (files, web, etc.)

– Different types (images, text, audio, etc.)

Data clean up

– Poorly formatted files

– Irregularly sampled data

– Redundant data, outliers, missing data etc.

Data specific processing

– Signals: Smoothing, resampling, denoising,

Wavelet transforms, etc.

– Images: Image registration, morphological

filtering, deblurring, etc.

Dealing with out of memory data (big data)

Challenges

Point and click tools to access

variety of data sources

High-performance environment

for big data

Files

Signals

Databases

Images

Built-in algorithms for data

preprocessing including sensor,

image, audio, video and other

real-time data

Page 17: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

21

Consider Machine/Deep Learning When

update as more data becomes available

learn complex non-linear relationships

learn efficiently from very large data sets

Problem is too complex for hand written rules or equations

Speech Recognition Object Recognition Engine Health Monitoring

Program needs to adapt with changing data

Weather Forecasting Energy Load Forecasting Stock Market Prediction

Program needs to scale

IoT Analytics Taxi Availability Airline Flight Delays

Because algorithms can

Page 18: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

22

Different Types of Learning

Machine

Learning

Supervised

Learning

Classification

Regression

Unsupervised

LearningClustering

Discover an internal representation from

input data only

Develop predictivemodel based on bothinput and output data

Type of Learning Categories of Algorithms

• No output - find natural groups and

patterns from input data only

• Output is a real number

(temperature, stock prices)

• Output is a choice between classes

(True, False) (Red, Blue, Green)

Page 19: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

24

Machine Learning with Big Data

• Descriptive statistics (skewness, tabulat

e, crosstab, cov, grpstats, …)

• K-means clustering (kmeans)

• Visualization (ksdensity, binScatterPlot;

histogram, histogram2)

• Dimensionality reduction (pca, pcacov, f

actoran)

• Linear and generalized linear regression

(fitlm, fitglm)

• Discriminant analysis (fitcdiscr)

• Linear classification methods for SVM

and logistic regression (fitclinear)

• Random forest ensembles of

classification trees (TreeBagger)

• Naïve Bayes classification (fitcnb)

• Regularized regression (lasso)

• Prediction applied to tall arrays

Page 20: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

25

Regression Learner

Page 21: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

26

Demo: Training a Machine Learning Model

Page 22: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

27

Demo: Training a Machine Learning Model

Page 23: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

28

Regression LearnerApp to apply advanced regression methods to your data

Added to Statistics and Machine Learning

Toolbox in R2017a

Point and click interface – no coding requi

red

Quickly evaluate, compare and select regr

ession models

Export and share MATLAB code or traine

d models

Page 24: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

29

Classification LearnerApp to apply advanced classification methods to your data

Added to Statistics and Machine Learning

Toolbox in R2014a

Point and click interface – no coding requi

red

Quickly evaluate, compare and select clas

sification models

Export and share MATLAB code or traine

d models

Page 25: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

30

Tuning Machine Learning ModelsGet more accurate models in less time

Automatically select best

machine leaning “features”

NCA: Neighborhood Component Analysis

Select best “features”

to keep in model from

over 400 candidates

Automatically fine-tune

machine learning parameters

Hyperparameter Tuning

Page 26: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

31

Machine Learning Hyperparameters

Hyperparameters

Tune a typical set of

hyperparameters for this model

Tune all

hyperparameters for this model

Page 27: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

32

Bayesian Optimization in Action

Page 28: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

34

MATLAB Production Server

Server software

– Manages packaged MATLAB progr

ams and worker pool

MATLAB Runtime libraries

– Single server can use runtimes fro

m different releases

RESTful JSON interface

Lightweight client libraries

– C/C++, .NET, Python, and Java

MATLAB Production Server

MATLAB

Runtime

Request Broker

&

Program

ManagerApplications/

Database

Servers RESTful

JSON

Enterprise

Application

MPS Client

Library

Page 29: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

35

Integrate analytics with systems

MATLAB

Runtime

C, C++ HDL PLC

Embedded Hardware

C/C++ ++ExcelAdd-in Java

Hadoop/

Spark.NET

MATLABProduction

Server

StandaloneApplication

Enterprise Systems

Python

Page 30: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

37

Key Takeaways

Utilize all of your data.

Apply advanced analytics techniques.

Operationalize analytics to enterprise syste

ms and embedded devices.

MATLAB Analytics work

with business and

engineering data

1

MATLAB enables domain experts to do

Data Science

2

3MATLAB Analytics run anywhere

Page 31: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

38

Resources to learn and get started mathworks.com/machine-learning

eBook

mathworks.com/big-data

Page 32: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The

39© 2017 The MathWorks, Inc.