24
Building ML Pipelines qplum_team qplum.co [email protected]

Building Machine Learning Pipelines

Embed Size (px)

Citation preview

Page 1: Building Machine Learning Pipelines

Building ML Pipelines

qplum_team qplum.co [email protected]

Page 2: Building Machine Learning Pipelines

What do ML Pipelines Look Like?

Page 3: Building Machine Learning Pipelines

TRAINING DATA

AWESOME ML

TECHNIQUEMODEL

TESTING DATA

PREDICTIONS

Page 4: Building Machine Learning Pipelines

Let’s build one now!

Page 5: Building Machine Learning Pipelines

UserID Pet Children Salary

1 cat 4 90

2 dog 6 24

3 dog 3 44

4 fish 3 27

5 cat 2 32

6 dog 3 59

7 cat 5 36

8 fish 4 27

Predict the salary from the kind of pets and the number of children a person has

Page 6: Building Machine Learning Pipelines

You may need to:1. Binarize/normalize data2. Remove noise3. Reduce dimensionality of data4. Make features from raw data…before you get to train your model !!

Page 7: Building Machine Learning Pipelines

C D F N S

1 0 0 0.21 90

0 1 0 1.88 24

0 1 0 -0.63 44

0 0 1 -0.63 27

1 0 0 1.46 32

0 1 0 -0.63 59

1 0 0 1.04 36

0 0 1 0.21 27

Neural Net

Training Set

YX

Model

Page 8: Building Machine Learning Pipelines

But is this enough?

Page 9: Building Machine Learning Pipelines

No ML Pipeline is complete without Cross-validation and Hyper-parameter optimization

Page 10: Building Machine Learning Pipelines

So how does our ML Pipeline look now?

Page 11: Building Machine Learning Pipelines

RAW DATA

AWESOME ML

TECHNIQUEwith

PARAMETERS 1

BEST MODEL

TESTING DATA

PREDICTIONS

PRE-PROCESSED

DATA

EXTRACT FEATURES

TRAINING DATA

AWESOME ML

TECHNIQUEwith

PARAMETERS K

AWESOME ML

TECHNIQUEwith

PARAMETERS N

Page 12: Building Machine Learning Pipelines

What does ‘best’ model mean?

Page 13: Building Machine Learning Pipelines

ML Pipeline in Code

Page 14: Building Machine Learning Pipelines

Series of transformationsTransformations might involve making modelsModels can be used to transform or predictGrid-search on Parameters

Page 15: Building Machine Learning Pipelines

>>> clf.set_params(svm__C=10)

Pipeline(steps=[('reduce_dim', PCA(copy=True, n_components=None, whiten=False)),

('svm', SVC(C=10, cache_size=200, class_weight=None,

coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1,

probability=False, random_state=None, shrinking=True, tol=0.001,

verbose=False))])

>>> from sklearn.grid_search import GridSearchCV

>>> params = dict(reduce_dim__n_components=[2, 5, 10],

... svm__C=[0.1, 10, 100])

>>> grid_search = GridSearchCV(clf, param_grid=params)

Page 16: Building Machine Learning Pipelines

Extra features:Configurable data sourcesCustomized scoring metrics(average, median of results etc.)

Page 17: Building Machine Learning Pipelines

Customize cross-validation based on nature of data

How do you cross-validate on time-series data?

Page 18: Building Machine Learning Pipelines

Why use ML pipelines?

Page 19: Building Machine Learning Pipelines

DRY

Page 20: Building Machine Learning Pipelines

Libraries with ML PipelinesSci-kit Learn, Pandas and Scikit-MapperSparks MLLibWrite your own!!

Page 21: Building Machine Learning Pipelines

qplum_team qplum.co [email protected]

Thanks for coming!

Debidatta Dwibedi@debidatta

Page 22: Building Machine Learning Pipelines

qplum_team qplum.co [email protected]

EXTRA

Page 23: Building Machine Learning Pipelines

UserID Pet Children Salary

1 cat 4 90

2 dog 6 24

3 dog 3 44

4 fish 3 27

5 cat 2 32

6 dog 3 59

7 cat 5 36

8 fish 4 27

Need to binarize this column Might also want to normalize this column

Page 24: Building Machine Learning Pipelines

Is Pet a Cat? Is Pet a Dog? Is Pet a Fish? Normalized number of children

1 0 0 0.21

0 1 0 1.88

0 1 0 -0.63

0 0 1 -0.63

1 0 0 1.46

0 1 0 -0.63

1 0 0 1.04

0 0 1 0.21