32
Machine Learning Part 1. Introduction

Machine Learning - Introduction

Embed Size (px)

Citation preview

Page 1: Machine Learning - Introduction

Machine LearningPart 1. Introduction

Page 2: Machine Learning - Introduction

Classifying images

Page 3: Machine Learning - Introduction

Boolean 2000 LOC + dictionary per language

Probabilistic 20 LOC +

lots of data all languages

machine learning

Page 4: Machine Learning - Introduction

Types of ML

Page 5: Machine Learning - Introduction

Supervised learning

Page 6: Machine Learning - Introduction

Supervised learning

x

o

xxx

x x

ooo

o o

Page 7: Machine Learning - Introduction

PrepareInstall Anaconda: https://conda.io/docs/install/quick.html

Update: conda update condo

Create env: conda create --name <envname> python=3

Switch to env: source activate <envrname>

Install libraries: sudo pip install

numpy

scipy

matplotlib

ipython

scikit-learn

pandas

pillow

Page 8: Machine Learning - Introduction

//load breast cancer data

from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

//split data into train & test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=66)

//use k-neighbors algorithm to perform classification

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3)

clf.fit(X_train,y_train

Page 9: Machine Learning - Introduction

//predict cancer on test data

clf.predict(X_test)

//check accuracy

clf.score(X_test,y_test)

Page 10: Machine Learning - Introduction
Page 11: Machine Learning - Introduction

Unsupervised learning

Page 12: Machine Learning - Introduction

Unsupervised learning

oooo

o o

oooo

o o

X1

X2

Page 13: Machine Learning - Introduction

//Kmeans algorithm

from sklearn.cluster import KMeans

km = KMeans(n_clusters=2,random_state=0)

//split data into train & test

labels_km = km.fit_predict(X_train)

print(labels_km)

print(y_train)

Page 14: Machine Learning - Introduction

Type of learning?

Page 15: Machine Learning - Introduction

Algorithm cheat sheet

Page 16: Machine Learning - Introduction

Data is keyHow to prepare it for ML?

Page 17: Machine Learning - Introduction

Typical tasks

Categorical data —> one-hot-encoding (dummy variable)

Multidimensional data —> scaling

Too many features —> Principal Component Analysis (PCA)

Text —> bag-of-words

Page 18: Machine Learning - Introduction

One-hot-encoding

# of flights account # of days since join

features

150 google, facebook

300 gmail_parsed_success

200 icloud 600 gmail_parsed_success

1 live 0

3 google 1

Page 19: Machine Learning - Introduction

One-hot-encoding

account has_google

has_facebook

has_icloud

has_live

google, facebook

1 1 0 0

icloud 0 0 1 0

live 0 0 0 1

google 1 0 0 –

Page 20: Machine Learning - Introduction

//use pandas

from pandas import get_dummies

data_dummies = pd.get_dummies(data)

One-hot-encoding

Page 21: Machine Learning - Introduction

Scaling//minmax scaler

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(X_train)

//scale data

X_train_scaled = scaler.transform(X_train)

print(X_train_scaled)

print(X_train

Page 22: Machine Learning - Introduction

PCA (eigenfaces)

Page 23: Machine Learning - Introduction

//load Labeled Faces in the Wild dataset

from sklearn.datasets import fetch_lfw_people

people = fetch_lfw_people(min_faces_per_person=20,resize=0.7)

//display 10 faces

image_shape = people.images[0].shape

import matplotlib.pyplot as plt

fix,axes = plt.subplots(2,5, figsize=(15,8),subplot_kw={‘xticks’:(),’yticks':()})

for target,image,ax in zip(people.target,people.images,axes.ravel()):

ax.imshow(image)

ax.set_title(people.target_names[target])

plt.show()

//use plt.ion() if plot isn't displayed or create .matplotlibrc in ./.matplotlib/ with text ‘backend: TkAgg'

Page 24: Machine Learning - Introduction

//apply k-neighbors & estimate score

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(people.data,people.target,stratify=people.target,random_state=0)

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)

knn.score(X_test,y_test)

without PCA

Page 25: Machine Learning - Introduction

//apply PCA and then KNN

from sklearn.decomposition import PCA

pca = PCA(n_components=100,whiten=True,random_state=0).fit(X_train)

X_train_pca = pca.transform(X_train)

X_test_pca = pca.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train_pca,y_train)

knn.score(X_test_pca,y_test)

with PCA

Page 26: Machine Learning - Introduction

//display eigenfaces

fix,axes = plt.subplots(3,5,figsize=(15,12),subplot_kw={'xticks':(),'yticks':()})

for i, (component, ax) in enumerate(zip(pca.components_,axes.ravel())):

ax.imshow(component.reshape(image_shape),cmap='viridis')

ax.set_title("{}. component”.format((i+1)))

plt.show()

Eigenfaces

Page 27: Machine Learning - Introduction

Eigenfaces

Page 28: Machine Learning - Introduction

Bag-of-words

Page 29: Machine Learning - Introduction

Bag-of-words//vectorize

sentence = ["Hello world"]

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

vect.fit(sentence)

//print vocabulary

vect.vocabulary_

//apply bag-of-words to sentence

bag_of_words = vect.transform(sentence)

bag_of_words.toarray()

Page 30: Machine Learning - Introduction

Whats next?

Page 31: Machine Learning - Introduction

ExercisesPredict user purchase (User, UserInfo, UserSessionAction)

Find clusters of users (User, UserInfo, UserSessionAction)

Determine if there is free wifi at the airport? (Tip)

Predicting CBP wait times at the airport (regression)

Others?

Page 32: Machine Learning - Introduction

Useful

CSV read: pandas.read_csv

Working with images as numpy arrays: scikit-image

Scikit-learn.org