Machine Learning - Introduction

Machine LearningPart 1. Introduction

Classifying images

Boolean 2000 LOC + dictionary per language

Probabilistic 20 LOC +

lots of data all languages

machine learning

Types of ML

Supervised learning

Supervised learning

x

o

xxx

x x

ooo

o o

PrepareInstall Anaconda: https://conda.io/docs/install/quick.html

Update: conda update condo

Create env: conda create --name <envname> python=3

Switch to env: source activate <envrname>

Install libraries: sudo pip install

numpy

scipy

matplotlib

ipython

scikit-learn

pandas

pillow

https://conda.io/docs/install/quick.html

//load breast cancer data

from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

//split data into train & test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=66)

//use k-neighbors algorithm to perform classification

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3)

clf.fit(X_train,y_train

//predict cancer on test data

clf.predict(X_test)

//check accuracy

clf.score(X_test,y_test)

Unsupervised learning

Unsupervised learning

oooo

o o

oooo

o o

X1

X2

//Kmeans algorithm

from sklearn.cluster import KMeans

km = KMeans(n_clusters=2,random_state=0)

//split data into train & test

labels_km = km.fit_predict(X_train)

print(labels_km)

print(y_train)

Type of learning?

Algorithm cheat sheet

Data is keyHow to prepare it for ML?

Typical tasks

Categorical data —> one-hot-encoding (dummy variable)

Multidimensional data —> scaling

Too many features —> Principal Component Analysis (PCA)

Text —> bag-of-words

One-hot-encoding

# of flights account # of days since join

features

150 google, facebook

300 gmail_parsed_success

200 icloud 600 gmail_parsed_success

1 live 0

3 google 1

One-hot-encoding

account has_google

has_facebook

has_icloud

has_live

google, facebook

1 1 0 0

icloud 0 0 1 0

live 0 0 0 1

google 1 0 0 –

//use pandas

from pandas import get_dummies

data_dummies = pd.get_dummies(data)

One-hot-encoding

Scaling//minmax scaler

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(X_train)

//scale data

X_train_scaled = scaler.transform(X_train)

print(X_train_scaled)

print(X_train

PCA (eigenfaces)

//load Labeled Faces in the Wild dataset

from sklearn.datasets import fetch_lfw_people

people = fetch_lfw_people(min_faces_per_person=20,resize=0.7)

//display 10 faces

image_shape = people.images[0].shape

import matplotlib.pyplot as plt

fix,axes = plt.subplots(2,5, figsize=(15,8),subplot_kw={‘xticks’:(),’yticks':()})

for target,image,ax in zip(people.target,people.images,axes.ravel()):

ax.imshow(image)

ax.set_title(people.target_names[target])

plt.show()

//use plt.ion() if plot isn't displayed or create .matplotlibrc in ./.matplotlib/ with text ‘backend: TkAgg'

//apply k-neighbors & estimate score

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(people.data,people.target,stratify=people.target,random_state=0)

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)

knn.score(X_test,y_test)

without PCA

//apply PCA and then KNN

from sklearn.decomposition import PCA

pca = PCA(n_components=100,whiten=True,random_state=0).fit(X_train)

X_train_pca = pca.transform(X_train)

X_test_pca = pca.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train_pca,y_train)

knn.score(X_test_pca,y_test)

with PCA

//display eigenfaces

fix,axes = plt.subplots(3,5,figsize=(15,12),subplot_kw={'xticks':(),'yticks':()})

for i, (component, ax) in enumerate(zip(pca.components_,axes.ravel())):

ax.imshow(component.reshape(image_shape),cmap='viridis')

ax.set_title("{}. component”.format((i+1)))

plt.show()

Eigenfaces

Eigenfaces

Bag-of-words

Bag-of-words//vectorize

sentence = ["Hello world"]

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

vect.fit(sentence)

//print vocabulary

vect.vocabulary_

//apply bag-of-words to sentence

bag_of_words = vect.transform(sentence)

bag_of_words.toarray()

Whats next?

ExercisesPredict user purchase (User, UserInfo, UserSessionAction)

Find clusters of users (User, UserInfo, UserSessionAction)

Determine if there is free wifi at the airport? (Tip)

Predicting CBP wait times at the airport (regression)

Others?

Useful

CSV read: pandas.read_csv

Working with images as numpy arrays: scikit-image

Scikit-learn.org

Technology

Machine Learning - Introduction