View
188
Download
2
Category
Preview:
Citation preview
Machine LearningPart 1. Introduction
Classifying images
Boolean 2000 LOC + dictionary per language
Probabilistic 20 LOC +
lots of data all languages
machine learning
Types of ML
Supervised learning
Supervised learning
x
o
xxx
x x
ooo
o o
PrepareInstall Anaconda: https://conda.io/docs/install/quick.html
Update: conda update condo
Create env: conda create --name <envname> python=3
Switch to env: source activate <envrname>
Install libraries: sudo pip install
numpy
scipy
matplotlib
ipython
scikit-learn
pandas
pillow
//load breast cancer data
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
//split data into train & test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=66)
//use k-neighbors algorithm to perform classification
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train,y_train
//predict cancer on test data
clf.predict(X_test)
//check accuracy
clf.score(X_test,y_test)
Unsupervised learning
Unsupervised learning
oooo
o o
oooo
o o
X1
X2
//Kmeans algorithm
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2,random_state=0)
//split data into train & test
labels_km = km.fit_predict(X_train)
print(labels_km)
print(y_train)
Type of learning?
Algorithm cheat sheet
Data is keyHow to prepare it for ML?
Typical tasks
Categorical data —> one-hot-encoding (dummy variable)
Multidimensional data —> scaling
Too many features —> Principal Component Analysis (PCA)
Text —> bag-of-words
One-hot-encoding
# of flights account # of days since join
features
150 google, facebook
300 gmail_parsed_success
200 icloud 600 gmail_parsed_success
1 live 0
3 google 1
One-hot-encoding
account has_google
has_facebook
has_icloud
has_live
google, facebook
1 1 0 0
icloud 0 0 1 0
live 0 0 0 1
google 1 0 0 –
//use pandas
from pandas import get_dummies
data_dummies = pd.get_dummies(data)
One-hot-encoding
Scaling//minmax scaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
//scale data
X_train_scaled = scaler.transform(X_train)
print(X_train_scaled)
print(X_train
PCA (eigenfaces)
//load Labeled Faces in the Wild dataset
from sklearn.datasets import fetch_lfw_people
people = fetch_lfw_people(min_faces_per_person=20,resize=0.7)
//display 10 faces
image_shape = people.images[0].shape
import matplotlib.pyplot as plt
fix,axes = plt.subplots(2,5, figsize=(15,8),subplot_kw={‘xticks’:(),’yticks':()})
for target,image,ax in zip(people.target,people.images,axes.ravel()):
ax.imshow(image)
ax.set_title(people.target_names[target])
plt.show()
//use plt.ion() if plot isn't displayed or create .matplotlibrc in ./.matplotlib/ with text ‘backend: TkAgg'
//apply k-neighbors & estimate score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(people.data,people.target,stratify=people.target,random_state=0)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
knn.score(X_test,y_test)
without PCA
//apply PCA and then KNN
from sklearn.decomposition import PCA
pca = PCA(n_components=100,whiten=True,random_state=0).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_pca,y_train)
knn.score(X_test_pca,y_test)
with PCA
//display eigenfaces
fix,axes = plt.subplots(3,5,figsize=(15,12),subplot_kw={'xticks':(),'yticks':()})
for i, (component, ax) in enumerate(zip(pca.components_,axes.ravel())):
ax.imshow(component.reshape(image_shape),cmap='viridis')
ax.set_title("{}. component”.format((i+1)))
plt.show()
Eigenfaces
Eigenfaces
Bag-of-words
Bag-of-words//vectorize
sentence = ["Hello world"]
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(sentence)
//print vocabulary
vect.vocabulary_
//apply bag-of-words to sentence
bag_of_words = vect.transform(sentence)
bag_of_words.toarray()
Whats next?
ExercisesPredict user purchase (User, UserInfo, UserSessionAction)
Find clusters of users (User, UserInfo, UserSessionAction)
Determine if there is free wifi at the airport? (Tip)
Predicting CBP wait times at the airport (regression)
Others?
Useful
CSV read: pandas.read_csv
Working with images as numpy arrays: scikit-image
Scikit-learn.org
Recommended