Analytic Team | Fajri Koto · 1. K-Means ( ) 2. HCS clustering 3. Canopy 4. DBSCAN 5. Fuzzy Clustering 6. K-SVD 7. Pitman More than 100 approaches 1. Naive Bayes 2. Linear Regression

Machine Learning Part 1

KMK Online

Analytic Team | Fajri KotoData Scientist

[email protected]

Data Science Weekend


Outline

1. Machine Learning at glance2. Vector Representation3. Evaluation Measurement in Machine Learning4. Clustering vs Classification5. K-Means Algorithm6. Linear Regression


Outline

1. Machine Learning at glance ✔2. Vector Representation 3. Evaluation Measurement in Machine Learning4. Clustering vs Classification5. K-Means Algorithm6. Linear Regression

1. Machine Learning at Glance

DATA1. CHOOSE FEATURES2. DATA PREPROCESSING (data

distribution, outliers, manual labelling :(, data cleansing, data splitting, etc)

3. VECTOR REPRESENTATION(Tf, Idf, Tf-Idf, )

4. OPTIMIZATION (Feature selection, Matrix factorization, PCA, etc)

METHODS / ALGORITHM

1. CHOOSE TYPE (classification, clustering, regression, rank, association )

2. CHOOSE ALGORITHM(linear regression, svm, decision tree, etc.)

3. EVALUATION(accuracy, precision, recall, f1 score, ctr, etc)

Data and the Methods


Overview of Training stage


Overview of Testing stage


Outline

1. Machine Learning at glance ✔2. Vector Representation ✔3. Evaluation Measurement in Machine Learning4. Clustering vs Classification5. K-Means Algorithm6. Linear Regression

2. Vector Representation

TF Vector Space Model

→ The main point is to form every problem into set of vectors. Example:


TF in Sklearn Python

→ TF Vector Space Model in python, is implemented as CountVectorizer

DEMO → count_vectorizer.py


TFIDF Vector Space Model

→ Another example of feature extraction result:


Outline

1. Machine Learning at glance ✔2. Vector Representation ✔3. Evaluation Measurement in Machine Learning ✔4. Clustering vs Classification5. K-Means Algorithm6. Linear Regression

3. Evaluation measurement

Machine Learning Overview


Evaluation in General


K-Fold cross validation




K-Fold cross validation (in Rapid Miner)


K-Fold cross validation - Sampling

1. Linear Sampling

The Linear sampling simply divides the ExampleSet into partitions without changing the order of the examples i.e. subsets with consecutive examples are created.

2. Shuffle Sampling

The Shuffled sampling builds random subsets of the ExampleSet. Examples are chosen randomly for making subsets.

3. Stratified Sampling √

The Stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole ExampleSet.




Confusion Matrix

Predicted

Actual

0 1

0 1 2

1 1 4

Pred: 1 0 1 0 1 1 1 1

Act : 1 1 1 0 0 1 0 1


Precision, Recall, Accuracy


Precision, Recall, Accuracy → Practice

Given prediction and actual label of binary classification in a list,

Please measure:

1. Confusion Metrics2. Accuracy3. Precision4. Recall

Please use sklearn python library

FILE: - data1.csv- evaluate.py


Outline

1. Machine Learning at glance ✔2. Vector Representation ✔3. Evaluation Measurement in Machine Learning ✔4. Clustering vs Classification ✔5. K-Means Algorithm6. Linear Regression

Classification:→ Mengelompokkan data berdasarkan class/Label yang diberikan

Clustering:→ Mengelompokkan data berdasarkan data itu sendiri, tanpa memperhatikan class/Label

Clustering algorithms divide a data set into natural groups (clusters). Instances in the same cluster are similar to each other, they share certain properties

4. Clustering vs Classification


Example of Clustering



Supervised vs Unsupervised Classification

Supervised learning:

classification requires supervised learning, i.e., the training data has to specify what we are trying to learn (the classes).

Unsupervised learning:

clustering is an unsupervised task, i.e., the training data doesn’t specify what we are trying to learn (the clusters).


Supervised vs Unsupervised Learning

Unsupervised / Clustering Supervised / Classification

1. K-Means (✔)2. HCS clustering3. Canopy4. DBSCAN5. Fuzzy Clustering6. K-SVD7. Pitman

More than 100 approaches

1. Naive Bayes2. Linear Regression (✔)3. Decision Tree4. Random Forest5. Support Vector Machine6. Neural Network7. Deep Neural Network



Outline

1. Machine Learning at glance ✔2. Vector Representation ✔3. Evaluation Measurement in Machine Learning ✔4. Clustering vs Classification ✔5. K-Means Algorithm ✔6. Linear Regression

Why Clustering

5. K-Means Algorithm

When Clustering


The K-Means AlgorithmCentroid-based clustering


The K-Means Algorithm

Complexity?






The K-Means Algorithm: ExampleStep1: Pick the centroid randomlyDATA

Step2 ITERATION



Outline

1. Machine Learning at glance ✔2. Vector Representation ✔3. Evaluation Measurement in Machine Learning ✔4. Clustering vs Classification ✔5. K-Means Algorithm ✔6. Linear Regression ✔

Regression VS Classification

- Regression is used to predict continuous values.

- Classification is used to predict which class a data point is part of (discrete value)

6. Linear Regreession

Regression5. K-Means Algorithm

3. Regression

Model Representation

6. Linear Regression

How to Determine β0 and β1 ?

Simple Matrix Multiplication

β = X-1.y


How to Determine β0 and β1 ?

Stochastic Gradien Descent

βt = βt-1 - αe


Linear Regression On Sklearn


Linear Regression On Sklearn>>> data = [[1,2,3],[3,2,1],[2,3,1]]

>>> label = [50, 65, 70]

>>> from sklearn.linear_model import LinearRegression

>>> lr = LinearRegression()

>>> lr.fit(data, label)<<< LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

>>> lr.coef_<<< array([ 0.83333333, 5.83333333, -6.66666667])

>>> lr.intercept_<<< 57.499999999999993


Evaluating Linear Regression

R2 (Coefficient of Determination)


Evaluating Linear Regression

R2 (Coefficient of Determination)

>>> data_test = [[1,2,1],[1,1,3],[2,2,1]]

>>> label_test = [45, 60, 68]

>>> lr.predict(data_test)<<< array([ 63.33, 44.16 64.17])

>>> lr.score(data_test, label_test)<<< -1.2059902200489008


Thank YouQuestions ?

KMK Online

Analytic Team | Fajri KotoData Scientist

[email protected]


Documents

Analytic Team | Fajri Koto · 1. K-Means ( ) 2. HCS clustering 3. Canopy 4. DBSCAN 5. Fuzzy Clustering 6. K-SVD 7. Pitman More than 100 approaches 1. Naive Bayes 2. Linear Regression