Upload
haliem
View
240
Download
6
Embed Size (px)
Citation preview
일정표
1일차 2일차 3일차 4일차
오전
도입 빅데이터 배경/개념 빅데이터 플랫폼
데이터 분석 개념과 절차 1 CRISP-DM 분석전략 (목표와 가
설/지표체계) 분석도구
통계 기초 이론 기술통계/추론통계
데이터 수집 개요 Excel SQL/NoSQL,
분석절차 2 모델링 개요 Bias-Variance
Trade-off Resampling
통계분석 모델링 3 비선형모델 선형대수와
다변량분석 데이터 정제 및 EDA
이론 실습
기계학습3 신경망 군집화 연관분석
모델개발3 (모델평가, 성능고도화) 모델평가 모델 성능고
도화
오후
실습 환경구축 (R, RStudio) R 기초
R 데이터구조, 함수 작성
R 활용 통계분석 모델링1
통계분석 모델링 2 회귀분석 모델선정과
Regularization 시계열분석
기계학습1 KNN 의사결정트리
기계학습2 SVM Naïve Bayes
시각화 시각화
빅데이터 플랫폼 Hadoop Spark
마무리 클라우드 DL
빅데이터 개념과 분석 플랫폼 데이터 분석 개념과 모델링 통계 분석 기계학습 R 언어
2
3일차
3
선형대수이론 기초
5
Matrix
– Square Matrix
• ; has the same number of rows as columns
– Transpose
• ; created by converting its rows into columns
6
• 행렬의 곱
• 항등행렬 – AI = A
• Orthogonal Matrix – A matrix A is orthogonal if AAT = ATA = I.
7
벡터
• 개념 – = points ; components --> dimension
• 길이 (length)
• Vector 연산 (operation) – Addition
– Scalar Multiplication • 𝑣 = [3,6,8,4] 일 때 1.5 ∗ 𝑣 = 1.5 ∗ 3,6,8,4 = [4.5, 9, 12, 6]
– 내적 (Inner Product) • = dot product = scalar product
8
• Orthogonality – = perpendicular inner product = 0
• Normal Vector
• Orthonormal Vector – = Vectors of unit length that are orthogonal to each other
9
Eigenvector와 Eigenvalue
• Eigenvector – = An eigenvector is a nonzero vector that satisfies
단, A = square matrix, ⃗v = eigenvector, λ = eigenvalue
10
• Eigenvector & eigenvalue 구하기
11
• Eigendecomposition – 고유값분해를 이용한 대각화 (정방행렬에 대해서만 가능)
– 대각행렬과의 행렬 곱
• SVD (특이값 분해)
12
다변량 통계분석
13
통계와 벡터 개념
• 원점수 벡터, 편차점수 벡터, 표준점수 벡터 – “centered” = 원점수 X에서 평균 𝑋 를 빼준 점수
– “centered & scaled”=centered 점수/표준편차(𝑠)표준점수 (𝑧)
– 표준편차의 벡터개념
• 변인 X의 표준편차 𝑠 = (𝑋−𝑋 )2
𝑁−1=
(𝑋−𝑋 )2
𝑁−1
– 분자는 편차점수 벡터의 길이 해당 변인의 variability를 반영
• 편차점수 벡터 길이와 표준편차 관계 𝑋 − 𝑋 = 𝑁 − 1 𝑠
• 즉, z 표준화는 모든 변인벡터의 길이를 𝑁 − 1로 통일시키는 것
피험자 원점수 (X) 편차점수 (X-𝑋 ) 표준점수 (z)
1 15 0 0
2 12 -3 -1
3 18 3 1
𝑋 15 0 0
s 3 3 1
인용: 박광배, 『다변량분석』, 학지사 14
– 상관계수의 벡터개념
– 𝑧𝐴 = (0)2+(−1)2+(1)2 = 1.414
– 𝑧𝐵 = (0.92)2+(−1.06)2+(0.13)2= 1.414
» 즉, 두 변인의 상관계수는 r=cosθ
• 선형조합과 데이터 분석
– 투사점 (projection point) – 변인 C 즉, 선형조합축 C의 각도 조합가중치 (composite weight)
• 표준화 (standardization)
•22
22+0.82 = 0.92850.82
22+0.82 = 0.3714 이들의 제곱합 = 1
• 또한 cosθA = 0.9285, cosθB =0.3714
피험자 A B
1 15 21
2 12 16
3 18 19
피험자 ZA ZB
1 0 0.92
2 -1 -1.06
3 1 0.13
15
피험자 변인 A 변인 B 변인 C = 2A+0.8B
1 1 3 4.4
2 2 2 5.6
• SSCP 행렬 – Sum-of-Squares and Cross-Products
• = 𝐴′𝐴
• 𝐴′ 𝐴 𝑆𝑆𝐶𝑃
•1 2 3−4 −6 −23 9 6
1 −4 32 −6 93 −2 6
= 14 −22 39−22 56 −7839 −78 126
X’X= SSCP = =
16
• Variance-Covariance Matrix – Variance
– Covariance
– 예에서 • (A의 요소 – 각 열의 평균) SSCP를 구한 후
• SSCP / (A의 행의 개수) Variance-Covariance Matrix
•1
3
−1 0 10 −2 2−3 3 0
−1 0 30 −2 31 2 0
=0.667 0.667 10,667 2.667 −2
1 −2 6
• Correlation Matrix – A의 요소들을 각 열별로 표준화 SSCP를 구한 후
– (그 SSCP 행렬의 모든 요소)/(A의 행의 수) Correlation Matrix
–1
3
−1.225 0 1.2250 −1.225 1.225
−1.225 1.225 0
−1.225 0 −1.2250 −1.225 1.225
1.225 1.225 0=
1 0.5 0.50,5 1 −0.50.5 −0.5 1
17
거리개념의 확장
• Distance of a point from the mean in univariate space = 𝑥𝑖 − 𝑥
• Euclidean distance
18
• 다변량 데이터에서의 거리측정
– (𝑥𝑖 − 𝑥 )2+(𝑦𝑖 − 𝑦 )2+⋯+ (𝑛𝑖 − 𝑛 )2
• Euclidean distance
• 정의
• 한계 – it has some degree of covariance
19
거리의 개념
• Distance between numeric data points – Minkowski
– Euclidean distance.
• When p = 2,
– Manhattan distance. • When p = 1,
– Mahalanobis distance
• 기타
– Distance between categorical data points • Hamming distance, Jaccard,
– Distance between sequence (String, TimeSeries)
• 기타 관련 개념 – z-transform, Pearson,
20
Variance-Covariance Matrix
21
Covariance와 Distance
• It would be easier to calculate distance if we could rescale the coordinates so they didn’t have any covariance
22
Distances as Vectors
• Distances in coordinate space can be described as vectors
23
MVA 기법
기법 Interdependence Explanatory/Confirmatory
Factor Analysis Interdependence Explanatory
Confirmatory
MDS Interdependence Explanatory
Cluster Analysis Interdependence Explanatory
Canonical Correlation Dependence Confirmatory
SEM (Structural Equation Modeling)
Dependence Confirmatory
ANOVA Dependence Confirmatory
Discriminant Analysis Dependence Confirmatory
Logit Choice Model Dependence Confirmatory
Source: Analyzing Multivariate Data By J.M. Lattin (외)
PCA, SVD, 판별분석
25
• 3개의 관점
– method for transforming correlated variables into a set of uncorrelated ones better expose various relationships among the original data items.
– method for identifying and ordering dimensions data points exhibit the most variation.
– method for data reduction.
• SVD의 의의
– 차원축소 expose substructure of the original data more clearly and orders it from most variation to the least.
• 대표적 활용: NLP
– 문서에 내재된 주된 관계성을 확인하면서도 특정 threshold 이하의 variation을 무시하는 대신 대대적으로 차원을 축소
• 방법론 – a rectangular matrix A can be broken into product of 3
matrices - an orthogonal matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V
• 단, UTU = I, V TV = I; – U 행렬의 column들은 orthonormal eigenvectors of AAT
,
– V 행렬의 column들은 orthonormal eigenvectors of ATA,
– S 는 a diagonal matrix containing square roots of eigenvalues from U or V.
– 예:
특이값 분해 (Singular Value Decomposition)
• 3개의 관점
– method for transforming correlated variables into a set of uncorrelated ones better expose various relationships among the original data items.
– method for identifying and ordering dimensions data points exhibit the most variation.
– method for data reduction.
• SVD의 의의
– 차원축소 expose substructure of the original data more clearly and orders it from most variation to the least.
• 대표적 활용: NLP
– 문서에 내재된 주된 관계성을 확인하면서도 특정 threshold 이하의 variation을 무시하는 대신 대대적으로 차원을 축소
28
29
• 방법론 – a rectangular matrix A can be broken into product of 3
matrices - an orthogonal matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V
• 단, UTU = I, V TV = I;
• U 행렬의 column들은 orthonormal eigenvectors of AAT ,
• V 행렬의 column들은 orthonormal eigenvectors of ATA,
• S 는 a diagonal matrix containing square roots of eigenvalues from U or V in descending order.
30
– To find U, start with AAT
– To find eigenvalues & corresponding eigenvectors of AAT
31
• Eigenvectors become column vectors in a matrix ordered by the size of the corresponding eigenvalue. eigenvector for λ = 12 is column 1, and eigenvector for λ = 10 is column 2.
• convert into orthogonal matrix by Gram-Schmidt orthonormalization process to the column vectors
– Normalize 𝑣1
32
– To find V
33
34
• For λ = 12;
– …
– 𝑣1 = [1, 2, 1].
• For λ = 10;
– …
– 𝑣2 = [2, -1,0].
• For λ = 12;
– …
– 𝑣3 = [1, 2, -5].
• Gram-Schmidt orthonormalization process
35
• For S – we take square roots of non-zero eigenvalues and populate
the diagonal with them
– put the largest in s11, the next largest in s22 and so on until the smallest value ends up in smm.
– + add a 0-column vector to S so we can multiply between U and V .
– Diagonal entries in S are singular values of A, columns in U are left singular vectors, and columns in V are right singular vectors.
36
• Now we have all the pieces of the puzzle
37
LDA (선형판별식)
38
Frequency
table
Zero R
One R
Naive
Bayesian
Decision
tree
Covariance
matrix
LDA
Logistic
regression
Similarity
functions
KNN
Others
ANN
SVM
39
LDA
• LDA: pick a new dimension that gives: – Maximum separation between means of projected classes
– Minimum variance within each projected class
• Solution: eigenvectors based on between-class and within-class covariance matrices
40
PCA vs. LDA
• LDA not guaranteed to be better for classification – Assumes classes are unimodal Gaussians
– Fails when discriminatory information is not in the mean, but in the variance of the data
• Example where PCA gives better projection:
41
42
• Modeling difference in groups for the purpose of separating 2 or more classes, objects, categories,
– much like logit, probit models
• LDA seeks to reduce dimensionality while preserving as much of the (two) class discriminatory information as possible
• (ex)
– Assume D-dimensional samples, N1 of which belong to class w1, and N2 to class w2.
– obtain a scalar y by projecting the samples x onto a line y
𝑦 = 𝑤𝑇𝑥
• Select one that maximizes the separability of the scalars
43
• 예 – Discriminating students in high school
• will go to college
• will go to trade school
• discontinue education
– Some pattern must be there, so we collect
• family background
• academic information
– Discriminate a person between mail or femail based on height
44
Theory
• Discrimination by comparing means of variables • Can have several variables • Assumes multi variety normality - independent variables ne
ed to be continuous • Homogeneous variance • Create an equation which minimizes the possibility of miscl
assifying cases into their respective groups or categories – D = a1*X1 + a2*X2 + ... + ai*Xi + b
• where: • D = discrimination function • X = response score for that variable • a = discrimination coefficient (analogous to regression coeff) • B = constant • i = No of discriminant variables
45
• Based on the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets)
• To capture the notion of separability, Fisher defined the following score function:
46
• Given the score function, the problem is to estimate the linear coefficients that maximize the score which can be solved by the following equations:
– 𝛽 = 𝐶−1 𝜇1 − 𝜇2 Model coefficients
– 𝐶 = 1
𝑛1+𝑛2 (𝑛1𝐶1 + 𝑛2𝐶2) Pooled covariance matrix
» 𝛽 ; Linear model coefficients
» 𝐶1, 𝐶2 ; Covariance matrices
» 𝜇1, 𝜇2 ; mean vectors
47
• Mahalanobis distance between 2 groups – A distance greater than 3 means that in two averages differ by
more than 3 standard deviations
– It means that the overlap (probability of misclassification) is quite small
– ∆2= 𝛽𝑇 𝜇1 − 𝜇2
– ∆ : Mahalanobis distance between two groups
48
• Finally, a new point is classified by projecting it onto the maximally separating direction and classifying it as C1 if:
– 𝛽𝑇 𝑥 − 𝜇1+𝜇2
2> log
𝑃(𝑐1)
𝑃(𝑐2)
• 𝛽 ; coefficient vector
• 𝑥 ; Data vector
• 𝜇 ; mean vector
• 𝑃(𝑐2) ; class probability
49
LDA 예
• 은행에서의 고객 (중소기업) 부도위험 판별 – clients who defaulted (red square) and those that did not
(blue circles) separated by delinquent days (DAYSDELQ) and number of months in business (BUSAGE)
– LDA 이용한 판별모델 (default and non-default)
– Data
(no of observations = 100)
BUSAGE DAYSDELQ DEFAULT Z Z-Z0 Prediction
87 13 N
89 27 Y
...
50
• We use LDA to find an optimal linear model that best separates two classes (default and non-default)
51
Z0 = .3985302 Log(P(N)/PY)) = 0.4771212547
52
• The first step is to calculate the mean (average) vectors, covariance matrices and class probabilities
53
• Then we calculate people covariance matrix and finally the coefficients of the linear model
54
• Assume we have a point with: BUSAGE=111 and DAYSDELQ=24
– x = [111 24]
• 𝜷𝑻 𝒙 − 𝝁𝟏+𝝁𝟐
𝟐> 𝐥𝐨𝐠
𝒑(𝒄𝟏)
𝒑(𝒄𝟐)
• 𝛽𝑇 ; 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑒𝑐𝑡𝑜𝑟
• 𝑥 ; Data vector
•𝜇1+𝜇2
2 ; mean vector
• 𝑝(𝑐1) ; class probability
•−0.0095−0.1408
[111 24][116.23 16.89]+[115.04 55.32]
2>? log
0.75
0.25
55
• A Mahalanobis distance of 2.32 shows a small overlap between 2 groups which means a good separation between classes by the linear model – ∆2= 𝛽𝑇 𝜇1 − 𝜇2 = 5.40
– ∆ = 2..32
56
PCA
57
1. 개념
• 의의 – PCA: orthogonal projection of highly correlated variables to
principal components
• linear transformation is defined in such a way that the first principal component has the largest possible variance.
• PC: a set of values of linearly uncorrelated variables
• 활용 – High-dimensional data
– 이미지 처리, Text 처리, 주식정보, …
– describe them in a simpler way
Covariance
• For 2 dimensional data, cov(x,y)
• For 3 dimensional data, cov(x,y), cov(x,z), cov(y,z)
• For an n-dimensional data set, 𝑛
𝑛−2 !∗2 different
covariance values
• So, the definition for the covariance matrix for a set of data with dimensions is:
• Eigenvector? – non-0 vector that, after being multiplied by the matrix, remain parallel to origi
nal vector.
• In order to keep eigenvectors standard, we usually scale it to make it have a length of 1, so that all eigenvectors have the same length.
non-eigenvector eigenvector
• 진행절차 – Step 1: 데이터 입수 및 정비
• Subtract the mean & covariance matrix 계산
– Step 4: covariance matrix 의 eigenvector와 eigenvalues 계산
– Step 5: components 선택 및 feature vector 생성
– Step 6: 새로운 데이터 셋 도출
• components 선택 및 feature vector 생성
• eigenvector with the highest eigenvalue is principle component of the data set.
• 나머지 생략 가능…
• Step 6: 새로운 데이터 셋 도출
– RowFeatureVector = matrix with the eigenvectors in the columns transposed, with the most significant eigenvector at the top.
– RowDataAdjust = mean-adjusted data transposed, ie. the data items are in each column, with each row holding a separate dimension.
– original data를 우리가 선택한 vector에 의거하여 변형
– the patterns are the lines that most closely describe the relationships between the data.
• Getting the old data back
• Biplot – shows the proportions of each variable along the 2 PCs
• Spree
EDA
66
EDA (탐색적 데이터 분석)
• 주된 용도 – 1. 데이터셋에 대한 insight. – 2. 데이터에 영향주는 요소 (Understand some critical impact
variable)을 확인하고 그 관계를 이해함 – 3. Outlier 존재 여부를 확인 – 4. 데이터셋에 내재하는 전제조건 (underlying assumptions)을 검증
• 데이터 분석 – 탐색 vs. 확인 – Confirmatory data analysis
• tests a hypothesis • settles questions • (Inferential statistics)
– Exploratory data analysis • finds a good description • raises new questions • (Descriptive statistics)
67
• Exploratory Data Analysis (EDA) – an approach/philosophy for data analysis that employs a
variety of techniques (mostly graphical) to
• maximize insight into a data set;
• uncover underlying structure;
• extract important variables;
• detect outliers and anomalies;
• test underlying assumptions;
• develop parsimonious models; and
• determine optimal factor settings.
3가지 접근법
• For classical analysis, the sequence is – Problem => Data => Model => Analysis => Conclusions
• For EDA, the sequence is – Problem => Data => Analysis => Model => Conclusions
• For Bayesian, the sequence is – Problem => Data => Model => Prior Distribution => Analysis
=> Conclusions
70
• 데이터분석 절차 – 시각화 및 EDA
71
그림출처: Wickham and Grolemund
• Data Munging – Transforming data
– Raw data to usable data
– Data must be cleaned first
• 주요 Tasks – Renaming variables
– Data type conversion
– Encoding, decoding or recoding data
– Merging data sets
– Transforming data
– Handling missing data (imputing)
– Handling anomalous values
72
1. 기술적 측면
• (1) 데이터 읽기 – read.table
• read.delim read.delim2 read.csv • read.csv2 read.table read.fwf
– A freshly read data.frame should always be inspected with functions like head, str, and summary
• (2) 타입 변환 – coercion
• as.numeric as.logical as.integer • as.factor as.character as.ordered
– factor 변환 • factor()
– date 변환 • library(lubridate)
• (3) 문자열과 encoding – Sys.getlocale("LC_CTYPE") – f <- file("myUTF16file.txt", encoding = "UTF-16")
2. Consistent Data
• (1) Missing value – na.rm = TRUE – (persons_complete <- na.omit(person))
• (2) special value 문제 – (예) – is.special <- function(x){ – if (is.numeric(x)) !is.finite(x) else is.na(x) – }
• (3) Outlier 문제
3. 수정
• 대체값 적용 (Imputation) – x <- 1:5 # create a vector... – x[2] <- NA # ...with an empty value – x <- impute(x, mean) – x – ## 1 2 3 4 5 – ## 1.00 3.25* 3.00 4.00 5.00 – is.imputed(x)
– # -- – I <- is.na(x) – R <- sum(x[!I])/sum(y[!I]) – x[I] <- R * y[I]
– # -- – data(iris) – iris$Sepal.Length[1:10] <- NA – model <- lm(Sepal.Length ~ Sepal.Width + Petal.Width, data = iris) – I <- is.na(iris$Sepal.Length) – iris$Sepal.Length[I] <- predict(model, newdata = iris[I, ])
dplyr 기초
• 6가지의 주된 함수 – Pick observations by their values (filter()). – Reorder the rows (arrange()). – Pick variables by their names (select()). – Create new variables with functions of existing variables (mutate()). – Collapse many values down to a single summary (summarise()). – + group_by()
• changes the scope of each function from operating on the entire dataset to operating on it group-by-group.
• 사용법
– The first argument is a data frame. – The subsequent arguments describe what to do with the data
frame, using the variable names (without quotes). – The result is a new data frame.
• dplyr 의 filter에서의 logical operation
Tidy data set
• rules for a tidy dataset : – Each variable must have its own column.
– Each observation must have its own row.
– Each value must have its own cell.
> table4a # A tibble: 3 × 3 country `1999` `2000` * <chr> <int> <int> 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766
> table4a %>% + gather(`1999`, `2000`, key = "year", value = "cases") # A tibble: 6 × 3 country year cases <chr> <chr> <int> 1 Afghanistan 1999 745 2 Brazil 1999 37737 3 China 1999 212258 4 Afghanistan 2000 2666 5 Brazil 2000 80488 6 China 2000 213766
Relational data
• A primary key – uniquely identifies an observation in its own table. – (ex) planes$tailnum is a primary key
• A foreign key – uniquely identifies an observation in another table. – (ex) flights$tailnum is a foreign key
시각화
(v.0.9) 82
개요
• Base Graphics – plot()
• hist() and boxplot().
– points(), lines(), text(), mtext(), axis(), rug(), identify()
• 특화 패키지: – http://www.comput
erworld.com/article/2921176/business-intelligence/great-r-packages-for-data-import-wrangling-visualization.html
Lattice Graphics
• Lattice = a flavour of trellis graphics – For lattice, graphics formulae are mandatory.
– grid = a low-level graphics system. It was used to build lattice.
• Lattice vs. base graphics – xyplot() vs. plot() – plot() gives a graph as a side effect of the command.
– xyplot() generates a graphics object.
• As this is output to the command line, the object is “printed”, i.e., a graph appears.
84
graph_type 설명 formula 예
barchart bar chart x~A or A~x
bwplot boxplot x~A or A~x
cloud 3D scatterplot z~x*y|A
contourplot 3D contour plot z~x*y
densityplot kernal density plot ~x|A*B
dotplot dotplot ~x|A
histogram histogram ~x
levelplot 3D level plot z~y*x
parallel parallel coordinates plot data frame
splom scatterplot matrix data frame
stripplot strip plots A~x or x~A
xyplot scatterplot y~x|A
wireframe 3D wireframe graph z~y*x
85
ggplot2
• ggplot2 의 특징 – (장점) Consistent underlying grammar of graphics – (한계) Things you cannot do:
• 3-dimensional graphics • Graph-theory type graphs (nodes/edges layout)
• Anatomy of a plot: – data aesthetic mapping – geometric object statistical transformations – scales coordinate system – position adjustments faceting
• ggplot2 vs. Base Graphics – ggplot2 is more verbose for simple / canned graphics – is less verbose for complex / custom graphics – does not have methods (data should always be in a data.frame) – uses a diferent system for adding plot elements
• Geometric Objects And Aesthetics – Aesthetic Mapping
• ggplot 에서 aesthetic 이란 "something you can see" – position (i.e., on the x and y axes)
– color ("outside" color)
– fill ("inside" color)
– shape (of points)
– linetype
– size
• > aes()
– Geometric Objects • = actual marks we put on a plot
– points (geom_point, for scatter plots, dot plots, etc)
– lines (geom_line, for time series, trend lines, etc)
– boxplot (geom_boxplot, for, well, boxplots!)
기계학습 모델링
88
Machine Learning?
• 개념 – = subfield of Artificial Intelligence (AI)
– “construction and study of systems that can learn from data”
• 종류 – http://en.wikipedia.org/wiki/Machine_learning
• 방법론 – “A computer program learns from experience (E) with some cl
ass of tasks (T) and a performance measure (P) if its performance at tasks in T as measured by P improves with E”
89
• 용어 – Features
• = distinct traits that can be used to describe each item in a quantitative manner.
– Samples • an item to process (e.g. classify). • document, a picture, a sound, a video, a row in database or CSV file,
…
– Feature vector • an n-dimensional vector of numerical features that represent some o
bject.
– Feature extraction • Preparation of feature vector – transforms the data in the high-dimen
sional space to a space of fewer dimensions. •
– Training/Evolution set • Set of data to discover potentially predictive relationships.
90
학습 (Learning) vs. 훈련 (Training )
91
절차
92
종류
• Supervised Learning
• Unsupervised Learning
• Semi-Supervised Learning
• Reinforcement Learning – allows the machine or software agent to learn its behavior bas
ed on feedback from the environment.
– This behavior can be learnt once and for all, or keep on adapting as time goes by.
93
ML Algorithm의 유형
• Predictive model – = target 변수와 다른 feature들 사이의 관계를 발견 또는 모델링
하려는 것
– = supervised learning clear instruction on what they learn & how (단, 사람 아닌 target values provide a supervisory role to find …)
• Descriptive model – Nor target to learn No single feature is more important tha
n other
– (ex) Pattern discovery (Market basket analysis, clustering)
94
Supervised L Unsupervised L Other Applications Remarks
NN
(Naïve) Bayes
Decision Tree
(Classif’n Rule L)
Linear Regression
Model Tree
Neural Net
SVM
AR
K-means
..
Marketing Anal.
… 95
KNN
96
KNN
• = classify unlabeled examples by assigning them to the class of the most similar labeled example
• [사례] Blind testing을 통한 tomato 분류배정
97
98
99
100
• 거리계산 – Euclidean distance
– Manhattan D
• NN – 1NN ; orange이므로 as fruit
– 3NN ; vote among the 3 nearest neighbor
• 적절한 K 값의 선택
Large K Small K
Bias 감소 Variance 감소
But Underfitting But Overfitting
Single K outlier
실무: 학습대상 concept의 복잡성, training data의 개수 101
절차
• Data 준비 – 사전준비 – transform features to a standard range
• Shrinking, rescaling min-max normalization
• Z-score standardization
– Nominal feature의 경우 dummy coding • 단, Ordinal data의 경우 number 부여 후 normalize
• 특징 – Lazy Learning No abstraction, No generalization
– 대신, instance-based Learning
– Non-parametric Learning
102
응용
• Voronoi Diagram – Training example에 의거한 Decision surface
103
BAYESIAN과 NAÏVE BAYES
104
기본 개념
• 배경: – the estimated likelihood of an event should be based on the e
vidence at hand.
• 확률
105
• 조건부 확률 - Bayes’ theorem
– 사전확률 (prior probability)
• the most reasonable guess would be the probability that any prior message was spam (~ 20 % in the example).
– likelihood
• Probability that the word Viagra was used in previous spam messages
– Marginal likelihood
• Probability that Viagra appeared in any message at all.
• 사후확률 (posterior probability) – Bayes' theorem을 이용해서 메시지가 spam일 사후확률 계산
– IF ( > 50 %) THEN message is likely to be spam should be filtered.
106
• Bayes’ theorem의 적용
• Frequency table
• P(spam∩Viagra) = P(Viagram|spam) * P(spam) = (4/20)((20/100)=0.04
• P(spam|Viagra) = P(Viagra|spam) * P(spam) / P(Viagra) = (4/20)*(20/100)/(5/100) = 0.80
107
Naïve Bayes 분류
• spam main 예의 확장 – Train by constructing likelihood table for the appearance of 4
words:
– 확률계산 (ex) Viagra=Y, Money=N, Groceries=N,
Unsubscribe=Y
– Class-conditional independence를 이용하여 계산 단순화
108
– likelihood of ham: • (4/20) * (10/20) * (20/20) * (12/20) * (20/100) = 0.012
• Spam일 확률: 0.012 / (0.012 + 0.002) = 0.857
– likelihood of ham : • (1/80) * (66/80) * (71/80) * (23/80) * (80/100) = 0.002
• Ham일 확률: 0.002 / (0.012 + 0.002) = 0.143
– => expect that the message is spam (85.7 %), ham with 14.3 %. • 즉, “this message is 6 times more likely to be spam than ham.”
– probability of level L for class C, given the evidence provided by features F1 ~ Fn, is:
109
• Laplace estimator – (IF) message contains: Viagra, Groceries, Money, Unsubscribe.
• naive Bayes algorithm 에서의 likelihood of spam:
– (4/20) * (10/20) * (0/20) * (12/20) * (20/100) = 0
• And the likelihood of ham is:
– (1/80) * (14/80) * (8/80) * (23/80) * (80/100) = 0.00005
• probability of spam is: 0 / (0 + 0.0099) = 0, probability of ham is: 0.00005 / (0 + 0. 0.00005) = 1
– (Solution) Laplace estimator
• frequency table의 count에 숫자 (1) 가산 ensure that each feature has a nonzero probability of occurring with each class.
• Naïve Bayes에서 numeric feature 사용 – By discretizing/binning
110
사례 – 휴대폰 spam filtering
• SMS에서의 spam filter
• 데이터: sms_spam.csv
• Package: tm 이용 – Corpus() corpus (= a collection of text document) 생성
– VectorSource() tell Corpus() to use the message in the vector sms_train$text
• ** tm package **
111