TensorFlow와 그 활용 - 용v21.pdf · PDF file일정 (안) 주제 주요 내용...

Preview:

Citation preview

딥러닝과 TensorFlow

2017.4~5

윤형기 (hky@openwith.net)

일정 (안)

주제 주요 내용

1일차 환경구축 VMM/Linux, Python

Python 기초 Python, numpy, pandas, …

기초통계, 선형대수, 모델링 통계, 기계학습, 선형대수, 모델링 개요

2일차 ANN과 DL 개요 ANN과 DL 소개 (+ Tensorflow)

TensorFlow 기초 TensorFlow, TensorFlow ML

3일차 DL 고급 Link함수, Backpropagation, Tuning

TensorFlow DL CNN, RNN, RBM, Autoencoder

4일차 TF 응용 Chatbot, …

1일차

인사말씀

• 목적 – …

• 일정 협의 – 4일 x 8시간

• 환경 구축 – 가상머신 + Linux + Python

• 개념정의 – Deep Learning과 Machine Learning

– TensorFlow

개념정의

5

개념

• Machine Learning과 Deep Learning – … 뒷면 …

• TensorFlow? – S/W library for numerical computation using data flow graphs.

• Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.

• The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.

• Tensor? – 물리학에서의 Tensor

• '좌표변환 하에서 특정한 변환법칙(transformation law)을 따르는 양' • 일반적으로 n-dimension의 m-rank tensor는 n^m개 원소를 가짐.

– 0차 tensor가 스칼라 > 1차 tensor가 vector[2] – 2차 Tensor - 역학, 전자기학, … > 3차 이상 Tensor – Riemann 기하학, 입자물리

등에서 활용

– 수학에서의 텐서 • 쌍대 공간의 개념을 일반화한 것

데이터 분석의 개념과 범위

• Data Mining/ Predictive Analysis

• Data Science

• BI/OLAP

• Analytics

• Modeling

• Machine Learning

• 수리/통계 분석

• KDD (Knowledge Discovery)

• Decision Support System

7

• 발전 – Data Science – 전통적 분석

• BI/OLAP/DB Query, Spreadsheet 중심 분석

• 통계 분석

– + 텍스트 분석 (SNA/감성분석, 마이닝, 검색)

– + Machine Learning/Deep Learning

데이터과학 (Data Science)

• Data Science

• 통계와 기계학습

통계 기계학습

Population & Sample All Data

Estimation Learning

Hypothesis Classifier

Example/Instance Data point

Regression Supervised Learning

Covariate Features

Response Label

… … 8

환경 구축

9

설치

• MS Windows

• + Virtual Machine (VirtualBox)

• + Ubuntu

• + Python (+ scipy + … : anaconda)

• + TensorFlow

• + Scikit-Learn

• + … (git, Spark, …)

PYTHON 기초

11

Python

단계 주요 내용

Python 기초 Python 개요와 설치

변수, 문장, 조건문과 Loop

함수, Module과 프로그램, 예제 프로그램

Python 프로그래밍 (1)

String

List/Dictionary/Set

Module과 Package

Python 프로그래밍 (2)

File & I/O

OOP

Exception 처리

Python 프로그래밍 (3)

Regular Expression, 데이터베이스 활용

Standard Library, 기타의 유용한 기능

Python을 활용한 데이터분석 (1)

Python 활용 데이터분석 (1)

Python 활용 데이터분석 (2)

Python을 활용한 데이터분석 (2)

Python 활용 데이터분석 (3)

Python과 빅데이터

마무리

• Python cheat sheet

numpy와 pandas

• numpy – ndarray

• Contiguous allocation of mermoy

• Vectorized operation

– np.array(), .size(), .zeros(), .arange(), .reshape(), .linspace(), …

– slicing – flatten(), a1[3:8], .concatenate(), .vstack(), .hsplit(),

– .min()/.max(), .cumsum()/.cumprod(), …

• pandas – numpy array + labeled index (default: dtype=Int64Index, … Object)

– Index label • Series

• DataFrame

– Zero-based position

– Logical expression, Entire raws by .loc(), .iloc()

NumPy: Array와 Vectorized Computation

• NumPy ndarray: A Multidimensional Array Object – ndarrays 생성 – Data Types for ndarrays – Operations between Arrays and Scalars – Basic Indexing and Slicing – Boolean Indexing – Fancy Indexing – Transposing Arrays and Swapping Axes

• Universal Functions: Fast Element-wise Array Functions • Loop-free programming with arrays

– Expressing Conditional Logic as Array Operations – Mathematical and Statistical Methods – Methods for Boolean Arrays – Sorting – Unique and Other Set Logic

• File 입출력과 Arrays – Storing Arrays on Disk in Binary Format – Saving and Loading Text Files

• Linear Algebra • Pseudorandom Number Generation

Function Description

array Convert input data to an ndarray either by inferring a dtype or explicitly specifying a dtype. Copies the input data by default.

asarray Convert input to ndarray, but do not copy if the input is already an ndarray

arange Like the built-in range but returns an ndarray instead of a list.

ones, ones_like Produce an array of all 1’s with the given shape and dtype. ones_liketakes another array and produces a ones array of the same shape and dtype.

zeros, zeros_like Like ones and ones_like but producing arrays of 0’s instead

empty, empty_like Create new arrays by allocating new memory, but do not populate with any values like ones and zeros

full, full_like Produce an array of the given shape and dtype with all values set to the indicated “fill value”. full_like takes another array and produces a a filled array of the same shape and dtype.

eye, identity Create N x N identity matrix (1’s on diagonal and 0’s elsewhere)

Table. Array creation functions

Type Type Code Description

int8, uint8 i1, u1 Signed and unsigned 8-bit (1 byte) integer types

int16, uint16 i2, u2 Signed and unsigned 16-bit integer types

int32, uint32 i4, u4 Signed and unsigned 32-bit integer types

int64, uint64 i8, u8 Signed and unsigned 32-bit integer types

float16 F2 Half-precision floating point

float32 f4 or f Standard single-precision floating point. Compatible with C float

float64 f8 or d Standard double-precision floating point. Compatible with C double and Python float object

float128 f16 or g Extended-precision floating point

Table. NumPy data types

Type Type Code Description

complex64, complex128, complex256

c8, c16, c32 Complex numbers represented by two 32, 64, or 128 floats, respectively

bool ? Boolean type storing True and False values

object O Python object type, a value can be any Python object

string_ S Fixed-length ASCII string type (1 byte per character). For example, to create a string dtype with length 10, use 'S10'.

unicode_ U Fixed-length unicode type (number of bytes platform specific). Same specification semantics as string_ (e.g.'U10').

• ndarray • ndarray is essentially defined by:

– a number of dimensions

– a shape

– strides

– a data type, or dtype

– the actual data

• ndarray 에 대한 Vector 연산 – vector (or vectorized) operations.

• an elementary mathematical operation is performed on an element-wise basis on two arrays.

• ndarray 의 메모리 저장형태 • 내부적으로, ndarray = metadata + actual binary data.

– data is stored in a contiguous block of memory.

• Default order in NumPy is the C-order, although this can be configured differently.

• Strides describe how elements of a multidimensional array are

organized within data buffer. NumPy implements a strided indexing scheme.

– strides describe, in any axis, how many bytes to jump over in the data buffer to go from one item to the next along that axis.

pandas

• 개요 – Rich data manipulation tool built on top of Numpy

– Fast, intuitive data structures

– Filling the gap between Python and DSL like R

– Like r’s data.frame

– Easy-to-use, highly consistent API

– Data munging/preparation/cleaning/integraation

• 핵심 기능 – Reindexing

– Dropping entries from an axis

– Indexing, selection, and filtering

– Sorting and ranking

– Axis indexes with duplicate values

• 핵심기능 (계속) – 요약과 통계 (기술통계)

• Correlation과 Covariance

• Unique Values, Value Counts, and Membership

– 결측 데이터의 처리

• Filtering Out Missing Data과 Filling in Missing Data

– Hierarchical Indexing

• Reordering and Sorting Levels

• Summary Statistics by Level

• Indexing with a DataFrame’s columns

• Integer Indexes

• pandas 데이터 구조 – Series

• Subclass of numpy.ndarray

• Data: any dtype

• Index labels need not be ordered

• Duplicates are possible (but result in reduced functionality)

– DataFrame

• ndarray-like, but not ndarray

• Each column can have a different dtype

• Row and column index

• Size mutable: insert and delete columns

기초통계

25

목차

• Unit I: 개요 – 1. 개요와 기술(記述)통계

– 2. 확률이론과 Bayesian

• Unit II: 변량별 데이터 분석 – 3. 단변량/이변량/다변량

• Unit III: 분포와 표본추출 – 4. 이산 분포와 연속 분포

– 5. 표본추출과 표본분포

• Unit IV: 모수 추정 – 6. 추정(단일/2개 모집단)

– 7. 가설검정

– 8. 분산분석과 실험계획

26 http://www.openwith.net

UNIT I: 개요

1. 기본개념과 기술(記述)통계 2. 확률이론과 Bayesian

27 http://www.openwith.net

1. 기본개념과 기술통계

• 1.1 통계 개념

28 http://www.openwith.net

• 1.2 기술통계 (Descriptive Statistics) – (1) 중심경향성: Ungrouped Data

• Mode, Mean, Median • Percentile, Quantile/Quartile

– (2) 변동성: Ungrouped Data • Range & IQR (Interquartile Range) • MAD (Mean Absolute Deviation) • Variance, Standard Deviation

• Empirical Rule와 Chebychev’s Theorem

• Population vs. Sample Variance and Standard Deviation – Unbiased estimator

• Z-score

• Coefficient of Variation (CV)

29

http://www.openwith.net

– (3) Measures of Shape

• Skewness

– Coefficient of Skewness

• Kurtosis

• Box-and-Whisker Plots

30

– (4) 연관성 (Association) 측도

• Correlation

– Pearson product-moment correlation coefficient

– Spearman Correlation Coefficient

– Kendall Tau(τ) Correlation Coefficient

» 두 변수 순서관계 (ordinal association)

31 http://www.openwith.net

2. 확률이론과 Bayesian

• 2.1 기본개념

– Experiment, (근원) 사건, 표본공간, 독립사건, Unions, Intersections,

– MECE (Mutually Exclusive Collectively Exhaustive)

– Marginal, Union, Joint

– Counting Possibilities

• mn Counting Rule: m x n

• Sampling from a Population with Replacement: (N)n possibilities

• Combinations: Sampling from Population Without Replacement: NCn = 𝑁!/𝑛!(𝑁−𝑛)!

32

P(X⋂Y) = 0

http://www.openwith.net

33 http://www.openwith.net

UNIT II: 변량별 데이터 분석

3. 단변량/ 이변량/ 다변량

34 http://www.openwith.net

3. 변량별 분석 도구

• 3.1 단변량 – Categorical Data

• Table, Barplots, Pie Chart, Dot Chart

– Numeric Data

• Stem-and-leaf plots, Strip chart

• Center: mean, median & mode

• Range, variance, …

– 분포의 모양

• Mode, Symmetry and Skew

• Boxplot, Histogram

35 http://www.openwith.net

• 3.2 이변량 (Bivariate) 데이터 – Pairs of categorical variables

• 2-way Table - 주변분포 (Marginal Distribution), 조건부 분포, contingency table

– 독립표본의 비교 • Side-by-side Boxplots, Density plot, Strip Chart, Q-Q plots

– Numeric Data에서의 관계(Relationship) • Scatter plot을 이용한 관계성 분석 - 상관관계

– 단순회귀분석

• 3.3 다변량 (Multivariate) 데이터 – 다변량데이터의 요약

• 범주형 다변량데이터 요약

• 독립표본의 비교와 관계성 비교

– 다변량 데이터 모델링 • Boxplot과 다변량 모델

• Contingency Table – xtabs()

• split()과 stack()

– Lattice 그래픽 활용

36

http://www.openwith.net

UNIT II: 분포와 표본추출

4. 이산 분포와 연속 분포

5. 표본추출과 표본분포

37 http://www.openwith.net

4. 이산 분포와 연속분포

• 4.1 개요 – 확률변수 (Random variable)

• = a variable that contains the outcomes of a chance experiment

• 4.2 이산분포의 모양 – 평균 or 기대값

• = long-run average of occurrences

– Variance와 Standard Deviation

• 4.2 이항분포 – Binomial formula

– 이항분포의 평균과 표준편차

• 4.3 Poisson 분포 – Law of improbable events

38

http://www.openwith.net

λ = long-run average

• 4.5 초기하 (Hypergeometric) 분포 – 개요

• = 유한 모집단으로부터 비복원추출 시 나타나는 확률분포

– 다음 경우에 이항분포 대신 사용:

• (i) Sampling is done without replacement.

• (ii) n ≥ 5% N

39 http://www.openwith.net

(연속 분포 )

• 4.6 일양분포 (一樣分布 Uniform Distribution)

• 4.7 정규분포 – 개요

• Gaussian 분포

• 정규분포의 확률밀도함수

– Standardized Normal Distribution • z score = 평균을 중심으로 한 표준편차의 개수

• z distribution

• 4.8 이항분포 대신 정규분포의 적용 (Approximate) – 경험법칙;

• 대략 normal curve value의 99.7%가 3 s.d. 이내

• n • p > 5 and n • q > 5

– Correcting for Continuity • ; Converting discrete distribution into a continuous distribution.

40

http://www.openwith.net

• 4.7 지수분포 – = Random occurrences 사이 시간의 확률분포

– 지수분포의 확률

• random arrivals 사이의 Inter-arrival times는 지수분포

– cf. Poisson 분포 = random occurrences over some interval

41 http://www.openwith.net

5. 표본추출과 표본분포

• 5.1 Sampling(표본추출) 방법

• 5.2 𝑥 의 표본분포

– 중심극한정리

• 𝜇𝑥 = μ

• 𝜎𝑥 = 𝜎

𝑛

– z Formula for Sample Means

– Sampling from a Finite Population

– 중심극한정리

• 5.3 𝑝 의 표본분포

42 http://www.openwith.net

UNIT IV: 모수 추정

6. 추정

7. 가설검정

8. 분산분석과 실험계획

43 http://www.openwith.net

6. 추정

• 신뢰구간 추정 (단일 모집단) – z 통계량 이용한 신뢰구간 추정 (단일 모집단) (σ Known)

• 점추정 (point estimation)

• 100(1-α)% Confidence Interval to Estimate μ: σ known]

• 유한조정계수

• Sample Size가 작은 경우 – 여태까지 주로 n ≥ 30

– n < 30 이어도 중심극한정리에 의해 z formula 적용 :

– sample size가 클 때 또는 작아도 모집단이 정규분포 (σ known)

44 http://www.openwith.net

– t 통계량 이용한 신뢰구간추정 (단일모집단) (σ Unknown)

• 모집단이 정규분포인데 모집단 s.d 를 모르는 경우 t 분포 적용.

– 표본크기에 따라 분포가 다르다.

– t statistic 의 assumption: 모집단이 정규분포

» If population is not normal dist. or is unknown, nonparametric techniques

– t Distribution의 특징: Robust

• t 통계량을 이용한 모집단 평균 추정에서의 신뢰구간

– 모비율 추정

45 http://www.openwith.net

– 모분산 추정

• (…)

– Sample Variance

– 모분산과 표본분산의 관계: χ2 분포

– 표본크기의 산정

• μ 추정 시의 표본크기

– μ 추정 시: 표본크기는 z formula를 이용

• p 추정 시의 표본크기

46 http://www.openwith.net

7. 가설검정 (단일 모집단)

• 7.1 개요 – Hypotheses의 종류

– Statistical Hypotheses

• H0 Ha

– 가설검정의 절차

– Rejection and Nonrejection Regions

– Type I 및 Type II Errors

47 http://www.openwith.net

• 7.2 z 통계량 이용한 모평균의 가설검정 (σ Known) – 단일평균에 대한 z Test

– 유한모집단의 평균에 대한 검정

– p-Value를 이용한 가설검정

• p-value = 관측된 유의수준 (level of significance)

– defines the smallest value of 𝛼 for which the H0 can be rejected.

• “α 가 p보다 커야만 H0를 reject 가능”

– Critical Value Method를 이용한 가설검정

• Rejecting H0 using p-values

48 http://www.openwith.net

• 7.3 t 통계량 이용한 모평균 가설검정 (σ Unknown) – (…)

• z Test of a Population Proportion

– Critical Value Method를 이용한 가설검정 • Rejecting H0 using p-values

• 7.4 비율에 관한 가설검정 – […]

• Using p-value

• Using the critical value method

49 http://www.openwith.net

• 7.5 분산에 관한 가설검정

• Table χ2 vs. Observed χ2

• H0 can also be tested by the critical value method.

• 관측된 χ2 값 대신 critical χ2 value for α를 적용하여 s2 계산 yields critical sample variance (sc

2)

• 7.6 Type II Errors

50 http://www.openwith.net

(추정 – 2개 모집단)

• 7.7 z 통계량 이용한 두 개 평균 차에 대한 추정/가설검정 (σ Known)

– (…) – CLT: “”Difference in two sample means, 𝑥 1 − 𝑥 2 ~ ND() for large sample (both

n1 and n2 ≥ 30) regardless of the shape of populations”

– z formula for the difference in two sample means

– Hypothesis Testing – H0: μ1 – μ2 =δ

– Ha: μ1 – μ2 ≠δ

– Confidence Intervals

51

http://www.openwith.net

• 7.8 두 평균 차에 대한 추정/가설검정: 독립표본이고 σ Known – 가설검정

– t Test를 이용한 두 모평균 차에 대한 CI 수립 및 가설검정 – Confidence Intervals

• 7.9 서로 관련된 모집단에 대한 추정

– 종류 • Before-and-after study • Matched-pair with built-in relatedness, as an experimental control mechanism

(ex) twins, siblings

– 가설검정

– 신뢰구간

52

http://www.openwith.net

• 7.10 두 개 모비율에 대한 추정(p1 - p2)

– (…)

– 가설검정

– 신뢰구간

• 7.11 두 개 모분산에 대한 추정

53 http://www.openwith.net

선형대수

54

Matrix

– Square Matrix

• ; has the same number of rows as columns

– Transpose

• ; created by converting its rows into columns

55

• 행렬의 곱

• 항등행렬 – AI = A

• Orthogonal Matrix – A matrix A is orthogonal if AAT = ATA = I.

56

벡터

• 개념 – = points ; components --> dimension

• 길이 (length)

• Vector 연산 (operation) – Addition

– Scalar Multiplication • 𝑣 = [3,6,8,4] 일 때 1.5 ∗ 𝑣 = 1.5 ∗ 3,6,8,4 = [4.5, 9, 12, 6]

– 내적 (Inner Product) • = dot product = scalar product

57

Orthogonal Vector

• Orthogonality – = perpendicular inner product = 0

• Normal Vector

• Orthonormal Vector – = Vectors of unit length that are orthogonal to each other

58

Eigenvector와 Eigenvalue

• Eigenvector – = An eigenvector is a nonzero vector that satisfies

단, A = square matrix, ⃗v = eigenvector, λ = eigenvalue

59

• Eigenvector & eigenvalue 구하기

60

• Eigendecomposition – 고유값분해를 이용한 대각화 (정방행렬에 대해서만 가능)

– 대각행렬과의 행렬 곱

• SVD (특이값 분해)

61

PCA, SVD, 판별분석

• 3개의 관점

– method for transforming correlated variables into a set of uncorrelated ones better expose various relationships among the original data items.

– method for identifying and ordering dimensions data points exhibit the most variation.

– method for data reduction.

• SVD의 의의

– 차원축소 expose substructure of the original data more clearly and orders it from most variation to the least.

• 대표적 활용: NLP

– 문서에 내재된 주된 관계성을 확인하면서도 특정 threshold 이하의 variation을 무시하는 대신 대대적으로 차원을 축소

• 방법론 – a rectangular matrix A can be broken into product of 3

matrices - an orthogonal matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V

• 단, UTU = I, V TV = I;

• U 행렬의 column들은 orthonormal eigenvectors of AAT ,

• V 행렬의 column들은 orthonormal eigenvectors of ATA,

• S 는 a diagonal matrix containing square roots of eigenvalues from U or V in descending order.

63

ML & DL

64

DL – ML – AI

Deep Learning

Machine Learning

AI

모델링

• Modeling

• Underfitting & Overfitting

• 절차

f의 추정?

• Prediction

• The accuracy of Y as a prediction for Y depends on two

quantities, which we will call the reducible error and the irreducible error.

• Inference • (Black box가 아니라) 각종 질문에 대한 해답을 구함. (예: 앞서의

Advertising data)

• prediction and inference

• Supervised vs. Unsupervised Learning

• Model Flexibility vs. Interpretability

69

Supervised L. Unsupervised L.

• Bias-Variance Trade-Off – Regression

Resampling & Cross-Validation

• Hold-out for Cross-Validation – Validation dataset의 Hold back

– LOOCV

– K-fold CV

• 기타 – Boosting

– Jack Knife

– …

71

Training Data Testing Data

5.1 Cross-Validation 183

F IGUR E 5.5. A schematic display of 5-fold CV . A set of n observations israndomly split into five non-overlapping groups. Each of these fifths acts as avalidation set (shown in beige), and the remainder as a training set (shown inblue). The test error is estimated by averaging the five resulting MSE estimates.

chapters. Themagic formula (5.2) does not hold in general, in which casethe model has to be refit n times.

5.1.3 k-Fold Cross-Validation

An alternative to LOOCV is k-fold CV. This approach involves randomlyk-fold CV

dividing the set of observations into k groups, or folds, of approximatelyequal size. The first fold is treated as a validation set, and the methodis fit on the remaining k − 1 folds. The mean squared error, MSE1, isthen computed on the observations in the held-out fold. This procedure isrepeated k times; each time, a different group of observations is treatedas a validation set. This process results in k estimates of the test error,MSE1,MSE2, . . .,MSEk . Thek-fold CV estimate iscomputed by averagingthese values,

CV (k) =1

k

k

i = 1

MSE i . (5.3)

Figure 5.5 illustrates the k-fold CV approach.It isnot hard toseethat LOOCV isa special caseof k-fold CV in which kis set to equal n. In practice, one typically performsk-fold CV using k = 5or k = 10. What is the advantage of using k = 5 or k = 10 rather thank = n? The most obvious advantage is computational. LOOCV requiresfitting the statistical learningmethod n times. This has thepotential to becomputationally expensive (except for linear models fit by least squares,in which case formula (5.2) can be used). But cross-validation is a verygeneral approach that can be applied to almost any statistical learningmethod. Some statistical learningmethods havecomputationally intensivefitting procedures, and so performing LOOCV may pose computational

Model Selection & Regularization

• 개념 – 독립변수의 개수가 많을 경우 이를 축소하여 단순화

• Subset Selection – Best Subset Selection

– Stepwise Selection

– Optimal Model의 선택

• Shrinkage Methods – Ridge Regression

– The Lasso

– …

• Dimension Reduction – PCA

72

신경망과 DL 개요

73

Biological Neural Nets

• 개요 – 인간 brain 모델을 따라 입출력 신호 관계를 규정하려는 시도

– = a network of interconnected cells (=neurons) to create a massive parallel processor

• 85 billion neurons (human) (cf. mouse (75 million), fly (100k))

• 궁극적 비교는 Turing test

– 초기에는 단순 논리비교 급격한 발전

인공신경망 (ANN)

• 기본 개념 – 선형 맞춤 (linear fitting)과 비선형 변환 (nonlinear

transformation or activation)을 반복해 쌓아 올린 구조 • 데이터를 잘 구분할 수 있는 선들을 긋고 이 공간들을 잘 왜곡해 합

하는 것을 반복 (=optimization)

– 발전 과정 • Backpropagation in 1980’s

• Kernel methods in early 2000’s

• Parameter training by unsupervised learning in 2006

• CNN – Data Feature Knowledge

• RNN – LTSM (Long-Short term memory)

• GPGPU

• Incoming signals dendrites (수상돌기) – = biochemical process that allows the impulse to be weighted

according to its relative importance or frequency

– cell body가 입력신호를 축적하면서 threshold 도달 시 cell fires 출력신호가 Axon (축삭돌기) 화학적 신호가 synapse를 통해 이웃 neuron에 도달 (= tiny gap)

– X var ≈ dendrite,

– Σ = directed network diagram, f = activation function

• NN의 특징을 결정하는 요소: – Activation function

• transforms a neuron's net input signal into a single output signal to be broadcasted further in the network

– Network topology (or architecture)

• ; describes the number of neurons in the model as well as the number of layers and manner in which they are connected

– Training algorithm

• specifies how connection weights are set in order to inhibit or excite neurons in proportion to the input signal

Activation 함수

• Threshold activation function Modeled after nature

– Unit step activation function

– Sigmoid activation function

• ; not binary, differentiable

78

– Other types of activation functions

79

• 다양한 activation functions 사이의 차이점: – 출력신호의 범위 – Typically, one of (0, 1), (-1, +1), or (-inf, +inf). – allows the construction of specialized neural networks.

• Squashing 문제 – 대부분 activation 함수의 경우 출력에 영향 주는 입력값은 relatively narrow.

(예: in sigmoid, the output signal is always 0 or always 1 for an input signal below -5 or above +5, respectively.)

– 신호압축 saturated signal at the high and low ends of very dynamic inputs, just as turning a guitar amplifier up too high results in a distorted sound due to clipping the peaks of sound waves.

– squeezes input values into a smaller range of outputs

• 해결방안 – 입력을 변환하여 feature value가 0에 가깝게 변환 표준화 – By limiting the input values, the activation function will have action across

the entire range, preventing large-valued features such as household income from dominating small-valued features such as the number of children in the household.

80

Network topology

• 개념 = pattern & structure of interconnected neurons • Layer의 개수

– Single-layer network – Multilayer network ; has hidden layers, and are fully connected

• information travel 의 방향 – Feedforward network deep learning – Feedback network recurrent network (with delay (= short-term

memory))

• Layer별 node 개수 – I/O node + hidden nodes – Use the fewest nodes that result in adequate performance on a

validation data set. – Universal function approximator

• = a NN with at least one hidden layer of sufficiently many neurons

Backpropagation을 통한 신경망 훈련

• Network’s connection weights reflect the patterns observed over time. backpropagation

• backpropagation algorithm iterates through many cycles of two processes. Each iteration of the algorithm is known as an epoch. – Because the network contains no a priori (existing) knowledge, typically the w

eights are set randomly prior to beginning. Then, the algorithm cycles through the processes until a stopping criterion is reached. The cycles include:

• A forward phase • A backward phase

Strengths Weaknesses

Can be adapted to classification or numeric prediction problems

Among the most accurate modeling approaches

Makes few assumptions about the data's underlying relationships

Reputation of being computationally intensive and slow to train, particularly if the network topology is complex

Easy to overfit or underfit training data Results in a complex black box model that

is difficult if not impossible to interpret

82

• gradient descent. – determine how much (or whether) a weight should be

changed, when using information sent backward to reduce the total error

– Backpropagation algorithm uses the derivative of each neuron's activation function to identify the gradient in the direction of each of the incoming weights – hence the importance of having a differentiable activation function.

• 학습률 – The gradient suggests how steeply the error will be reduced

or increased for a change in the weight.

– It changes the weights (학습률) that result in the greatest reduction in error by an amount of learning rate.

83

– = 1st order optimization algorithm

• =finding the “best” value of a function which is the minimum value of the function.

– Gradient is the slope of a function.

• # of “turning points” of a function depend on the order of the function.

84

f(x)

x

global minimum

inflection point

local minimum

global maximum

– Not all turning points are minima.

• The least of all the minimum points is called the “global” minimum.

• Every minimum is a “local” minimum.

85

2일차

NUMPY 기초

87

numpy 기초

• ndarray

• 내적 (행렬 곱)

DEEP LEARNING WITHOUT TENSORFLOW

89

PERCEPTRON

90

개념

• Perceptron – = Artificial neuron

– = 다수의 신호를 입력 받아 하나의 신호를 출력

– Perceptron의 구현

y = 0 𝑤1𝑥1 + 𝑤2𝑥2 ≤ 𝜃 y = 1 (𝑤1𝑥1 + 𝑤2𝑥2 ≤ 𝜃)

y = 0 b + 𝑤1𝑥1 + 𝑤2𝑥2 ≤ 0 y = 1 (b + 𝑤1𝑥1 + 𝑤2𝑥2 ≤ 0)

Perceptron과 논리회로

• Truth table – AND

– NAND

– OR – (b, w1, w2) = (-0.5, 1.0, 1.0)

– 구현

• MLP (multi-layer perceptron)

– XOR

– 구현

– non-linear functions

• 2-layer perceptron with non-linear functions

– Sigmoid function as an activation function

ANN (1) 개념

97

• 신경망

– Bias

– 활성화 함수

• Sigmoid 함수

– 구현

ℎ 𝑥 = 1

1 + 𝑒𝑥𝑝(−𝑥)

• Step function

• Sigmoid 함수와 계단 함수 비교

• 비선형 함수

• ReLU 함수

– Inner Product of NN

– Inner_Product_NN.py

X W = Y (1)x2 2x3 3x1

– 신경망 표기법

𝑤1 2 앞 층의 2𝑛𝑑 𝑛𝑒𝑢𝑟𝑜𝑛에서 다음 층의 1𝑠𝑡 𝑛𝑒𝑢𝑟𝑜𝑛

1 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟 1

– Layer간의 신호 전달 (3-layer의 예)

𝑎11= 𝑤1 1

1𝑥1 + 𝑤1 2

1𝑥2 + 𝑏1

1

Input layer Layer 1

Layer 1 Layer 2

Layer 2 Output Layer

– Output layer

• Identity function

• Softmax

– 특징

» 출력 총합 =1

» y = exp(x) 단조함수 각 원소의 대소 유지

» 단, overflow의 개선!

– Batch 처리

• 다수의 데이터를 함께 묶어 predict() (예: 이미지 100개)

– Mini-batch

𝑦𝑘 = 𝑒𝑥𝑝(𝑎𝑖)

exp (𝑎𝑖)𝑛𝑖=1

ANN (2) TRAINING NN

108

Cost

• Cross Entropy Loss

E = − 𝑡𝑘 log 𝑦𝑘𝑘

• Mean Square Error

• Misclassification Rate

• L1 loss – = LAD (least absolution deviation)

– cf. L2 loss = least squares

E = 1

2 (𝑦𝑘 − 𝑡𝑘)

2

𝑘

주요 개념

• Initialization

• SGD와 mini-batch – Stochastic Gradient Descent

– Mini batch

미분과 편미분

• 미분 (differentiation) – 수치미분

– 편미분

𝑓 𝑥0, 𝑥1 = 𝑥02 + 𝑥1

2의 기울기

– Gradient

• x0와 x1의 편미분을 (예: x0=3, x1= 4) 동시에 계산 (𝜕𝑓

𝜕𝑥0,𝜕𝑓

𝜕𝑥1)

• 모든 변수의 편미분을 벡터로 정리한 것

– 경사하강법 (gradient descent method)

– 경사상승법 (gradient ascent method)

Training Algorithm

• SGD – mini batch

• Gradient 계산

• Update parameters

출처: http://neuralnetworksanddeeplearning.com/

반복

ANN (3) BACKPROPAGATION

115

개념

• 계산식

• Computation Graph

5,000

20,000

X

X

Tip

+

5개

2개

X

1.1

65,000 71,500

Chain Rule

• Composite Functions

• Chain rule

• If y = f[g(x)]

• Then y’ = f’[g(x)] 〮 g’(x)

• If y = f(u) and u = g(x)

• Then dy/dx = dy/du 〮 d /dux

Backpropagation

• 개념 – gradient descent를 chain rule을 사용하여 단순화한 것

– Phase 1: Propagation

• Forward propagation:

• Back propagation:

– output neuron에서 계산된 error를 각 edge들의 weight를 사용해 바로 이전 layer의 neuron들이 얼마나 error에 영향을 미쳤는지 계산.

– Phase 2: Weight update

• Chain rule을 사용해 parameter들의 gradient를 계산한다

• 예:

• BP for: – Multiplication

– Addition

• BP for Activation functions – ReLU

– Sigmoid

• Affine/Softmax 계층 구현하기 – Affine transformation

• Feedforward NN에서의 inner product는 Affine 변환

– Affine transformation for mini batch

– Softmax-with-Loss 계층

• Backpropagaton – 덧셈노드의 Backpropagaton

– 곱셈노드의 Backpropagaton

Optimizers

• SGD with Momentum

• RMS propagation

• Adagrad

• Adadelta

• Adam

실습

TENSORFLOW

128

• 1. TensorFlow 개요 – 개요

• 2. 설치와 실행 – TensorFlow 설치

– 실행 • Hello World

• …

• MNIST

• 3. TensorFlow 기초 – Computational graphs

– Graphs, Sessions and Fetches

– Flowing Tensors

– Variables, placeholders and simple optimization

TENSORFLOW 개요

• TensorFlow 의 활용 현황 – <<PRE-TRAINED MODELS: 컴퓨터 비전>>

• publicly releasing pre-trained models — deep neural nets that are already trained, and only require users to download them and apply to their data

– <<자연어 처리 – IMAGE 설명>>

• Image captioning

– <<텍스트 요약>>

TensorFlow?

• 배경 – Google’s deep learning framework (Succeeding the DistBelief project

– Open-sourced in 2015 as Apache 2.0 license

• Deep neural networks – Networks of neurons, each learning to do its own as part of a larger

picture.

– Data enters this network as input, and flows through the network as it adapts itself at training time or predicts outputs in a deployed system.

• Tensors – = multidimensional arrays, an extension of 2-dimensional tables (matrices)

to data with higher dimension.

– TensorFlow에서 computation은 dataflow graph로 진행 • nodes represent computational operations (such as addition, multiplication)

• edges represent data (tensors) flowing around the system.

. A dataflow computation graph.

• TensorFlow – = a S/W framework for numerical computations based on

dataflow graphs.

– is designed as an interface for machine learning algorithms, chief among them deep neural networks.

• 특징 – Portability

– Flexibililty • Many optimization algorithms

• TensorBoard

– Core: C++ + front-end: C++, Python

– +TF-Slim

– <<TENSORFLOW ABSTRACTIONS>> • Distributed training, Cloud (AWS, GCP)

2. 설치와 실행

• 설치 – (Native pip 설치)

• $ pip install tensorflow

• VirutalBox, Ubuntu, Java, Anaconda,

• TensorFlow.org (Python3.6)

– <<~/.BASHRC 에 alias 추가>>

• alias tensorflow="source ~/envs/tensorflow/bin/activate“

• virtual environment에서 activate 명령the.

• 종료 시 deactivate.

Hello World

– <<IDE CONFIGURATION>>

• In PyCharm IDE :

• Run->Edit Configurations... , then changing “Python Interpreter” to point to ~/envs/tensorflow/bin/python, assuming you used ~/envs/tensorflow as the virtualenv directory.

import tensorflow as tf h = tf.constant("Hello") w = tf.constant(" World!") hw = h + w with tf.Session() as sess: ans = sess.run(hw) print ans

MNIST

• Softmax Regression

• For instance, will be

– a large positive if the 38-th pixel having a high intensity points strongly to the digit being a zero,

– a strong negative number if high intensity values in this position occur mostly in other digits,

– zero if the intensity value of the 38-th pixel tells us nothing about whether or not this digit is a zero.

• Test

– http://yann.lecun.com/exdb/mnist

– <<MODEL EVALUATION AND MEMORY ERRORS>>

mnist (tensorflow.org)

• MNIST data – 55,000 (mnist.train)+10,000(mnist.test)+5,000(mnist.validation).

– 2 parts: image (x)+label (y). (mnist.train.image, mnist.train.label)

– 28 x 28 pixels : Flatten mnist.train.images is a tensor (n-D array) with a shape of [55000, 784].

• our labels as "one-hot vectors".

– nth digit represented as a vector which is 1 in the nth dimension.

• Softmax Regressions for multinomial classification – 2 steps: (i) add up evidence of input being in certain classes,

(ii) convert that evidence into probabilities.

• a weighted sum of pixel intensities – weight is (-) if that pixel having a high intensity is evidence against the image being in that class, and (+) if it is evidence in favor.

– + Add some extra evidence called a bias

• Here softmax is serving as an "activation" or "link" function, shaping the output of our linear function into the form we want -- in this case, a probability distribution over 10 cases.

• Train and Evaluate

# headers from tensorflow.examples.tutorials.mnist import input_data import tensorflow as tf FLAGS = None def main(_): mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True) # Create the model x = tf.placeholder(tf.float32, [None, 784]) W = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10])) y = tf.matmul(x, W) + b # Define loss and optimizer y_ = tf.placeholder(tf.float32, [None, 10]) cross_entropy = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)) train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy) sess = tf.InteractiveSession() tf.global_variables_initializer().run()

mnist for ML beginners (tensorflow.org)

# Train for _ in range(1000): batch_xs, batch_ys = mnist.train.next_batch(100) sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) # Test trained model correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels})) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--data_dir', type=str, default='/tmp/tensorflow/mnist/input_data', help='Directory for storing input data') FLAGS, unparsed = parser.parse_known_args() tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

모델의 그래프 표현. (Rectangular = variables, circles = placeholders.) 좌상: label prediction part, 우하: frame the evaluation.

3. TENSORFLOW 기초

Computational graphs

Graphs, Sessions 및 Fetches

Flowing Tensors

Variables, placeholders 및 간단한 optimization

Variables

Placeholders

Optimization

Computational graphs

• Computational graph란? – ; functional architecture를 표현

• Graph dependencies

Node e is directly dependent on node c, indirectly dependent on node a and independent of node d.

Graphs, Sessions 및 Fetches

• 예제

– 2 phases:

• Graph 구축

• 그래프 실행.

– tf.<node does what>

– default graph.

(1st_compgraph) import tensorflow as tf a = tf.constant(5) b = tf.constant(2) c = tf.constant(3) d = tf.mul(a,b) e = tf.add(c,b) f = tf.sub(d,e) sess = tf.Session() outs = sess.run(f) sess.close() print "outs = {}".format(outs) Output: outs = 5

• Session의 생성과 수행

• Constructing and managing our (own) graph

g = tf.Graph() print g Output: <tensorflow.python.framework.ops.Graph object at 0x7f0917a799d0>

g = tf.Graph() print g a = tf.constant(5) print a.graph print tf.get_default_graph() print a.graph is g print a.graph is tf.get_default_graph() Output: <tensorflow.python.framework.ops.Graph object at 0x7fc250a91710> <tensorflow.python.framework.ops.Graph object at 0x7fc250a91750> <tensorflow.python.framework.ops.Graph object at 0x7fc250a91750> False True

– “with” context manager의 이용

g1 = tf.get_default_graph() g2 = tf.Graph() print g1 is tf.get_default_graph() with g2.as_default(): print g1 is tf.get_default_graph() print g1 is tf.get_default_graph() Output: True False True

• Fetches – ; contains elements of the graph we wish to compute.

• 앞서 특정 node 요청 시 sess.run() 에 argument로 제공. - 필요한 node만 입력하면 다른 node 출력도 요청 가능.

• 이때의 요청을 위한 argument를 ‘fetches’ argument라고 함.

– TF computation이 graph형태로 이루어지므로 모든 sub-function residing in a node 는 이용 가능.

• 단, 이때 TF computes efficiently by computing only the nodes that our given subset of nodes has a dependency.

sess = tf.Session() fetches = [a,b,c,d,e,f] outs = sess.run(fetches) sess.close() print "outs = {}".format(outs) Output: outs = [5, 2, 3, 10, 5, 5]

Flowing Tensors

• Nodes = operations, edges = Tensor objects – Graph에 node구축 시 (tf.add()) 실제는 operation instance 생성.

• we should think about nodes in TensorFlow as operations.

• 이들 operations는 graph 실행 때까지 actual value는 생성X – rather reference result as a handle that can be passed on — “flow” — to another node (handle = Tensor objects TF의 명칭),

• In TF, first a skeleton graph is created with all of its components. 이때 no actual data flows in it and no computations take place.

• Session 실행 시 data enters the graph and computations occur 전체 graph structure 를 고려하므로 효율적.

c = tf.constant(4.0, dtype='float64') print c print c.dtype Output: Tensor("Const_10:0", shape=(), dtype=float64) <dtype: 'float64'>

• Data types – 명시적 선택 or default type

c = tf.constant(4.0, dtype='float64') print c print c.dtype Output: Tensor("Const_10:0", shape=(), dtype=float64)

Data type Python type

DT_FLOAT tf.float32

DT_DOUBLE tf.float64

DT_INT8 tf.int8

DT_INT16 tf.int16

DT_INT32 tf.int32

DT_INT64 tf.int64

DT_UINT8 tf.uint8

DT_UINT16 tf.uint16

DT_STRING tf.string

DT_BOOL tf.bool

DT_COMPLEX64 tf.complex64

DT_COMPLEX128 tf.complex128

DT_QINT8 tf.qint8

DT_QINT32 tf.qint32

DT_QUINT8 tf.quint8

• Tensor arrays 및 Shapes – 2가지의 “Tensor”

• Tensor: name of an object used in the Python API as a handle for the result of an operation in the graph.

• tensor: a fancy name for n-dimensional arrays.

– 1x1 tensor : scalar, 1xn : vector, nxn: a matrix, nxnxn: a 3-D array, …

– TF regards all data units that flow in the graph as tensors, whether they are multidimensional arrays, vectors, matrices or scalars.

– Like with dtype, unless stated explicitly, TensorFlow infers the shape of the data.

import numpy as np c = tf.constant([[1,2,3],[4,5,6]]) print "With Python List: {}".format(c.get_shape()) c = tf.constant(np.array([[1,2,3],[4,5,6]])) print "With Numpy array: {}".format(c.get_shape()) c = tf.constant([[[1,2,3], [4,5,6]], [[1,1,1], [2,2,2]]]) print "3d list: {}".format(c.get_shape()) Output: With Python list: (2, 3) With NumPy array: (2, 3) 3d list: (2, 2, 3)

– <<NUMPY>>

• TensorFlow and NumPy are tightly coupled - for example, the output returned by sess.run() is a NumPy array.

c = tf.fill((3,2,2), 13) print c.get_shape() sess = tf.Session() print sess.run(c) sess.close() Output: (3, 2, 2) [[[13 13] [13 13]] [[13 13] [13 13]] [[13 13] [13 13]]]

– 기타의 constant generator

TensorFlow operation Description

tf.constant(value) Creates a tensor populated with values of as specified by arguments “value”

tf.fill(shape,value) Creates a tensor of shape “shape" and fills it with “value"

tf.zeros(shape) Returns a tensor of shape “shape” and all elements set to zero

tf.zeros_like(tensor) Returns a tensor of the same type and shape as “tensor" with all elements set to zero

tf.ones(shape) Returns a tensor of shape “shape" and all elements set to 1

tf.ones_like(tensor) Returns a tensor of the same type and shape as “tensor" with all elements set to 1

TensorFlow operation Description

tf.random_normal(shape, mean, stddev) Outputs random values from a normal distribution

tf.truncated_normal(shape, mean, stddev)

Outputs random values from a truncated normal distribution (values whose magnitude is more than 2 standard deviations from the mean are dropped and re-picked).

tf.random_uniform(shape, minval, maxval) Generated values from a uniform distribution in the range [minval, maxval)

tf.random_shuffle(tensor) Randomly shuffles a tensor along its first dimension

tf.random_crop(tensor, shape) Slices a shape “shape” portion out of “tensor" at a uniformly chosen offset

tf.multinomial(logits, n_samples) Draws samples from a multinomial distribution

tf.random_gamma(shape,alpha,beta) Draws “shape" samples from each of the given Gamma distribution(s)

c = tf.random_normal((3,3), mean=0.0, stddev=1.0) print c.get_shape() sess = tf.Session() print sess.run(c) sess.close() Output: (3, 3) [[-0.27925715 -0.48847842 -0.05559563] [ 0.390374 -1.03677559 -3.6154325 ] [ 1.44002807 -1.56482267 -0.40263358]]

c = tf.linspace(0.0, 4.0, 5) print c.get_shape() sess = tf.Session() print sess.run(c) sess.close() Output: (5,) [ 0. 1. 2. 3. 4.]

import tensorflow as tf with tf.Graph().as_default(): x = tf.random_normal((5,10), mean=0.0, stddev=1.0) w = tf.random_normal((10,1), mean=0.0, stddev=1.0) b = tf.fill((5,1),-1.) xw = tf.matmul(x,w) xwb = xw + b s = tf.sigmoid(xwb) sess = tf.Session() outs = sess.run(s) sess.close() print "outs = {}".format(outs) outs = [[ 0.88813359] [ 0.82033235] [ 0.35137126] [ 0.49279356] [ 0.01243223]]

– <<OPERATOR OVERLOADING>>

TensorFlow operator Shortcut

tf.add() a + b

tf.mul() a * b

tf.sub() a - b

tf.div() a / b

tf.pow() a ** b

tf.mod() a % b

tf.logical_and() a & b

tf.greater() a > b

tf.greater_equal() a >= b

tf.less_equal() a <= b

tf.less() a < b

tf.neg() -a

tf.logical_not() ~a

tf.abs() abs(a)

tf.logical_or() a | b

• Names – 각각의 Tensor object에는 식별이름 (identifying name)이 있음.

– This name is an intrinsic string name, not to be confused with the name of the variable. As with dtype, we can use the .name attribute to see the name of the object.

with tf.Graph().as_default(): c1 = tf.constant(4,dtype='float64',name='c') c2 = tf.constant(4,dtype='int32',name='c') print c1.name print c2.name Output: c:0 c_1:0

– << DUPLICATE NAMES>>

• 동일 graph 내에서 같은 이름 사용 불가 _ 가 추가됨

g1 = tf.Graph() with g1.as_default(): c1 = tf.constant(4,dtype='float64',name='c') print c1.name g2 = tf.Graph() with g2.as_default(): c2 = tf.constant(4,dtype='int32',name='c') print c2.name Output: c:0 c:0

– <<NAME SCOPEs>>

• Use tf.name_scope(“prefix”) together with “with” clause:

with tf.Graph().as_default(): c1 = tf.constant(4,dtype='float64',name='c') with tf.name_scope("2"): c2 = tf.constant(4,dtype='int32',name='c') c3 = tf.constant(4,dtype='float64',name='c') print c1.name print c2.name print c3.name Output: c:0 2/c:0 2/c_1:0

Variables, placeholders 및 간단한 최적화

• Variables – During optimization process, tuning the weights of the model by

iterative updates requires that their current state be maintained. For that purpose, TensorFlow uses special objects called “Variables”.

– Variables, unlike other Tensor objects that are “refilled” across calls to run(), can maintain a fixed state in the graph. Like other Tensors, Variables can be used as input for other operations in the graph.

– Variable 사용 시의 2 stages • 1) call tf.Variable() to create a Variable, and define value to be

initialized with • (2) explicitly perform an initialization by running session with

tf.initialize_all_variables() allocates the memory for the variable and sets the initial value as defined in (1).

– Like with other Tensor objects, Variables are only computed when the model runs:

c = tf.constant(15, name='c') x = tf.Variable(c*5, name='x') init = tf.global_variables_initializer() print x with tf.Session() as sess: sess.run(init) print sess.run(x) Output: <tensorflow.python.ops.variables.Variable object at 0x7f0c600c8810> 75

• Placeholders – built-in structures for feeding input values – 일종의 empty variables that

will be filled with data later on. – We use them by first constructing our graph and only when it is

executed feed them with the input data. – 3 arguments:

• dtype for the type of data that will be inserted to them • 2 optional arguments — shape and name.

– If a shape is not fed, then placeholder can be fed with data of any size. ( ‘None’ 지정과 같은 효과).

– We can also place ‘None’ for specific dimensions we are not sure about their length —ex: use only for the rows dimension of a matrix, corresponding to the number of samples, while having the length of the columns (features) fixed. We feed the input values when running the session, just after stating what outputs we want to evaluate.

– Placeholder 정의 시, we must feed it with some input values or else an exception will be thrown.

– The input data is given as a dictionary, where each key corresponds to a placeholder variable name, and the matching values are the data values given in the form of a list or a numpy array.

3일차

CNN

171

개요

• 기본 개념 – Local receptive fields

– Shared weights

– Pooling (or down-sampling

• 데이터의 spatial structure를 활용

• Feed-forward feature extraction:

– 1. Convolve input with learned filters

– 2. Non-linearity

– 3. Spatial pooling

– 4. Normalization

• Supervised training of convolutional filters by back-propagating classification error

• LOCAL CONNECTIVITY – MNIST image의 예

• 28x28 image Local connectivity (5x5)

Source: Neural Networks and Deep Learning. Michael Nielsen.

• SHARED WEIGHTS – local receptive field에서, CNNs use the same shared weights

for each of the 24x24 hidden neurons parameter reduction, (예: for a 5x5 receptive field, we need only 25 shared weights).

• 20 feature maps using 5x5 — 20*26 = 520 weights

• A fully connected first layer, with 784=28*28 input neurons, and a relatively modest 30 hidden neurons, would produce 784*30 = 23.520 weights, more than 40 times as many parameters as the convolutional layer.

• CONVOLUTIONAL LAYER

– shared weights와 bias를 kernel or filter로 부름.

– translation invariance.

• Since these filters works on every part of the image, they are “searching” for the same feature everywhere in the image.

아키텍처

– Input layer – 28x28 pixels

– Convolutional layer — 3 feature maps (5x5 kernel)

Pooling Layer

• Pooling layers are usually present after a convolutional layer. down-sampling of the convolution output.

• 위 예에서 2x2 region is being used as input of the pooling.

Pooling Layer

• 다양한 pooling 방법 – 통상 max-pooling과 average pooling: – Pooling layers downsamples the volume spatially, reducing small

translations of the features. They also provide a parameter reduction.

Source: CS231n Convolutional Neural Networks for Visual Recognition.

Classification

• We then add a dense fully-connected layer (usually using softmax) at the end of NN in order to get predictions for the problem we’re working on (10 classes, 10 digits).

Going Deeper

• We rarely see a shallow convnet like that.

• Replication of convolutional + pooling layers produces better results the deeper you go. Winners of ImageNet challenge, have more than 15 layers (VGGNet has 19 layers).

– Sander Dieleman et. al. Galaxy Zoo best performing network (winner of the challenge).

Source: Rotation-invariant convolutional neural networks for galaxy morphology prediction.

Dropout 기법

• Overfitting 대책 – specially on dense layers.

• Drop occur only at training time, not on test time.

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”. Nitish Srivastava et. al.

4일차

RNN

186

Why RNN?

• Recurrent Neural Networks

Recurrent neuron

• Unrolling a recurrent network into a feed-forward network

• RNN의 특징

• Machine Translation

Training an RNN

• Vanishing/Exploding gradients

• 1. Exploding gradients – Truncated BPTT

– Clip gradients at threshold

– RMSprop to adjust learning rate

• 2. Vanishing gradients – Harder to detect

– Weight initialization

– ReLU activation functions

– RMSprop

– LSTM, GRUs

LSTM

• Long-Short Term Memory (LSTM)

출력 gate

입력 gate

LSTM

LSTM

LSTM

GRU – simplified LSTM

고급 & 응용

203

Projects

Recommended