Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Applied Math for Machine Learning
Prof. Kuan-Ting Lai
2021/3/11
Applied Math for Machine Learning
β’ Linear Algebra
β’Probability
β’Calculus
β’Optimization
Linear Algebra
β’ Scalarβ real numbers
β’ Vector (1D)β Has a magnitude & a direction
β’ Matrix (2D)β An array of numbers arranges in rows &
columns
β’ Tensor (>=3D)β Multi-dimensional arrays of numbers
Real-world examples of Data Tensors
β’ Timeseries Data β 3D (samples, timesteps, features)
β’ Images β 4D (samples, height, width, channels)
β’ Video β 5D (samples, frames, height, width, channels)
4
Vector Dimension vs. Tensor Dimension
β’ The number of data in a vector is also called βdimensionβ
β’ In deep learning , the dimension of Tensor is also called βrankβ
β’ Matrix = 2d array = 2d tensor = rank 2 tensor
https://deeplizard.com/learn/video/AiyK0idr4uM
The Matrix
Matrix
β’ Define a matrix with m rows and n columns:
Santanu Pattanayak, βPro Deep Learning with TensorFlow,β Apress, 2017
Matrix Operations
β’ Addition and Subtraction
Matrix Multiplication
β’ Two matrices A and B, where
β’ The columns of A must be equal to the rows of B, i.e. n == p
β’ A * B = C, where
β’m
n
p
m
Example of Matrix Multiplication (3-1)
https://www.mathsisfun.com/algebra/matrix-multiplying.html
Example of Matrix Multiplication (3-2)
https://www.mathsisfun.com/algebra/matrix-multiplying.html
Example of Matrix Multiplication (3-3)
https://www.mathsisfun.com/algebra/matrix-multiplying.html
Dot Product
β’ Dot product of two vectors become a scalar
β’ Inner product is a generalization of the dot product
β’ Notation: π£1 β π£2 or π£1ππ£2
Dot Product in a Matrix
Outer Product
https://en.wikipedia.org/wiki/Outer_product
Linear Independence
β’ A vector is linearly dependent on other vectors if it can be expressed as the linear combination of other vectors
β’ A set of vectors π£1, π£2,β― , π£π is linearly independent if π1π£1 +π2π£2 +β―+ πππ£π = 0 implies all ππ = 0, βπ β {1,2,β―π}
Span the Vector Space
β’ n linearly independent vectors can span n-dimensional space
Rank of a Matrix
β’ Rank is:β The number of linearly independent row or column vectors
β The dimension of the vector space generated by its columns
β’ Row rank = Column rank
β’ Example:
https://en.wikipedia.org/wiki/Rank_(linear_algebra)
Row-echelon
form
Identity Matrix Iβ’ Any vector or matrix multiplied by I remains unchanged
β’ For a matrix π΄, π΄πΌ = πΌπ΄ = π΄
Inverse of a Matrix
β’ The product of a square matrix π΄ and its inverse matrix π΄β1
produces the identity matrix πΌ
β’ π΄π΄β1 = π΄β1π΄ = πΌ
β’ Inverse matrix is square, but not all square matrices has inverses
Pseudo Inverse
β’ Non-square matrix and have left-inverse or right-inverse matrix
β’ Example:
β Create a square matrix π΄ππ΄
β Multiplied both sides by inverse matrix (π΄ππ΄)β1
β (π΄ππ΄)β1π΄π is the pseudo inverse function
π΄π₯ = π, π΄ β βπΓπ, π β βπ
π΄ππ΄π₯ = π΄ππ
π₯ = (π΄ππ΄)β1π΄ππ
Norm
β’ Norm is a measure of a vectorβs magnitude
β’ π2 norm
β’ π1 norm
β’ ππ norm
β’ πβ norm
Eigen Vectors
β’ Eigenvector is a non-zero vector that changed by only a scalar factor Ξ»when linear transform π΄ is applied to:
β’ π₯ are Eigenvectors and π are Eigenvalues
β’ One of the most important concepts in machine learning, ex:β Principle Component Analysis (PCA)
β Eigenvector centrality
β PageRank
β β¦
π΄π₯ = ππ₯, π΄ β βπΓπ, π₯ β βπ
Example: Shear Mappingβ’ Horizontal axis is the
Eigenvector
Principle Component Analysis (PCA)
β’ Eigenvector of Covariance Matrix
https://en.wikipedia.org/wiki/Principal_component_analysis
NumPy for Linear Algebra
β’ NumPy is the fundamental package for scientific computing with Python. It contains among other things:βa powerful N-dimensional array objectβsophisticated (broadcasting) functionsβtools for integrating C/C++ and Fortran codeβuseful linear algebra, Fourier transform, and random
number capabilities
Create Tensors
Scalars (0D tensors) Vectors (1D tensors) Matrices (2D tensors)
Create 3D Tensor
Attributes of a Numpy Tensor
β’ Number of axes (dimensions, rank)β x.ndim
β’ Shapeβ This is a tuple of integers showing how many data the tensor has along each axis
β’ Data typeβ uint8, float32 or float64
Numpy Multiplication
Unfolding the Manifold
β’ Tensor operations are complex geometric transformation in high-dimensional spaceβ Dimension reduction
Basics of Probability
Three Axioms of Probability
β’ Given an Event πΈ in a sample space π, S = π=1Ϊπ πΈπ
β’ First axiomβ π πΈ β β, 0 β€ π(πΈ) β€ 1
β’ Second axiomβ π π = 1
β’ Third axiomβ Additivity, any countable sequence of mutually exclusive events πΈπβ π π=1Ϊ
π πΈπ = π πΈ1 + π πΈ2 +β―+ π πΈπ = Οπ=1π π πΈπ
Union, Intersection, and Conditional Probability β’ π π΄ βͺ π΅ = π π΄ + π π΅ β π π΄ β© π΅
β’ π π΄ β© π΅ is simplified as π π΄π΅
β’ Conditional Probability π π΄|π΅ , the probability of event A given B has occurred
β π π΄|π΅ = ππ΄π΅
π΅
β π π΄π΅ = π π΄|π΅ π π΅ = π π΅|π΄ π(π΄)
Chain Rule of Probability
β’ The joint probability can be expressed as chain rule
Mutually Exclusive
β’ π π΄π΅ = 0
β’ π π΄ βͺ π΅ = π π΄ + π π΅
Independence of Events
β’ Two events A and B are said to be independent if the probability of their intersection is equal to the product of their individual probabilitiesβ π π΄π΅ = π π΄ π π΅
β π π΄|π΅ = π π΄
Bayes Rule
π π΄|π΅ =π π΅|π΄ π(π΄)
π(π΅)
Proof:
Remember π π΄|π΅ = ππ΄π΅
π΅
So π π΄π΅ = π π΄|π΅ π π΅ = π π΅|π΄ π(π΄)Then Bayes π π΄|π΅ = π π΅|π΄ π(π΄)/π π΅
NaΓ―ve Bayes Classifier
NaΓ―ve = Assume All Features Independent
Normal (Gaussian) Distributionβ’ One of the most important distributions
β’ Central limit theoremβ Averages of samples of observations of random variables independently drawn from
independent distributions converge to the normal distribution
Differentiation
OR
Derivatives of Basic Function ππ¦
ππ₯
Gradient of a Functionβ’ Gradient is a multi-variable generalization of the derivative
β’ Apply partial derivatives
β’ Example
Chain Rule
47
Maxima and Minima for Univariate Function
β’ If ππ(π₯)
ππ₯= 0, itβs a minima or a maxima point, then we study the
second derivative:
β If π2π(π₯)
ππ₯2< 0 => Maxima
β If π2π(π₯)
ππ₯2> 0 => Minima
β If π2π(π₯)
ππ₯2= 0 => Point of reflection
Minima
Gradient Descent
Gradient Descent along a 2D Surface
Avoid Local Minimum using Momentum
Optimization
https://en.wikipedia.org/wiki/Optimization_problem
Principle Component Analysis (PCA)
β’ Assumptionsβ Linearity
β Mean and Variance are sufficient statistics
β The principal components are orthogonal
Principle Component Analysis (PCA)
max. cov π, π
π . π. π‘ πTπ = π
References
β’ Francois Chollet, βDeep Learning with Python,β Chapter 2 βMathematical Building Blocks of Neural Networksβ
β’ Santanu Pattanayak, βPro Deep Learning with TensorFlow,β Apress, 2017
β’ Machine Learning Cheat Sheet
β’ https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
β’ https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when
β’ Wikipedia