Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
funded by:
Machine LearningLecture 1
Marios PolycarpouDirector, KIOS Research and Innovation Center of Excellence
Professor, Electrical and Computer Engineering
University of Cyprus
MSc on Intelligent Critical Infrastructure Systems
2
Fourth Industrial Revolution
What is a technological revolution? 1st Industrial Revolution (1760) 2nd Industrial Revolution (1900) 3rd Industrial Revolution (1960) 4th Industrial Revolution (2000) – term coined in 2015 by Klaus Schwab,
executive director of the World Economic Forum. Combines hardware, software, and biology (cyber-physical systems), and emphasizes advances in communication and connectivity. It is expected to be marked by breakthroughs in fields such as machine learning, robotics, nanotechnology, biotechnology, the internet of things (IoT), wireless technologies (5G), 3D printing and fully autonomous vehicles.
3
Sensor technology• wealth of sensors• new generation sensors
Information & Communication Technology (ICT)• store, process and transmit data collected by sensors
Motivation for Machine Learning
4
Motivation for Machine Learning
Big Data• sensor technology and ICT enabled the collection of
extremely large data sets• contribute to the better perception of complex systems
Internet of Things (IoT)• sensor enabled devices
connected to the internet and able to communicate with each other
5
Motivation for Machine Learning
Image processing Speech recognition Prediction in financial services Finding patterns in marketing and sales Improving healthcare Automation – no human intervention needed Continuous improvement Wide applications
6
How is Machine Learning related to Intelligent Critical Infrastructure Systems?
Critical Infrastructure Systems (CIS)
Heterogeneous - Interdependent - InterconnectedRisks, Faults, Attacks
Intelligent Systems and Networks
Sensors, ActuatorsBig Data, Internet of Things
Monitoring, ControlManagement, Security
Information Communication Technologies (ICT)
Big Data Smart Decisions
7
How is Machine Learning related to Intelligent Critical Infrastructure Systems?
Smart BuildingsSmart Grids
8
How is Machine Learning related to Intelligent Critical Infrastructure Systems?
Water Systems
Revenue & Water Losses Water Quality Energy Consumption Safety & Security
9
How is Machine Learning related to Intelligent Critical Infrastructure Systems?
Multi-Agent FormationIntelligent Transportation
10
How is Machine Learning related to Intelligent Critical Infrastructure Systems?
Smart Cities
Physical Interconnections
Cyber Interconnections
Interdependencies
11
Defining Machine Learning
A machine learning algorithm is an algorithm that is able to learn from data. Other related terms: data-driven methods, computational intelligence, artificial intelligence (AI).
In general, Artificial Intelligence and Computational Intelligence are bigger concepts aimed at creating intelligent machines that can simulate human thinking capability and behaviour. Machine Learning is a subset of AI that allows machines to learn from data without being programmed explicitly.
12
Defining Machine Learning
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E (Mitchell, 1997).
T.M. Mitchell, Machine Learning, McGraw-Hill, New York, 1997.
13
The Task T
Classification Regression Monitoring and anomaly detection Transcription Speech recognition Machine translation Navigation and Control
14
The Task T -- Classification
x 1, 2,y K Classifier
: 1, 2,nf K
1
2
n
xx
x
x
Features
1, 2,y K
Classes
15
The Task T -- Regression
x yPredictor
: nf
16
The Performance Measure P
Error rate; error properties Training set; Test set Different performance measures may be used What are norms?
Examples of norms ( )n R
11
: i
n
i
21
1/22:
n
ii
(1-norm) (2-norm)Euclidean norm
1: max ii n
(∞-norm)
17
The Experience E
What type of dataset is available?
Types of Learning Supervised Learning (there exist labelled examples) Unsupervised Learning (clustering, dimensionality reduction) Semi-Supervised Learning Reinforcement Learning
18
Machine Learning – Simple Example
Dataset:
1 2
( ) sin(2 )( )
( , , ) 10N
f x xy f xx x x x N
19
Adaptive Approximation Model: polynomial
20 1 2
0
ˆ ( ; )M
M jM j
j
f x x x x x
Learning: minimizing a cost function (or error function)
2
1
1 ˆ( ) ( ; )2
N
n nn
J f x y
Machine Learning – Simple Example
20
Machine Learning – Simple Example
underfitting underfitting
overfittingappropriate capacity
21
underfitting and overfitting depends on both M and N
Machine Learning – Simple Example
22
Capacity, Overfitting and Underfitting
• The ability to perform well on previously unobserved inputs is called generalization. This is a key challenge in machine learning.
• Training set vs. test set. Typically, the generalization error (test error) is measuring the performance on a test set.
• The expected test error is greater than or equal to the expected training error
The performance of a machine learning algorithm is based on both:1. Making the training error small2. Making the gap between the training and test error small
23
Capacity, Overfitting and Underfitting
• Occam’s Razor – is a principle that states that among competing hypothesis that explain known observations equally well, we should choose the simplest one. Named after William of Ockham (1287-1347).
• Statistical Learning Theory provides various means of quantifying model capacity. The Vapnik-Chervonenkis dimension (VC dimension) is the most well known. However, it is not very practical with advanced machine learning algorithms
24
2 2
1
1 ˆ( ) ( ; )2 2
N
R n nn
J f x y
Regularizarion
25
Hyperparameters and Validation Sets
• Hyperparameters – parameters that affect the performance of the algorithm but they are not adapted by the learning algorithm itself; e.g., the degree of the polynomial is a capacity hyperparameter; the value of λ in regularization is another hyperparameter.
• Validation Set – is used to select the hyperparameters. Split the training set into two disjoint subsets; one subset of data is used to learn the parameters, the other subset (validation set) is used to estimate the generalization error in order to update the hyperparameters; e.g.; 80% of data is used for training and 20% for validation.
• Three types of data: Training set, Validation set, Test set.
26
Batch Learning vs. Online Learning
• Batch Learning – machine learning methods that are based on learning on the entire training data set.
• Online Learning – machine learning methods that are based on data becoming available in sequential order and is sued to update the parameters at each step. Data does not need to be stored after it is used to update the adjustable parameters
• Something in between batch learning and online learning is to use a window of data (not the entire data) to update the parameters.
27
Mathematical Foundations for Machine Learning
Linear Algebra Optimization Approximation Theory Probability Theory Random Signals and Systems
28
Optimization
Optimization is fundamental in multiple science and engineering fields Main idea is to take the derivative and set it to zero (but it soon becomes
more complicating than that……)
29
OptimizationOptimization problem:
minimize
subject to 0 1,2,
0 ', ' 1
, '
,,
nx
i
i
F x
c x i
c x i m m
m
n
R
:{ }:i
F xc
Objective function constraint functions
Some notation: : nF R R
Gradient of :F1 2
, , ,n
F F FFx x x
Hessian matrix:( symmetric matrix)
2 2 2
22
2
2
12
1 1
2
1 2
i j
nn
F F Fx x x x xn
FFx x
F Fx x x
n n
30
Optimization
Necessary and sufficient conditions for unconstrained optimization
Global
minimum
(strong) local
minimum(weak) local
minima
f x
x
1
minimizex
f xR
(univariate case)
0
0
f x
f x
Necessary conditions for to be a minimum
0
0
f x
f x
Sufficient conditions for to be a minimum
x
x
31
Optimization
1 2
minmize
, ,,n
n
xF x
F x F x xx
R
(multivariate case)
2
0F x
F x
NecessaryConditions is positive semi‐definite
2
0F x
F x
SufficientConditions is positive definite
32
Optimization
When is a matrix A positive (semi)‐definite?
is positive‐definite if for any non‐zero .
is positive‐semidefinite if
Lemma:
A ( 0)A 0x Ax x
A ( 0)A 0 .xx Ax
0 0 1, ,iA A i n
all the eigenvalues are positive
33
OptimizationConvexity:The set is convex if
A function is convex over if
Lemma: If and is convex over then is a global minimum of iff .
S (1 ) , , 0,1 .x y S x y S
convex Not convex
:F S R S (1 ) (1 ) , , 0,1F x y F x F y x y S
x y F x
F y
1( , )CF FF
xnR 0F x
34
OptimizationExample: Quadratic function 1minimize
2nxc x x Gx
R
n n nc G R R symmetric
0F x c Gx F x Gx c
2F x G x F xis a minimum of 0
Gx cG
Note: is convex F x 0G
Warning: not all functions that we want to minimize are smooth
35
OptimizationContour plots:
for different values of .
•
•
F x
2 21 2 1 2,F x x x x
1 2 21 2 1 2 1 2 2, 4 2 4 2 1xF x x e x x x x x
36
Optimization Linear Programming: (Simplex method)
Standard Form:
Unconstrained Minimization: (Steepest descent; Newton method)
Constrained Minimization: (Gradient Projection; Penalty methods ; Lagrange methods)
minimize
subject to0
nxc x
Ax bx
R
, ,A b c are given.
minnxF x
R
minimize
0subject to
0
nxF x
c xd x
R
37
Optimization
Optimization Algorithms
What are the performance criteria?• Convergence• Rate of convergence• Complexity
38
Optimization
Algorithms for Unconstrained Minimization
(A) Steepest Descent Method (Gradient Method)
Start
minnxF x
R
1 20 kx x x x vectors
1k k k kx x F x 0k
k is determined be a linear search so that minimizes in
the direction from .1kx F x
k kd F x kx
39
Optimization
Trade‐offs in choosing :• speed• convergence• complexity
1 const.k k kx x F x
Simplest gradient descent:
40
Optimization
Intuition for Steepest Descent
x
x
x
1-dimentional xR
0x1x
x
2-dimentional 2xR
Contour lines
1 2,F x x
41
Optimization
Example: 212
F x x 1
0(1 )
k k
k k k
kk
F x xx x
x x
x
Example: 1 2 21 2 1 2 1 2 2, 4 2 4 2 1xF x x e x x x x x
1
1
2 21 2 1 2 1 2
1 21 2
4 2 4 8 6 1,
4 4 2
x
x
e x x x x x xF x x
e x x
1k k k kx x F x
42
OptimizationIn continuous‐time:
as
After rescaling:
11
k kk k k k
x xx F x F xx
0 x F x 00x x
x F x 0 is symmetric & positive definite
Example: minnxx Gx F x
R
0; (0)F x G x xx Gx x
0Gtx t e x
43
Optimization(B) Newton’s Method: min
nxF x
R( is convex)F
0F x (necessary condition)
2
( )k
k
x x
F x
k kdF x F x F x x xdx
Let 2:H x F x
Hessian
( ) 0k k kF x F x H x x x (we want)
1 10 ( )k k k k kF x F x H x x x
11k k k kx x H x F x (assuming exists) 1
kH x
44
Optimization
In continuous‐time: 1 ; 0x H x F x
Example: 212
F x x
11 1F x x H x H x
11
1 0.k k k k
k k k
x x H x F xx x x
0 1 0x x
Convergence in one step!
45
Optimization
Example: 1 2 21 2 1 2 1 2 2, 4 2 4 2 1xF x x e x x x x x
1
1
2 21 2 1 2 1 2
1 21 2
4 2 4 8 6 1,
4 4 2
x
x
e x x x x x xF x x
e x x
1
2 21 2 1 2 1 2 1 2
1 21 2
4 2 4 16 10 9 4 4 6,
4 4 6 4x x x x x x x x x
H x x ex x
1
1 22 2
1 2 1 2 1 221 22 1 2 1 2
1
1 2
4 (4 4 6), 4 2 4
(4 4 6)8( 2 2 )16 10 9
xx x
eH x x x x x xx xx x x x x
x x
11k k k kx x H x F x