Marios Polycarpou - msccis.ucy.ac.cy · • Occam’s Razor – is a principle that states that among competing hypothesis that explain known observations equally well, we should

1

funded by:

Machine LearningLecture 1

Marios PolycarpouDirector, KIOS Research and Innovation Center of Excellence

Professor, Electrical and Computer Engineering

University of Cyprus

MSc on Intelligent Critical Infrastructure Systems

2

Fourth Industrial Revolution

What is a technological revolution? 1st Industrial Revolution (1760) 2nd Industrial Revolution (1900) 3rd Industrial Revolution (1960) 4th Industrial Revolution (2000) – term coined in 2015 by Klaus Schwab,

executive director of the World Economic Forum. Combines hardware, software, and biology (cyber-physical systems), and emphasizes advances in communication and connectivity. It is expected to be marked by breakthroughs in fields such as machine learning, robotics, nanotechnology, biotechnology, the internet of things (IoT), wireless technologies (5G), 3D printing and fully autonomous vehicles.

3

Sensor technology• wealth of sensors• new generation sensors

Information & Communication Technology (ICT)• store, process and transmit data collected by sensors

Motivation for Machine Learning

4


Big Data• sensor technology and ICT enabled the collection of

extremely large data sets• contribute to the better perception of complex systems

Internet of Things (IoT)• sensor enabled devices

connected to the internet and able to communicate with each other

5


Image processing Speech recognition Prediction in financial services Finding patterns in marketing and sales Improving healthcare Automation – no human intervention needed Continuous improvement Wide applications

6

How is Machine Learning related to Intelligent Critical Infrastructure Systems?

Critical Infrastructure Systems (CIS)

Heterogeneous - Interdependent - InterconnectedRisks, Faults, Attacks

Intelligent Systems and Networks

Sensors, ActuatorsBig Data, Internet of Things

Monitoring, ControlManagement, Security

Information Communication Technologies (ICT)

Big Data Smart Decisions

7


Smart BuildingsSmart Grids

8


Water Systems

Revenue & Water Losses Water Quality Energy Consumption Safety & Security

9


Multi-Agent FormationIntelligent Transportation

10


Smart Cities

Physical Interconnections

Cyber Interconnections

Interdependencies

11

Defining Machine Learning

A machine learning algorithm is an algorithm that is able to learn from data. Other related terms: data-driven methods, computational intelligence, artificial intelligence (AI).

In general, Artificial Intelligence and Computational Intelligence are bigger concepts aimed at creating intelligent machines that can simulate human thinking capability and behaviour. Machine Learning is a subset of AI that allows machines to learn from data without being programmed explicitly.

12

Defining Machine Learning

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E (Mitchell, 1997).

T.M. Mitchell, Machine Learning, McGraw-Hill, New York, 1997.

13

The Task T

Classification Regression Monitoring and anomaly detection Transcription Speech recognition Machine translation Navigation and Control

14

The Task T -- Classification

x 1, 2,y K Classifier

: 1, 2,nf K

1

2

n

xx

x

x

Features

1, 2,y K

Classes

15

The Task T -- Regression

x yPredictor

: nf

16

The Performance Measure P

Error rate; error properties Training set; Test set Different performance measures may be used What are norms?

Examples of norms ( )n R

11

: i

n

i

21

1/22:

n

ii

(1-norm) (2-norm)Euclidean norm

1: max ii n

(∞-norm)

17

The Experience E

What type of dataset is available?

Types of Learning Supervised Learning (there exist labelled examples) Unsupervised Learning (clustering, dimensionality reduction) Semi-Supervised Learning Reinforcement Learning

18

Machine Learning – Simple Example

Dataset:

1 2

( ) sin(2 )( )

( , , ) 10N

f x xy f xx x x x N

19

Adaptive Approximation Model: polynomial

20 1 2

0

ˆ ( ; )M

M jM j

j

f x x x x x

Learning: minimizing a cost function (or error function)

2

1

1 ˆ( ) ( ; )2

N

n nn

J f x y


20


underfitting underfitting

overfittingappropriate capacity

21

underfitting and overfitting depends on both M and N


22

Capacity, Overfitting and Underfitting

• The ability to perform well on previously unobserved inputs is called generalization. This is a key challenge in machine learning.

• Training set vs. test set. Typically, the generalization error (test error) is measuring the performance on a test set.

• The expected test error is greater than or equal to the expected training error

The performance of a machine learning algorithm is based on both:1. Making the training error small2. Making the gap between the training and test error small

23

Capacity, Overfitting and Underfitting

• Occam’s Razor – is a principle that states that among competing hypothesis that explain known observations equally well, we should choose the simplest one. Named after William of Ockham (1287-1347).

• Statistical Learning Theory provides various means of quantifying model capacity. The Vapnik-Chervonenkis dimension (VC dimension) is the most well known. However, it is not very practical with advanced machine learning algorithms

24

2 2

1

1 ˆ( ) ( ; )2 2

N

R n nn

J f x y

Regularizarion

25

Hyperparameters and Validation Sets

• Hyperparameters – parameters that affect the performance of the algorithm but they are not adapted by the learning algorithm itself; e.g., the degree of the polynomial is a capacity hyperparameter; the value of λ in regularization is another hyperparameter.

• Validation Set – is used to select the hyperparameters. Split the training set into two disjoint subsets; one subset of data is used to learn the parameters, the other subset (validation set) is used to estimate the generalization error in order to update the hyperparameters; e.g.; 80% of data is used for training and 20% for validation.

• Three types of data: Training set, Validation set, Test set.

26

Batch Learning vs. Online Learning

• Batch Learning – machine learning methods that are based on learning on the entire training data set.

• Online Learning – machine learning methods that are based on data becoming available in sequential order and is sued to update the parameters at each step. Data does not need to be stored after it is used to update the adjustable parameters

• Something in between batch learning and online learning is to use a window of data (not the entire data) to update the parameters.

27

Mathematical Foundations for Machine Learning

Linear Algebra Optimization Approximation Theory Probability Theory Random Signals and Systems

28

Optimization

Optimization is fundamental in multiple science and engineering fields Main idea is to take the derivative and set it to zero (but it soon becomes

more complicating than that……)

29

OptimizationOptimization problem:

minimize

subject to 0 1,2,

0 ', ' 1

, '

,,

nx

i

i

F x

c x i

c x i m m

m

n

R

:{ }:i

F xc

Objective function constraint functions

Some notation: : nF R R

Gradient of :F1 2

, , ,n

F F FFx x x

Hessian matrix:( symmetric matrix)

2 2 2

22

2

2

12

1 1

2

1 2

i j

nn

F F Fx x x x xn

FFx x

F Fx x x

n n

30

Optimization

Necessary and sufficient conditions for unconstrained optimization

Global

minimum

(strong) local

minimum(weak) local

minima

f x

x

1

minimizex

f xR

(univariate case)

0

0

f x

f x

Necessary conditions for to be a minimum

0

0

f x

f x

Sufficient conditions for to be a minimum

x

x

31

Optimization

1 2

minmize

, ,,n

n

xF x

F x F x xx

R

(multivariate case)

2

0F x

F x

NecessaryConditions is positive semi‐definite

2

0F x

F x

SufficientConditions is positive definite

32

Optimization

When is a matrix A positive (semi)‐definite?

is positive‐definite if for any non‐zero .

is positive‐semidefinite if

Lemma:

A ( 0)A 0x Ax x

A ( 0)A 0 .xx Ax

0 0 1, ,iA A i n

all the eigenvalues are positive

33

OptimizationConvexity:The set is convex if

A function is convex over if

Lemma: If and is convex over then is a global minimum of iff .

S (1 ) , , 0,1 .x y S x y S

convex Not convex

:F S R S (1 ) (1 ) , , 0,1F x y F x F y x y S

x y F x

F y

1( , )CF FF

xnR 0F x

34

OptimizationExample: Quadratic function 1minimize

2nxc x x Gx

R

n n nc G R R symmetric

0F x c Gx F x Gx c

2F x G x F xis a minimum of 0

Gx cG

Note: is convex F x 0G

Warning: not all functions that we want to minimize are smooth

35

OptimizationContour plots:

for different values of .

•

•

F x

2 21 2 1 2,F x x x x

1 2 21 2 1 2 1 2 2, 4 2 4 2 1xF x x e x x x x x

36

Optimization Linear Programming: (Simplex method)

Standard Form:

Unconstrained Minimization: (Steepest descent; Newton method)

Constrained Minimization: (Gradient Projection; Penalty methods ; Lagrange methods)

minimize

subject to0

nxc x

Ax bx

R

, ,A b c are given.

minnxF x

R

minimize

0subject to

0

nxF x

c xd x

R

37

Optimization

Optimization Algorithms

What are the performance criteria?• Convergence• Rate of convergence• Complexity

38

Optimization

Algorithms for Unconstrained Minimization

(A) Steepest Descent Method (Gradient Method)

Start

minnxF x

R

1 20 kx x x x vectors

1k k k kx x F x 0k

k is determined be a linear search so that minimizes in

the direction from .1kx F x

k kd F x kx

39

Optimization

Trade‐offs in choosing :• speed• convergence• complexity

1 const.k k kx x F x

Simplest gradient descent:

40

Optimization

Intuition for Steepest Descent

x

x

x

1-dimentional xR

0x1x

x

2-dimentional 2xR

Contour lines

1 2,F x x

41

Optimization

Example: 212

F x x 1

0(1 )

k k

k k k

kk

F x xx x

x x

x

Example: 1 2 21 2 1 2 1 2 2, 4 2 4 2 1xF x x e x x x x x

1

1

2 21 2 1 2 1 2

1 21 2

4 2 4 8 6 1,

4 4 2

x

x

e x x x x x xF x x

e x x

1k k k kx x F x

42

OptimizationIn continuous‐time:

as

After rescaling:

11

k kk k k k

x xx F x F xx

0 x F x 00x x

x F x 0 is symmetric & positive definite

Example: minnxx Gx F x

R

0; (0)F x G x xx Gx x

0Gtx t e x

43

Optimization(B) Newton’s Method: min

nxF x

R( is convex)F

0F x (necessary condition)

2

( )k

k

x x

F x

k kdF x F x F x x xdx

Let 2:H x F x

Hessian

( ) 0k k kF x F x H x x x (we want)

1 10 ( )k k k k kF x F x H x x x

11k k k kx x H x F x (assuming exists) 1

kH x

44

Optimization

In continuous‐time: 1 ; 0x H x F x

Example: 212

F x x

11 1F x x H x H x

11

1 0.k k k k

k k k

x x H x F xx x x

0 1 0x x

Convergence in one step!

45

Optimization

Example: 1 2 21 2 1 2 1 2 2, 4 2 4 2 1xF x x e x x x x x

1

1

2 21 2 1 2 1 2

1 21 2

4 2 4 8 6 1,

4 4 2

x

x

e x x x x x xF x x

e x x

1

2 21 2 1 2 1 2 1 2

1 21 2

4 2 4 16 10 9 4 4 6,

4 4 6 4x x x x x x x x x

H x x ex x

1

1 22 2

1 2 1 2 1 221 22 1 2 1 2

1

1 2

4 (4 4 6), 4 2 4

(4 4 6)8( 2 2 )16 10 9

xx x

eH x x x x x xx xx x x x x

x x

11k k k kx x H x F x

Documents

Marios Polycarpou - msccis.ucy.ac.cy · • Occam’s Razor – is a principle that states that among competing hypothesis that explain known observations equally well, we should