20
Some Take-Home Messages (THM) about ML.... Data Science Meetup Gianluca Bontempi Interuniversity Institute of Bioinformatics in Brussels, (IB) 2 Machine Learning Group, Computer Science Department, ULB mlg.ulb.ac.be, ibsquare.be May 20, 2016

Some Take-Home Message about Machine Learning

Embed Size (px)

Citation preview

Some Take-Home Messages (THM) about ML....Data Science Meetup

Gianluca BontempiInteruniversity Institute of Bioinformatics in Brussels, (IB)2

Machine Learning Group,Computer Science Department, ULBmlg.ulb.ac.be, ibsquare.be

May 20, 2016

Introducing myself

1992: Computer science engineer (Politecnico di Milano, Italy),

1994: Researcher in robotics in IRST, Trento, Italy,

1995: Researcher in IRIDIA, ULB, Brussels,

1996-97: Researcher in IDSIA, Lugano, Switzerland,

1998-2000: Marie Curie fellowship in IRIDIA, ULB,

2000-2001: Scientist in Philips Research, Eindhoven, TheNetherlands,

2001-2002: Scientist in IMEC, Microelectronics Institute,Leuven, Belgium,

since 2002: professor in Machine Learning, Modeling andSimulation, Bioinformatics in ULB Computer Science Dept.,

since 2004: head of the ULB Machine Learning Group (MLG).

since 2013: director of the Interuniversity Institute ofBioinformatics in Brussels (IB)2, ibsquare.be.

What is machine learning?

Machine learning is that domain of computational intelligencewhich is concerned with the question of how to construct computerprograms that automatically improve with experience. (Mitchell,97)

Reductionist attitude: ML is just a buzzword which equates tostatistics plus marketing

Positive attitude: ML paved the way to the treatment of realproblems related to data analysis, sometimesoverlooked by statisticians (nonlinearity, classification,pattern recognition, missing variables, adaptivity,optimization, massive datasets, data management,causality, representation of knowledge, parallelisation)

Interdisciplinary attitude: ML should have its roots on statisticsand complements it by focusing on: algorithmicissues, computational efficiency, data engineering.

Prediction is pervasive ...

Prediction is pervasive ...

Predict

whether you will like a book/movie (collaborative filtering)

credit applicants as low, medium, or high risk.

which home telephone lines are used for Internet access.

which customers are likely to stop being customers (churn).

the value of a piece of real estate

which telephone subscribers will order a 4G service

which CARREFOUR clients will be more interested to adiscount in Italian products.

the probability that a company is employing black workers(anti-fraud detection)

the survival risk of a patient on the basis of a genetic signature

the probability of a crime in an urban area.

the key of a cryptographic algorithm on the basis of powerconsumption

Supervised learning

First assumption: learning is essentially about prediction !Second assumption: reality is stochastic, dependency anduncertainty are well described by conditional probability.

PREDICTION

TARGET

TRAINING DATASET

INPUT OUTPUTERROR

PREDICTION

MODEL

measurable features (inputs)

measurable target variables (outputs) and accuracy criteria

data (in God we trust, all the others must bring data)

THM1: formalizing a problem as a prediction problem is often themost important contribution of a data scientist!

It is all about ...

1 Probabilistic modelingit formalizes uncertainty and dependency (regression function)notions of entropy and informationrelevant and irrelevant features (e.g. Markov blanket notion)Bayesian networks, causal reasoning

2 Estimationbias/variance notionsgeneralization issues: underfitting vs overfittingBayesian, frequentist, decision theoryvalidationcombination/averaging of estimators (bagging, boosting)

3 OptimizationMaximum likelihood, least squares, backpropagationDual problems (SVM)L1, L2 norm (lasso)

4 Computer scienceimplementation, algorithmsparallelism, scalabilitydata management

So ... how to teach machine learning?

Focus on ...

Formalism ?

Algorithms ?

Coding ?

Applications ?

Of course all is important but what is the essence, what is commonto the exploding number of algorithms, techniques, fancyapplications?

EstimationSTOCHASTIC PHENOMENON

DATA

LEARNER

DATA DATA

MODEL,

PREDICTION

LEARNER

MODEL,

PREDICTION

LEARNER

MODEL,

PREDICTION

THM2: a predictor is an estimator, i.e. an algorithm (black-box)which takes data and returns a prediction.THM3: reality is stochastic, so data is stochastic and prediction isstochastic.

Assessing in an un uncertain world (Baggio, 1998)

non aver paura di sbagliare un calcio di rigore, non è mica da questiparticolari che si giudica un giocatore (De Gregori, 1982)).

Assessing a learner

The goal of learning is to find a model which is able togeneralize, i.e. able to return good predictions in contextswith the same distribution but independent of the training set

How to estimate the quality of a model?

It is always possible to find models with such a complicatestructure that they have null training errors. Are these modelsgood?

Typically NOT. Since doing very well on the training set couldmean doing badly on new data.

This is the phenomenon of overfitting.

THM4: learning is challenging since data have to be used 1) forcreating prediction models and 2) for assessing them.

Bias and variance of a model

Estimation theory: mean-squared-error (a measure of thegeneralization quality) can be written as

MSE = σ2

w+ squared bias + variance

where

noise concerns the reality alone,bias reflects the relation between reality and the learningalgorithmvariance concerns the learning algorithm alone.

This is purely theoretical since these quantities cannot bemeasured ....

.. but useful to understand why and in which circumstanceslearners work.

The bias/variance dilemma

Noise is all that cannot be learned from data

Bias measures the lack of representational power of the classof hypotheses.

Too simple model ⇒ large bias ⇒ underfitting

Variance warns us against an excessive complexity of theapproximator.

Too complex model ⇒ large variance ⇒ overfitting

A neural network is less biased than a linear model butinevitably more variant.Averaging (e.g. bagging, boosting, random forests) is a goodcure for variance.

Bias/variance trade-off

complexity

generalizationerror

Bias

Variance

Underfitting Overfitting

THM5: think in terms of bias/variance tradeoff. Think to yourpreferred learning algorithm and discover how bias/variance ismanaged.

The Ockam’s Razor (1825)

THM6: "Pluralitas non est ponenda sine neccesitate" i.e. oneshould not increase, beyond what is necessary, the number ofentities required to explain anything.

This is the medieval rule of parsimony, or principle ofeconomy, known as Ockham’s razor.

In other terms the principle states that one should not makemore assumptions than the minimum needed.

It underlies all scientific modeling and theory building. Itadmonishes us to choose from a set of otherwise equivalentmodels the simplest one.

Be simple: "shave off" those concepts, variables or constructsthat are not really needed to explain the phenomenon.

Does the best exist?

Given a finite number of samples, are there any reasons toprefer one learning algorithm over another?

If we make no assumption about the nature of the learningtask, can we expect any learning method to be superior orinferior overall?

Can we even find an algorithm that is overall superior to (orinferior to) random guessing?

The No Free Lunch Theorem answers NO to these questions.

No Free Lunch theorem

If the goal is to obtain good generalization performance, thereare no context-independent or usage-independent reasonsto favor one learning method over another.

If one algorithm seems to outperform another in a particularsituation, it is a consequence of its fit to the particular patternrecognition problem, not the general superiority of thealgorithm.

The theorem also justifies the skeptiscism about studies thatdemonstrate the overall superiority of a particular learning orrecognition algorithm.

If a learning method performs well over some set of problems,then it must perform worse than average elsewhere. Nomethod can perform well throughout the full set of functions.

THM7: Every learning algorithm makes assumptions (most of thetimes in implicit manner) and these make the difference.

Conclusion

Popper claimed that, if a theory is falsifiable (i.e. it can becontradicted by an observation or the outcome of a physicalexperiment), then it is scientific. Since prediction is the mostfalsifiable aspect of science it is also the most scientific one.

Effective machine learning is an extension of statistics, in noway an alternative.

Simplest (i.e. linear) model first.

Modelling is more an art than an automatic process... thenexperience data analysts are more valuable than expensivetools.

Expert knowledge matters..., data too

Understanding what is predictable is as important as trying topredict it.

All models are wrong, some of them are useful.

All that we did not discuss...

Dimensionality reduction and feature selection

Causal inference

Unsupervised learning

Active learning

Spatio-temporal prediction

Nonstationary problems

Scalable machine learning

Control and robotics

Libraries and platforms (R, python, Weka)

Resources

A biased list ...:-)

Scoop-itwww.scoop.it/t/machine-learning-by-gianluca-bontempi

on machine learning

Scoop-itwww.scoop.it/t/probabilistic-reasoning-and-statistics

on Probabilistic reasoning, causal inference and statistics

MLG mlg.ulb.ac.be

MA course INFO-F-422 Statistical foundations of machinelearning

Handbook available on https://www.otexts.org