23
Current Trends in Machine Current Trends in Machine Learning and Data Mining Learning and Data Mining Dr Marcus Gallagher Dr Marcus Gallagher School of Information Technology and School of Information Technology and Electrical Engineering Electrical Engineering University of Queensland 4072 Australia University of Queensland 4072 Australia [email protected] [email protected]

Current Trends in Machine Learning and Data Mining

  • Upload
    sybil

  • View
    72

  • Download
    0

Embed Size (px)

DESCRIPTION

Current Trends in Machine Learning and Data Mining. Dr Marcus Gallagher School of Information Technology and Electrical Engineering University of Queensland 4072 Australia [email protected]. Talk Overview. Machine Learning Data mining Where machine learning fits into data mining - PowerPoint PPT Presentation

Citation preview

Page 1: Current Trends in Machine Learning and Data Mining

Current Trends in Machine Learning and Current Trends in Machine Learning and Data MiningData Mining

Dr Marcus GallagherDr Marcus GallagherSchool of Information Technology and Electrical School of Information Technology and Electrical

EngineeringEngineeringUniversity of Queensland 4072 AustraliaUniversity of Queensland 4072 Australia

[email protected]@itee.uq.edu.au

Page 2: Current Trends in Machine Learning and Data Mining

22Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Talk OverviewTalk Overview

Machine LearningMachine Learning

Data miningData mining Where machine learning fits into data miningWhere machine learning fits into data mining

Scientific Data MiningScientific Data Mining Machine learning and the scientific processMachine learning and the scientific process Current topics & future directionsCurrent topics & future directions

SummarySummary

Page 3: Current Trends in Machine Learning and Data Mining

33Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Machine LearningMachine Learning

“…“…the study of computer algorithms the study of computer algorithms capable of learning to improve their capable of learning to improve their performance on a task on the basis of their performance on a task on the basis of their own experience.”own experience.” Often this is “learning from data”.Often this is “learning from data”.

A sub-discipline of artificial intelligence, A sub-discipline of artificial intelligence, with large overlaps into statistics, pattern with large overlaps into statistics, pattern recognition, visualization, robotics, control, recognition, visualization, robotics, control, … …

Page 4: Current Trends in Machine Learning and Data Mining

44Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Data MiningData Mining

“…“…the analysis of (often large) the analysis of (often large) observational data sets to find observational data sets to find unsuspected relationships and to unsuspected relationships and to summarize the data in novel ways that are summarize the data in novel ways that are both understandable and useful to the both understandable and useful to the data owner.”data owner.”

Page 5: Current Trends in Machine Learning and Data Mining

55Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Data MiningData Mining

Modern science is driven by data analysis like never Modern science is driven by data analysis like never before. We have an ability to collect and process data before. We have an ability to collect and process data that is increasing exponentially!that is increasing exponentially!

Page 6: Current Trends in Machine Learning and Data Mining

66Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Data mining: increase in CPU performance (1/execution time)Data mining: increase in CPU performance (1/execution time)

Page 7: Current Trends in Machine Learning and Data Mining

77Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Data mining: trends in memory capacity and costData mining: trends in memory capacity and cost

Page 8: Current Trends in Machine Learning and Data Mining

88Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Data miningData mining

General trends:General trends: Increase in the size of datasets (in terms of Increase in the size of datasets (in terms of

observations and dimensionality).observations and dimensionality).What to do when #dimensions > # observations?What to do when #dimensions > # observations?

Interest in data mining for non-numerical data.Interest in data mining for non-numerical data.

““We are drowning in information and starving We are drowning in information and starving for knowledge”. – Rutherford D. Rogerfor knowledge”. – Rutherford D. Roger

Page 9: Current Trends in Machine Learning and Data Mining

99Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Data MiningData Mining

Datacollection

Define problem

Datapreparation

Data modelling

Interpretation/Evaluation

Implement/Deploy model

Page 10: Current Trends in Machine Learning and Data Mining

1010Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Data MiningData Mining

Datacollection

Define problem

Datapreparation

Data modelling

Interpretation/Evaluation

Implement/Deploy model

MachineLearning

Page 11: Current Trends in Machine Learning and Data Mining

1111Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Data modelling and the scientific processData modelling and the scientific process

Data modelling plays an important role at Data modelling plays an important role at several stages in the scientific process:several stages in the scientific process:

1.1. Observe and explore interesting phenomena.Observe and explore interesting phenomena.

2.2. Generate hypotheses.Generate hypotheses.

3.3. Formulate model to explain phenomena.Formulate model to explain phenomena.

4.4. Test predictions made by the theory.Test predictions made by the theory.

5.5. Modify theory and repeat (at 2 or 3).Modify theory and repeat (at 2 or 3).

The explosion of data suggests that we need The explosion of data suggests that we need to (partially) automate numerous aspects of to (partially) automate numerous aspects of the scientific process.the scientific process.

Page 12: Current Trends in Machine Learning and Data Mining

1212Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

1. Observe and explore interesting phenomena.1. Observe and explore interesting phenomena.

Problems here are typically referred to as Problems here are typically referred to as unsupervised unsupervised learninglearning in the ML community: in the ML community:

Given a set of d-dimensional data vectors (XGiven a set of d-dimensional data vectors (X11,…,X,…,Xnn), X), Xii = (x = (x11,,…,x…,xdd), build a model of the data to infer properties of the ), build a model of the data to infer properties of the underlying distribution (process) that generated the data.underlying distribution (process) that generated the data.

Key problems:Key problems: Dimensionality reductionDimensionality reduction: developing algorithms that can reduce : developing algorithms that can reduce

a dataset of hundreds or thousands of dimensions to just a few a dataset of hundreds or thousands of dimensions to just a few for visualization, while retaining as much of the “information” as for visualization, while retaining as much of the “information” as possible in the original dataset.possible in the original dataset.

ClusteringClustering of data – outlier detection: Identifying trends and/or of data – outlier detection: Identifying trends and/or anomalies in datasets.anomalies in datasets.

Page 13: Current Trends in Machine Learning and Data Mining

1313Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

1. Observe and explore interesting phenomena1. Observe and explore interesting phenomena

Current ML approaches:Current ML approaches: Independent Component AnalysisIndependent Component Analysis – –

decompose multivariate data with the aim of decompose multivariate data with the aim of producing components that are as producing components that are as “statistically independent” as possible.“statistically independent” as possible.

Related to PCA and factor analysis.Related to PCA and factor analysis. Gaussian mixture modelsGaussian mixture models for clustering – uses for clustering – uses

a semi-parametric probability density a semi-parametric probability density estimator that is trained iteratively on data.estimator that is trained iteratively on data.

Implements a “soft” version of k-means. Implements a “soft” version of k-means.

Page 14: Current Trends in Machine Learning and Data Mining

1414Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

1. Observe and explore interesting phenomena1. Observe and explore interesting phenomena

Self-organizing mapsSelf-organizing maps and and topographic topographic mappingmapping – similar to clustering but where the – similar to clustering but where the cluster-centres are constrained to lie in a low-cluster-centres are constrained to lie in a low-dimensional manifold (and so have a spatial dimensional manifold (and so have a spatial relationship).relationship).

Page 15: Current Trends in Machine Learning and Data Mining

1515Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

3. Formulate model to explain phenomena3. Formulate model to explain phenomena

Problems here are typically referred to as Problems here are typically referred to as supervised learning in the ML community:supervised learning in the ML community: Given a Given a trainingtraining set of pairs of set of pairs of inputinput and and

outputoutput data vectors {(X data vectors {(X ii,Y,Yii),…,(X),…,(Xnn,Y,Ynn)}, where )}, where the input values are thought to have some the input values are thought to have some influence on the corresponding output values, influence on the corresponding output values, build a model of the data that can predict the build a model of the data that can predict the outputs of unseen (outputs of unseen (testtest) inputs.) inputs.

Key problems:Key problems: Regression, classification, forecasting.Regression, classification, forecasting.

Page 16: Current Trends in Machine Learning and Data Mining

1616Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

3. Formulate model to explain phenomena3. Formulate model to explain phenomena

Established ML approaches:Established ML approaches: Neural networksNeural networks Decision-treesDecision-trees and rule-based classifiers and rule-based classifiers

Example applications in astronomy:Example applications in astronomy: Star/galaxy classification – on the basis of Star/galaxy classification – on the basis of

optical data.optical data. Photometric redshift evaluation.Photometric redshift evaluation. Noise identification and removal in Noise identification and removal in

gravitational waves detectors.gravitational waves detectors.

Page 17: Current Trends in Machine Learning and Data Mining

1717Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Current/New Machine Learning ModelsCurrent/New Machine Learning Models

Current ML approaches: Current ML approaches: Support vector machinesSupport vector machines – uses insights from – uses insights from

computation learning theory and geometry to produce computation learning theory and geometry to produce predictors with powerful discrimination and good predictors with powerful discrimination and good generalization.generalization.

Ensemble methodsEnsemble methods – improving the accuracy of – improving the accuracy of predictions by using multiple models and bootstrap predictions by using multiple models and bootstrap sampling.sampling.

Example: boosting – incrementally constructs an ensemble Example: boosting – incrementally constructs an ensemble of “weak” models, where each model is forced to concentrate of “weak” models, where each model is forced to concentrate on the mistakes made by previous models. on the mistakes made by previous models.

Page 18: Current Trends in Machine Learning and Data Mining

1818Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Current/New Machine Learning ModelsCurrent/New Machine Learning Models

Gaussian ProcessesGaussian Processes – use the machinery of – use the machinery of Bayesian inference to model data using Bayesian inference to model data using stochastic processes.stochastic processes.

All information is represented as a probability All information is represented as a probability distribution.distribution.Incorporates uncertainty associated with prior Incorporates uncertainty associated with prior information and predictions made.information and predictions made.

Probabilistic graphical modelsProbabilistic graphical models – the model is – the model is a probability distribution where dependencies a probability distribution where dependencies are explicitly encoded.are explicitly encoded.

Generative models.Generative models.

Page 19: Current Trends in Machine Learning and Data Mining

1919Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Current/New Problems in Machine LearningCurrent/New Problems in Machine Learning

Active learning:Active learning: Assume that data can be generated Assume that data can be generated

(measured or labelled) on demand – build a (measured or labelled) on demand – build a learning algorithm that learns on the basis of learning algorithm that learns on the basis of self-selected data points (queries).self-selected data points (queries).

Aims to reduce the amount of data required, Aims to reduce the amount of data required, training time of the model, or amount of data training time of the model, or amount of data that must be manually labelled.that must be manually labelled.

Page 20: Current Trends in Machine Learning and Data Mining

2020Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Current/New Problems in Machine LearningCurrent/New Problems in Machine Learning

Semi-supervised learning:Semi-supervised learning: Learning from both labelled (expensive, scarce) and Learning from both labelled (expensive, scarce) and

unlabelled (abundant, cheap) data.unlabelled (abundant, cheap) data. Aims are similar to active learning.Aims are similar to active learning.

Transductive learning:Transductive learning: All unlabelled data points belong to the test set.All unlabelled data points belong to the test set. The algorithm is able to take advantage of the spatial The algorithm is able to take advantage of the spatial

distribution of the data that the real-world generates distribution of the data that the real-world generates (and that it will be tested on).(and that it will be tested on).

Page 21: Current Trends in Machine Learning and Data Mining

2121Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

Current/New Problems in Machine LearningCurrent/New Problems in Machine Learning

Non-numerical (unreal!) dataNon-numerical (unreal!) data Examples:Examples:

Strings – text, DNA, computer programs.Strings – text, DNA, computer programs.

Trees – parse trees, XML, phylogenetic trees.Trees – parse trees, XML, phylogenetic trees.

Graphs – organic molecules, go board positions.Graphs – organic molecules, go board positions.

Embedding unreal data in continuous space and Embedding unreal data in continuous space and applying standard ML techniques has limitations.applying standard ML techniques has limitations.

Need to define appropriate kernels, distance Need to define appropriate kernels, distance metrics over these data spaces.metrics over these data spaces.

Page 22: Current Trends in Machine Learning and Data Mining

2222Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

SummarySummary

Automated data mining plays an increasingly Automated data mining plays an increasingly important role in the scientific process.important role in the scientific process.

Machine Learning techniques are driven by the Machine Learning techniques are driven by the problems in data mining and provide some problems in data mining and provide some effective solutions.effective solutions.

Machine Learning is a highly active area of Machine Learning is a highly active area of research – new & evolving algorithms.research – new & evolving algorithms.

The scope of learning problems is widening to The scope of learning problems is widening to deal with important real-world data.deal with important real-world data.

Page 23: Current Trends in Machine Learning and Data Mining

2323Australian Virtual Observatory WorkshoAustralian Virtual Observatory Workshop 2003p 2003

ReferencesReferences

1.1. E. Mjolsness and D. DeCoste. Machine Learning for Science: E. Mjolsness and D. DeCoste. Machine Learning for Science: State of the Art and Future Prospects. Science (293):2051-2055, State of the Art and Future Prospects. Science (293):2051-2055, 2001.2001.

2.2. D. Hand et al. Principles of Data Mining. MIT Press, 2001.D. Hand et al. Principles of Data Mining. MIT Press, 2001.3.3. T. Hastie et al. The Elements of Statistical Learning. Springer, T. Hastie et al. The Elements of Statistical Learning. Springer,

2001.2001.4.4. R. Tagliaferri et al. Neural networks in astronomy. Neural R. Tagliaferri et al. Neural networks in astronomy. Neural

Networks (16):297-319, 2003.Networks (16):297-319, 2003.5.5. S. Harmeling et al. Kernel-based nonlinear blind source S. Harmeling et al. Kernel-based nonlinear blind source

separation. Neural Computation v15, n5, 1089-1124, 2003.separation. Neural Computation v15, n5, 1089-1124, 2003.6.6. T. Graepel. Getting Real with Unreal Data: Lessons Learned and T. Graepel. Getting Real with Unreal Data: Lessons Learned and

the Way Ahead. NIPS Workshop on Unreal Data, 2002.the Way Ahead. NIPS Workshop on Unreal Data, 2002.7.7. S. J. Hong and S. Weiss: Advances in predictive models for data S. J. Hong and S. Weiss: Advances in predictive models for data

mining. Pattern Recognition Letters 22(1): 55-61 (2001).mining. Pattern Recognition Letters 22(1): 55-61 (2001).