Tom M. Mitchell E. Fredkin Professor and Department Head March 2007

Tom M. Mitchell

E. Fredkin Professor and Department Head

March 2007

The Discipline and Futureof Machine Learning

The Discipline of Machine Learning

The defining question: • How can we build computer systems that automatically

improve with experience, and what are the fundamental laws that govern all learning processes?

A process learns with respect to <T,P,E> if it • Improves its performance P• at task T• through experience E

Machine Learning - Practice

Object recognition

Mining Databases

Speech Recognition

Control learning

• Reinforcement learning

• Supervised learning

• Bayesian networks

• Hidden Markov models

• Unsupervised clustering

• Explanation-based learning

• ....

Extracting facts from text

Machine Learning - Theory

PAC Learning Theory

# examples (m)

representational complexity (H)

error rate ()failure probability ()

Other theories for

• Reinforcement skill learning

• Semi-supervised learning

• Active student querying

• …

… also relating:

• # of mistakes during learning

• convergence rate

• asymptotic performance

• bias, variance

• VC dimension

(for supervised concept learning)

The Discipline of Machine Learning

Machine Learning: • How can we build computer systems that automatically improve with

experience, and what are the fundamental laws that govern all learning processes?

Computer Science:• How can we build machines that solve problems, and which

problems are inherently tractable/intractable?

Statistics:• What can be learned from data with a set of modeling assumptions,

while taking into account the data-collection process?

Animal learning (Cognitive science,

Psychology, Neuroscience)Machine learning

Statistics

Computer science

Adaptive Control Theory

and

Robotics

Evolution

Economics

ML and CS

• Machine learning already the preferred approach to– Speech recognition, Natural language processing– Computer vision– Medical outcomes analysis– Many robot control problems– …

• The ML niche will grow– Why?

All software

ML software

ML and Empirical Sciences• Empirical science is a learning process, subject to automation and to study

– improve performance P (accuracy)

– at task T (predict which gene knockouts will impact the aromatic AA pathway, and how)

– with experience E (active experimentation)

Functional genomic hypothesis generation and experimentation by a robot scientist, King et al., Nature, 427(6971), 247-252

Which protein ORFs influence which enzymes in the AAA pathway

Our current state:

• The problem of tabula-rasa function approximation is solved (in an 80-20 sense): – Given:

• Class of hypotheses H = {h: X Y}

• Labeled examples {<xi,f(yi)>}

– Determine:• The h from H that best approximates f

• It’s time to move on– Enrich the function approx problem definition– Use function approx as building block– Work on new problems

Some Current Research Questions

• When/how can unlabeled data be useful in function approximation?

• How can assumed sparsity of relevant features be exploited in high dimensional nonparametric learning?

• How can information learned from one task be transferred to simplify learning another?

• What algorithms can learn control strategies from delayed rewards and other inputs?

• What are the best “active learning” strategies for different learning problems?

• To what degree can one preserve data privacy while obtaining the benefits of data mining?

The Future of Machine Learning

A Quick Look Back

1960 1970 1980 1990 2000

Samuel’s checker learner

Perceptrons

Winston’s symbolic

concept learner

Rule learning

Decision tree learning

Neural networks

Explanation-based

learning

Dimensionality reduction

Bayes nets

PAC learning theory

Architectures for learning

and problem solving

Reinforcement learning

Semi-supervised

learning

Non-parametric methods

Statistical perspective on learning

HMMs SVMs

Theories of grammar induction

Large scale datamining

Speech applications

Robot control

Privacy preserving data mining

Transfer learning

Version Spaces

Theories of perceptron

capacity and learnability

Evolutionary and revolutionary changes

What might lead to the next revolution?

1. Use Machine Learning to help understand Human Learning(and vice versa)

Models of Learning Processes

• # of examples• Error rate• Reinforcement learning• Explanations

• Learning from examples• Complexity of learner’s

representation• Probability of success• Prior probabilities• Loss functions

• # of examples• Error rate• Reinforcement learning• Explanations

• Human supervision– Lectures– Questions, Homeworks

• Attention, motivation• Skills vs. Principles• Implicit vs. Explicit learning• Memory, retention, forgetting• Hebbian learning, consolidation

Machine Learning: Human Learning:

Reinforcement Learning

...]rγr γE[r(s)V 2t2

1tt*

[Sutton and Barto 1981; Samuel 1957]

Observed immediate reward

Learned sum of future rewards

Reinforcement Learning in ML

r =100

V=100

0

V=72 V=81 V=90

= .9

...]rγr γE[r)V(s 2t2

1ttt

S0 S2S1 S3

)V(s γ]E[r)V(s 1ttt

To learn V, use each transition to generate a training signal:

Reinforcement Learning in ML

• Variants of RL have been used for a variety of practical control learning problems – Temporal Difference learning– Q learning – Learning MDPs, POMDPs

• Theoretical results too– Assured convergence to optimal V(s) under certain conditions– Assured convergence for Q(s,a) under certain conditions

)V(s)V(s γr error training t1tt

Dopamine As Reward Signal

[Schultz et al., Science, 1997]

t



t



t

)V(s)V(s γr error t1tt

RL Models for Human Learning[Seymore et al., Nature 2004]

[Seymore et al., Nature 2004]

Human and Machine Learning

Additional overlaps:

• Learning of perceptual representations– Dimensionality reduction methods, low level percepts– Lewicky et al.: optimal sparse codes of natural scenes yield gabor

filters found in primate visual cortex. Similar result for auditory cortex.

• Learning with redundant sensory input– CoTraining methods, Sensory redundancy hypothesis in development– De Sa & Ballard; Coen: co-clustering voice/video yields phonemes– Mitchell & Perfetti: co-training in second language learning

• Learning and explanations– Explanation-based learning, teaching concepts & skills, chunking– VanLehn et al: explanation-based learning accounts for some human

learning behaviors.– Chi: students learn best when forced to explain– Newell; Anderson: chunking/knowledge-compilation models

2. Never-ending learning

Never-Ending Learning

Current machine learning systems: • Learn one function• Are shut down after they learn it• Start from scratch when programmed to learn the next

function

Let’s study and construct learning processes that:• Learn many different things• Formulate their own next learning task• Use what they have already learned to help learn the

next thing

Example: Never-ending learning robot

Imagine a robot with three goals: (1) avoid collisions, (2) recharge when battery low, and (3) find and collect trash

What is stopping us from giving it some trash examples, then letting it learn for a year?

What must it start with to formulate and solve relevant learning subtasks?• Learn to recognize trash in scene• Learn where to search for trash, and when• Learn how close to get to find out whether trash is there• Learn to manipulate trash• Transfer what it learned about paper trash to help with bottle trash• Discover relevant subcategories of trash (e.g., plastic versus glass

bottles), and of other objects in the environment

Core Questions for Never-Ending Learning Agent

• What function or fact to learn next?– Self-reflection on performance, credit assignment

• What representation for this target function or fact?– Choice of input-output representation for target function– E.g., “classify whether it’s trash”

• How to obtain (which type of) training experience?– Primarily self-supervised, but occasional teacher input– E.g., “classify whether it’s trash”

• Guided by what prior knowledge?– Transfer learning, but transfer between what?– XPaperTrash help learn XPlasticTrash ?– State(t) x Action(t) State(t+1) help learn XPlasticTrash ?

Example: Never-ending language learner

Read the Web project: Create 24x7 web agent that each day:• Extracts more facts from the web into structured database• Learns to extract facts better than yesterday

Starting point:• Ontology of hundreds of categories and relations

– and 6-10 training examples of each

• Never-ending learning architecture– State of art language processing primitives– Learning mechanisms

• Top level task:– Populate a database of these categories and relations by reading

the web, and improve continually

[Carlson, Cohen, Fahlman, Hong, Nyberg, Wang, ...]

Q: how can it obtain useful training experience (i.e., self-supervise)?

A: redundancy

Bootstrapping: Learning to extract named entities

I arrived in Pittsburgh on Saturday.

location?

x1: I arrived in _________ on Saturday.

x2: Pittsburgh

Bootstrap learning to extract named entities[Riloff and Jones, 1999], [Collins and Singer, 1999], ...

Iterations

InitializationAustraliaCanada China England France Germany Japan Mexico Switzerland United_states

locations in ?x

South AfricaUnited KingdomWarrentonFar_EastOregonLexingtonEuropeU.S._A.Eastern CanadaBlairSouthwestern_statesTexasStatesSingapore …

operations in ?x

ThailandMaineproduction_controlnorthern_LosNew_Zealandeastern_EuropeAmericasMichigan New_HampshireHungarysouth_americadistrictLatin_AmericaFlorida ...

republic of ?x

…...

Co-Training

Answer1

Classifier1

Answer2

Classifier2

I flew to New York today.

New York I flew to ____ today

Idea: Train Classifier1 and Classifier2 to:

1. Correctly classify labeled examples

2. Agree on classification of unlabeled

Co-Training Theory [Blum&Mitchell 98; Dasgupta 04, ...]

Final Accuracy

# unlabeled examples

Conditional dependence among inputs

# labeled examples

Number of redundant inputs

want inputs less dependent, increased number of redundant inputs, …

)()()()(,

:

:

221121

21

xfxgxgxggand

ondistributiunknownfromdrawnxwhere

XXXwhere

YXflearn

settingCoTraining

disagreement over unlabeled examples can bound true error

Example Bootstrap learning algorithms:

• Classifying web pages [Blum&Mitchell 98; Slattery 99]

• Classifying email [Kiritchenko&Matwin 01; Chan et al. 04]

• Named entity extraction [Collins&Singer 99; Jones 05]

• Wrapper induction [Muslea et al., 01; Mohapatra et al. 04]

• Word sense disambiguation [Yarowsky 96]

• Discovering new word senses [Pantel&Lin 02]

• Synonym discovery [Lin et al., 03]

• Relation extraction [Brin et al.; Yangarber et al. 00]

• Statistical parsing [Sarkar 01]

What is relation between “Elvis” and “January 8”?

Q: how can it choose next learning task?

A: self-reflect on where it is failing, then formulate learning task to repair failure

Some strategies for generating new tasks

• Collect more data from web– To learn about specific entities (e.g., “Rolling Stones”) – To learn meaning of particular language (e.g., “will attend”)– To locate easy-to extract facts (e.g., web pages with lists)

• Learn regularities from the populated KB– “Most LTI office names are of the form “NSH dddd”

• Explore specializations of ontological categories– What distinguishes events occurring on CMU campus from

those who occurring elsewhere? Can this be predicted? What subsets of events warrant becoming categories?

• Explore specializations of language structures– Which ‘location’ entities share surrounding language? e.g., “the city of ?x,” Do they share other properties?

Some Types of Knowledge to Learn

• Linguistic regularities– {“spoon”,”fork”,”chopsticks”} occur often in “eat with my ___”

– They’re instances of ontology class “eating implement”

• HTML layout regularities– HTML lists often contain items of the same class

• Web site regularities– University departments often have page listing all faculty

• Regularities over extracted facts– ‘Professors typically have more publications than their advisees’

– ‘Professors typically received their BS degree before their advisees’

• Temporal stability– Birthdays don’t change. Stock prices do.

Research Issues

• What target knowledge representation?• How can initial ontology be extended?• What types of self-reflection are required?• Can one learn language without non-linguistic

knowledge?• How can we manage mapping between text

tokens and non-text entities they describe?• What curriculum for staging the learning?• What active learning methods?

More Revolutionary Research Directions

• Can we design new kinds of computer programming languages with explicit learning primitives?

• Can we build robot scientists?

• What are the fundamental tradeoffs between computational efficiency and statistical efficiency?

• How can we build systems that learn from instruction, dialogs and problem sets, in addition to labeled examples?

• How can we unify machine learning theories and models with those from other fields studying adaptation, eg., adaptive control theory, economics, evolution?

Summary

• Machine Learning research is (should be more) connected to understanding all learning processes

• Field is ripe for new revolutionary directions:– Computational models for human learning– Never-ending learners– <your idea here>

Thank you!

Documents

Tom M. Mitchell E. Fredkin Professor and Department Head March 2007