Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
1
Comp 540Statistical Machine Learning:
Principles and Applications
Devika SubramanianComputer Science
Rice University
(c) Devika Subramanian, 2008 2
Class information
4 creditsTuesday/Thursday 1:00 to 2:20 at Duncan 1075Instructor: Devika Subramanian
(c) Devika Subramanian, 2008 3
Goals of courseintroduce several state-of-the-art algorithms in statistical machine learning. show how each of these algorithms applies to real-world problems in science and engineering.provide practice in applying algorithms by solving real problems.
(c) Devika Subramanian, 2008 4
Auxiliary goals of course
To give you experience in independent research.To give you practice in oral presentation of technical material.To train you to write high-quality technical papers.
(c) Devika Subramanian, 2008 5
Course work
A term project (groups of at most two).
Project proposalInterim progress reports (5)Project presentationProject report (a technical paper)
An oral technical presentation
(c) Devika Subramanian, 2008 6
What is machine learning?
LearningalgorithmTraining
data
Prior knowledgePredictive model
Reprogramsystem decision rules
System
use data to build models useful for decision making.
systemresponse
System produces responsesbased on inputs from environmentaccording to pre-determined rules.
2
(c) Devika Subramanian, 2008 7
An example from finance
Learningalgorithm
EconomicIndicators,Price of GOOG
Prediction on GOOGprice
Investment policy
The stockmarket
Prior knowledge
Validation of learned model:are you making money?
(c) Devika Subramanian, 2008 8
IssuesWhat data should be gathered to make predictions? (Feature selection)What kind of model should be learned (e.g., a deterministic function of observed data, a probabilistic prediction on action choice)? (model selection)How can we be sure we have a good predictive model? (model assessment or validation)What algorithms should we use to learn these models from data? How can we scale them to work on large data sets in real time?
(c) Devika Subramanian, 2008 9
An example from biology
LearningalgorithmFlow cytometry
measurements
Prior knowledgeSignaling network
Therapeutic intervention based on network
Cancercells
drugsystemresponse
(c) Devika Subramanian, 2008 10
PKC
Raf
P44/42
Mek
Plcγ
PKA
Akt
Jnk P38
PIP2
PIP3
Expected Pathway
Reported
Missed
15/17 Classic17/17 Reported3 Missed
Reversed
Phospho-Proteins
Phospho-LipidsPerturbed in data
T-cell signaling network
Science 2005, Sachs et. al
(c) Devika Subramanian, 2008 11
Bcr-Abl signal transduction pathways in CML
Sawyers CL. NEJM. 1999; 340(17):1331
(c) Devika Subramanian, 2008 12
Inhibiting Bcr-Abl kinase(Gleevec)
3
(c) Devika Subramanian, 2008 13
Spam filtering
LearningalgorithmTraining
data
Prior knowledge Probability that msgis spam
Spam labeling policy
MailStream
userfeedback
labelmsg as spam/ham
(c) Devika Subramanian, 2008 14
Standard system building methodology
Analyze problemInterview human experts, gather requirements, understand how decisions are made.
Design a solutionHandcraft system models and devise algorithms for decision making
ImplementTest System
InputsOutputs
(c) Devika Subramanian, 2008 15
When do we need machine learning?
When we don’t know how to calculate outputs from inputs. (cancer cells)When requirements change rapidly. (spam filtering)When environments in which systems operate change rapidly. (stock market)When there is tremendous individual variability and therefore need for customization. (spam filtering, cancer cells)
(c) Devika Subramanian, 2008 16
Machine learningPrinciples, methods and algorithms for prediction and modeling on the basis of past experience.
(c) Devika Subramanian, 2008 17
Statistical machine learningMachine Learning is already at the heart of speech recognition and handwriting recognition.Statistical learning methods are transforming information retrieval. (Google) and retail (Amazon, Walmart).Statistical learning methods are creating opportunities in databases, computer graphics, robotics, computer vision, networking, operating systems, and computer security.
(c) Devika Subramanian, 2008 18
Role of ML in CSData is a new source of power for computer science.Every computer science student should learn the fundamentals of machine learning and statistical thinking.By combining engineered frameworks with models learned from data, we can develop the high-performance systems of the future.
4
(c) Devika Subramanian, 2008 19
Learning in context
Machine learning
Uncertainty
Multi-agent
systems
Artificial Intelligence
Control Th. OR (MDPs) Statistics App. Math.
Algorithms Systems/Software E. Data Mining
(c) Devika Subramanian, 2008 20
Outline of rest of lectureLearning, by exampleThe structure of the courseOne more example application of learning
(c) Devika Subramanian, 2008 21
A problemSort incoming mail into bins based on zipcode
Great variabilityin handwriting, hardto write a fixedset of rules forrecognizingdigits.
(c) Devika Subramanian, 2008 22
Learn from examplesMachine aligns the letter so that camera can take an image of the letter and extract zip code from it (segmentation and pre-processing for digit extraction)
(c) Devika Subramanian, 2008 23
How supervised learning worksSteps
Entertain a set of possibilitiesAdjust predictions based on feedbackRethink the possibilities
3
2
2Input Label
(adapted from T. Jaakola) (c) Devika Subramanian, 2008 24
Key questions Data and assumptions
What data is available for the learning task?What can we assume about the problem?
RepresentationHow should we represent the examples (feature selection)
Method and estimationWhat are the possible hypotheses?How do we adjust our predictions based on feedback?
(23-30 adapted from T. Jaakola)
5
(c) Devika Subramanian, 2008 25
Key questions (contd.)Evaluation
How well are we doing?Model selection
Can we do even better by selecting a richer class of hypotheses?
(c) Devika Subramanian, 2008 26
Data and assumptionsHow are the digits generated and how reliable are the labels?
3
2
2Input Label
(c) Devika Subramanian, 2008 27
Data representationRepresentation of input can be as
A bitmapExtracted features on the bitmap (number of curves, loops, etc..)
Representation can make learning problem easy or difficult.
3
2
2Input Label
(c) Devika Subramanian, 2008 28
Method and estimation
).( xwsigny =
Bitmap representation (8 by 8) of input, stored as a 64 bit vector xHypothesis:
3
2
2Input Label
where w is a parametervector we learn fromthe data
(c) Devika Subramanian, 2008 29
Model assessmentLook at average classification error as a function of the number of examples
3
2
2Input Label
Number of examples
Average error
(c) Devika Subramanian, 2008 30
Model selectionOur classifier is limited, can we make it more flexible?Is there an entirely different type of classifier that will be more suitable?
3
2
2Input Label
6
(c) Devika Subramanian, 2008 31
Outline of courseLearning techniques
Supervised learningRegression
Linear, locally weighted, polynomial, additiveNearest neighbor, prototype methods
ClassificationDiscriminative: logistic regression, perceptrons, neural nets, SVMsGenerative: LDA, QDA, naïve Bayes, Bayesian nets
(c) Devika Subramanian, 2008 32
Outline (contd.)Unsupervised learning
Clustering and k-meansExpectation maximization and Gaussian mixturesFactor Analysis: PCA, ICA, Isomap
Learning from sequential dataHMMs and CRFsReinforcement learning
Learning theoryBias/variance tradeoff, overfitting and regularizationEnsemble learning: boosting and baggingModel assessment and selection
(c) Devika Subramanian, 2008 33
Outline (contd.)Applications
Elevator control and backgammon playingLearning from forests of sensorsFace and handwriting recognitionText mining and information extractionLearning regulatory networks from biological dataGene findingAnd others that interest you!
(c) Devika Subramanian, 2008 34
The basic methodsSupervised learningUnsupervised learningReinforcement learning
(c) Devika Subramanian, 2008 35
Supervised learning
83601
Labeled Training
Examples
LearningAlgorithm Classifier
New Examples
8(c) Devika Subramanian, 2008 36
Supervised learningDesired model is a classifier function y = f(x).Training examples are pairs of the form (x1,y1)…(xn,yn), where xi is a vector denoting an input and yi is its corresponding classification.
7
(c) Devika Subramanian, 2008 37
Graphics: image analogies
: ::
: ?
Hertzmann, Jacobs, Oliver, Curless, Salesin (2000) SIGGRAPH(c) Devika Subramanian, 2008 38
Learning texture maps
:
:
(c) Devika Subramanian, 2008 39
Unsupervised learningModel is a probability density function p on the space of inputs X. This is the joint probability distribution on X.Training data are samples x1,…,xnfrom X.
mean(c) Devika Subramanian, 2008 40
ClusteringFinding structure in the data
Mixture models
(c) Devika Subramanian, 2008 41
Clustering of microarray data
(c) Devika Subramanian, 2008 42
Reinforcement learning
agentEnvironment
state s
reward r
action a
Agent’s goal: Choose actions to maximize total rewardAction Selection Rule is called a “policy”: a = p(s)
8
(c) Devika Subramanian, 2008 43
Methods for reinforcement learning
DirectStart with initial policy πExperiment with environment to decide how to improve πRepeat
Model BasedExperiment with environment to learn how it behaves (dynamics + rewards)Compute optimal policy π
(c) Devika Subramanian, 2008 44
Reinforcement learningDesired output is an action selection policy πTraining examples are <s,a,r,s’> tuplescollected by the agent interacting with the environment
(c) Devika Subramanian, 2008 45
Temporal Difference LearningTD-Gammon [Tesauro]
Neural Network (Input: raw board information)A more intelligent weight update ruleSelf-play (300,000 games)Human expert level
Program
Program
action action
(c) Devika Subramanian, 2008 46
Fundamental questions in machine learning
Incorporating prior knowledgeWe shouldn’t force systems to learn everything from scratch, but to bootstrap off of what is already known.
Incorporating learned structures into larger systems
How to embed learning in a larger system (good examples: speech recognizers and handwriting recognizers)
(c) Devika Subramanian, 2008 47
Fundamental questions (contd.)
Making learning algorithms (particularly reinforcement learning) practical.
Unsupervised and supervised learning have close ties to statistics; many practically fielded algorithms there.
Tradeoff between accuracy, sample size, and hypothesis complexity.
Need for both theoretical results and experimental methodologies for making these tradeoffs.
(c) Devika Subramanian, 2008 48
An example from cognitive science
LearningalgorithmTraining
data
Prior knowledgePredictive model
Interventionsto aid learning
HumanLearningthe NRL
task
9
(c) Devika Subramanian, 2008 49
Submarine School 101The NRL Navigation Task
50% of class weeded out by this game!
•Pilot a submarine to a goal through a minefield in a limited time period
•Distance to mines revealed via seven discrete sonars
•Time remaining, as-the-crow-flies distance to goal, and bearing to goal is given
•Actions communicated via a joystick interface
(c) Devika Subramanian, 2008 50
The NRL Navigation Task
Mine configurationchanges with everygame.
Game has a strategicand a visual-motorcomponent!
(c) Devika Subramanian, 2008 51
Learning curves
01020
3040506070
8090
100
1 50 99 148
197
246
295
344
393
442
491
540
589
638
687
736
Episode
Succ
ess
%
S3
S4S5
S1S2
Successful learnerslook similar: plateausbetween improvements
Unsuccessfullearners areDOA!
Navy takes 5 days to tell if a person succeeds/fails.(c) Devika Subramanian, 2008 52
Task QuestionsIs the game hard? What is the source of complexity?Why does human performance plateau out at 80%? Is that a feature of the human learning system or the game? Can machine learners achieve higher levels of competence?Can we understand why humans learn/fail to learn the task? Can we detect inability to learn early enough to intervene?How can we actively shape human learning on this task?
(c) Devika Subramanian, 2008 53
Mathematical characteristics of the NRL task
A partially observable Markov decision process which can be made fully observable by augmentation of state with previous action.State space of size 1014, at each step a choice of 153 actions (17 turns and 9 speeds).Feedback at the end of up to 200 steps.Challenging for both humans and machines.
(c) Devika Subramanian, 2008 54
Reinforcement learning
“a way of programming agents by reward and punishment without needing to specify
how the task is to be achieved”
[Kaelbling, Littman, & Moore, 96]
10
(c) Devika Subramanian, 2008 55
Reinforcement learning
Learner
Task
action
Feedback
state
Reinforcement Learning, Barto and Sutton, MIT Press, 1998.
1. Observe state, st2. Decide on an action, at3. Perform action4. Observe new state, st+15. Observe reward, rt+16. Learn from experience7. Repeat
AS→:π
(c) Devika Subramanian, 2008 56
Reinforcement learning/NRL task
Representational hurdlesState and action spaces have to be manageably small.Good intermediate feedback in the form of a non-deceptive progress function needed.
Algorithmic hurdlesAppropriate credit assignment policy needed to handle the two types of failures (timeouts and explosions are different).Learning is too slow to converge (because there are up to 200 steps in a single training episode).
(c) Devika Subramanian, 2008 57
State space design
Binary distinction on sonar: is it > 50?Six equivalence classes on bearing: 12, {1,2}, {3,4}, {5,6,7},{8,9}, {10,11}State space size = 27 * 6 = 768.Discretization of actions
speed: 0, 20 and 40.turn: -32, -16, -8, 0, 8, 16, 32.
Automated discovery of abstract state spaces for reinforcement learning,Griffin and Subramanian, 2001.
(c) Devika Subramanian, 2008 58
The dense reward function
r(s,a,s’) = 0 if s’ is a state where player hits mine.= 1 if s’ is a goal state= 0.5 if s’ is a timeout state
= 0.75 if s is an all-blocked state and s’ is a not-all-blocked state= 0.5 + Diff in sum of sonars/1000 if s’ is an all-blocked state= 0.5 + Diff in range/1000 + abs(bearing - 6)/40 otherwise
Feedback at theend
Useful feedback during play
(c) Devika Subramanian, 2008 59
Credit assignment policy
Penalize the last action alone in a sequence which ends in an explosion.Penalize all actions in sequence which ends in a timeout.
(c) Devika Subramanian, 2008 60
Simplification of value estimation
Estimate the average local reward for each action in each state.
ss’
t
Q(s,a) = is the sum of rewards from s to terminal state.
r1r2 r3
),()1()]','(max[),('
asQasQrasQa
αα −++=
Instead of learning Q
We maintain an approximationsasasQ from winsofpct *)for at rewards of avg running(),(' =
Open question:When does thisapprox work?
11
(c) Devika Subramanian, 2008 61
Results of learning complete policy
Blue: learn turnsonly
Red: learn turnand speed
Humans makemore effectiveuse of trainingexamples. ButQ-learner gets tonear 100% success.
Griffin and Subramanian, 2000(c) Devika Subramanian, 2008 62
Full Q learner/1500 episodes
(c) Devika Subramanian, 2008 63
Full Q learner/10000 episodes
(c) Devika Subramanian, 2008 64
Full Q learner/failure after 10K
(c) Devika Subramanian, 2008 65
Why learning takes so long
Stateswhere3 or fewerof the 153action choicesare correct!Griffin and
Subramanian, 2000
(c) Devika Subramanian, 2008 66
Lessons from machine learningTask level
Task is hard because states in which action choice is critical occur less than 5% of the time.Staged learning makes task significantly easierA locally non-deceptive reward function speeds up learning.
Reinforcement learningLong sequence of moves makes credit assignment
hard; a new cheap approximation to global value function makes learning possible for such problems.Algorithm for automatic discretization of large, irregular state spaces.
Griffin and Subramanian, 2000, 2001
12
(c) Devika Subramanian, 2008 67
Task QuestionsIs the game hard? Is it hard for machines? What is the source of complexity?Why does human performance plateau out at 80%? Is that a feature of the human learning system or the game? Can machine learners achieve higher levels of competence?Can we understand why humans learn/fail to learn the task? Can we detect inability to learn early enough to intervene?How can we actively shape human learning on this task?
(c) Devika Subramanian, 2008 68
Tracking human learning
(sensor panel,joystick action)
Learningalgorithm
Prior knowledgeModel
Strategy mappingsensor panelsto joystickaction
(time coursedata)
Interventionsto aidlearning
Extract strategy and study its evolution over time
(c) Devika Subramanian, 2008 69
ChallengesHigh-dimensionality of visual data (11 dimensions spanning a space of size 1014)Large volumes of dataNoise in dataNon-stationarity: policies change over time
(c) Devika Subramanian, 2008 70
Embedded learner designRepresentation
Use raw visual-motor data stream to induce policies/strategies.
LearningDirect models: lookup table mapping sensors at time t and action at t-1 to distribution of actions at time t. (1st order Markov model)
Decision-makingCompute “derivative” of the policies over time, and use it (1) to classify learner and select interventions, (2) to build behaviorally equivalent models of subjects
(c) Devika Subramanian, 2008 71
Strategy: mapping from sensors to action distributions
w
Window ofw games
(c) Devika Subramanian, 2008 72
Surely, this can’t work!
There are 1014 sensor configurations possible in the NRL Navigation task.However, there are between 103 to 104 of those configurations actually observed by humans in a training run of 600 episodes.Exploit sparsity in sensor configuration space to build a direct model of the subject.
13
(c) Devika Subramanian, 2008 73
How do strategies evolve over time?
Distance function between strategies: KL-divergence
)2,(),,(( swiswiwiiP −+−+Π+ΠΔ
ww
Overlap = s
(c) Devika Subramanian, 2008 74
Results: model derivative
Siruguri and Subramanian, 2002
(c) Devika Subramanian, 2008 75
Before shift (episode 300)
(c) Devika Subramanian, 2008 76
After shift (episode 320)
(c) Devika Subramanian, 2008 77
Model derivative for Hei
Siruguri andSubramanian, 2002
(c) Devika Subramanian, 2008 78
How humans learn
Subjects have relatively static periods of action policy choice punctuated by radical shifts.Successful learners have conceptual shifts during the first part of training; unsuccessful ones keep trying till the end of the protocol!
14
(c) Devika Subramanian, 2008 79
Behaviorally equivalent modelsModel
NRL task
(c) Devika Subramanian, 2008 80
Generating behaviorally equivalent models
To compute action a associated with current sensor configuration s in a given segment,
take 100 neighbors of s in lookup table.perform locally weighted regression (LWR) on these 100 (s,a) pairs.
(c) Devika Subramanian, 2008 81
Subject Cea: Day 5: 1
Subject Model
(c) Devika Subramanian, 2008 82
Subject Cea: Day 5: 2
Subject Model
(c) Devika Subramanian, 2008 83
Subject Cea: day 5: 3
Subject Model
(c) Devika Subramanian, 2008 84
Subject Cea: Day 5: 4
Subject Model
15
(c) Devika Subramanian, 2008 85
Subject Cea: Day 5: 5
Subject Model
(c) Devika Subramanian, 2008 86
Subject Cea: Day 5: 6
Subject Model
(c) Devika Subramanian, 2008 87
Subject Cea: Day 5: 7
Subject Model
(c) Devika Subramanian, 2008 88
Subject Cea: Day 5: 8
Subject Model
(c) Devika Subramanian, 2008 89
Subject Cea: Day 5: 9
Subject Model
(c) Devika Subramanian, 2008 90
Comparison with global methods
Siruguri and Subramanian, 2002
16
(c) Devika Subramanian, 2008 91
SummaryWe can model subjects on the NRL task in real-time, achieving excellent fits to their learning curves, using the available visual-motor data stream.One of the first in cognitive science to directly use objective visual-motor performance data to derive evolution of strategy on a complex task.
(c) Devika Subramanian, 2008 92
Where’s the science?
(c) Devika Subramanian, 2008 93
Lessons Learn simple models from objective, low-level data!Non-stationarity is commonplace, need to design algorithms robust with respect to it.Fast new algorithms for detecting change-points and building predictive stochastic models for massive, noisy, non-stationary, vector time series data.
(c) Devika Subramanian, 2008 94
Neural correlates
Are there neural correlates to strategy shifts observed in the visual-motor data?
(c) Devika Subramanian, 2008 95
Task QuestionsCan we adapt training protocols in the NRL task by identifying whether subjects are struggling with strategy formulation or visual-motor control or both?Can we use analysis of EEG data gathered during learning as well as visual-motor performance data to correlate ‘brain events’ with ‘visual-motor performance events’? Can this correlation separate subjects with different learning difficulties?
(c) Devika Subramanian, 2008 96
The (new) NRL Navigation Task
17
(c) Devika Subramanian, 2008 97
Gathering performance data
(c) Devika Subramanian, 2008 98
Fusing EEG and visualmotor data
EEGData
Artifact Removal
Coherence computation
Visualization Mechanism
PerformanceData
(c) Devika Subramanian, 2008 99
Measuring functional connectivity in the brain
Coherence provides the means to measure synchronous activity between two brain areasA function that calculates the normalized cross-power spectrum, a measure of similarity of signal in the frequency domain
)]()([|)(|
)(2
fSfSfS
fCyyxx
xyxy =
(c) Devika Subramanian, 2008 100
Topological coherence map
Front
Back
(c) Devika Subramanian, 2008 101
Frequency bandsCoherence map of connections in each band
Δ (0-5 Hz)θ (5-9 Hz)α (9-14 Hz)β (14-30 Hz)γ (40-52 Hz)
(c) Devika Subramanian, 2008 102
Subject moh progression chart
18
(c) Devika Subramanian, 2008 103
Results (subject moh)
(c) Devika Subramanian, 2008 104
Results
Subject bil progression chart
(c) Devika Subramanian, 2008 105
Results (subject bil)
Baluch, Zouridakis, Stevenson and Subramanian, 2005, 2006
(c) Devika Subramanian, 2008 106
Subject G
Subject is inskill refinementphase
Subject is a near-expert performer
(c) Devika Subramanian, 2008 107
Subject VSubjectneverlearned agood strategy
It wasn’tfor lack of trying..
(c) Devika Subramanian, 2008 108
There are distinct EEG coherence map signatures associated with different learning difficulties
Lack of strategy Shifting between too many strategies
Subjects in our study who showed a move from a low level of performance to a high level of performance show front to back synchrony in the gamma range or long range gamma synchrony (LRGS). [Baluch,Zouridakis,Stevenson,Subramanian 2007]We are conducting experiments on more subjects to confirm these findings. (14 subjects so far, and more are being collected right now.)
Results
19
(c) Devika Subramanian, 2008 109
What else is this good for?Using EEG readouts to analyze the effectiveness of video games for relieving pre-operative stress in children (A. Patel, UMDNJ).Using EEG to read emotional state of players in immersive video games (M. Zyda, USC).Analyzing human performance on any visualmotor task with significant strategic component.
(c) Devika Subramanian, 2008 110
Questions?