Upload
xavier-llora
View
930
Download
0
Embed Size (px)
DESCRIPTION
Lashon Booker presents the glance to the past of LCS and how that connects to the current and future efforts.
Citation preview
© 2006 The MITRE Corporation. All rights reserved.
A Retrospective Look atA Retrospective Look atClassifier System ResearchClassifier System Research
Lashon B. BookerThe MITRE Corporation
© 2006 The MITRE Corporation. All rights reserved.
Early Motivations for Learning ClassifierEarly Motivations for Learning ClassifierSystem (System (LCSLCS) Research) Research
Design symbolic problem solvers that avoid brittleness inrealistic (uncertain and continually varying) domainsinvolving– On-line, adaptive control of behaviors
Representations and procedures must adjust without unnecessarilydisrupting existing capabilities
– Discovering relevant categories in a complex and unlabeledstream of inputInputs must be incrementally grouped together into plausible classes
This is especially difficult when behavior requires moreknowledge representation and processing capability thanis available with simple empirical associations betweeninputs and outputs
© 2006 The MITRE Corporation. All rights reserved.
Requirements for Non-Brittle Rule-BasedRequirements for Non-Brittle Rule-BasedBehaviorBehavior
Need to identify and take advantage of the exploitableregularities in the environment
Generalizations must be selective, pragmatic and subjectto exceptions
Learning must be incremental and closely coupled withperformance and with unfolding reality
Rules must be treated as tentative hypotheses (not logicalassertions) subject to testing and conformation– Hypothesis “strength” is derived from experienced-based
predictions of performance– Strength is used to determine rule fitness and infer plausibility
© 2006 The MITRE Corporation. All rights reserved.
Observations about early researchObservations about early research
The Holland and Reitman collaboration placed a strongemphasis on cognition and characterized the problems ofinterest
Viewed classifier systems as symbolic problem solvers thatavoid brittle behavior (an alternative to expert systems)
– Treat rule set as a model and rules as parts in a context– Evaluation of parts is context dependent (i.e., aspects are non-stationary)
Learning emphasized policy search and value estimation– Rules are policy elements along with performance estimators– Adjust policy via natural selection among rule types– The Pitt approach preserved this idea, using the GA for direct policy search
Included provisions for motivation, affect and introspection
These ideas provided the foundation for a comprehensive theoryof induction (rule clusters, distributed representations,associations, spreading activation, etc.)
© 2006 The MITRE Corporation. All rights reserved.
Influence of reinforcement learningInfluence of reinforcement learning
Reinforcement learning problems arefaced by agents that must learn actionsequences from trial-and-error– Framework provides attractive formalisms
based on estimating value functions (withkey contributions from Sutton and Barto)
– Algorithms provide useful benchmarks forcomparisons
Emphasis on value functions has hada strong influence on LCS research– The primary niche is learning compact
value function representations for off-policy temporal difference methods
– But, the RL community has goodalternatives
It is not clear if we are learning thebest generalizations, or givingsufficient emphasis to policyimprovement
Solution strategies:
• Search the space of possible behaviors
• Estimate utility of taking actions inworld states
Action
EnvironmentState
LearningAgent
input
scalarfeedback
© 2006 The MITRE Corporation. All rights reserved.
Value-based generalizations arenValue-based generalizations aren’’t often intuitivet often intuitive
Grefenstette’s 9x32 abstract state space
There are many obvious intuitive solution strategies:– E.g. Move left or right to column with highest reward, then go straight
Classifier systems tend to learn piecemeal strategies rather than coherentones
– Many narrowly-focused general rules are needed to get the overall solution– Generalizations correspond to symmetries in the reward distribution e.g., (Row = 111) (Column = #011#) RIGHT ) not the key attribute-based concepts.– This distinction has been irrelevant in most classifier system test problems (e.g.,
multiplexor and Woods problems)
Start
0050
75
125
250
500
1000
1000
500
250
125
75
50
000050
75
125
250
500
1000
1000
500
250
125
75
50
00 0050
75
125
250
500
1000
1000
500
250
125
75
50
000050
75
125
250
500
1000
1000
500
250
125
75
50
00
© 2006 The MITRE Corporation. All rights reserved.
Off-policy Methods Learn DifferentOff-policy Methods Learn DifferentBehaviorsBehaviors
Since Q learning is an off-policy method(i.e., behavior policy may differ fromestimation policy), it does not suffernegative consequences for exploration
Sarsa (i.e. the bucket brigade) is an on-policy method, so its solution accountsfor the consequences of exploration
In real problems where on-line errorsare costly, this distinction is important
This also has architectural implications(e.g., how to approximate the valuefunction)
Bottom line: we need to identify and build on the strengths of the LCSapproach. The key may be in specifying a set of organizing principlesthat go beyond implementation diagrams
© 2006 The MITRE Corporation. All rights reserved.
Soar Architecture of Intelligent Rule-basedSoar Architecture of Intelligent Rule-basedBehaviorBehavior
Reaction
Deliberation
Reflection
Slower
Faster
High
Intelligence
Low
Intelligence
I/O
Learning
Reaction
Deliberation
Reflection
Reaction
Deliberation
Reflection
Slower
Faster
Slower
Faster
High
Intelligence
Low
Intelligence
I/O
Learning
Derived by Newell and his students (~1980), also as a responseto the expert system phenomenon
Based on a theory of problem solving (i.e., problem spaces),along with a companion view of learning (i.e., chunking)
The theory was operationalized as an architecture that hasserved that community well
© 2006 The MITRE Corporation. All rights reserved.
What kind of architecture makes sense forWhat kind of architecture makes sense forclassifier systems?classifier systems?
The key role of policyimprovement suggests thatan actor-critic structure maybe a good start
The idea is to intermix valueiteration and policyimprovement continually(state by state, action byaction, sample by sample)
Is there an organizingprinciple that extends thisconcept to cover many formsof induction at differentscales? (includingperception, reasoning, andaction)
Actor
!
Critic
V, Q
policyevaluation
policyimprovement
!*
V*, Q
*
value learning
greedification
Actor
!
Critic
V, Q
policyevaluation
policyimprovement
!*
V*, Q
*
value learning
greedification
© 2006 The MITRE Corporation. All rights reserved.
DARPA/IPTODARPA/IPTO Focus on Cognitive Systems Focus on Cognitive Systems
Darpa views a cognitive system as one that– can reason, using substantial amounts of appropriately represented knowledge– can learn from its experience so that it performs better tomorrow than it did today– can explain itself and be told what to do– can be aware of its own capabilities and reflect on its own behavior– can respond robustly to surprise
Learning is ubiquitous. Different forms operate at different times andplaces
What niche is the LCS community best suited to fill?
© 2006 The MITRE Corporation. All rights reserved.
Some Open Problems for ReinforcementSome Open Problems for ReinforcementLearning (Sutton) - and Classifier SystemsLearning (Sutton) - and Classifier Systems
Incomplete state information Exploration Structured states and actions Incorporating prior knowledge Using teachers Theory of RL with function approximators Modular and hierarchical architectures Integration with other problem–solving and
planning methods