Upload
jared-pierce
View
215
Download
0
Embed Size (px)
Citation preview
1
While waiting for the talk to start, try to find 4 mistakes in this student essay.
Question: Suppose you are running in a straight line at constant speed. You throw a pumpkin straight up. Where will it land? Explain why.
Student: Once the pumpkin leaves my hand, the horizontal force that I am exerting on it no longer exists, only a vertical force (caused by my throwing it). As it reaches it’s maximum height, gravity (exerted vertically downward) will cause the pumpkin to fall. Since no horizontal force acted on the pumpkin from the time it left my hand, it will fall at the same place where it left my hands.
2
Can (physics) tutoring systems be more effective
than human tutors?
Kurt VanLehn Pittsburgh Science of Learning Center
LRDC & The Computer Science Department
University of Pittsburgh
3
Thanks!Current team Pamela Jordan (lead) Patricia Albacete Min Chi John Connelly Roxana Gheorghui Sung-Young Jung Brian Moses Hall Uma Pappuswamy Mike Ringenberg
Team Alumni Dumisizwe Bhembe Michael Boettner Andy Gaydos Maxim Makatchev Antonio Roque Carolyn Rosé Stephanie Siler Ramesh Srivistava Roy Wilson
+ Art Graesser’s group at the University of Memphis
4
The Learning Science research question:Increasing tutoring systems’ effectiveness?
Computer aided instruction (CAI) > classroom by d=0.4 sigma – Kulik, 1994
Intelligent tutoring systems (ITS) > classroom by d=1.0 sigma – Koedinger et al. 1997; VanLehn et al. 2006; …
Human tutors (HT) > classroom by d=2.0 sigma– Bloom, 1984
How can we build tutoring systems that are as effective as human tutors?
where effect size (Cohen’s d) = [gain(experimental) – gain(controls)] / standard_deviation(pooled)
5
The Cognitive Science research question: The more interactivity, the more gain?
0
0.5
1
1.5
2
2.5
CAI ITS NLT Human
Type of tutoring / interactivity
Effectsize
?
6
The Computer Science research question: Deep linguistic techniques vs shallow?
Shallow linguistic Deep linguistic
Natural language understanding (NLU)
LSA, other bag-of-words Syntactic grammars, lexicons, semantics…
Dialog management Finite state networks Reactive planning
Natural language generation (NLG)
Text templates Plan-based
Non-routine language Ignored Anaphora, negation, …
Because the techniques are compared in the context of a tutoring system, we can evaluate them for pedagogical effectiveness as well as the usual measures of speed, accuracy, generality, etc.
7
Outline
Introduction– Focus on multi-step problem solving– What is human tutoring? – Research questions
Why2-Atlas Evaluations
– Of individual techniques– Of the whole sysem
Next
8
A multi-step quantitative problem
Step
Step
Step
Step
Step
Step
Step
9
A multi-step qualitative problem
Q: Suppose a man is running in a straight line at constant speed. He throws a pumpkin straight up. Where will it land?
Initially, the man and the pumpkin have the same horizontal velocity. His throw exerts a net force vertically on the pumpkin, thus causing a vertical acceleration, which leaves the horizontal velocity unaffected…
Step
Step
Step
Step
10
A multi-step problem where order of steps doesn’t matter
Q: Why do most computers have a disk drive? Why can’t they have only RAM?
Student:1. RAM’s content disappears when power quits,
but disk content persist.2. RAM is usually holds less information than
disk3. RAM takes battery power, so larger RAM
takes more power4. Certain information, e.g., operating system
and user files, must be stored permanently.
Step
Step
Step
Step
11
Outline
Introduction– Focus on multi-step problem solving– What is human tutoring? – Research questions
Why2-Atlas Evaluations
– Of individual techniques– Of the whole sysem
Next
12
Human tutorial dialogue is a sequence of episodes, one per step
S: RAM is too small. Only the disk is big enough.T: That’s usually true.
But suppose you bought a lot of RAM? Why wouldn’t that work?S: The battery would run out too fast.T: Excellent.
What else?S: That’s it.T: What if the battery dies?S: Oh. The RAM dies.T: Anything wrong with that? S: You lose your files.T: Besides the user’s files, what else would be lost?S: Beats me.T: The operating system!
Step
Step
Step
Q: Why does a computer need disk as well as RAM?
13
Schematic of tutorial dialogue
Problem statement Step Step Step Step Answer Reflection (optional)
14
Schematic of dialogue about a single step
Stepstart
T: Tell
T: Elicit S: Correct
Stepend
S: IncorrectT: Hint, or prompt,
or explain, or analogy, or …
Remediation:
15
Comparisons of expert to novice human tutors
Stepstart
T: Tell
T: Elicit S: Correct
Stepend
S: IncorrectT: Hint, or prompt,
or explain, or analogy, or …
Novices
Experts
Experts may have a wider variety
16
Outline
Introduction– Focus on multi-step problem solving– What is human tutoring? – Research questions
Why2-Atlas Evaluations
– Of individual techniques– Of the whole sysem
Next
17
The Learning Science research question:Increasing tutoring system effectiveness
CAI – Remediation on answer only ITS (e.g., Andes) – Remediation on each step
– Hint sequence, with final “bottom out” hint
Human tutors – Remediation on each step– Natural language dialogues– Many tutorial tactics
A tutoring system with Natural Language for its remediation?
18
The Cognitive Science research question: The more interactivity, the more gain?
0
0.5
1
1.5
2
2.5
CAI ITS NLT Human
Type of tutoring / interactivity
Effectsize
?
19
The Computer Science research question: Deep linguistic techniques vs shallow?
Shallow linguistic Deep linguistic
Natural language understanding (NLU)
LSA, other bag-of-words Syntactic grammars, lexicons, semantics…
Dialog management Finite state networks Reactive planning
Natural language generation (NLG)
Text templates Plan-based
Non-routine language Ignored Anaphora, negation, …
Evaluate for pedagogical effectiveness as well as the usual measures of speed, accuracy, generality, etc.
20
A task domain where deep understanding may add value Qualitative physics
– “A massive truck and a light car have a head-on collision. Which suffers the greater impact force? Why?”
Linguistic relationship matter– {car, truck, exerts, more, force}
Detecting deep misconceptions– E.g., Bigger things exert more force.
Unfortunately, these misconceptions are notoriously resistant to instruction– Try giving 10 hours of instruction
21
Outline
Introduction– Focus on multi-step problem solving– What is human tutoring? – What is an ITS? CAI?– Research questions
Why2-Atlas Evaluations
– Of individual techniques– Of the whole sysem
Next
22
Student’s screen for Why2-Atlas
Problem
Dialogue history Student’s
essay
Student’s turn in the dialogue
23
Schematic of Why2-Atlas tutorial dialogue T: <displays problem> S: <Enters essay> T: <analyzes essay to identify missing & incorrect steps;
picks one; starts a script for remediation of the step>– T: When the pumpkin is in the air, what forces act on it?<many turns>– T: Please change your essay– S: <Edits essay>
T:<analyzes essay to identify missing & incorrect steps; picks one; starts a script><many turns>– T: Please change your essay– S: <Edits essay>
T: <Analyzes essay; decides all steps are correct>– T: Excellent! Here’s my version: <presents “ideal” essay>
24
Real example (slide 1 of 3)
Question: Suppose a man is running in a straight line at constant speed. He throws a pumpkin straight up. Where will it land? Explain.
Student’s Essay: The pumpkin would land in front of me because of the horizontal speed that is exerting on the pumpkin.
25
Example (slide 2 of 3)
Tutor: I see some opportunities for growth. Let's chat. So let's discuss a point that seems to be missing in your essay. Can you tell me what forces are acting on the pumpkin after it is released?
Student: Gravity force
Picks a missing step (forces) & starts a
KCD script
Correct, so goes to next state of the KCD
26
Example (slide 3 of 3) Tutor: Good. Since we know that gravity is the only
force acting on the pumpkin, let me ask you this. As the gravitational force pulls the pumpkin down, does it also pull the pumpkin horizontally a little bit?
Student: Yes.
– Tutor: Let's look at this more closely. In which direction does the earth's gravitational force act?
– Student: Vertical
» Tutor: Let's consider this some more. Is that vertical upwards or vertical downwards?
» Student: Vertical downwards
Wrong, so subdialogue
Incomplete, so subdialogue
27
Schematic of a single step
Stepstart
T: Tell
T: Elicit S: Correct
Stepend
S: IncorrectT: …S: …T: …S: …T: …
KCD
28
Outline
Introduction– Focus on multi-step problem solving– What is human tutoring? – What is an ITS? CAI?– Research questions
Why2-Atlas Evaluations
– Of individual techniques– Of the whole sysem
Next
29
Why2-Atlas main modules
Student enters/edits the essay
Sentence understander
Essay understander
Discourse manager decides what to do w.r.t. history
KCD script interpreter RealPro NLG
Student
Words
FOPL propositions
Missing / bad steps
Script for remedying missing/bad step
Ideal essay
ClarificationDone
30
Modules evaluated (in yellow)
Student enters/edits the essay
Sentence understander
Essay understander
Discourse manager decides what to do w.r.t. history
KCD script interpreter RealPro NLG
Student
Words
FOPL propositions
Missing / bad steps
Script for remedying missing/bad step
Ideal essay
ClarificationDone
31
Evaluate for accuracy (w.r.t. human judges) and speed
Student enters/edits the essay
Sentence understander
Essay understander
Discourse manager decides what to do w.r.t. history
KCD script interpreter RealPro NLG
Student
Words
Propositions
Missing/incorrect steps
Script
Ideal essay
ClarificationDone
•Deep NLU: Carmel•LCFlex parser•Comlex lexicon •Semantic authoring tool
•Shallow NLU: Naïve Bayes; LSA•Hybrids: CarmelTC; Rapel•Result:
•Similar accuracy•Complementary errors•Best to use all 3
32
Evaluate for utility as tool
Student enters/edits the essay
Sentence understander
Essay understander
Discourse manager decides what to do w.r.t. history
KCD script interpreter RealPro NLG
Student
Words
Propositions
Missing/incorrect steps
Script
Ideal essay
ClarificationDone
•Re-implemented as TuTalk•GUI authoring system •XML authoring system•Handy features (e.g., +/- feedback) for ITS•Currently being used by 4 projects
33
Evaluate for accuracy & speed
Student enters/edits the essay
Sentence understander
Essay understander
Discourse manager decides what to do w.r.t. history
KCD script interpreter RealPro NLG
Student
Words
Propositions
Missing/incorrect steps
Script
Ideal essay
ClarificationDone
Next few slides
34
Essay analysis: You probably found all 4 incorrect steps. Can the essay analyzer?
Question: Suppose you are running in a straight line at constant speed. You throw a pumpkin straight up. Where will it land? Explain why.
Student: Once the pumpkin leaves my hand, the horizontal force that I am exerting on it no longer exists, only a vertical force (caused by my throwing it). As it reaches it’s maximum height, gravity (exerted vertically downward) will cause the pumpkin to fall. Since no horizontal force acted on the pumpkin from the time it left my hand, it will fall at the same place where it left my hands.
35
Research problem, more precisely, is… Given:
– Student’s sentences: {s1&s2, s3&s4&s5, …}– Set of correct steps: {c1&c2, c3&c4&c5, …}– Set of incorrect steps:{i1&i2, i3, i4&i5,….}
Determine: Which correct and incorrect steps match the student’s sentences:
– Directly (graph matching)– Indirectly, using domain knowledge
36
Why do we need indirect matching? The student said (incorrectly):
– “The pumpkin slows down, so it lands behind me.”
Correct steps– Yada – Yada– Yada
Incorrect steps– Yada – When there is no force to propel an object along, it slows down– Air friction matters– Yada
Essay analyzer should output both derivations, with estimates
of their probabilities
37
First method: Abduction using Tacitus-Lite+
Backchaining theorem prover (like Prolog)– Student’s utterance goal to be proved– Problem statement givens– Proofs of earlier student utterances more givens
Accepts goals without proof (at a cost)– Because not everything can be anticipated– Searches for lowest cost proof
Checks consistency as it goes– Don’t try to prove ~p when the proof already has p.
38
Derivation 1 (of 2) for “The pumpkin slows down”
The velocity of the pumpkin is decreasing
The horizontal component of the velocity of the pumpkin is decreasing
The horizontal component of the net force on the pumpkin is zero
The horizontal component of the air friction force on the pumpkin is zero
The horizontal component of the man’s force on the pumpkin is zero
Net force is sum of forces
given
(The net force causes the velocity, so) zero net force implies velocity decreases
Imprecision
given
Student said this
An inference rule
Incorrect inference rule
A correct inference rule
39
Derivation 2 (of 2) for “The pumpkin slows down”
The velocity of the pumpkin is decreasing
The horizontal component of the velocity of the pumpkin is decreasing
The horizontal component of the net force on the pumpkin is negative
The horizontal component of the air friction force on the pumpkin is negative
The horizontal component of the man’s force on the pumpkin is zero
Net force is sum of forces
False assumption
Kinematics
Imprecision
given
Student said this
Newton’s second law
The horizontal component of the acceleration of the pumpkin is negative
40
Results of using Tacitus-Lite+
Acceptable accuracy, but far too slow Cost may not be a good substitute for
probability when there are multiple competing explanations
41
Second method: Precompute the time-consuming reasoning
Precomputions– The deductive closure of the problem statement givens– Save as directed graph– Label subsets of nodes that represent correct and
incorrect steps– Convert to Bayesian network & train
To analyze a student’s utterance– Clamp directly matched nodes as “evidence”– Run Bayesian network– Read out most probable steps
42
Results: Fast enough. Better accuracy, but not by much.
43
Summary
Student enters/edits the essay
Sentence understander
Essay understander
Discourse manager decides what to do w.r.t. history
KCD script interpreter RealPro NLG
Student
Words
Propositions
Missing/incorrect steps
Script
Ideal essay
ClarificationDone
•Methods•Abductive theorem prover•Bayesian deductive closure
•Results •Similar accuracy•Bayesian deductive closure faster than abductive theorem prover
44
Outline
Introduction– Focus on multi-step problem solving– What is human tutoring? – What is an ITS? CAI?– Research questions
Why2-Atlas Evaluations
– Of individual techniques– Of the whole sysem Next
45
Evaluation framework Pretest (1 hr) Training (5 to 10 hrs):
For each question, do:1. Student enters initial essay2. Tutor analyses it for missing & incorrect steps,
picks one, and discusses it with student3. Student enters revised essay4. Tutor either
congratulates student & presents ideal essayor goes to step 2
Posttest (1 hr)
Only step remediation varies with the condition
46
Conditions
Expert Human tutors – Text-based communication– Spoken communication
Computer tutors– Why2-Atlas (VanLehn et al.)– ITSPOKE (Litman et al.)– Why2-AutoTutor (Graesser et al.)
Control conditions– Canned text remediation– Textbook
47
Talking headGesturesSynthesized speech
Problem
Dialog historyStudent types here
48
Human tutors
Stepstart
T: Tell
T: Elicit S: Correct
Stepend
S: IncorrectT: Hint, or prompt,
or explain, or analogy, or …
49
Why2-Atlas
Stepstart
T: Tell
T: Elicit S: Correct
Stepend
S: IncorrectKnowledge
construction dialogue
50
Why2-AutoTutor
Stepstart
T: Tell
T: Elicit S: Correct
Stepend
S: IncorrectHint,
or prompt,or assert
51
Canned-text remediation
Stepstart
T: Tell
T: Elicit S: Correct
Stepend
S: Incorrect<text>
52
Results from 7 experiments
Why2-Atlas = Why2-AutoTutor – Trend for >, but not significant– Why2-Atlas may need more development
Why2 > Textbook – In Textbook condition, students do not write essays
Why2 = Human tutoring !!!
Human tutoring = Canned text remediation– Exception: If pre-physics students get instruction designed for
post-physics students, then Human tutoring > Canned text remediation
53
Impact & significance of the results
Why2-Atlas = Why2-AutoTutor – Common in AI that complex techniques are only slightly
better than simple ones, at least initially. Why2 > Textbook
– Common in Learning Sciences that active > passive
Why2 = Human tutoring = Canned text remediation– Highly counter-intuitive to Learning and Cognitive
scientists (including us)
54
Hypothesis 1: Exactly how tutors remedy a step doesn’t matter much
Stepstart
T: Tell
T: Elicit S: Correct
Stepend
S: Incorrect
What’s in here doesn’t matter much
55
Other studies where type of step remediation had little impact Human tutors
1. Human tutoring = human tutoring with only content-free prompting for step remediation (Chi et al., 2001)
2. Human tutoring = solving a problem in pairs with a video solution available (Chi et al., 2007)
3. Human tutoring = canned text during post-practice remediation (Katz et al., 2003)
4. Human tutoring = an ITS (Reif & Scott, 1999)
5. Micro-analyses of human tutoring (VanLehn et al., 2003)
6. Socratic human tutoring = didactic human tutoring (Rosé et al., 2001a; Johnson & Johnson, 1992)
Natural language tutoring systems1. Circsim (canned text) = Cirsim Tutor (Evens & Michael, 2007)
2. Andes-Atlas = Andes with canned text (Rosé et al, 2001b)
3. Cognitive geometry tutor (Aleven et al., 2004)
56
Hypothesis 2: Cannot eliminate the step remediation loop
Stepstart
T: Tell
T: Elicit S: Correct
Stepend
S: IncorrectText
Must avoid this
57
Studies consistent with harmfulness of just telling & explaining
Human tutoring– Human tutoring > textbook alone (Azevedo; Evens; VanLehn)– Human tutoring > lecture/demo (Wood et al. 1978; Swanton,
Natural language tutoring systems– NLT > textbook alone (Graesser; Evens; Lane; Vanlehn)– NLT > lecture/demo (Craig)
58
Conclusions Learning Science: Can computer tutors be as effective as
human tutors?– Yes, as long as students attempt steps with feedback & hints on
each
Computer Science: When is deep linguistic technology more effective than shallow?– Several positive results at module level– At whole system level, still tied, but encouraging
Cognitive Science: The higher the interactivity, the higher the learning gains? – No. See next slide
59
The interactivity plateau
0
0.5
1
1.5
2
2.5
CAI ITS NLT Human
Type of tutoring / interactivity
Effectsize
Claim: Perhaps Bloom’s 2 experiments were confounded
60
How can we achieve super-human results?
0
0.5
1
1.5
2
2.5
CAI ITS NLT Human New
Type of tutoring / interactivity
Effectsize
?
61
Future work (slide 1 of 3):Increasing engagement
NeuroCog engagement meter (DARPA)– Can we reliably measure engagment with fMRI? – Can we train students to maintain engagement with it?
Interesting problems (PSLC)– Ill-defined & design problems– Recommender system
ITS as a member of a social network (DFK, PSLC)– Pairs > solos for engagment, but correctness?– Can we add an ITS without destroying engagement?
62
Future work (slide 2 of 3):Faster learning; Faster authoring
Author & student interface ≈ PowerPoint– Fast to learn & use
» e.g., type “Let V1, V2 be the initial, final velocities”
– Freedom; domain independence As students master a step:
– Tutor does it, or – It gets folded into a larger step
TruthBench– Knowledge acquisition for truth checking vs.– Knowledge acquisition for solving an (ill-defined) problem– Examples instead of hints
63
Future work (slide 3 of 3) :Teach what an AI learner needs
Explicit teaching of backwards chaining (PSLC)– Accelerates learning; transfers (Chi & VanLehn, 2007)
Explicit teaching of confluences– KE = ½ m v2 If mass and kinetic energy are
constant, then velocity must be constant Explicit teaching of abstraction planning
– KE = ½ m v2 If need a velocity, then find a kinetic energy
Dream system: A model human learner– For testing curriculum designs– Getting the step sizes right
65
When to use deep vs. shallow?
Shallow linguistic Deep linguistic
Sentence understanding
LSA, Rainbow, Rappel Carmel: parser, semantics…
Essay/Discourse understanding
LSA Abduction, Bnets
Dialog management
Finite state networks Reactive planning
Natural language generation
Text Plan-based
Use both
Use deep
Use locally smart FSA
Use equivalent texts
66
“It ain’t so much the things we don’t know that get us into trouble. It’s the things we know that just ain’t so.”
-- Josh Billings (Henry Wheeler Shaw)
67
A deep sentence understander:Carmel
LCFlex parser, a robust parser that uses skipping, insertion & flexible unification (Rosé & Lavie 2001)
Comlex, with 40,000 lexemes (Grishman et al., 1994)
A broad-coverage, domain-independent, English syntactic grammar
CarmelTools for semi-automatically creating semantic functions (Rosé 2000, Rosé et al., AIED 2003).
68
Shallow sentence understanders Words only
– Rainbow: A Naïve Bayes text classifier» Given new bag of words, calculates most probable domain
propositions using Bayes rule
– LSA, several others» Worse, not currently used
Words + syntactic features– Rappel (Jordan)
» Minipar produces dependency relations between words » Ripper builds one classifier per predicate type & per argument
of the predicate type
– CarmelTC (Rosé et al., HLT/NAACL 2003)» Worse, not currently used
69
Results
Speed: All were fast enough Accuracy: All were too low
– Deep ≈ Words ≈ Words + syntactic features Best (Jordan et al, ITS04)
– Run all 3– Use heuristics to choose 1 of 3 outputs
» E.g., If “velocity” or “speed” appear in the sentence, then “velocity” should appear in the output propositions somewhere.
70
Direct matching via largest common subgraph (Shearer et al., 2001)
x1 x2
compare(x1, x2, same)
speed(x2, man, ...)speed(x1, pumpkin, ...)
x5
compare(x5, x6, ...)
speed(x5,pumpkin,...)
Expectation: “The speed of the pumpkin is the same as the speed of man.”
Student said: “…allow us to compare it to the speed of the pumpkin.”
71Correct givensBuggy given
Generation of the ATMS for a problem starts with givens (node = atomic prop.; red = incorrect)
72
RA1 RA2
Correct givensBuggy given
Apply rules forward (RA = rule application)
73
RA1 RA2
RA4
Correct givensBuggy given
RA3
Stop when no more rule applications
74Correct givensBuggy given
Propositions are not always ground!Variables (colored links) shared across nodes
75Correct givensBuggy given
Specific subsets correspond to expectations / misconceptions
Expectation (a key step in the explanation)
Expectation
Misconception
76Correct givensBuggy given
Input
At runtime, find all node subsets that unify with the student input
77Correct givensBuggy given
Input
In this case, two subsets (happy faces) unify with student’s input
78
RA2
RA4
Correct givensBuggy given
RA3 Input
Direct match
Close, but not a direct match
Output the expectations/misconceptions that are directly matched
RA1
79
RA2
RA4
Correct givensBuggy given
RA3 Input
Bnet nodes represent sets of nodes for expectations/misconceptions
RA1
80
RA2
RA4
Correct givensBuggy given
RA3
Clamp student’s input nodes (happy faces) and update net
RA1
Moderate posterior
probability
High posteriorprobability
Low posteriorprobability
81
Training and evaluation
Topology of network is given by deductive closure, etc.
Only learn the conditional probabilities 293 sentences coded by human Expectation/Maximization 10-fold cross validation