1 While waiting for the talk to start, try to find 4 mistakes in this student essay. Question: Suppose you are running in a straight line at constant speed

1

While waiting for the talk to start, try to find 4 mistakes in this student essay.

Question: Suppose you are running in a straight line at constant speed. You throw a pumpkin straight up. Where will it land? Explain why.

Student: Once the pumpkin leaves my hand, the horizontal force that I am exerting on it no longer exists, only a vertical force (caused by my throwing it). As it reaches it’s maximum height, gravity (exerted vertically downward) will cause the pumpkin to fall. Since no horizontal force acted on the pumpkin from the time it left my hand, it will fall at the same place where it left my hands.

2

Can (physics) tutoring systems be more effective

than human tutors?

Kurt VanLehn Pittsburgh Science of Learning Center

LRDC & The Computer Science Department

University of Pittsburgh

3

Thanks!Current team Pamela Jordan (lead) Patricia Albacete Min Chi John Connelly Roxana Gheorghui Sung-Young Jung Brian Moses Hall Uma Pappuswamy Mike Ringenberg

Team Alumni Dumisizwe Bhembe Michael Boettner Andy Gaydos Maxim Makatchev Antonio Roque Carolyn Rosé Stephanie Siler Ramesh Srivistava Roy Wilson

+ Art Graesser’s group at the University of Memphis

4

The Learning Science research question:Increasing tutoring systems’ effectiveness?

Computer aided instruction (CAI) > classroom by d=0.4 sigma – Kulik, 1994

Intelligent tutoring systems (ITS) > classroom by d=1.0 sigma – Koedinger et al. 1997; VanLehn et al. 2006; …

Human tutors (HT) > classroom by d=2.0 sigma– Bloom, 1984

How can we build tutoring systems that are as effective as human tutors?

where effect size (Cohen’s d) = [gain(experimental) – gain(controls)] / standard_deviation(pooled)

5

The Cognitive Science research question: The more interactivity, the more gain?

0

0.5

1

1.5

2

2.5

CAI ITS NLT Human

Type of tutoring / interactivity

Effectsize

?

6

The Computer Science research question: Deep linguistic techniques vs shallow?

Shallow linguistic Deep linguistic

Natural language understanding (NLU)

LSA, other bag-of-words Syntactic grammars, lexicons, semantics…

Dialog management Finite state networks Reactive planning

Natural language generation (NLG)

Text templates Plan-based

Non-routine language Ignored Anaphora, negation, …

Because the techniques are compared in the context of a tutoring system, we can evaluate them for pedagogical effectiveness as well as the usual measures of speed, accuracy, generality, etc.

7

Outline

Introduction– Focus on multi-step problem solving– What is human tutoring? – Research questions

Why2-Atlas Evaluations

– Of individual techniques– Of the whole sysem

Next

8

A multi-step quantitative problem

Step

Step

Step

Step

Step

Step

Step

9

A multi-step qualitative problem

Q: Suppose a man is running in a straight line at constant speed. He throws a pumpkin straight up. Where will it land?

Initially, the man and the pumpkin have the same horizontal velocity. His throw exerts a net force vertically on the pumpkin, thus causing a vertical acceleration, which leaves the horizontal velocity unaffected…

Step

Step

Step

Step

10

A multi-step problem where order of steps doesn’t matter

Q: Why do most computers have a disk drive? Why can’t they have only RAM?

Student:1. RAM’s content disappears when power quits,

but disk content persist.2. RAM is usually holds less information than

disk3. RAM takes battery power, so larger RAM

takes more power4. Certain information, e.g., operating system

and user files, must be stored permanently.

Step

Step

Step

Step

11

Outline




Next

12

Human tutorial dialogue is a sequence of episodes, one per step

S: RAM is too small. Only the disk is big enough.T: That’s usually true.

But suppose you bought a lot of RAM? Why wouldn’t that work?S: The battery would run out too fast.T: Excellent.

What else?S: That’s it.T: What if the battery dies?S: Oh. The RAM dies.T: Anything wrong with that? S: You lose your files.T: Besides the user’s files, what else would be lost?S: Beats me.T: The operating system!

Step

Step

Step

Q: Why does a computer need disk as well as RAM?

13

Schematic of tutorial dialogue

Problem statement Step Step Step Step Answer Reflection (optional)

14

Schematic of dialogue about a single step

Stepstart

T: Tell

T: Elicit S: Correct

Stepend

S: IncorrectT: Hint, or prompt,

or explain, or analogy, or …

Remediation:

15

Comparisons of expert to novice human tutors

Stepstart

T: Tell


Stepend



Novices

Experts

Experts may have a wider variety

16

Outline




Next

17

The Learning Science research question:Increasing tutoring system effectiveness

CAI – Remediation on answer only ITS (e.g., Andes) – Remediation on each step

– Hint sequence, with final “bottom out” hint

Human tutors – Remediation on each step– Natural language dialogues– Many tutorial tactics

A tutoring system with Natural Language for its remediation?

18

The Cognitive Science research question: The more interactivity, the more gain?

0

0.5

1

1.5

2

2.5

CAI ITS NLT Human


Effectsize

?

19

The Computer Science research question: Deep linguistic techniques vs shallow?


Natural language understanding (NLU)

LSA, other bag-of-words Syntactic grammars, lexicons, semantics…

Dialog management Finite state networks Reactive planning

Natural language generation (NLG)

Text templates Plan-based

Non-routine language Ignored Anaphora, negation, …

Evaluate for pedagogical effectiveness as well as the usual measures of speed, accuracy, generality, etc.

20

A task domain where deep understanding may add value Qualitative physics

– “A massive truck and a light car have a head-on collision. Which suffers the greater impact force? Why?”

Linguistic relationship matter– {car, truck, exerts, more, force}

Detecting deep misconceptions– E.g., Bigger things exert more force.

Unfortunately, these misconceptions are notoriously resistant to instruction– Try giving 10 hours of instruction

21

Outline

Introduction– Focus on multi-step problem solving– What is human tutoring? – What is an ITS? CAI?– Research questions



Next

22

Student’s screen for Why2-Atlas

Problem

Dialogue history Student’s

essay

Student’s turn in the dialogue

23

Schematic of Why2-Atlas tutorial dialogue T: <displays problem> S: <Enters essay> T: <analyzes essay to identify missing & incorrect steps;

picks one; starts a script for remediation of the step>– T: When the pumpkin is in the air, what forces act on it?<many turns>– T: Please change your essay– S: <Edits essay>

T:<analyzes essay to identify missing & incorrect steps; picks one; starts a script><many turns>– T: Please change your essay– S: <Edits essay>

T: <Analyzes essay; decides all steps are correct>– T: Excellent! Here’s my version: <presents “ideal” essay>

24

Real example (slide 1 of 3)

Question: Suppose a man is running in a straight line at constant speed. He throws a pumpkin straight up. Where will it land? Explain.

Student’s Essay: The pumpkin would land in front of me because of the horizontal speed that is exerting on the pumpkin.

25

Example (slide 2 of 3)

Tutor: I see some opportunities for growth. Let's chat. So let's discuss a point that seems to be missing in your essay. Can you tell me what forces are acting on the pumpkin after it is released?

Student: Gravity force

Picks a missing step (forces) & starts a

KCD script

Correct, so goes to next state of the KCD

26

Example (slide 3 of 3) Tutor: Good. Since we know that gravity is the only

force acting on the pumpkin, let me ask you this. As the gravitational force pulls the pumpkin down, does it also pull the pumpkin horizontally a little bit?

Student: Yes.

– Tutor: Let's look at this more closely. In which direction does the earth's gravitational force act?

– Student: Vertical

» Tutor: Let's consider this some more. Is that vertical upwards or vertical downwards?

» Student: Vertical downwards

Wrong, so subdialogue

Incomplete, so subdialogue

27

Schematic of a single step

Stepstart

T: Tell


Stepend

S: IncorrectT: …S: …T: …S: …T: …

KCD

28

Outline




Next

29

Why2-Atlas main modules

Student enters/edits the essay

Sentence understander

Essay understander

Discourse manager decides what to do w.r.t. history

KCD script interpreter RealPro NLG

Student

Words

FOPL propositions

Missing / bad steps

Script for remedying missing/bad step

Ideal essay

ClarificationDone

30

Modules evaluated (in yellow)



Essay understander



Student

Words

FOPL propositions

Missing / bad steps

Script for remedying missing/bad step

Ideal essay

ClarificationDone

31

Evaluate for accuracy (w.r.t. human judges) and speed



Essay understander



Student

Words

Propositions

Missing/incorrect steps

Script

Ideal essay

ClarificationDone

•Deep NLU: Carmel•LCFlex parser•Comlex lexicon •Semantic authoring tool

•Shallow NLU: Naïve Bayes; LSA•Hybrids: CarmelTC; Rapel•Result:

•Similar accuracy•Complementary errors•Best to use all 3

32

Evaluate for utility as tool



Essay understander



Student

Words

Propositions


Script

Ideal essay

ClarificationDone

•Re-implemented as TuTalk•GUI authoring system •XML authoring system•Handy features (e.g., +/- feedback) for ITS•Currently being used by 4 projects

33

Evaluate for accuracy & speed



Essay understander



Student

Words

Propositions


Script

Ideal essay

ClarificationDone

Next few slides

34

Essay analysis: You probably found all 4 incorrect steps. Can the essay analyzer?

Question: Suppose you are running in a straight line at constant speed. You throw a pumpkin straight up. Where will it land? Explain why.

Student: Once the pumpkin leaves my hand, the horizontal force that I am exerting on it no longer exists, only a vertical force (caused by my throwing it). As it reaches it’s maximum height, gravity (exerted vertically downward) will cause the pumpkin to fall. Since no horizontal force acted on the pumpkin from the time it left my hand, it will fall at the same place where it left my hands.

35

Research problem, more precisely, is… Given:

– Student’s sentences: {s1&s2, s3&s4&s5, …}– Set of correct steps: {c1&c2, c3&c4&c5, …}– Set of incorrect steps:{i1&i2, i3, i4&i5,….}

Determine: Which correct and incorrect steps match the student’s sentences:

– Directly (graph matching)– Indirectly, using domain knowledge

36

Why do we need indirect matching? The student said (incorrectly):

– “The pumpkin slows down, so it lands behind me.”

Correct steps– Yada – Yada– Yada

Incorrect steps– Yada – When there is no force to propel an object along, it slows down– Air friction matters– Yada

Essay analyzer should output both derivations, with estimates

of their probabilities

37

First method: Abduction using Tacitus-Lite+

Backchaining theorem prover (like Prolog)– Student’s utterance goal to be proved– Problem statement givens– Proofs of earlier student utterances more givens

Accepts goals without proof (at a cost)– Because not everything can be anticipated– Searches for lowest cost proof

Checks consistency as it goes– Don’t try to prove ~p when the proof already has p.

38

Derivation 1 (of 2) for “The pumpkin slows down”

The velocity of the pumpkin is decreasing

The horizontal component of the velocity of the pumpkin is decreasing

The horizontal component of the net force on the pumpkin is zero

The horizontal component of the air friction force on the pumpkin is zero

The horizontal component of the man’s force on the pumpkin is zero

Net force is sum of forces

given

(The net force causes the velocity, so) zero net force implies velocity decreases

Imprecision

given

Student said this

An inference rule

Incorrect inference rule

A correct inference rule

39

Derivation 2 (of 2) for “The pumpkin slows down”

The velocity of the pumpkin is decreasing

The horizontal component of the velocity of the pumpkin is decreasing

The horizontal component of the net force on the pumpkin is negative

The horizontal component of the air friction force on the pumpkin is negative

The horizontal component of the man’s force on the pumpkin is zero

Net force is sum of forces

False assumption

Kinematics

Imprecision

given

Student said this

Newton’s second law

The horizontal component of the acceleration of the pumpkin is negative

40

Results of using Tacitus-Lite+

Acceptable accuracy, but far too slow Cost may not be a good substitute for

probability when there are multiple competing explanations

41

Second method: Precompute the time-consuming reasoning

Precomputions– The deductive closure of the problem statement givens– Save as directed graph– Label subsets of nodes that represent correct and

incorrect steps– Convert to Bayesian network & train

To analyze a student’s utterance– Clamp directly matched nodes as “evidence”– Run Bayesian network– Read out most probable steps

42

Results: Fast enough. Better accuracy, but not by much.

43

Summary



Essay understander



Student

Words

Propositions


Script

Ideal essay

ClarificationDone

•Methods•Abductive theorem prover•Bayesian deductive closure

•Results •Similar accuracy•Bayesian deductive closure faster than abductive theorem prover

44

Outline



– Of individual techniques– Of the whole sysem Next

45

Evaluation framework Pretest (1 hr) Training (5 to 10 hrs):

For each question, do:1. Student enters initial essay2. Tutor analyses it for missing & incorrect steps,

picks one, and discusses it with student3. Student enters revised essay4. Tutor either

congratulates student & presents ideal essayor goes to step 2

Posttest (1 hr)

Only step remediation varies with the condition

46

Conditions

Expert Human tutors – Text-based communication– Spoken communication

Computer tutors– Why2-Atlas (VanLehn et al.)– ITSPOKE (Litman et al.)– Why2-AutoTutor (Graesser et al.)

Control conditions– Canned text remediation– Textbook

47

Talking headGesturesSynthesized speech

Problem

Dialog historyStudent types here

48

Human tutors

Stepstart

T: Tell


Stepend



49

Why2-Atlas

Stepstart

T: Tell


Stepend

S: IncorrectKnowledge

construction dialogue

50

Why2-AutoTutor

Stepstart

T: Tell


Stepend

S: IncorrectHint,

or prompt,or assert

51

Canned-text remediation

Stepstart

T: Tell


Stepend

S: Incorrect<text>

52

Results from 7 experiments

Why2-Atlas = Why2-AutoTutor – Trend for >, but not significant– Why2-Atlas may need more development

Why2 > Textbook – In Textbook condition, students do not write essays

Why2 = Human tutoring !!!

Human tutoring = Canned text remediation– Exception: If pre-physics students get instruction designed for

post-physics students, then Human tutoring > Canned text remediation

53

Impact & significance of the results

Why2-Atlas = Why2-AutoTutor – Common in AI that complex techniques are only slightly

better than simple ones, at least initially. Why2 > Textbook

– Common in Learning Sciences that active > passive

Why2 = Human tutoring = Canned text remediation– Highly counter-intuitive to Learning and Cognitive

scientists (including us)

54

Hypothesis 1: Exactly how tutors remedy a step doesn’t matter much

Stepstart

T: Tell


Stepend

S: Incorrect

What’s in here doesn’t matter much

55

Other studies where type of step remediation had little impact Human tutors

1. Human tutoring = human tutoring with only content-free prompting for step remediation (Chi et al., 2001)

2. Human tutoring = solving a problem in pairs with a video solution available (Chi et al., 2007)

3. Human tutoring = canned text during post-practice remediation (Katz et al., 2003)

4. Human tutoring = an ITS (Reif & Scott, 1999)

5. Micro-analyses of human tutoring (VanLehn et al., 2003)

6. Socratic human tutoring = didactic human tutoring (Rosé et al., 2001a; Johnson & Johnson, 1992)

Natural language tutoring systems1. Circsim (canned text) = Cirsim Tutor (Evens & Michael, 2007)

2. Andes-Atlas = Andes with canned text (Rosé et al, 2001b)

3. Cognitive geometry tutor (Aleven et al., 2004)

56

Hypothesis 2: Cannot eliminate the step remediation loop

Stepstart

T: Tell


Stepend

S: IncorrectText

Must avoid this

57

Studies consistent with harmfulness of just telling & explaining

Human tutoring– Human tutoring > textbook alone (Azevedo; Evens; VanLehn)– Human tutoring > lecture/demo (Wood et al. 1978; Swanton,

Natural language tutoring systems– NLT > textbook alone (Graesser; Evens; Lane; Vanlehn)– NLT > lecture/demo (Craig)

58

Conclusions Learning Science: Can computer tutors be as effective as

human tutors?– Yes, as long as students attempt steps with feedback & hints on

each

Computer Science: When is deep linguistic technology more effective than shallow?– Several positive results at module level– At whole system level, still tied, but encouraging

Cognitive Science: The higher the interactivity, the higher the learning gains? – No. See next slide

59

The interactivity plateau

0

0.5

1

1.5

2

2.5

CAI ITS NLT Human


Effectsize

Claim: Perhaps Bloom’s 2 experiments were confounded

60

How can we achieve super-human results?

0

0.5

1

1.5

2

2.5

CAI ITS NLT Human New


Effectsize

?

61

Future work (slide 1 of 3):Increasing engagement

NeuroCog engagement meter (DARPA)– Can we reliably measure engagment with fMRI? – Can we train students to maintain engagement with it?

Interesting problems (PSLC)– Ill-defined & design problems– Recommender system

ITS as a member of a social network (DFK, PSLC)– Pairs > solos for engagment, but correctness?– Can we add an ITS without destroying engagement?

62

Future work (slide 2 of 3):Faster learning; Faster authoring

Author & student interface ≈ PowerPoint– Fast to learn & use

» e.g., type “Let V1, V2 be the initial, final velocities”

– Freedom; domain independence As students master a step:

– Tutor does it, or – It gets folded into a larger step

TruthBench– Knowledge acquisition for truth checking vs.– Knowledge acquisition for solving an (ill-defined) problem– Examples instead of hints

63

Future work (slide 3 of 3) :Teach what an AI learner needs

Explicit teaching of backwards chaining (PSLC)– Accelerates learning; transfers (Chi & VanLehn, 2007)

Explicit teaching of confluences– KE = ½ m v2 If mass and kinetic energy are

constant, then velocity must be constant Explicit teaching of abstraction planning

– KE = ½ m v2 If need a velocity, then find a kinetic energy

Dream system: A model human learner– For testing curriculum designs– Getting the step sizes right

64

Thanks!

See www.pitt.edu/~vanlehnfor publications

http://www.pitt.edu/~vanlehn

65

When to use deep vs. shallow?


Sentence understanding

LSA, Rainbow, Rappel Carmel: parser, semantics…

Essay/Discourse understanding

LSA Abduction, Bnets

Dialog management

Finite state networks Reactive planning

Natural language generation

Text Plan-based

Use both

Use deep

Use locally smart FSA

Use equivalent texts

66

“It ain’t so much the things we don’t know that get us into trouble. It’s the things we know that just ain’t so.”

-- Josh Billings (Henry Wheeler Shaw)

67

A deep sentence understander:Carmel

LCFlex parser, a robust parser that uses skipping, insertion & flexible unification (Rosé & Lavie 2001)

Comlex, with 40,000 lexemes (Grishman et al., 1994)

A broad-coverage, domain-independent, English syntactic grammar

CarmelTools for semi-automatically creating semantic functions (Rosé 2000, Rosé et al., AIED 2003).

68

Shallow sentence understanders Words only

– Rainbow: A Naïve Bayes text classifier» Given new bag of words, calculates most probable domain

propositions using Bayes rule

– LSA, several others» Worse, not currently used

Words + syntactic features– Rappel (Jordan)

» Minipar produces dependency relations between words » Ripper builds one classifier per predicate type & per argument

of the predicate type

– CarmelTC (Rosé et al., HLT/NAACL 2003)» Worse, not currently used

69

Results

Speed: All were fast enough Accuracy: All were too low

– Deep ≈ Words ≈ Words + syntactic features Best (Jordan et al, ITS04)

– Run all 3– Use heuristics to choose 1 of 3 outputs

» E.g., If “velocity” or “speed” appear in the sentence, then “velocity” should appear in the output propositions somewhere.

70

Direct matching via largest common subgraph (Shearer et al., 2001)

x1 x2

compare(x1, x2, same)

speed(x2, man, ...)speed(x1, pumpkin, ...)

x5

compare(x5, x6, ...)

speed(x5,pumpkin,...)

Expectation: “The speed of the pumpkin is the same as the speed of man.”

Student said: “…allow us to compare it to the speed of the pumpkin.”

71Correct givensBuggy given

Generation of the ATMS for a problem starts with givens (node = atomic prop.; red = incorrect)

72

RA1 RA2

Correct givensBuggy given

Apply rules forward (RA = rule application)

73

RA1 RA2

RA4


RA3

Stop when no more rule applications


Propositions are not always ground!Variables (colored links) shared across nodes


Specific subsets correspond to expectations / misconceptions

Expectation (a key step in the explanation)

Expectation

Misconception


Input

At runtime, find all node subsets that unify with the student input


Input

In this case, two subsets (happy faces) unify with student’s input

78

RA2

RA4


RA3 Input

Direct match

Close, but not a direct match

Output the expectations/misconceptions that are directly matched

RA1

79

RA2

RA4


RA3 Input

Bnet nodes represent sets of nodes for expectations/misconceptions

RA1

80

RA2

RA4


RA3

Clamp student’s input nodes (happy faces) and update net

RA1

Moderate posterior

probability

High posteriorprobability

Low posteriorprobability

81

Training and evaluation

Topology of network is given by deductive closure, etc.

Only learn the conditional probabilities 293 sentences coded by human Expectation/Maximization 10-fold cross validation

Documents

1 While waiting for the talk to start, try to find 4 mistakes in this student essay. Question: Suppose you are running in a straight line at constant speed