The Essay Scoring Tool (TEST) for Hindi

The Essay Scoring Tool The Essay Scoring Tool - TEST- TEST

B.E Project presentationB.E Project presentation

Submitted by:Submitted by:Abhinav Gupta 201/CO/03Abhinav Gupta 201/CO/03

Danish Contractor 233/CO/03Danish Contractor 233/CO/03 Gaurav Singh 238/CO/03Gaurav Singh 238/CO/03

Himanshu Mehrotra 241/CO/03Himanshu Mehrotra 241/CO/03

Under the guidance of:Under the guidance of:Dr. Shampa Chakraverty Dr. Shampa Chakraverty

COE Dept.COE Dept.NSITNSIT

Date of presentation:Date of presentation: 1 1stst June 2007 June 2007NSIT, Delhi

PRIOR WORKPRIOR WORK

NSIT, Delhi

Overview of the Software

NSIT, Delhi

Student Essay

TEST Essay TEST

Training Essays

INPUTS

Spelling & Grammatical Checks

Corpus Facts

Feedback to student

Score

OUTPUTS

Scoring ParametersScoring Parameters

NSIT, Delhi

Scoring Engine

Quality of

Content

Global Coherence

Factual Accuracy

Local Coherence

SINGULAR VALUES (K) SINGULAR VALUES (K) RETAINEDRETAINED

Variation of correlation of TEST scores with human rater, according to k

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

7 10 28 42 57 74 93 114 137 162 190 220k

co

rre

lati

on

b/w

hu

ma

n

rate

r's

an

d T

ES

T's

sc

ore

s

NSIT, Delhi

Study UndertakenStudy Undertaken

Set of essays given to Human GradersSet of essays given to Human Graders Essays rated as :Essays rated as :

Good EssaysGood Essays Bad EssaysBad Essays

LOCAL COHERENCE – Good LOCAL COHERENCE – Good EssaysEssays

0

0.121

0.242

5 7 9 10 14 15

Essay No.

Av

era

ge

in

ter

se

nte

nc

e s

imil

ari

ty

Average variance from gold standard - 0.0219

NSIT, Delhi

LOCAL COHERENCE – Other LOCAL COHERENCE – Other EssaysEssays

0

0.121

0.242

0.363

3 4 6 11 12 17

Essay no.

Av

era

ge

in

ter

se

nte

nc

e s

imil

ari

ty

NSIT, Delhi

Average variance from gold standard - 0.212

LOCAL COHERENCE- Combined LOCAL COHERENCE- Combined EssaysEssays

0

0.121

0.242

0.363

0.484

1 2 3 4 5 6

No. of observations

Ave

rag

e in

ter

sen

ten

ce

sim

ilar

ity

Series1

Series4

NSIT, Delhi

Series 1 : Good essays

Series 2 : Other Essays

LOCAL COHERENCE - MARKING LOCAL COHERENCE - MARKING SCHEMESCHEME

0

0.121

0.242

0.363

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Essay No.

Av

era

ge

in

ter-

se

nte

nc

e s

imil

ari

ty

NSIT, Delhi

LOCAL COHERENCE - LOCAL COHERENCE - MARKSMARKS

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

No. of essay

Ma

rks

(c

om

pa

riti

ve

)

NSIT, Delhi

CONTENTS-ESSAYS TO BE MARKEDCONTENTS-ESSAYS TO BE MARKED

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

18 19 33 21 26 24 20 27 28 32 22 25 29 31 34 30 23

Essay nos.

Ave

rag

e aS

imil

arit

y w

ith

go

ld

std

.

NSIT, Delhi

CONTENT – Good EssaysCONTENT – Good Essays

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

20 25 29 31 34 30 23

Essay no.

Av

era

ge

Sim

iari

ty w

ith

Go

ld

sta

nd

ard

NSIT, Delhi

CONTENT – Other EssaysCONTENT – Other Essays

0

0.1

0.2

0.3

0.4

0.5

0.6

18 19 33 21 26 24 28

Essay no.

Av

erg

ae

Sim

ila

rity

wit

h g

old

s

tan

da

rd

NSIT, Delhi

CONTENT - COMBINEDCONTENT - COMBINED

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5 6 7

No. of observations

Ave

rag

e S

imil

arit

y w

ith

go

ld

stan

dar

d

Series1

Series5

SERIES 1 : GOOD ESSAYS

SERIES 5: OTHER ESSAYS NSIT, Delhi

CONTENT-NORMALIZED MARKSCONTENT-NORMALIZED MARKS

0102030405060708090

100

18 19 33 21 26 24 20 27 28 32 22 25 29 31 34 30 23

Essay No.

Mar

ks

NSIT, Delhi

GLOBAL COHERENCEGLOBAL COHERENCE

Essays are classified as having a :Essays are classified as having a :

Good StructureGood Structure Average StructureAverage Structure Bad StructureBad Structure

NSIT, Delhi

0.502 0.523 0.512

0.729

0.432

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5

Theme No.

Co

rre

lati

on

Co

eff

icie

nt

GOOD STRUCTURED ESSAY

NSIT, Delhi

0.214

0.523

0.398

0.305

0.412

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4 5

Theme No.

Co

rre

lati

on

Co

eff

icie

nt

AVERAGELY STRUCTURED ESSAY

NSIT, Delhi

0.231

0.342

0.109

0.285

0.198

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4 5

Theme No.

Co

rrel

atio

n C

oef

fici

ent

BADLY STRUCTURED ESSAY

NSIT, Delhi

GLOBAL COHERENCE MARKSGLOBAL COHERENCE MARKS

0

10

20

30

40

50

60

70

80

90

Bad structure Average Structure Good structure

Type of Essay

Mar

ks

NSIT, Delhi

Fact Evaluation ModuleFact Evaluation Module

NSIT, Delhi

TEST Fact Evaluation Module

Topic Specific Keywords

List of Essays

Correct Facts List

Incorrect Facts List

Individual Essay Reports & Scores

N X 1 Score Matrix (For Internal use by TEST)

Fact Evaluation Fact Evaluation

No. of facts matched:4No. of Incorrect Facts matched:1SCORE: 0.8

NSIT, Delhi

Breakup of Essay Scores Breakup of Essay Scores

0

10

20

30

40

50

60

70

80

90

100

2 11 15

Essay No.

Ma

rks

GlobalCoherence

Content

Local Coherence

Factual Accuracy

Overall Score

NSIT, Delhi

Human scores v/s TEST scoresHuman scores v/s TEST scores

NSIT, Delhi

0 2 4 6 8 10 12 14 16 180

1

2

3

4

5

6

7

8

9

10

Essay No.

Mark

s

TEST tool

Human Rater

0.6

0.65

0.7

0.75

0.8

0.85

0.9

correlation coefficient

human-human

human-TEST

Performance of TESTPerformance of TEST

Adjacent agreement Adjacent agreement with human graders with human graders around 77%around 77%

Agreement among Agreement among human graders human graders around 73%around 73%

NSIT, Delhi

TIME COMPLEXITYTIME COMPLEXITY

PRE-PROCESSING FOR GLOBAL PRE-PROCESSING FOR GLOBAL COHERENCE COHERENCE

0(N^3), Where N = No. of sentences in 0(N^3), Where N = No. of sentences in corpus.corpus.

O(t*n^2), t=no. of themes, n=no. of O(t*n^2), t=no. of themes, n=no. of sentences in eval. Essaysentences in eval. Essay

FACT MODULE – O(k^4)FACT MODULE – O(k^4)

k=no. of keywordsk=no. of keywords

PEGPEG IEAIEA E-RaterE-Rater TESTTESTEvaluation Evaluation parametersparameters

Essay length, Essay length, Complexity of Complexity of sentence and sentence and word lengthword length

Similarity with Similarity with gold standardgold standard

Lexical Lexical complexity, complexity, Vocabulary, Essay Vocabulary, Essay organization and organization and many more..many more..

Similarity with gold Similarity with gold standard, Essay standard, Essay organization,Fact organization,Fact Accuracy.Accuracy.

FeedbackFeedback NoNo YesYes YesYes YesYes

Essay Essay content content checkingchecking

NoNo YesYes YesYes YesYes

Fact Fact checkingchecking

NoNo NoNo YesYes YesYes

Training Training phasephase

Time consuming Time consuming & inexpensive& inexpensive

Time consuming Time consuming & inexpensive& inexpensive

Time consuming & Time consuming & expensiveexpensive

Time consuming & Time consuming & inexpensiveinexpensive

Language of Language of essaysessays

EnglishEnglish EnglishEnglish EnglishEnglish Hindi Hindi

PerformancePerformance Correlation of Correlation of 0.87 with human 0.87 with human ratersraters

Correlation of Correlation of 0.85 with human 0.85 with human raters.raters.

Correlation of 0.87 Correlation of 0.87 with human raters.with human raters.

Correlation of 0.7652 Correlation of 0.7652 with human raters.with human raters.

COMPARISON OF TEST WITH OTHER AES TOOLS

NSIT, Delhi

FUTURE WORK

Include OCR (Optical Character Recognition).

Increasing the size and variety of the corpus.

Incorporating modules for spelling and grammar evaluation.

The use of Random Indexing (RI) techniques can reduce the size of the matrix which is input for the SVD procedure and thus can reduce time-complexity.

LIMITATIONSLIMITATIONSAbsence of grammatical checkingAbsence of grammatical checkingAbsence of a spell-checkAbsence of a spell-checkThe tool is unable to check The tool is unable to check individualistic styles of writingindividualistic styles of writingDomain – specific knowledge Domain – specific knowledge required before checking an essayrequired before checking an essay

NSIT, Delhi

CONTRIBUTIONCONTRIBUTIONFirst AES tool for HindiFirst AES tool for HindiLocal Coherence at granularity of Local Coherence at granularity of sentencessentencesGood correlation with human ratersGood correlation with human ratersSVD done only once for Local and SVD done only once for Local and Global CoherenceGlobal Coherence

NSIT, Delhi

References1. An Introduction to Latent Semantic Analysis by Thomas K Landauer University of Colorado at

Boulder, Peter W. Foltz, Department of Psychology, New Mexico State University, Darrell Laham, Department of Psychology University of Colorado at Boulder, Discourse Processes, 1998

2. The Measurement of Textual Coherence with Latent Semantic Analysis by Peter W. Foltz, New Mexico State University Walter Kintsch and Thomas K. Landauer University of Colorado, Discourse Processes, 1998

3. Indexing by Latent Semantic Analysis by Scott Deerwester, Graduate Library School University of Chicago, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Bell Communications Research Richard Harshman, University of Western Ontario, Journal of the American Society for Information Science, 1990

4. On the notions of theme and topic in psychological process models of text comprehension by Walter Kintsch, Department of Psychology, University of Colorado, Interdisciplinary Studies, 2002

5. How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans by Thomas K. Landauer, Darrell Laham, Bob Rehder, and M. E. Schreiner Department of Psychology & Institute of Cognitive Science University of Colorado, Boulder, corpus, 1996

6. A Critiquing System to Support English Composition through the Use of Latent Semantic Analysis by Kelvin C. Wong, Anders I. Mørch, William K. Cheung, Mason H. Lam1 and Janti P. Tang, Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong, 2005 IEEE International Conference on e-Technology, e-Commerce and e-Service (EEE'05) pp. 576-581

7. Finding the WRITE stuff: Automatic identification of discourse structure in student essays by Jill Burstein, Daniel Marcu, and Kevin Knight. 2003b IEEE Trans-actions on Intelligent Systems: Special Issue on Ad-vances in Natural Language Processing, 181:32–39.

NSIT, Delhi

WE WOULD LIKE TO THANKWE WOULD LIKE TO THANK Dr. Shampa Chakraverty, without her constant guidance Dr. Shampa Chakraverty, without her constant guidance

and support we would have given up long ago.and support we would have given up long ago.

Dr. Niladri Chatterjee, Dept. of Mathematics, IIT Delhi for Dr. Niladri Chatterjee, Dept. of Mathematics, IIT Delhi for sharing his experience in the NLP field.sharing his experience in the NLP field.

Ms. Yasmin Contractor, Principal, Summerfields School, Ms. Yasmin Contractor, Principal, Summerfields School, Gurgaon for providing us with the student essays.Gurgaon for providing us with the student essays.

Faculty of COE Dept. and fellow students.Faculty of COE Dept. and fellow students.

NSIT, Delhi

Q & A?

NSIT, Delhi

Automatic Essay Evaluation Automatic Essay Evaluation SoftwareSoftware

B.E. Final Year Project : Final EvaluationB.E. Final Year Project : Final Evaluation

Project Guide: Dr. Shampa ChakravertyProject Guide: Dr. Shampa Chakraverty

Team:Team:

Abhinav Gupta 201/CO/03Abhinav Gupta 201/CO/03

Danish Contractor 233/CO/03Danish Contractor 233/CO/03

Gaurav SinghGaurav Singh 238/CO/03 238/CO/03

Himanshu Mehrotra 241/CO/03Himanshu Mehrotra 241/CO/03

Aim of the softwareAim of the software

To score students’ essays on a specific To score students’ essays on a specific topic.topic.

Give feedback to the student on Give feedback to the student on deficiencies in his/her essay.deficiencies in his/her essay.

Need for this softwareNeed for this software

Teachers these days are overburdened with the Teachers these days are overburdened with the evaluation of answer scripts.evaluation of answer scripts.

Teachers are unable to give personalized attention to the Teachers are unable to give personalized attention to the students’ needs.students’ needs.

Students feel the need to practice writing essays in a Students feel the need to practice writing essays in a non-test environment.non-test environment.

Many factors influence the scoring of essays and Many factors influence the scoring of essays and introduce error.introduce error.

Overview of the Software

Parameters used for evaluationParameters used for evaluation

Similarity with the gold standard Similarity with the gold standard

Local coherence of essayLocal coherence of essay

Global and Theme coherence checker and Feedback generator.Global and Theme coherence checker and Feedback generator.

Fact checkingFact checking

Latent Semantic Analysis Latent Semantic Analysis (LSA) (LSA)

Latent semantic analysis Latent semantic analysis is a statistical technique in natural is a statistical technique in natural language processing of analyzing relationships between a set of language processing of analyzing relationships between a set of documents and the terms they contain by producing a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.concepts related to the documents and terms.

LSA derives a high-dimensional semantic space. Words and LSA derives a high-dimensional semantic space. Words and passages are represented as vectors in the space.passages are represented as vectors in the space.

The LSA measured similarities have been shown to closely mimic The LSA measured similarities have been shown to closely mimic human judgments of meaning similarity.human judgments of meaning similarity.

Training corpus of gold standard essay and other articles, essays on the same topic +

Essay under evaluation

Term-document matrix (M)

(After Singular-value decomposition)

Three matrices – T,S and D(T=Term matrix, S=Singular-values matrix and D=document matrix)

Dimensionality reduction and preserving only 2 largest dimensions in S gives S-improved

(Multiplying T, S-improved and D)

New Term by Document matrix

LSA: Steps involved

LSA Example

Titles of Some Technical Memos

• c1: Human machine interface for ABC computer applications

• c2: A survey of user opinion of computer system response time

• c3: The EPS user interface management system

• c4: System and human system engineering testing of EPS

• c5: Relation of user perceived response time to error measurement

• m1: The generation of random, binary, ordered trees

• m2: The intersection graph of paths in trees

• m3: Graph minors IV: Widths of trees and well- quasi- ordering

• m4: Graph minors : A survey

LSA Example : Term by document matrix

LSA Example: After SVD

LSA Example: Results

Similarity between documents:

C1 and C2 = 0.91 (high)

C1 and C3 = 1.00 (very-high)

C1 with C5 = 0.85(high)

C2 with C3 = 0.91 (high)

C1 and M1 = -0.85 (low)

M1 and M2 = 1.00 (very-high)

M2 and M3 = 1.00 (very-high)

C2 and C3 = 0.91 (high)

Local Coherence Estimation

What is Coherence?

Each sentence in an essay is connected to previous sentences. The degree of this connection measures the coherence of the sentence pairs.

Coherence estimation using LSA:

By comparing vectors for two adjoining segments of text in a semantic space, LSA measures degree of semantic relatedness between the segments.

Global and theme coherence checker and feedback generator

The global structure of the essay is as follows:

Introduction

Ideas in individual paragraphs

Conclusion

Ideas in an essay are presented in the following way:

1. Main idea

2. Supporting idea

3. Explanation of 1. and 2

Global and theme coherence checker and feedback generator

A set of possible introductions, conclusions and ideas are extracted from gold standard and other training essays.

The similarity of student essay introduction is measured against the set of introductions using LSA. The same is done for the ideas and conclusions.

Using the similarity measures the presence or absence of ideas, introductions and conclusions can be determined.

Fact EvaluationFact Evaluation To facilitate this we will have 2 sets of facts –Correct fact and incorrect To facilitate this we will have 2 sets of facts –Correct fact and incorrect

facts, per essay topic. facts, per essay topic.

The following guidelines would be used to evaluate facts:The following guidelines would be used to evaluate facts:

Set of “keywords" to be checked at the sentential level in the text. Set of “keywords" to be checked at the sentential level in the text.

Detection of two or more keywords invokes the checking module Detection of two or more keywords invokes the checking module

2 databases of facts (Correct and Incorrect) contain sets of 2 databases of facts (Correct and Incorrect) contain sets of keywords to form a "fact".keywords to form a "fact".

Each sentence would be assumed to have a maximum of one factEach sentence would be assumed to have a maximum of one fact

Connectives in sentences to be treated as "end-of-sentence" markers Connectives in sentences to be treated as "end-of-sentence" markers for fact evaluation purposes. for fact evaluation purposes.

Fact EvaluationFact Evaluation The keywords detected are paired and matched to form sets of The keywords detected are paired and matched to form sets of

"facts" and then checked in the database. Three cases may arise:"facts" and then checked in the database. Three cases may arise: It returns a positive match in both databases.It returns a positive match in both databases. It returns a positive match in the correct facts database.It returns a positive match in the correct facts database. It returns a positive match in the incorrect facts databaseIt returns a positive match in the incorrect facts database

The time complexity of factual evaluation is around O ( m* (log p)^2 ) The time complexity of factual evaluation is around O ( m* (log p)^2 ) p= No of keywordsp= No of keywords

m= Average sentence length m= Average sentence length

This could be a huge overhead while evaluating essays as fact This could be a huge overhead while evaluating essays as fact evaluation is a very small aspect of the entire process. evaluation is a very small aspect of the entire process.

The use of SQL (for reading facts) and other database optimizations The use of SQL (for reading facts) and other database optimizations should reduce the time required during computation should reduce the time required during computation

References

1. An Introduction to Latent Semantic Analysis by Thomas K Landauer University of Colorado at Boulder, Peter W. Foltz, Department of Psychology, New Mexico State University, Darrell Laham, Department of Psychology University of Colorado at Boulder, Discourse Processes, 1998

2. The Measurement of Textual Coherence with Latent Semantic Analysis by Peter W. Foltz, New Mexico State University Walter Kintsch and Thomas K. Landauer University of Colorado, Discourse Processes, 1998

3. Indexing by Latent Semantic Analysis by Scott Deerwester, Graduate Library School University of Chicago, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Bell Communications Research Richard Harshman, University of Western Ontario, Journal of the American Society for Information Science, 1990

4. On the notions of theme and topic in psychological process models of text comprehension by Walter Kintsch, Department of Psychology, University of Colorado, Interdisciplinary Studies, 2002

5. How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans by Thomas K. Landauer, Darrell Laham, Bob Rehder, and M. E. Schreiner Department of Psychology & Institute of Cognitive Science University of Colorado, Boulder, corpus, 1996

6. A Critiquing System to Support English Composition through the Use of Latent Semantic Analysis by Kelvin C. Wong, Anders I. Mørch, William K. Cheung, Mason H. Lam1 and Janti P. Tang, Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong, 2005 IEEE International Conference on e-Technology, e-Commerce and e-Service (EEE'05) pp. 576-581

7. Finding the WRITE stuff: Automatic identification of discourse structure in student essays by Jill Burstein, Daniel Marcu, and Kevin Knight. 2003b IEEE Trans-actions on Intelligent Systems: Special Issue on Ad-vances in Natural Language Processing, 181:32–39.

Local Coherence ModuleLocal Coherence Module

NSIT, Delhi

The reduced term-documentMatrix after LSA

Evaluation Essay column number in term-document

matrix

Score onLocal Coherence

Feedback to Student

Local Coherence Module

Local Coherence ResultsLocal Coherence Results

0 2 4 6 8 10 12 14 16 180

10

20

30

40

50

60

70

80

90

100Variation of Marks according to Local Coherence with different values of 'k' - scoring scheme 1

Essay No.

Mar

ks k = 114

k = 42

k =10

NSIT, Delhi

Content Evaluation ModuleContent Evaluation Module

NSIT, Delhi

Set of Domain SpecificGolden Standard

Essays

Set of Essaysto be

evaluated

Essay Content Evaluation Module

Normalized scores on basis of Content

Content Evaluation ResultsContent Evaluation Results

0 2 4 6 8 10 12 14 16 180

10

20

30

40

50

60

70

80

90

100

Essay No.

Mar

ks

Variation of Marks according to content with different values of 'k' - scoring scheme 1

k = 114

k = 42

k = 10

NSIT, Delhi

Content EvaluationContent Evaluation Normalized Results Normalized Results

0 2 4 6 8 10 12 14 16 180

10

20

30

40

50

60

70

80

90

100

Essay No.

Mar

ks

Variation of Variation of Marks according to content with different values of 'k' - scoring scheme 2

k = 114

k = 42

k = 10

NSIT, Delhi

Global Coherence ModuleGlobal Coherence Module

NSIT, Delhi

Golden StandardEssays

Global Coherence Evaluation Module

Feedback Score

EvaluationEssay(s)

Global Coherence EvaluationGlobal Coherence EvaluationEffect of KEffect of K

0 2 4 6 8 10 12 14 16 180

20

40

60

80

100

120

Essay No.

Mark

s

Variation of Marks according to Global Coherence with different values of k

k = 114

k = 42 k = 10

NSIT, Delhi