Detecting Erroneous Sentences using Automatically Mined Sequential Patterns Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns

  • Upload

  • View

  • Download

Embed Size (px)


Detecting Erroneous Sentences using Automatically Mined Sequential Patterns. Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.12.04. Outlines. Introduction Related Work Proposed Technique Experimental Evaluation Conclusions and Future Work. Introduction. Summary - PowerPoint PPT Presentation

Citation preview

Page 1: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Detecting Erroneous Sentences using Automatically Mined

Sequential Patterns

Advisor: Hsin-His ChenReporter: Chi-Hsin YuDate: 2007.12.04

Page 2: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns


Introduction Related Work Proposed Technique Experimental Evaluation Conclusions and Future Work

Page 3: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns


Summary Problem: Identifying erroneous/correct sentences Algorithm: Classification (SVM, NB) Approach: Sequential patterns (Data Mining)

Applications Providing feedback for writers of English as a

Second Language (ESL) Controlling the quality of parallel bilingual

sentences mined from the Web Evaluating the MT results

Page 4: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Introduction (cont.)

The common mistakes (Yukio et al.,2001; Gui and Yang, 2003) made by ESL learners spelling, verb formation lexical collocation, tense, agreement, wrong Part-Of-Speec

h (POS), article usage sentence structure (grammar structure)

Example “If Maggie will go to supermarket, she will buy a bag for you.

” The pattern: “if...will...will” (would ) N-grams: considering only continuous sequence of words,

very expensive if N > 3

Page 5: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Related Work

Category 1: the use of hand-crafted rules Heidorn, 2000; Michaud et al., 2000; Bender et al.,

2004 Difficulties

Expensive to write rules manually difficult to produce and maintain a large number of no

n-conflicting rules to cover a wide range of grammatical errors

making different errors by different first-language backgrounds and skill levels

hard to write rules for some grammatical errors

Page 6: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Related Work (cont.)

Category 2: statistical approaches Chodorow and Leacock, 2000; Izumi et al., 2003;

Brockett et al., 2006; Nagata et al., 2006 Problems

focusing on some pre-defined errors the reported results being not attractive the need of errors to be specified and tagged in the tra

ining sentences the need of parallel tagged data

Page 7: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique

Classification model Using SVM (light SVM) Features

Labeled Sequential Patterns (LSP) – 1 feature Complementary features

Lexical Collocation (LC) – 3 features Perplexity from Language Model (PLM) – 2 features Syntactic Score (SC) – 1 feature Function Word Density (FWD) – 5 features

Page 8: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique —LSP (1)

A labeled sequential pattern (LSP), p, is in the form of <LHS, c> LHS is a sequence <a1, ..., am>

ai is named “item”.

c is a class label (correct/incorrect here) Sequence database D

The collection of LSPs

Page 9: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique —LSP (2)

“Contain” relation (subsequence) a sequence s1 =< a1, ..., am > is contained in a seq

uence s2 =< b1, ..., bn > if there exist integers i1, ...i

m such that 1 <= i1 < i2 < ... < im <= n and aj = bij for all j in {1, ...,m}.

A=<abcdefgh> has a subsequence B=<bdeg> A contains B.

A LSP p1 is contained by p2 if the sequence p

1.LHS is contained by p2.LHS and p1.c = p2.c.

Page 10: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique —LSP (3)

A LSP p is attached with two measures, support and confidence. The support of p (the generality of the pattern p)

denoted by sup(p) the percentage of tuples in database D that contain th

e LSP p the confidence of p (predictive ability of p)

Denoted by conf(p) Computed as

Page 11: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique —LSP (4)

Example: t1 = (< a, d, e, f >,E) t2 = (< a, f, e, f >,E) t3 = (< d, a, f >,C) One example LSP p1 = (< a, e, f >, E)

is contained in t1 and t2

sup(p1) = 2/3 = 66.7%, conf(p1)=(2/3)/(2/3) = 100%

LSP p2 = (< a, f >, E) sup(p2) = 3/3 = 100%, conf(p2)= (2/3)/(3/3) = 66.7%

Page 12: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique —LSP (5) Generating Sequence Database

applying Part-Of-Speech (POS) tagger to tag each training sentence MXPOST-Maximum Entropy Part of Speech Tagger Toolkit

3 for POS tags keeping function words and time words each sentence together with its label becomes a database t

uple “In the past, John was kind to his sister” “In the past, NNP was JJ to his NN”

LSP Examples (<a, NNS>, Error), NNS: plural noun (<yesterday, is>, Error)

Page 13: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique —LSP (6)

Mining LSPs adapting the frequent sequence mining algorithm i

n (Pei et al., 2001) setting minimum support at 0.1% and minimum co

nfidence at 75% Converting LSPs to Features

the corresponding feature being set at 1 if a sentence includes a LSP

Page 14: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique —LSP (7)

LSPs for erroneous sentences “<this, NNS>” (“this books is stolen.”) “<past, is>” ( “in the past, John is kind to his sister.”) “<one, of, NN>” ( “it is one of important working language” “<although, but>” (“although he likes it, but he can’t buy it.”) “<only, if, I, am>” (“only if my teacher has given permission,

I am allowed to enter this room.”)

LSPs for correct sentences “<would, VB>” (“he would buy it.”), “<VBD, yeserday>” (“I bought this book yesterday.”)

Page 15: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique —Other Linguistic Features (1)

Lexical Collocation (LC) Lexical collocation (“strong tea”/濃茶 , not “powerful tea”) collecting five types of collocations

verb-object, adjective-noun, verb-adverb, subject-verb, and preposition-object from a general English corpus

Correct LCs extracting collocations of high frequency

Erroneous LC candidates generated by replacing the word in correct collocations with

its confusion words, obtained from WordNet Consulted by experts to see if a candidate is a true erroneo

us collocation

Page 16: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique —Other Linguistic Features (2)

computing three LC features for each sentence (1)

m is the number of CLs n is the number of collocations in each sentence Probability p(coi) of each CL coi is calculated using the met

hod (Lu and Zhou, 2004) (2) the ratio of the number of unknown collocations (neither

correct LCs nor erroneous LCs) to the number of collocations in each sentence

(3) the ratio of the number of erroneous LCs to the number of collocations in each sentence

Page 17: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique —Other Linguistic Features (3)

Perplexity from Language Model (PLM) extracted from a trigram language Using the SRILM-SRI Language Modeling Toolkit

(Stolcke, 2002) Calculating two values for each sentence:

lexicalized trigram perplexity POS trigram perplexity

The erroneous sentences would have higher perplexity

Page 18: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique —Other Linguistic Features (4)

Syntactic Score (SC) using a statistical parser Toolkit (Collins, 1997) assigning each sentence a parser’s score

the related log probability of parsing Assuming that erroneous sentences with

undesirable sentence structures are more likely to receive lower scores

Page 19: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Proposed Technique —Other Linguistic Features (5)

Function Word Density (FWD) the ratio of function words to content words inspired by the work (Corston-Oliver et al., 2001)

Be effective to distinguish between human references and machine outputs

seven kinds of function words

Page 20: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Experimental Evaluation (1) – Experimental setup Classification model: SVM

For a non-binary feature X: its value x is normalized by z-score.

Two data sets: Japanese Corpus (JC) and Chinese Corpus (CC)

Page 21: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Experimental Evaluation (2)

Page 22: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Experimental Evaluation (3)

ALEK (Chodorow and Leacock, 2000)from Educational Testing Service (ETS)

Different cultures (Japanese/Chinese as first language)

694 parallel-sentences1671 non-parallel sentences

Page 23: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Experimental Evaluation (4)

Two LDC data, low-ranked and high-ranked data 14,604 low ranked (score 1-3) MTs 808 high ranked (score 3-5) MTs Both with corresponding human reference translations human references (Correct), MT (erroneous)

Page 24: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns

Conclusions and Future Work

Conclusions This paper proposed to mine LSPs as the input of classifica

tion models. LSPs were shown to be much more effective than the other

linguistic features. Other features were also beneficial.

Future work To use LSPs to provide detailed feedback for ESL learners To integrate the features effectively To further investigate the application for MT evaluation

Page 25: Detecting Erroneous  Sentences using Automatically Mined Sequential Patterns
