12
Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. CarboneU and J. Siekmann Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen 1083

Lecture Notes in Artificial Intelligence 1083 - Home - …978-3-540-68452... · 2017-08-28 · Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science

Embed Size (px)

Citation preview

Lecture Notes in Artificial Intelligence

Subseries of Lecture Notes in Computer Science

Edited by J. G. CarboneU and J. Siekmann

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen

1083

Karen Sparck Jones Julia R. Galliers

Evaluating Natural Language Processing Systems

An Analysis and Review

Springer

Series Editors

Jaime G. Carbonell School of Computer Science, Carnegie Mellon University Pittsburgh, PA 15213-3891, USA

J6rg Siekmann University of Saarland German Research Center forArtificial Intelligence (DFKI) Stuhlsatzenhausweg 3, D-66123 Saarbrticken, Germany

Authors

Karen Sparck Jones Julia Rose Galliers Computer Laboratory, University of Cambridge New Museums Site, Pembroke Street Cambridge CB2 3QG, UK

Cataloging-in-Publication Data applied for

Die D e u t s c h e B i b l i o t h e k - C I P - E i n h e i t s a u f n a h m e

Sparck Jones, Karen: Eva lua t ing na tura l language p roce s s ing sys tems : an analysis and rev iew / K. Sparck Jones ; J. R. Ga l l i e r s . - Ber l in ; H e i d e l b e r g ; New Y o r k ; Ba rce lona ; Budapes t ; H o n g Kong ; L o n d o n ; Mi l a n ; Paris ; Santa Clara ; S ingapore ; T o k y o : Spr inger , 1996

(Lecture notes in computer science ; 1083) ISBN 3-540-61309-9

NE: Galliers, Julia R.:; GT

CR Subject Classification (1991): 1.2.7, H.3, 1.2

ISBN 3-540-61309-9 Springer-Verlag Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer -Verlag. Violations are liable for prosecution under the German Copyright Law.

�9 Springer-Verlag Berlin Heidelberg 1995 Printed in Germany

Typesetting: Camera ready by author SPIN 10513063 06/3142 - 5 4 3 2 I 0 Printed on acid-free paper

P r e f a c e

This book presents a detailed analysis and review of NLP evaluation, in prin- ciple and in practice. Chapter 1 examines evaluation concepts and establishes a framework for NLP system evaluation. This makes use of experience in the related area of information retrieval, and the analysis also refers to evaluation in speech processing. Chapter 2 surveys significant evaluation work done so far, in relation both to particular tasks, for instance machine translation, and to NLP evaluation methodology as such, for example as this bears on generic system evaluation. The conclusion is that evaluation strategies and techniques for NLP need much more development, in particular to take proper account of the in- fluence of system tasks and settings. Chapter 3 develops a general approach to NLP evaluation, aimed at methodologically sound strategies for test and evalua- tion motivated by comprehensive performance factor identification. The analysis throughout the book is supported by extensive illustrative examples.

The book is a development and update of an earlier technical report (Gal- liers and Sparck Jones, 1993). The work for the report was carried out in the Computer Laboratory, University of Cambridge, under the UK Science and En- gineering Research Council's Grant GR/F 35227, 'A Contextual Reasoning and Cooperative Response Framework for the Core Language Engine (CLARE)'. This grant was for a component of a much larger project, developing CLARE itself, which was conducted at SRI International's Cambridge Centre, and we are grateful to Dr S.G. Pulman, the Centre's Director, for providing us with the opportunity and push for our enterprise. We are also very grateful to many people who supplied us with reports, papers, and information about their evalu- ation activities, both for the original report and this successor book, and to our colleagues, notably Louis des Tombe, for thoughtful comments.

Computer Laboratory University of Cambridge

Karen Sparck Jones Julia Galliers

January 1996

V

Table of Contents

L i s t o f F i g u r e s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

G l o s s a r y o f t e r m s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

I n t r o d u c t i o n 1

1 T h e F r a m e w o r k : S c o p e a n d C o n c e p t s 3

1 S y s t e m s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1 Language proce~,sing . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Systems and subsys tems . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Sys tem set t ings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 NLP system example . . . . . . . . . . . . . . . . . . . . . . . . 15

2 E v a l u a t i o n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Eva lua t ion levels . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 In fo rma t ion retrieval experience . . . . . . . . . . . . . . . . . . 20

2.3 Appl icabi l i ty to NLP . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 N L P evaluat ion example . . . . . . . . . . . . . . . . . . . . . . 30

2.5 Speech processing i l lus t ra t ions . . . . . . . . . . . . . . . . . . . 46

2.6 Eva lua t ing generic systems . . . . . . . . . . . . . . . . . . . . . 49

2.7 Eva lua t ion tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.8 Eva lua t ion from the social science point of view . . . . . . . . . 59

2 N L P E v a l u a t i o n : W o r k a n d S t r a t e g i e s 65

1 E v a l u a t i o n so f a r . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

1.1 Machine t r ans l a t ion (MT) . . . . . . . . . . . . . . . . . . . . . . 70

1.2 Message u n d e r s t a n d i n g (MU) . . . . . . . . . . . . . . . . . . . . 87

1.3 Da tabase query (DBQ) . . . . . . . . . . . . . . . . . . . . . . . 97

1.4 Speech u n d e r s t a n d i n g (SU) . . . . . . . . . . . . . . . . . . . . . 109

1.5 Miscellaneous NLP task evaluat ions . . . . . . . . . . . . . . . . 115

1.6 Text retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

1.7 Cioss- task issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

2 G e n e r a l d e , v e l o p m e n t s . . . . . . . . . . . . . . . . . . . . . . . . . 125

2.1 Eva lua t ion workshops . . . . . . . . . . . . . . . . . . . . . . . . 125

2.2 Eva lua t ion tu tor ia l s . . . . . . . . . . . . . . . . . . . . . . . . . 155

2.3 E A G L E S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

2.4 Pa r t i cu la r methodologies . . . . . . . . . . . . . . . . . . . . . . 163

2.5 Corpora , test suites, test collections, and toolki ts . . . . . . . . . 167

2.6 Gener ic NLP system evaluat ion . . . . . . . . . . . . . . . . . . . 180 2.7 Mega-eva lua t ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

2.8 Spee(h eva lua t ion . . . . . . . . . . . . . . . . . . . . . . . . . . 188

3 C o n c l u s i o n s o n e v a l u a t i o n t o d a t e . . . . . . . . . . . . . . . . . 189

vii

3 S t r a t e g i e s for E v a l u a t i o n 193

1 G e n e r a l r e c o m m e n d a t i o n s . . . . . . . . . . . . . . . . . . . . . . 193 1.1 Ques t ion framework . . . . . . . . . . . . . . . . . . . . . . . . . 194

2 E v a l u a t i o n i l l u s t r a t i o n . . . . . . . . . . . . . . . . . . . . . . . . . 196

2.1 I l lus t ra t ive examples: scene set t ing . . . . . . . . . . . . . . . . . 197

2.2 Examples: ~valuations . . . . . . . . . . . . . . . . . . . . . . . . 201

3 C o n c l u s i o n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

R e f e r e n c e s 219

I n d e x 226

viii

List of Figures

C h a p t e r 1

1 Illustrations of language, material and processing terms . . . . . 5 2 Illustrations of system types and elements . . . . . . . . . . . . . 7 3 Illustrations of setups and systems . . . . . . . . . . . . . . . . . 12 4 NLP system and setup example : 'Motorbikes' . . . . . . . . . . . 16 5 Illustration of IR system variables and parameters . . . . . . . . 24 6 NLP evaluation example: 'Motorbikes' . . . . . . . . . . . . . . . 31-34 7 8

Diagram of setup and system relations for NLP evaluation example 35 Speech evaluation illustrations . . . . . . . . . . . . . . . . . . . 47

C h a p t e r 2

1 Summary of MT criteria, measures and methods . . . . . . . . . 72 2 Falkedah Summary of MT criteria, measures and methods . . . . 82 3 EAGLES Translation Memory (TM) feature checklist example 88 4 Summary of MUC-3 criteria, measures and methods . . . . . . . 90 5 Summary of database query criteria, measures and methods . . . 99 6 Summary of ARPA speech understanding criteria, measures and

methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7 Summary of FRUMP criteria, measures and methods . . . . . . . 116 8 Summary of EAGLES evaluation methodology (EAGLES, 1994) 158 9 Summary of quality characteristics for software, ISO Standard

9126 (ISO, 1991) . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

C h a p t e r 3

1 Framework questions for evaluation scenario determining test and evaluation programme on subject . . . . . . . . . . . . . . . . . . !95

2 Summary of evaluation scenario for Example L . . . . . . . . . . 211

ix

Glossary of terms

(assessment acceptability activity adequacy evaluation AI aims angle annotated corpus answer antecedent variable apparatus application architecture argot attribute baseline behaviour benchmark black box bound broad scope catalogue category checking collection checklist class complexity component composite constitution consumer context corpus coverage corpus criterion customer data sort data source decomposition design design goal development data (diagnosis diagnostic evaluation

= evaluation) class of performance criterion of user in setup ends-oriented evaluation artificial intelligence of user viewpoint in evaluation linked to subject's ends corpus with labels output system should supply usually environment variable equipment other than system in setup (system for) task in specific domain infrastructure specification for NLP system very restricted sublanguage pertinent to quality characteristic performance floor from simple system of user established performance norm input/output-only evaluation area of evaluation - wide or narrow of setup fact list on evaluation subject of user e.g. casual controlled test collection evaluators' aid for featurisation of performance criterion - efficiency etc of system part of system evaluation subsuming several measures of evaluation subject of evaluation findings of evaluation subject of test material corpus with all phenomena for evaluation interested/consuming party for evaluation kind of test/evaluation material e.g. corpus of setup working or system operation for/of evaluation system specification for objective working data for whole community = evaluation) analytical evaluation

xi

dialogue distribution corpus division domain dry run eccentric effect effectiveness efficiency ends

environment environment factor evaluation evaluation data evaluation methodology evaluation procedure evaluation standard exemplar exigent processing experiment extrinsic criterion factor feature featurisation field evaluation form

full processing function

general-purpose system generic system glass box goal granularity grid guidelines hybrid system indicator informativeness interactive system interface interest

user-system interaction involving NL corpus giving phenomena distribution between 1-system and n-system area or field of task of evaluation procedure to check out idiosyncratic system performance of system, including output class of performance criterion class of performance criterion system objectives/functions or setup

purposes/functions setup from system's point of view variable as factor affecting performance of system or setup performance data used for evaluation methodology for evaluation for carrying out test requirements for evaluation criteria etc baseline or benchmark performance thorough NLP to explicate system/setup performance for evaluating wrt embedding setup see performance factor attribute, attribute value feature choice for evaluation evaluation in real-life situation of evaluation yardstick as

attainable/ideal etc complete NLP role of system in setup (or one setup

in another) system for any application system independent of application internal operation evaluation of evaluation of parameters under evaluation design style for test/evaluation for evaluation with both 1- and n-subsystems of performance, variable or parameter of evaluation about system or setup with user-system dialogue system for user interaction involving NLP category of evaluation requester

e:g. developer

xii

intervening variable instance intrinsic criterion investigation IR kind 1-system language language system legitimacy linkage

measure mega-evaluation method methodology metric mode

motivation n-system narrow bound narrow scope NL NLP non-interactive norm objective observation operation orientation p-setup parameter partial processing performance performance exemplar performance factor perspective

pretest programme progress evaluation pseudo-language purpose qualitative

~sually system parameter of input in test data for internal system/setup evaluation to determine system performance information retrieval of evaluation as experiment/investigation see language system natural language (sub)system doing NLP of data for test/evaluation use of variables, parameters, objectives, effects,

and measures of performance, instantiating criterion large-scale, long-term, multi-task evaluation of applying measure for test/evaluation = measure of evaluation as qualitative/

quantitative/hybrid stimulating reason for evaluation (sub)system not doing NLP of area of evaluation of setup natural language natural language processing system without user dialogue performance requirement (benchmark or target) what system itself is for of system/setup, preceding evaluation of system of evaluation as extrinsic/intrinsic setup for individual user of system incomplete NLP of system/setup wrt objective/purpose from baseline or benchmark any system/setup element affecting performance aspect under which evaluation subject seen

e.g. financial test of evaluation measures, methods set of related tests/evaluations improvement, development evaluation not actually natural language what setup is for holistic, non-numeric performance measure

xiii

quality characteristic

quantitative quasi-language range rationale reality reasonable references reliability remit reportable attribute representativeness richness role run scenario scope separation serious material setting setup simple processing sort source standards status

strategy style subject

sublanguage substitute subsystem system system factor target task terminal test test bed test collection

test data

(general) desired property of evaluation subject

numeric performance measure sublanguage with own life of NLP system, especially generic for performance comparison of data for test/evaluation use fair performance, given environment data including answers supporting evaluation consistency of performance measure of evaluation for evaluation customer of data for test/evaluation use of language of user in setup e.g. data input of system giving performance measure test and evaluation plan of setup - broad or narrow of system from user complex natural language material of parameter system plus operational context rudimentary NLP of test/evaluation data e.g. test suite corpus for data requirements for test/evaluation conduct of test/evaluation data for

e.g. representativeness for conducting evaluation of evaluation as exhaustive/indicative etc of evaluation, i.e. component/(sub) system/

setup of natural language honed answer in reference data 1- or n- part of system software+hardware entity parameter as factor affecting performance for performance what system does I / 0 manifestation of interface investigation or experiment application for exploring system design data, with references, especially for

experiment data for tests

xiv

test methodology test program test set test suite tool toolkit

training set transcription transportability trivial material tuning tweaking type user utility validity value variable wide bound working working data yardstick

methodology for tests for doing runs, scoring performance, etc data subset used for system testing designed test material for evaluation, i.e. data or program processing tools e.g. software for test/

evaluation data subset used for system development test collection of transcribed speech of system simple natural language material of system to application of system to evaluation conditions of evaluation as black box/glass box of system in setup operational setup-oriented criterion propriety of performance measure of variable property of environment affecting system large area of evaluation of setup system input material for testing nature of performance for comparison

XV