27
Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking Corpus: Identifying Substantive Issues in Public Comments

Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Embed Size (px)

Citation preview

Page 1: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS)

CeRI (Cornell eRulemaking Initiative)Cornell University

An eRulemaking Corpus:Identifying Substantive Issues in Public Comments

Page 2: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Plan for the Talk

Background– E-rulemaking

CeRI FTA Grant Circulars Corpus Text Categorization Experiments

Page 3: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

RulemakingE-Rulemaking

Rulemaking: one of the principal methods of making regulatory policy in the US- ~4000+ per year

“notice and comment” rulemaking: formal public participation phase– 10 – 500,000 comments per rule– comment length: 1 sentence – 10’s of pages – agency legally bound to respond to all substantive

issues

E-rulemaking = e-notice and e-comment

Page 4: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Current Agency Practice

Page 5: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Goals of Our Current Work

Determine the degree to which automatic issue categorization can facilitate analysis of comments by identifying and categorizing “relevant issues”.

Framed as a text categorization task: Given a comment set, the automated system

should determine, for each sentence in each comment, which of a group of pre-defined issue categories it raises, if any.

Builds on the work of Kwon & Hovy (2007) and Kwon et al. (2006)

Page 6: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Plan for the Talk

Background CeRI FTA Grant Circulars Corpus

– Difficulties– Interannotator agreement results

Text Categorization Experiments

Page 7: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

FTA Grant Circulars Rule

Topic: guidance to public and private transportation providers applying for federal aid for elderly, disabled and low income persons

267 comments shortest: 1 sentence longest: 1420 sentences

11,094 sentences total

Page 8: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

FTA Grant Circulars Issue Set

17 top-level issues

39 fine-grained issues

Page 9: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Kwon & Hovy (2007)

vs.

Page 10: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Difficulties for Text Categorization

Large, hierarchical issue set

Page 11: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

FTA Grant Circulars Issue Set

17 top-level issues

39 fine-grained issues

Page 12: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Difficulties for Text Categorization

Large, hierarchical issue set “NONE” category Skewed distribution across issues

– 87% of the sentences are from 6 categories– 13% of the sentences are from 33 categories

Potentially multiple issues per sentence. Even long sentences contain few words. Variation in comment quality, scope,

vocabulary and form.

Page 13: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

The Annotators

Page 14: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Interannotator Agreement

146 comments used for the study 6 annotators 2.66 annotators per comment 41.5 sentences per comment Overlap agreement measure

Page 15: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Plan for the Talk

Background– E-rulemaking– Public comment analysis

CeRI FTA Grant Circulars Corpus– Difficulties– Interannotator agreement results

Text Categorization Experiments

Page 16: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Fine-grained issues (39) Coarse-grained issues (17)

Standard Text Categorization Algorithms

Standard (flat) text categorization methods

Hierarchical text categorization methods

• SVMs (0/1 loss)• Maxent• Naïve Bayes

• cascaded classification Dumais & Chen (2000)

Page 17: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Cascaded Categorization

Some

Page 18: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Cascaded Categorization

Page 19: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Cascaded Categorization

Page 20: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Gold Standard Data Set

Simulate agency comment analysis process– One analyst / rule

Six data sets– One data set / annotator

Page 21: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

SVM Results with tf.idf Features

Page 22: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Best-Performing Fine-Grained Issues (Annotator 1)

0.71

0.83

0.75

0.69

0.75

0.62

0.57

0.660.65

0.57

0.68

0.510.48

0.66

0.0280.0150.0110.0390.032

0.51

0.60

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

OtherNFreeElig JARC_EligActiv CompSelect TechAsstTrain MobilMgt DesRecip none

Rule-specific Issue

% a

gre

em

en

t, c

orr

ec

t, s

en

ten

ce

s c

ov

ere

d

(de

pe

nd

ing

on

ba

r c

olo

r)

categorization accuracy

interannotator agreement

data coverage

Page 23: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Progress and Plans

• Promising initial results rule-specific issue categorization of public comments

– Annotate comments for more rules– Expert (rulewriter) vs. law student annotation– Integrate automatic text categorization into

annotation interface• Active learning (Purpura, Cardie & Simons, dg.o

2008)• Collaboration with HCI colleagues in InfoSci

Page 24: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

The End

For more on– the hierarchical text categorization method

• Cardie et al. (dg.o 2008)

– a new structural learning approach for hierarchical classification

• Purpura et al. (in preparation)

– active learning methods for hierarchical text categorization

• Purpura, Cardie & Simons (dg.o 2008)

Page 25: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking
Page 26: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking
Page 27: Claire Cardie (CS+IS), Cynthia Farina (Law), Matt Rawding (IS), Adil Aijaz (CS) CeRI (Cornell eRulemaking Initiative) Cornell University An eRulemaking

Minimizing the Costliest Errors**

**Underinclusive errors are the most costly