26
1 Automatic stylistic processing for classification and transformation of natural language text Foaad Khosmood Fall 2008

Automatic stylistic processing for classification and transformation of natural language text

  • Upload
    zora

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Automatic stylistic processing for classification and transformation of natural language text. Foaad Khosmood Fall 2008. Overview – 3 big sections. Background Intro Applications Related work System Design Examples Tools Current Status and Future Plans Current status Evaluation Plan - PowerPoint PPT Presentation

Citation preview

Page 1: Automatic stylistic processing for classification and transformation of natural language text

1

Automatic stylistic processing for

classification and transformation of

natural language text

Foaad Khosmood

Fall 2008

Page 2: Automatic stylistic processing for classification and transformation of natural language text

2

Overview – 3 big sections

1. Background• Intro• Applications• Related work

2. System• Design• Examples• Tools

3. Current Status and Future Plans• Current status• Evaluation Plan• Work Plan

Page 3: Automatic stylistic processing for classification and transformation of natural language text

3

Introduction/definitions Style

Way of doing something (“option” – Wallpole) Chess Opening

In Linguistics Dialect Register

Field (subject matter) Tenor (formality) Mode (medium, like “spoken”, “written”)

In other fields Text type (in classification)

Helsinki corpus Authorial style (in authorship attribution)

Stylometrics

Page 4: Automatic stylistic processing for classification and transformation of natural language text

4

Detecting Style in Text Using style markers

Features in text Any recognizable component Strictly defined, or rule-based / derived

Feature sets/combinations can model the lay notion of style

This project Recognize, detect and measure markers Add/remove markers through alterations to text

without changing meaning.

Page 5: Automatic stylistic processing for classification and transformation of natural language text

5

Applications Digital Forensics Style obfuscation and detection Style-based search Educational and productivity tools

Better writing assistants Authorial tools, digital entertainment Machine Translation (MT/SMT) Interactive agents

For help, digital assistants

Page 6: Automatic stylistic processing for classification and transformation of natural language text

6

Related work by others

Stylistics in Humanities Literary studies (Bradford) Controversy over meaning creation and

usefulness (Simpson, Fish, Carter) Style markers, computational stylistics

Wallpole (style as option, exercises) Luyckx, Argemennon, Juola (stylometrics,

authorship attribution) Brill, Heidorn (writing assistants)

Page 7: Automatic stylistic processing for classification and transformation of natural language text

7

Related work by us

Authorship attribution Using neural nets (IJCNN 2005) Unified methods and processes (ICMLC 06)

Transformation NSF proposal (’06) BCS paper (’08) on classification/transformation WCECS (’08): System Engineering DGfS (’09)

Page 8: Automatic stylistic processing for classification and transformation of natural language text

8

System Objectives

Classification Match text with corpus Accept source, multiple targets

Transformation Accept source, one target Utilize library of markers and transfoms Provide metric-based performance

Page 9: Automatic stylistic processing for classification and transformation of natural language text

9

A trivial scenario in depth

Overview corpus, T, which can be described as modern, formal, factual

writing with short sentences. Marker detector functions:

M1() returns inverse of the average words per sentence (i.e. sentence/word).

M2() returns a ratio of overall instances of single digit numbers written using Arabic numeral symbols (i.e. “1”, “2”, etc.), to the total number of single digit numbers written in the text.

We are also given 2 transform functions: X1() uses some symbolic linguistic techniques to split large

sentences into two or more smaller ones. X2() replaces instances of a number written in English to its

equivalent in Arabic numeral symbols.

Page 10: Automatic stylistic processing for classification and transformation of natural language text

10

Example - continued

Finally, we are given the following text from a recent new story as source text (S):

Canada's opposition parties have reached a tentative deal to form a coalition that would replace Prime Minister Stephen Harper's minority Conservative government less than two months after its reelection, a senior negotiator said on Monday.

This text contains 35 words and one of them is a number (“two”).

Page 11: Automatic stylistic processing for classification and transformation of natural language text

11

Example -continued

Analyze the target corpus, T (for this example we specify target corpus characteristics) M1(T) = 0.2 M2(T) = 0.8

Analyze the source text, S M1(S) = (1/35) = 0.0286 M2(S) = (0/1) = 0.0

For comparison and evaluation, a simple root mean squared error (RMSE) is calculated between the S and T vectors. RMSE(S,T) = √(( 0.2 - 0.0286)2 + (0.8 - 0)2) = 0.8182 The system determines the RMSE distance is too high to

constitute a match

Page 12: Automatic stylistic processing for classification and transformation of natural language text

12

Example -continued Operator X1 is applied to source text, S

Operator X1(S) converts the text to version S2 (inserted words and punctuation are underlined)

Canada's opposition parties have reached a tentative deal. The deal is to form a coalition. The coalition would replace Prime Minister Stephen Harper's minority Conservative government. This is less than two months after its reelection. A senior negotiator said it on Monday.

S2 is analyzed M1(S2) = ((1/8)+(1/7)+(1/11)+(1/9)+(1/7))/5 = 0.1225 M2(S2) = (0/1) = 0.0

New RMSE is calculated RMSE(S2,T) = √(( 0.2 - 0.1225)2 + (0.8 - 0)2) = 0.8037 System determines RMSE is still too high so we continue.

Page 13: Automatic stylistic processing for classification and transformation of natural language text

13

Example - continued Operator X2 is applies to source text, S

Operator X2(S2) converts the single instance of “two” to “2” and produces S3.

Canada's opposition parties have reached a tentative deal. The deal is to form a coalition. The coalition would replace Prime Minister Stephen Harper's minority Conservative government. This is less than 2 months after its reelection. A senior negotiator said it on Monday.

S3 is analyzed M1(S3) = ((1/8)+(1/7)+(1/11)+(1/9)+(1/7))/5 = 0.1225 M2(S3) = (1/1) = 1.0

New RMSE is calculated RMSE(S3,T) = √(( 0.2 - 0.1225)2 + (0.8 - 1.0)2) = 0.2144 At this point the system can either determine that this RMSE number is

within tolerance levels of the target corpus, or it can equally decide that since the transform functions have been exhausted S3 represents a best effort transformation for this problem.

Page 14: Automatic stylistic processing for classification and transformation of natural language text

14

Example - conclusion

Weight distributions among markers Operators applied successively Optimization is NP complete

Page 15: Automatic stylistic processing for classification and transformation of natural language text

15

System design

Page 16: Automatic stylistic processing for classification and transformation of natural language text

16

Components

Analyze source Analyze target Apply transforms Classify Compare Evaluator Operator library Target corpus Transformed text

Page 17: Automatic stylistic processing for classification and transformation of natural language text

17

Operators

Two example types Surface replacement

May require part-of-speech tagging May require dictionary selection Acronyms

Semantically aware Syntactic parsing May require scanning of whole document Yoda

Page 18: Automatic stylistic processing for classification and transformation of natural language text

18

Resources and tools

AAAC / JGAAP Patrick Juola’s Duquesne contest (AAAC)

LinkGrammar Parser Paraphrasing / summarization tools WordNet / Ontology / Word lists

Synonyms and related words GNU diction, style, flex, bison Readability measures Lancaster University course on Stylistics

Page 19: Automatic stylistic processing for classification and transformation of natural language text

19

Current Status

A series of scripts getMarkers(Text) analyze(Text1, Text2) transform(Text)

Missing capabilities intelligent selection of markers and transforms stager or planner of operations modular, web accessible contribution optimization for speed

Page 20: Automatic stylistic processing for classification and transformation of natural language text

20

Another example with current tools

Page 21: Automatic stylistic processing for classification and transformation of natural language text

21

Target – Elizabeth I speech in 1588

Elizabeth I of England - 1588

Page 22: Automatic stylistic processing for classification and transformation of natural language text

22

Results

Original source: I am $AI4, of the $CIVADJ7. We give you a choice: peace

and friendship - or blood, sweat, toil and tears. Please accept the following exchange as a sign of your friendly intentions toward our people.

Transformed source: I am $AI4, of the $CIVADJ7; we, under God, give you a

choice: peace and friendship , or blood, sweat, toil and tears. Take this exchange as a sign of your friendly intentions toward our loving people.

Page 23: Automatic stylistic processing for classification and transformation of natural language text

23

Classification-Transformation Loop

Method that can serve as a process directive

Classification and Transformation are intimately related

Classification as way to detect Transformation

Inadequate transformation => richer markers

Richer style markers => more precise classification

CLAS

TRAN

Page 24: Automatic stylistic processing for classification and transformation of natural language text

24

Evaluation Classification

Done by classic supervised machine learning techniques

Multi-fold, labeled corpus Transformation

Machine verification Auto classifier of same system External classifier

Human verification Periodic (perhaps 2 per phase) human assessment of

classification

Page 25: Automatic stylistic processing for classification and transformation of natural language text

25

Timeline Phase I: Vertical prototype (3-5 months)

Build the skeleton of the prototype system using the new programming tools and web environment. The goal is to get it to the level of the scripting based system we currently have. Very few markers and very few transformers will be implemented other than some token ones to demonstrate the feasibility and modularity.

Install and facilitate database communication with the web based system. Gather appropriate corpora for testing.

Phase II: Build up collection of markers and transforms (3-5 months) Include more markers and transformers using researched resources and techniques. Consult more

linguistic experts and tools in order to develop more robust and more generic natural language manipulation tools. Markers should be empirically tested in strict classification exercise to determine their relative values.

Work on planning algorithm for transformer selection, as well as statistically based markers (i.e marker functions that are based on specific text features and will be different for different corpora).

Phase III: Evaluator, comparator and planner (3-5 months) Work on automatic normalization and weight adjustment for the marker vector based on

reinforcement learning. Work on alternative and interchangeable distance metrics between texts (as opposed to strict style

matrix RMSE). Work on abstracting the planning function and allowing user control over its operations.

Phase IV: Transform granularity and constraint satisfaction (3-5 months) Develop a granularity operator for transforms or degrees of freedom (i.e. even though applicable to

many sections of a text, the transform functions should manipulate very few sections of the text at one time) and a measure of granularity should be specified and utilized by the system.

Improve transformation performance by adding more constraints for basic meaning transference (i.e. grammatical correctness and measurable meaning orthodoxy).

Test transformation with more corpora and validate by using best classification algorithms. Develop workable prototypes for one or more real-world applications.

Page 26: Automatic stylistic processing for classification and transformation of natural language text

26

Questions, more demos