STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION S. Sameen Fatima Dept. of Computer Science & Engineering Osmania University Hyderabad

STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED

TEXT CLASSIFICATION

S. Sameen Fatima

Dept. of Computer Science & Engineering

Osmania University

Hyderabad

([email protected])

BACKGROUNDQ. What is classification (of text)?

A. Classification is an important IR task in which one or more category labels are assigned to a document.

Approaches to Classification (of text)

Earlier approaches to text classification assigned labels to documents based on CONTENTS

1. Word-based techniques

-statistical (tf,idf)

- term/keyword searches

Advantage: Simple and can be automated

Disadvantage: Phrases cannot be extracted

2. Phrase-based techniques

a) In-depth NLP: Here we aspire to represent all the information in a text using context.

-syntax

-semantics

-statistics

Advantage: General task-independent representation

Disadvantage: Costly, Not possible in polynomial time

b) Information Extraction: Here we delimit in advance, as part of the specification of a task, the semantic range of the output, the relations we will represent and other allowable fillers in each slot.

Advantage: It works well for a specific corpus

Disadvantage: For a new corpus a new IE system will be designed.

Limitations of the Earlier Approaches to

Text Classification

Texts have, besides content, STYLE which has not been accounted for.

It is the focus of this talk to present STYLE as a new basis for text classification

COMPUTATIONAL STYLISTICS

The study of style or in other words the detection of patterns common to a writing is known as STYLISTICS.

If stylistic analysis uses computer-aided methods and statistical methods for analysis of texts, the field of study is called COMPUTATIONAL STYLISTICS.

Related Work in Computational Stylistics

1. Pre-WWW Era:

- Author Attribution Studies:

Popular Mosteller and Wallace’s study of anonymous essays published in THE FEDERALIST to identify the authors (Hamilton and Madison).

Stylistic parameters: sentence-length, content words(nouns, adjectives, verbs), function words(preposition, conjunction), use of by, from, and to, ……..

Came up with interesting result that content words were too subject-dependent and were not good discriminators, while function words were good discriminators.

- Automatic Abstracting:

Borko and Chatman advanced the view that it seems possible to make stylistic distinctions between informative (discusses research) abstracts and indicative (discusses the article whichh descsribes the research) abstracts, based on form, voice, tense, focus of the abstract.

- Teaching writing styles for different types of documents.

Writer’s WorkBench program on AT&T Unix.

Related Work in Computational Stylistics(contd)

2. WWW-Era: (on-going)

-Stylistic variation between the different genres found in the Wall Street Journal. (Jussi Karlgren, Troy Strazheim)

Example: Articles, Business News with tables, Business News, Lists of briefs, Editorials, letters, Briefs, “What’s New”, Tables.

Use simple stylistic parameters: characters/word, digits/keywords, words/sentence.

- Establishing a genre palette for internet material.

(Jussi Karlgren, John Dewe, Ivan Bretan)

Definition of a Genre/Functional StyleA set of documents with a perceived consistent tendency to

make the same stylistic choices, specifically if it has an established communication functions, a functional style.

Genres can have differing usefulness

Genres in my work (Corpus)Editorials from Hindu

Editorials from Hindustan Times

Editorials from Times of India

HypothesisEditorials from each newspaper show a systematic

and consistent difference in the choice of a presentation style, specifically to establish some intended communication function (aggressive, conservative, liberal)

Aim of the ExperimentTo find a descriptive and predictive algorithm for

classifying editorials from different newspapers based on stylistic features.

Mathematical ModelTwo models were explored to find which was applicable.

1. Vector Space Model - Used by Salton in the SMART system (IRS)

2. Euclidean Space Model.

Euclidean Space ModelAn n-dimensional Euclidean space, En is defined as the set of all n-tules of

real numbers (x1, x2, …., xn) where the Euclidean distance in En between 2 points: x = (x1, x2, …., xn) and y = (y1, y2, …., yn) is defined by

d(x,y) = sqrt((x1-y1)2 + (x2-y2)2 + ……………………….+ (xn-yn)2)

In our project Euclidean Space represents a Stylistic SpaceEuclidean Space Stylistic SpaceCo-od or axis xi Significant stylistic featureOrigin NeutralPoint x FSP of an editorialRegion Set of FSPsSpace S Set of FSPsDistance, d(x,y) Stylistic similarity of editorial x and editorial y

In the Vector Space Model distance between two points x and y is related by the angle (x,y) formed by the lines from each of the points to the origin, which is given by

cos (x,y) = (x . y) / ( (x .x)0.5 (y . y)0.5)

This failed in stylistic analysis

Stylistic ProfilingA method of identifying the stylistic features in the writing style of an individual

or a group of people and to present them in a systematic way.

1. Lexical Features • Percentage of interrogative pronouns

• Percentage of emphatic pronouns

• Percentage of prepositions

• Percentage of conjunctions

• Percentage of articles

• Percentage of action words

• Percentage of unique words

2. Structural Features• Average words/sentence

• maximum sentence length

• Total no. of sentences

• Total no. of words

• Total no. of characters

3. Affective Features• Percentage of passive sentences

• Flesch Reading Ease

• Coleman Liau Grade level

• Bormuth Grade Level

Classification Algorithm

1. Training PhaseTraining setconsisting of30 editorials each fromH, HT, TI

Feature Extraction(Lexical, Structural, Affective)

Conduct ANOVA test & extract the SIGNIFICANT FEATURES

Compute the mean for each of the significant features for each newspaper

3 Prototypes

P-HP-HTP-TI

90 FSPs90 SPs

2. Classification Phase

New instanceof editorial

SignificantFeature Extraction

Compute the distancebetween I and each ofthe prototypes from the training phase

FSP, I Least d(I,P-H), Classify as Hindu

Least d(I,P-HT), Classify as HT

Least d(I,P-TI), Classify as TI

Results1. Data Collection (SP)

2. Results of identifying significant features in the training phase (FSP):

One-tailed ANOVA test was carried out

Null hypothesis: No difference between the means

Alternate hypothesis: Means are different

ratio of the variance estimates is calculated, F=Sb2/Sw

2

Sb2 = Sw

2 (Check for null hypothesis)

Sb2 > Sw

2 (Check for alternate hypothesis)

F > Fcrit for a particular significance level, then we say that the means of the feature are significantly different

3. Results of the classification phase

Performance EvaluationFollowing measures were computed:

Precision = Number-classified-correctly/Number-total-classified

Recall = Number-classified-correctly/Number-relevant-for-classification

ConclusionThe results of the experiment were positive. It was possible to classify

editorials with a good degree of recall and precision

Scope for further workCurrently, it is not clear whether topic and style are two independent

dimensions of variation in text, or they go hand in hand. This can be further explored by subclassifying editorials based on topic and then studying each of them for stylistic variations

Applications- For classifying documents on the Internet based on GENRE

- Relating FSPs of editorials to the reader profiles for each newspaper so as to establish any interesting relationship.

Documents

STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION S. Sameen Fatima Dept. of Computer Science & Engineering Osmania University Hyderabad