Upload
roy-marshall
View
214
Download
0
Embed Size (px)
Citation preview
STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED
TEXT CLASSIFICATION
S. Sameen Fatima
Dept. of Computer Science & Engineering
Osmania University
Hyderabad
BACKGROUNDQ. What is classification (of text)?
A. Classification is an important IR task in which one or more category labels are assigned to a document.
Approaches to Classification (of text)
Earlier approaches to text classification assigned labels to documents based on CONTENTS
1. Word-based techniques
-statistical (tf,idf)
- term/keyword searches
Advantage: Simple and can be automated
Disadvantage: Phrases cannot be extracted
2. Phrase-based techniques
a) In-depth NLP: Here we aspire to represent all the information in a text using context.
-syntax
-semantics
-statistics
Advantage: General task-independent representation
Disadvantage: Costly, Not possible in polynomial time
b) Information Extraction: Here we delimit in advance, as part of the specification of a task, the semantic range of the output, the relations we will represent and other allowable fillers in each slot.
Advantage: It works well for a specific corpus
Disadvantage: For a new corpus a new IE system will be designed.
Limitations of the Earlier Approaches to
Text Classification
Texts have, besides content, STYLE which has not been accounted for.
It is the focus of this talk to present STYLE as a new basis for text classification
COMPUTATIONAL STYLISTICS
The study of style or in other words the detection of patterns common to a writing is known as STYLISTICS.
If stylistic analysis uses computer-aided methods and statistical methods for analysis of texts, the field of study is called COMPUTATIONAL STYLISTICS.
Related Work in Computational Stylistics
1. Pre-WWW Era:
- Author Attribution Studies:
Popular Mosteller and Wallace’s study of anonymous essays published in THE FEDERALIST to identify the authors (Hamilton and Madison).
Stylistic parameters: sentence-length, content words(nouns, adjectives, verbs), function words(preposition, conjunction), use of by, from, and to, ……..
Came up with interesting result that content words were too subject-dependent and were not good discriminators, while function words were good discriminators.
- Automatic Abstracting:
Borko and Chatman advanced the view that it seems possible to make stylistic distinctions between informative (discusses research) abstracts and indicative (discusses the article whichh descsribes the research) abstracts, based on form, voice, tense, focus of the abstract.
- Teaching writing styles for different types of documents.
Writer’s WorkBench program on AT&T Unix.
Related Work in Computational Stylistics(contd)
2. WWW-Era: (on-going)
-Stylistic variation between the different genres found in the Wall Street Journal. (Jussi Karlgren, Troy Strazheim)
Example: Articles, Business News with tables, Business News, Lists of briefs, Editorials, letters, Briefs, “What’s New”, Tables.
Use simple stylistic parameters: characters/word, digits/keywords, words/sentence.
- Establishing a genre palette for internet material.
(Jussi Karlgren, John Dewe, Ivan Bretan)
Definition of a Genre/Functional StyleA set of documents with a perceived consistent tendency to
make the same stylistic choices, specifically if it has an established communication functions, a functional style.
Genres can have differing usefulness
Genres in my work (Corpus)Editorials from Hindu
Editorials from Hindustan Times
Editorials from Times of India
HypothesisEditorials from each newspaper show a systematic
and consistent difference in the choice of a presentation style, specifically to establish some intended communication function (aggressive, conservative, liberal)
Aim of the ExperimentTo find a descriptive and predictive algorithm for
classifying editorials from different newspapers based on stylistic features.
Mathematical ModelTwo models were explored to find which was applicable.
1. Vector Space Model - Used by Salton in the SMART system (IRS)
2. Euclidean Space Model.
Euclidean Space ModelAn n-dimensional Euclidean space, En is defined as the set of all n-tules of
real numbers (x1, x2, …., xn) where the Euclidean distance in En between 2 points: x = (x1, x2, …., xn) and y = (y1, y2, …., yn) is defined by
d(x,y) = sqrt((x1-y1)2 + (x2-y2)2 + ……………………….+ (xn-yn)2)
In our project Euclidean Space represents a Stylistic SpaceEuclidean Space Stylistic SpaceCo-od or axis xi Significant stylistic featureOrigin NeutralPoint x FSP of an editorialRegion Set of FSPsSpace S Set of FSPsDistance, d(x,y) Stylistic similarity of editorial x and editorial y
In the Vector Space Model distance between two points x and y is related by the angle (x,y) formed by the lines from each of the points to the origin, which is given by
cos (x,y) = (x . y) / ( (x .x)0.5 (y . y)0.5)
This failed in stylistic analysis
Stylistic ProfilingA method of identifying the stylistic features in the writing style of an individual
or a group of people and to present them in a systematic way.
1. Lexical Features • Percentage of interrogative pronouns
• Percentage of emphatic pronouns
• Percentage of prepositions
• Percentage of conjunctions
• Percentage of articles
• Percentage of action words
• Percentage of unique words
2. Structural Features• Average words/sentence
• maximum sentence length
• Total no. of sentences
• Total no. of words
• Total no. of characters
3. Affective Features• Percentage of passive sentences
• Flesch Reading Ease
• Coleman Liau Grade level
• Bormuth Grade Level
Classification Algorithm
1. Training PhaseTraining setconsisting of30 editorials each fromH, HT, TI
Feature Extraction(Lexical, Structural, Affective)
Conduct ANOVA test & extract the SIGNIFICANT FEATURES
Compute the mean for each of the significant features for each newspaper
3 Prototypes
P-HP-HTP-TI
90 FSPs90 SPs
2. Classification Phase
New instanceof editorial
SignificantFeature Extraction
Compute the distancebetween I and each ofthe prototypes from the training phase
FSP, I Least d(I,P-H), Classify as Hindu
Least d(I,P-HT), Classify as HT
Least d(I,P-TI), Classify as TI
Results1. Data Collection (SP)
2. Results of identifying significant features in the training phase (FSP):
One-tailed ANOVA test was carried out
Null hypothesis: No difference between the means
Alternate hypothesis: Means are different
ratio of the variance estimates is calculated, F=Sb2/Sw
2
Sb2 = Sw
2 (Check for null hypothesis)
Sb2 > Sw
2 (Check for alternate hypothesis)
F > Fcrit for a particular significance level, then we say that the means of the feature are significantly different
3. Results of the classification phase
Performance EvaluationFollowing measures were computed:
Precision = Number-classified-correctly/Number-total-classified
Recall = Number-classified-correctly/Number-relevant-for-classification
ConclusionThe results of the experiment were positive. It was possible to classify
editorials with a good degree of recall and precision
Scope for further workCurrently, it is not clear whether topic and style are two independent
dimensions of variation in text, or they go hand in hand. This can be further explored by subclassifying editorials based on topic and then studying each of them for stylistic variations
Applications- For classifying documents on the Internet based on GENRE
- Relating FSPs of editorials to the reader profiles for each newspaper so as to establish any interesting relationship.