Assessing approaches to genre classification

Embed Size (px)

Citation preview

  • 7/29/2019 Assessing approaches to genre classification

    1/72

    Assessing approaches to

    genre classification

    Philipp Petrenz

    Master of ScienceSchool of Informatics

    University of Edinburgh

    2009

  • 7/29/2019 Assessing approaches to genre classification

    2/72

    Abstract

    Four formerly suggested approaches to automated genre classification are assessed and compared on a

    unified data basis. Evaluation is done in terms of prediction accuracy as well as recall and precisionvalues. The focus is on how well the algorithms cope when tested on texts with different writing styles

    and topics. Two U.S. based newspaper corpora are used for training and testing. The results suggest

    that different approaches are suitable for different tasks and none can be seen as generally superior

    genre classifier.

    Acknowledgements

    First and foremost, I would like to thank my supervisor, Bonnie Webber, for her outstanding support

    throughout all stages of this project. I am also grateful to Victor Lavrenko, who provided input in

    preceding discussions. Furthermore, helpful information was received from Geoffrey Nunberg, Brett

    Kessler and Hinrich Schtze and was much appreciated.

  • 7/29/2019 Assessing approaches to genre classification

    3/72

    Table of Contents

    1. Introduction ............................................................................................................................... 11.1. What are genres? ................................................................................................................ 11.2. How is genre different from style and topic? ..................................................................... 21.3. Genre classification ............................................................................................................ 41.4. Report Structure ................................................................................................................. 4

    2. Previous work ............................................................................................................................ 52.1. Text classification ............................................................................................................... 52.2. Genre classification ............................................................................................................ 6

    3. Project description ..................................................................................................................... 93.1. Motivation .......................................................................................................................... 93.2. Aims ................................................................................................................................. 103.3. Methodology .................................................................................................................... 103.4. Software and tools ............................................................................................................ 15

    4. Material and Methods .............................................................................................................. 164.1. The New York Times corpus ........................................................................................... 164.2. The Penn Treebank Wall Street Journal corpus ............................................................... 174.3. Data analysis and visualization ........................................................................................ 17

    4.3.1. Meta-data .................................................................................................................. 174.3.2. Baseline genres ......................................................................................................... 184.3.3.

    Genres and topics: Experiment one .......................................................................... 20

    4.3.4. Genres and topics: Experiment two .......................................................................... 22

    4.4. Pre-processing of data ...................................................................................................... 244.4.1. Transforming contents .............................................................................................. 244.4.2. Creating data sets ...................................................................................................... 26

    5. Implementation and Classification .......................................................................................... 285.1. Karlgren & Cutting (1994) ............................................................................................... 285.2. Kessler, Nunberg & Schtze (1997) ................................................................................. 29

  • 7/29/2019 Assessing approaches to genre classification

    4/72

    5.3. Freund, Clarke & Toms (2006) ........................................................................................ 325.4. Ferizis & Bailey (2006) .................................................................................................... 33

    6. Evaluation ................................................................................................................................ 356.1. Baseline experiment ......................................................................................................... 356.2. The impact of style ........................................................................................................... 366.3. The impact of topic ........................................................................................................... 41

    6.3.1. First experiment ........................................................................................................ 416.3.1. Second experiment ................................................................................................... 42

    7. Discussion ................................................................................................................................ 457.1.

    Conclusion of findings ..................................................................................................... 45

    7.2. Further work ..................................................................................................................... 47

    Appendix A: Text samples ............................................................................................................... 49Appendix B: Confusion matrices ..................................................................................................... 55References ........................................................................................................................................ 64

  • 7/29/2019 Assessing approaches to genre classification

    5/72

    1

    1. Introduction

    Automated text classification has become a major subfield of data mining, which is being targeted

    by many researchers and discussed eagerly in scientific literature. While the aim might be similar,

    the nature of the data greatly differs from many other classification tasks. There is a variety of

    characteristics and challenges very specific to the domain of text. This is for many reasons,

    including its heterogeneity and the size of the feature space (typically, even small corpora consist

    of tens or hundreds of thousands of words [1]).

    Unsurprisingly, the focus of researchers had initially been on distinguishing documents by their

    topics (e.g. [2][3]). However, text can also be categorized in other ways and often topical

    classification alone is not sufficient to match the requirements of users. In information retrieval for

    example, a search query for a fairly unambiguous topical term likeBox Jellyfish will return a whole

    range of different types of documents. They might include encyclopaedia articles, news reports and

    blog posts. Even if every single one of them is about the correct topic, only a subset will be

    relevant to the user. While it is surely possible to restrict this range by adding additional search

    terms (e.g.Box Jellyfish Wikipedia), a much more elegant way would be to provide the user with a

    choice of document genres to filter results. This is where genre classification comes into play.

    The aim of the project described in this report is to assess and compare different approaches to

    genre classification. To set the scene, the definitions of genre, topic and writing style are discussed

    in the following sections. Furthermore, the characteristics and issues of classifying genres are

    looked at.

    1.1. What are genres?

    Like topics, genres provide a way to describe the nature of a text, which allows for assigning

    documents to groups. However, it is not trivial to even define what genres are. In scientific

    literature, there is a wide range of descriptions and explanations. According to Biber [4] genres are

    characterized by external criteria. Part of this is the targeted audience, as laid out in an example

    taken from [5]: Press reports and even more so fiction are directed toward a more general audience

    than academic prose, while professional letters rely on shared interpersonal backgrounds between

    participants. Similar notions are suggested by Swales [6], who describes genres as classes of

    communicative events, which share communicative purposes.

  • 7/29/2019 Assessing approaches to genre classification

    6/72

    2

    The two key ideas communicative purpose and shared targetedaudience appear often in literature

    on genre. This definition implies that texts within a genre class may span over a wide range of

    linguistic variation. The degree of variation however depends on how well constrained a genre is

    and for how much freedom of personal expression it allows [4]. It also differs for genres which can

    contain a variety of topics (e.g. news articles about sport, politics, business etc.) and others that are

    more topic specific (e.g. obituaries).

    A functional definition can also be found in the work of Kessler, Nunberg & Schtze [7]. In

    addition however, they suggest that genres should be defined narrowly enough so that texts within

    a genre class possess common structural or linguistic properties. Similarly, Karlgren suggests a

    definition based on both linguistic and functional characteristics. A combination of the targeted

    audience and stylistic consistency is used to describe genres [8]. These are fundamentally different

    views on genres, as they characterize them by internal criteria. Biber, too, examines linguistic

    properties in [4]. However, he distinguishes between genres (external criteria) and text types

    (internal criteria) and argues that they should be seen as independent categories.

    The combined external and internal view on genres is what was used for this project. They were

    defined in the way Kessler, Nunberg & Schtze put it: We will use the term genre here to refer to

    any widely recognized class of text defined by some common communicative purpose or other

    functional traits, provided the function is connected to some formal cues or commonalities and that

    the class is extensible. [7]

    1.2. How is genre different from style and topic?

    Texts can be characterized in many different ways. Topic, writing style and genre are just three of

    them. Others include register, brow and language (e.g. French). Some of them might of course

    depend on each other and correlate. As definitions differ, the boarders between thesecharacterizations are fairly fuzzy. The focus of this project is on genres, topics and writing styles.

    Register and brow are considered part of this. For example, a play by Shakespeare will probably

    be considered high-brow in comparison to an advertising text for a chocolate bar. Differences in

    language are also not examined in this project, as all the texts used are written in English.

    The topic of a text is what it is about. Examples are countries, sports, financial mattes or crocodiles,

    regardless of whether it is a song or an FAQ section of a website. These would be considered

    instances of genres. In theory, the concepts of topic and genre are orthogonal, i.e. a text of any

    given genre can be about any given topic. However, as mentioned by Karlgren & Cutting [9], co-

  • 7/29/2019 Assessing approaches to genre classification

    7/72

    3

    variances exist as certain genre-topic combinations are far more likely than others. A text about

    dragons will usually be fiction rather than a news article. A poem is more likely to be about flowers

    than washing machines. While the difference between the terms topic and genre is fairly obvious,

    in practice one can be used to infer about the other. This fact has been accounted for only in a

    fraction of previous work on genre classification.

    Style and genre are not always seen as distinct concepts and the terms are often used

    interchangeably. Freund, Clarke & Toms acknowledge that specific document templates [] exist

    within different repositories, which may have an impact on genre classification [10]. However,

    this is only part of why one would want to consider the sources of texts used for this kind of task.

    The field of automated authorship attribution provides a strong motivation. Several studies have

    been conducted on classifying documents by their author (e.g. [11][12][13]). They all confirm that

    texts from different authors differ in terms of formal properties and can therefore be classified. This

    variety due to the origin of documents is what is referred to as different styles in this report.

    Strong evidence for distinguishing between genre and style comes from the work on genre and

    author classification by Stamatatos, Fakotakis & Kokkinakis [14]. They compare the two areas of

    research and use a set of 22 features to predict both authorship and genre. The authors also present

    the absolute t values of the linear regression functions for both tasks and for each feature. Each of

    them represents the usefulness of a given attribute for predicting either genres or authors. It is

    shown that some features help to predict style much more than genre and vice versa. This motivates

    regarding them as two distinct concepts with different effects on formal properties in a text.

    However, just like topic, style is orthogonal to genre only in theory. A text written by Jamie Oliver

    is more likely to be a recipe than a scientific article. Likewise, a poem is more likely to be written

    by Sir Walter Scott than Gordon Brown. Again, these correlations have not featured much in

    literature on genre classification.

    For the purpose of this project, the terms topic, genre and style are strictly distinguished. By topic,

    the subject of a text is meant. Genres are defined by a shared communicative purpose. Style is

    defined by authorship. This is not necessarily restricted to single persons, but can be extended to

    newspapers, companies or other institutions with binding style guides. Geographic regions and

    periods in time can also have their own styles. Both genres and styles are defined by shared formal

    properties as well. An example would be a letter about a trip to Inverness written by Robert Burns.

    The trip would be considered the topic of the document. The letter is the genre and the text iswritten in the personal style of Robert Burns.

  • 7/29/2019 Assessing approaches to genre classification

    8/72

    4

    1.3. Genre classification

    Generally speaking, automated genre classification is concerned with predicting the genre of an

    unknown text correctly, independent of its topic, style or any other characteristic. This is asupervised learning task. It is done on the basis of annotated example texts, selected features of

    which are used to build and train a classification model. Like in other text classification tasks, the

    two main issues are how to represent the text as a set of features and what classification algorithm

    to choose.

    Automatically classifying texts by genre can be useful in many different areas. Information

    retrieval might be the most obvious field of application. Genre filters help users to find relevant

    documents faster. They are particularly interesting for professional users, which might be interested

    in a very specific type of document (e.g. scientific papers for researchers or legal texts for lawyers).

    Spam filters for e-mails can also benefit from genre classification. Users might choose not to

    receive certain categories of mails, like advertisements or automatically generated status messages.

    Similar filters could be applied to RSS feeds. Another application might be the automated

    annotation of text corpora. Trained classifiers would be able to assign genre tags to documents. As

    manually annotating large amounts of texts is very expensive, this would be highly interesting for

    anyone dealing with major document collections.

    This list is not exhaustive and many other areas that could benefit from genre classification have

    been suggested. For example, Kessler, Nunberg & Schtze propose its application to support

    parsing, part of speech tagging and word-sense disambiguation [7]. The wealth of possible

    applications makes genre classification worth looking into.

    1.4. Report Structure

    In section 2, previous work on text classification and genre classification is discussed. Section 3

    covers the motivation for this project, as well as its aims and methodology. The data used in the

    process is described in section 4, along with a discussion of analysis and pre-processing steps.

    Section 5 deals with the algorithms which were re-implemented for the project. Evaluation results

    are presented in section 6. In section 7, conclusions of the findings as well as suggestions for

    further research are given.

  • 7/29/2019 Assessing approaches to genre classification

    9/72

    5

    2. Previous work

    This section gives an overview of the research that has been carried out on project related topics. It

    is meant to introduce concepts and techniques, rather than giving lengthy descriptions of

    methodologies and research results. Studies which are particularly relevant to this project will be

    discussed in more detail later in this report. The previous work section is divided into two parts:

    Text classification as such and the more specific area of genre classification.

    2.1. Text classification

    As already stated, text classification is traditionally concerned with predicting topics, rather than

    genres. There is a broad range of literature in this field, which is why an overview of the main ideas

    will be given by introducing a subset of examples. Proper feature representation is crucial in any

    data mining task. This is especially true for text classification, as the choice might be less obvious

    than for other types of data. In scientific literature, it is commonly accepted that simple vectors

    with word counts yield very good results [15][16]. This type of feature set is also known as bag-of-

    words representation. However, when it comes to classification algorithms, there is less consent

    among researchers.

    Traditionally, Nave Bayes (NB) has been a very popular technique to classify text based data. In a

    comparison of different classifiers and combinations of algorithms by Li & Jain [17], it is found

    that, in spite of the obvious incorrectness of the conditional independence assumption, NB

    performs reasonably well on text. The high dimensionality and the danger of overfitting are

    reported to be handled well by the classifier. Similar findings are reported in [18] and [19].

    In the last 20 years, many researchers have proposed methods, which use Support Vector Machines

    (SVM) for text classification. In [3], Joachims presents evidence that SVMs cope well with highdimensional feature spaces and the sparseness of the data. The author argues that these classifiers

    do not require parameter tuning or feature selection to achieve high accuracy. In [20], several

    algorithms are compared on a text classification task by Yang and Liu. The findings include that

    SVMs and k-Nearest-Neighbour (kNN) techniques outperform other methods significantly. Nave

    Bayes was found to perform particularly poor. Similar conclusions were drawn in [21] and [22].

    A number of other approaches are discussed as well. They include decision trees, neural networks

    and example based classifiers like kNN [2]. While all of them have interesting features and can

    produce good results, the majority of articles suggest the use of either NB or SVM methods.

  • 7/29/2019 Assessing approaches to genre classification

    10/72

    6

    2.2. Genre classification

    Genre classification has been discussed for several decades. However, comparatively little work

    has been devoted to this subfield of text classification. Kessler, Nunberg & Schtze [7] explain thiswith the fact that language corpora are often homogeneous with respect to the genres they contain.

    Similarly, Webber [23] found that previous research had ignored the variety of genres contained in

    the well-known Penn Treebank Wall Street Journal corpus, due to an apparent lack of meta-data

    indicating the type of each article.

    When it comes to genre classification, it can be said that the focus has been on an appropriate

    choice of features rather than classification algorithms. While the latter have been discussed, it

    seems that, at least for the time being, optimizing feature selection is crucial. The history of this

    field in scientific literature is mainly divided into two types of approaches: Linguistic analysis and

    term frequency based techniques [24]. Some examples of such research are presented in this

    section. Being by no means exhaustive, this list is meant to give a brief summary of the different

    methods that have been studied before.

    The 1994 study by Karlgren & Cutting [9] discusses a small and simple set of features for genre

    classification. It includes function word counts, word and character level statistics as well as part-

    of-speech (POS) frequencies. The authors use the Brown corpus and run three experiments, using

    different sets of genre classes. They range from two very broad genres (informative and

    imaginative texts) to 15 narrowly defined classes (e.g. press reviews and science fiction). Karlgren

    & Cutting use discriminant analysis to predict genres and suggest that their technique can be used

    in information retrieval applications. The impact of topical domain transfers is not examined,

    neither is the performance on a test set from a different source. In fact, tests are carried out on the

    training data which impairs the significance of their results.

    The work of Wolters & Kirsten [25] examines the use of function word frequencies and POS tags

    to predict genres. The authors consider three different classifiers to distinguish between four

    defined genres. They also identify several topical domains. However, these are solely used for topic

    classification rather than domain transfer experiments. Both training and testing was performed on

    documents taken from LIMAS, a German newspaper corpus. While texts in the LIMAS collection

    are gathered from 500 different sources, no effort is made to separate documents by their sources in

    the training and test sets. The study therefore reveals no insights in terms of how the approach

    copes with stylistic differences.

  • 7/29/2019 Assessing approaches to genre classification

    11/72

    7

    Kessler, Nunberg & Schtze [7] suggest that genre classification can be useful in many different

    areas including computational linguistics and information retrieval. Four different types of features

    are discussed to predict genres in text. These are structural (e.g. POS frequencies, passives,

    nominalizations etc.), lexical (word frequencies), character level (punctuation and delimiter

    frequencies) and derivative (ratios and variation measures) cues gathered from the texts. As the

    first group requires parsed or tagged documents, it is not used for the experiments, which are

    conducted on the basis of the Brown corpus. Logistic regression and artificial neural networks are

    used to classify six distinct genres. The impact of different writing styles or topics is not

    considered.

    In [10], Freund, Clark & Toms propose task-based genre classification to implement a search result

    filter in a workplace environment. The authors use a bag-of-words document representation of a

    data set comprised of 16 defined genres. The data used for the experiments was collected from the

    internet. The genres are classified using support vector machines. While the authors chose data

    from various sources (i.e. server domains) to avoid stylistic biases, no evaluation was carried out on

    their potential effect. Topical domains are not considered. In [26], a software package is presented,

    which, among other ideas, implements this algorithm.

    A similar approach is suggested by Stamatatos, Fakotakis & Kokkinakis [27]. However, unlike

    Freund, Clark & Toms they do not use all of the words in the document collection. Instead, a fixed

    amount of the most common words in the English language is used as feature set. This number is

    varied to find the optimum in terms of classification accuracy. In addition to that, the authors also

    examine the impact of eight different punctuation mark frequencies to complement their feature set.

    They use discriminant analysis to predict the four genres previously identified in the Wall Street

    Journal corpus. Again, the impact of styles or topics is not considered.

    In their 2001 study, Dewdey, VanEss-Dykema & MacMillan [28] compare a bag-of-words

    approach with a more elaborate set of features. They are interested in genre classification in the

    contexts of web search and spam filtering. An information gain algorithm is employed to reduce

    the number of features in the bag-of-words representation. The second feature set is a mixture of

    verb tense frequencies (past, present, future and transitions), content word frequencies, punctuation

    frequencies and statistics on character, word and sentence level. The utilized text corpus is

    comprised of seven genres. No distinction is made between sources (i.e. styles) or topics in the data

    set. For classification, Nave Bayes, Support Vector Machines and C 4.5 decision trees are used

    and compared.

  • 7/29/2019 Assessing approaches to genre classification

    12/72

    8

    To combine the strengths of linguistic analysis and term frequency approaches, Ferizis & Bailey

    [24] suggest a method based on the approximation of crucial POS features. The approach is based

    on the work of Karlgren & Cutting [9]. The authors show that their algorithm achieves high

    accuracies on four selected genres while being computationally inexpensive compared to standard

    linguistic analysis methods. However, the data used was explicitly chosen from one source only

    and no attempts were made to evaluate the algorithm on test sets with different writing styles.

    Topical domain transfers are not examined either.

  • 7/29/2019 Assessing approaches to genre classification

    13/72

    9

    3. Project description

    This section aims to outline the project carried out to assess different approaches to genre

    classification. It discusses the reasons for the work to be done, the aims and the expected outcomes.

    An explanation of the projects methodology and chronology is provided. Furthermore, tools and

    techniques used in the process are discussed.

    3.1. Motivation

    As mentioned in section 2.2, several different algorithms have been proposed to classify genres in

    texts. However, most of these methods are discussed in a very specific context (i.e. patent retrieval

    or workplace search engines for software engineers). This leads to very heterogeneous focuses.

    Some authors look for high recall values, others might favor precision. Likewise, sometimes only

    one of many genre classes is crucial to predict, while sometimes an overall good accuracy is aimed

    for.

    Furthermore, the classified data differs considerably from publication to publication. This is true in

    terms of both content and format. As a result, the amounts and natures of identified genres are very

    distinct. Class distribution may or may not be skewed and genres are often defined in varying

    degrees of broadness. The reported classification results are therefore impossible to compare.

    Moreover, most articles do not take the impact of stylistic differences into consideration. Even

    where style is an acknowledged factor (e.g. [10]), no assessment is provided for its influence.

    Algorithms are evaluated on test sets that are either from the same source or from a different source

    than the training set, but never on both. This is why it is hard to see how well different methods

    cope with stylistic differences.

    The same is true for the impact of topicality. While it has been noted that genre dependent variation

    is not orthogonal to topical variation [9], classifiers are typically tested on documents from the

    same topical domains they were trained on. It is therefore unknown, how well these methods

    perform when tested on data sets with different topic distributions. Although Finn & Kushmerick

    have investigated into this problem [29], they only took a very basic set of features into

    consideration. Thus, the question how previously proposed algorithms compare in this respect is

    yet to be answered.

  • 7/29/2019 Assessing approaches to genre classification

    14/72

    10

    3.2. Aims

    This project was meant to shed light on these very questions. Its aim was to construct a unified data

    framework in order to assess and compare different approaches to genre classification. The desiredevaluation was done in terms of classification accuracy, but also in terms of how well each method

    performs for different genres.

    In addition to this, the project aimed to answer the question of how well each approach can cope

    with a formerly unseen writing style, when genre was kept fixed. It was considered highly

    interesting to know whether a classifier that had been trained on documents from source A was able

    to predict genres reliably in documents from source B. A related question was if some genres were

    more affected by stylistic changes than others and if so, how the different approaches coped with

    that. Finding out was another goal of this research.

    The third aim was to determine how well formerly proposed genre classification methods deal with

    topical domain transfers. Could a classifier predict genre 1 in a text about topic A, when it had only

    seen samples of genre 1 about topic B and samples of genre 2 about topic A? How did different

    approaches compare in such a situation?

    No new way to tackle the problems of predicting genres reliably was developed in this project.

    Therefore, it was not carried out to prove that any algorithm is particularly suited for genre

    classification. It aimed to be an unbiased and fair empirical comparison between approaches.

    3.3. Methodology

    The answers to these questions could only be found by researching into the performance of genre

    classification methods. As source code was not provided for any of the proposed algorithms, re-

    implementation according to the specifications in the respective publications was necessary. The

    results could then be compared on the basis of a unified data framework.

    The first task was selecting a subset of algorithms to evaluate. The choice was partly motivated by

    the 2006 study of Finn & Kushmerick [29]. It discusses the usefulness of different ways of

    encoding document texts in genre classification problems. The authors focus on three types of

    simple feature sets. They include the bag-of-words method and part of speech tag frequencies. The

    third set is referred to as text statistics and is made up of document level attributes (e.g. number ofwords) as well as frequencies of function words (e.g. furthermore, probably) and punctuation

  • 7/29/2019 Assessing approaches to genre classification

    15/72

    11

    symbols (e.g. question marks). More sophisticated attributes (e.g. vocabulary richness, sentences

    without verbs, standard deviations) are not examined.

    The study was seen as a good starting point. The approaches to be assessed were chosen so that all

    of the mentioned feature sets were represented. However, genre classification methods do typically

    not rely on one type of features only. For example, a classifier might make use of POS frequencies

    and function word statistics combined. Furthermore, text representations that go beyond the

    features in the Finn & Kushmerick study have been suggested. These two facts were embraced and

    seen as an extension to their work.

    Four approaches were selected for the purpose of this project:

    The groundbreaking work of Karlgren & Cutting [9]. This early approach uses a small setof features and discriminant analysis to predict genres. Most of the features would fall in

    either the POS frequencies or the text statistics categories proposed by Finn & Kushmerick.

    The method of Ferizis & Bailey [24], which is based on the Karlgren & Cutting algorithm.However, POS frequencies are approximated using heuristics.

    The approach suggested by Kessler, Nunberg & Schtze [7]. Using three differentclassification algorithms, genres are predicted based on surface cues. This partly

    corresponds to the text statistics mentioned by Finn & Kushmerick, but more sophisticated

    text characteristics are included as well.

    The bag-of-words based approach by Freund, Clarke & Toms [10] which makes use ofsupport vector machines to predict genre classes.

    Details about the approaches and their implementation can be found in section 5.

    The second task was to decide on a suitable experimental framework to test, assess and compare

    the algorithms on. It had to be constructed so that the evaluation could provide answers to the

    questions raised in section 3.2. To this end, a sensible selection of genre classes was required, as

    was an appropriate split up of training and test sets.

    The data available for the project was taken from the New York Times corpus and the Penn

    Treebank annotated Wall Street Journal corpus, both of which are described in more detail in

  • 7/29/2019 Assessing approaches to genre classification

    16/72

    12

    section 4. The latter had been analyzed with respect to genres before by Webber [23] and four

    genres were identified, some of which comprised several lower level genres. They contained news

    articles, letters and essays1.

    This set of classes was regarded a sensible starting point for two reasons. Firstly, similar classes

    had been used in other experiments on genre classification before (e.g. [27][9][29]). Secondly,

    these genres occur in the New York Times corpus as well and Webber proposed a way to

    discriminate between them using meta-data. While this was to be further refined in the data

    analysis process, it provided a practical basis to start from.

    However, it had to be determined how appropriate news articles, letters and reviews were as a basis

    of assessing approaches to genre classification. It was seen as important that the classes complied

    with the definition of genres given in section 1: A shared communicative purpose and common

    formal properties. The external criteria were surely fulfilled merely by the fact that news articles,

    letters and reviews denote different sections of a newspaper. News articles are generally

    informative, neutral and formal. Reviews are formal as well, but they carry personal opinions and

    often include recommendations. Letters are often addressed specifically at one person or a certain

    group and can be informal. Finding out whether they could be distinguished by formal properties

    when extracted from the New York Times corpus was another focus of the data analysis.

    To examine the impact of stylistic changes, texts with different writing styles were needed.

    Newspaper corpora are perfectly qualified for this task, as journalists are typically required to abide

    by rules laid down in newspaper specific style manuals. This is commonly referred to as house

    style. For both the New York Times and the Wall Street Journal such manuals exist (see sections

    4.1 and 4.2). House styles can and will of course change over time, which is reflected by different

    editions of style manuals. This had to be considered in the experimental design.

    It was decided to run 3 experiments:

    Firstly, the different classifiers were to be trained and tested on documents from the NewYork Times corpus. No different house style was desired for the two sets. Therefore, the

    texts had to be taken from time periods with the same style manual edition in place. This

    was seen as a baseline test.

    1TheEssay class will be referred to asReview for the purpose of this report, as this term is used in

    the New York Times corpus metadata (see section 4.3.2).

  • 7/29/2019 Assessing approaches to genre classification

    17/72

    13

    Secondly, the same training set as before was to be used, but the approaches were to betested on New York Times texts from a different period. It was to be ensured that a

    different edition of the style manual was valid for documents in the test set. The difference

    in style was expected to be rather small, yet noticeable in evaluation results.

    Thirdly, the algorithms were to be tested on documents taken from the Wall Street Journalcorpus, while the training set remained unchanged again. This was done so that the

    classifiers were evaluated on texts with a formerly unseen house style. It was anticipated

    that the stylistic difference between training and test set was more substantial than in the

    second experiment.

    These experiments were hoped to answer the question, how the different approaches cope with a

    new style. Moreover, the setup provided further justification for the choice of genre classes. While

    news articles and reviews are written by journalists, letters are not. Therefore, the authors do not

    have to stick to stylistic guidelines. It was anticipated that this fact would have a strong effect and

    could be observed in the analysis of the classification results.

    Another question to be answered was how algorithms compare when faced with domain transfers.

    To this end, topics had to be identified in the texts used for classification. Details can be found in

    the section on data analysis (4.3). Again, it was considered preferable to carry out more than one

    experiment. This is why two different approaches were chosen.

    The first experiment was to be similar to the one described in the work of Finn & Kushmerick [29].

    Therefore, only two genre classes were to be predicted and two fairly distinct topics were required.

    However, unlike in the Finn & Kushmerick experiments, the genres were not just simply to be

    tested on a different topical domain. Instead, the experiment was to be split up into two parts:

    Blended and paired sets of genres and topics.

    The former were meant to be used as a baseline to compare results of the latter to. Both the training

    set and the test set were designed to comprise a mix of both genres and both topics in all 4

    combinations. For the paired sets, the topicality of the texts belonging to the two genres was

    designed to be opposite in the training set. In the test set, this selection was to be inverted, so that

    documents in each genre class were about different topics than before. Both setups are illustrated in

    Table 1.

  • 7/29/2019 Assessing approaches to genre classification

    18/72

    14

    Training Genre 1 Genre 2

    Topic A X X

    Topic B X X

    Test Genre 1 Genre 2

    Topic A X X

    Topic B X X

    Training Genre 1 Genre 2

    Topic A X

    Topic B X

    Test Genre 1 Genre 2

    Topic A X

    Topic B X

    Table 1: Documents in training and test sets for the first experiment to examine the impact of topics. Upper left:

    Blended training set. Upper right: Blended test set. Lower left: Paired training set. Lower right: Paired test set.

    An Xmarks documents that are included in the set.

    In contrast to the work of Finn & Kushmerick, topics were to be used to actively confuse the

    classifier. A classification algorithm based on topic was expected to fail completely on the reverse

    paired test set (with a near-zero accuracy), whereas the performance of a good genre classifier

    would not drop substantially between the blended and the paired test sets. This framework was

    found suitable to find out which approaches actually predict genres and which of them make use of

    genre-topic correlations.

    A similar technique was elaborated for the second experiment. However, it was designed as a three

    class problem, using all the genres from the baseline experiment. Two things were to be different

    from before. The topics were designed to be much broader than those in the first experiment. It was

    expected that this would have an impact on classification accuracies. Also, for one of the genre

    classes no topical selection was to be done, i.e. it was to be represented no differently in the

    blended and paired data sets. Table 2 shows this graphically.

    Training Genre 1 Genre 2 Genre 3

    Topic A X X X

    Topic B X X X

    Training Genre 1 Genre 2 Genre 3

    Topic A X X X

    Topic B X X X

    Training Genre 1 Genre 2 Genre 3

    Topic A X X

    Topic B X X

    Training Genre 1 Genre 2 Genre 3

    Topic A X X

    Topic B X X

    Table 2: Documents in training and test sets for the second experiment to examine the impact of topics. Upper left:

    Blended training set. Upper right: Blended test set. Lower left: Paired training set. Lower right: Paired test set.

    An Xmarks documents that are included in the set.

    It was considered interesting to find out whether the domain transfer in the first two genre classes

    has a negative (or positive) impact on the unchanged genre in terms of correct predictions. That

    was the reason for including a third class without changing the topics of its documents.

  • 7/29/2019 Assessing approaches to genre classification

    19/72

    All evaluation for this proje

    precision and recall values

    computed and compared as

    questions, e.g. whether or not

    The project started in mid-M

    timeframe are illustrated in

    selection of algorithms to a

    section. The data analysis,

    discussed in the following thr

    3.4. Software and tools

    The extraction of features forsource development environ

    They comprised:

    CRF Tagger [32] for Sentence Detector [3 StatistiXL [34] in co SVMmulticlass [35] for Weka [36] for classi MATLAB [37] for g

    15

    t was to be done in terms of classification ac

    for single classes can hold valuable informat

    ell. Also, confusion matrices were seen as a t

    letters are affected by a change in house style.

    Figure 1: Timeframe of the project.

    y 2009 and took three months to complete. The

    Figure 1. The initial design phase included p

    sess and the general outline of the project. It

    implementation and evaluation phases are sel

    ee sections.

    all assessed approaches was implemented in Javent Eclipse [31]. Several other tools were used

    part-of-speech tagging,

    3] for breaking texts into sentences,

    bination with Microsoft Excel for discriminant

    support vector machine classification,

    ication as well as computation of information ga

    neral calculations and computations of confiden

    curacy. However, as

    ion, they had to be

    ool to clarify certain

    eparate tasks and the

    reparatory work, the

    was covered in this

    -explicatory and are

    a [30] using the openor a variety of tasks.

    analysis,

    in,

    ce intervals.

  • 7/29/2019 Assessing approaches to genre classification

    20/72

    16

    4. Material and Methods

    This section deals with the data involved in the project. Two newspaper corpora were used as a

    basis to assess genre classification algorithms: The New York Times (NYT) corpus and the Penn

    Treebank Wall Street Journal (WSJ) corpus. They are described in detail in sections 4.1 and 4.2

    respectively. Section 4.3 covers the data analysis and visualization, while pre-processing steps and

    data set generation are discussed in section 4.4.

    4.1. The New York Times corpus

    The NYT corpus was recently published and contains over 1.8 million documents comprising

    roughly 1.1 billion words [38] and covering a time period ranging from 01/01/1987 to 19/06/2007.

    These documents are provided in xml format and conform to the News Industry Text Format

    specification (see [39] for details). The directory structure is divided in years, months and days of

    publication and every document has a unique number as file name, ranging from 0000000.xml to

    1855670.xml. In addition to the textual content, they contain various tags and meta-data like dates,

    locations, authors and topical descriptors [40]. There are up to 48 data fields assigned to each

    document, many of which can take multiple values. The text contents of NYT corpus documents

    are not annotated with linguistic meta-data (e.g. part-of-speech tags).

    The articles written by NYT journalists conform to the stylistic guidelines laid down by the New

    York Times Manual of Style and Usage by Siegal & Connolly [41]. However, this manual has been

    revised several times and there are three different editions in existence for the relevant period

    between 1987 and 2007. The current edition was introduced in 1999 and last updated in 2002.

    Before 1999, [42] by Jordan was the NYT style manual. Therefore, only documents created

    between 01/01/1987 and 31/12/1998 (referred to as NYT 87-98 from now on), as well as

    documents published after 31/12/2002 (NYT 03-07) were considered for the purpose of thisproject. This corresponds to the style dictated by [42] and [41] respectively.

    The NYT corpus includes Java software interfaces. They were used in this project to access the

    contents of the files.

  • 7/29/2019 Assessing approaches to genre classification

    21/72

    17

    4.2. The Penn Treebank Wall Street Journal corpus

    The annotated Penn Treebank [43][44] WSJ corpus was released in 1995 and comprises 2,499 text

    documents with a million words in total. The documents are grouped into 25 directories, containing99 or 100 files each. Their text contents are available in raw (text only), parsed and POS tagged

    versions. Apart from these linguistic analyses, no meta-data is provided in the corpus.

    The style guide for WSJ journalists is [45] by Martin. Articles have been written according to

    stylistic rules laid out by the same author since 1981, even though the guide has been published for

    public only recently. As all the documents in the WSJ corpus were created in 1989, it was assumed

    that the same edition had been valid for all of them. Therefore, there was no need to split up the

    data set in order to reflect stylistic differences.

    4.3. Data analysis and visualization

    In order to get an overview of the documents and the assigned meta-data, an extensive analysis of

    the NYT corpus was carried out. This included both manual inspections and automatic readouts of

    features. The aims were to find out about genres and topics within the corpus and to identify ways

    to separate them. It was decided that NYT 87-98 documents were the only ones to be used for

    classifier training (cf. section 3.3). Therefore, all analyses were performed on this collection. NYT

    03-07 and WSJ documents were treated as unknown test data and not examined further.

    4.3.1. Meta-data

    While there is no explicit meta-data tag for the genre of an article, an array of fields was found to

    be particularly useful for the purpose of this project. An example is the tag Taxonomic Classifier,

    which places a document into a hierarchy of articles [40]. This is a structured mixture of genres and

    topics. A document can be classified in several such hierarchies. Throughout the corpus, 99.5% of

    documents contain this field, with an average of 4.5 taxonomic classifiers assigned to each article.

    Examples include:

    Top/Features/Travel/Guides/Destinations/Europe/Turkey

    Top/News/Business/Markets

    Top/News/Sports/Hockey/National Hockey League/Florida Panthers

    Top/Opinion/Opinion/Letters

    Another valuable field is the Types of Material tag. It specifies the editorial category of the article,

    which in some cases corresponds to the definition of genre used for this project. In total, 41.5% of

  • 7/29/2019 Assessing approaches to genre classification

    22/72

    18

    the documents in the corpus have a Type of Material tag assigned to them [40]. The values are

    typically exclusive, even though a negligible amount of documents with more than one tag exists.

    There is no fixed set of values or hierarchy as there is for the taxonomic classifiers. Also, the Type

    of Material fields often contain errors, misspellings or very specific information about an article.

    Examples include:

    Obituary

    Letter

    Letterletter

    Editorial photo of homeless person

    For the purpose of topic detection, the field General Online Descriptors was found to hold accurate

    and unified information. The topicality of an article is described in different degrees of broadness

    (e.g.Religion and Churches would be a higher level category than Christians and Christianity). In

    the corpus, 79.7% of documents contain an average of 3.3 General Online Descriptors [40].

    Examples include:

    Elections

    Children and Youth

    Politics and Government

    Attacks on Police

    Various other fields were examined but found to be less useful for the purpose of distinguishing

    documents by their genre or topic.

    4.3.2. Baseline genres

    As explained in section 3.3, documents belonging to the categories News,LetterandReview were

    to be separated for both the baseline experiments and the investigation into the impact of style. As

    there is no News tag in the Types of Material field, the Taxonomic Classifier field was used to

    identify these categories as follows:

    News

    Taxonomic classifier begins with Top/News excluding

    Top/News/Obituaries

    Top/News/CorrectionTop/News/Editors Notes

  • 7/29/2019 Assessing approaches to genre classification

    23/72

    19

    Review

    Taxonomic classifier is one of the following

    Top/Opinion/Opinion/Editorials

    Top/Opinion/Opinion/Op-Ed Top/Opinion/Opinion/Op-Ed/***

    Top/Features/***/Columns Top/Features/***/Columns/***

    Top/Features/***/Reviews Top/Features/***/Reviews/***

    where *** can be anything, including several sub-hierarchies.

    Letter

    Taxonomic classifier is Top/Opinion/Opinion/Letters

    This conforms to the categorization of documents made by Webber in [23] and therefore

    corresponding classes have been identified in the WSJ corpus. They were separated and made

    available for this project by the author. As most documents are assigned to several taxonomic

    classifiers, it is possible that an article falls into two or all three groups. Such documents were

    ignored, i.e. not used for classification.

    In order to refine the identified classes further, the distribution of Types of Material tags was

    computed for each of the three categories. Note that the percentages do not necessarily add up to

    100 %, as there are documents which contain more than one Types of Material tag.

    News ReviewNo Types of Material tag 76.6 % Review 66.7 %

    Biography 5.5 % Editorial 14.2 %

    Summary 3.4 % Op-Ed 13.7 %

    Correction 2.9 % No Types of Material tag 5.1 %

    Obituary 2.9 % Question 0.7 %

    Letter 2.4 % Biography 0.1 %

    Others 6.7 % Chronology 0.1 %

    Letter

    Letter 100.0 %

    This indicates that, for theNews class in particular, a selection by taxonomic classifiers alone is not

    sufficient. It was decided to use the Types of Material field as an additional filter. Documents were

    only classified as news articles if they fulfilled both the criteria mentioned above and contained no

    Types of Material tag. For the Review class, only documents which were tagged Review,

    Editorial or Op-Ed were taken into consideration. No additional constraints were required for

    theLetterclass. The appropriateness of the remaining documents with respect to the requirements

    mentioned in section 3.3 was verified manually by taking samples.

  • 7/29/2019 Assessing approaches to genre classification

    24/72

    20

    News Letter ReviewThe Iranian Foreign

    Minister publicly

    divorced his

    Government today from

    the death threat

    imposed on the British

    author Salman Rushdie

    in 1989 by Ayatollah

    Ruhollah Khomeini, and

    Britain responded by

    restoring full

    diplomatic relations.

    1049130.xml

    Outraged, I yelled at

    the hunter that he had

    nearly hit me, but he

    denied that he was

    close.

    0724951.xml

    Mr. Lautenberg was

    mistaken in voting

    against the deficit-

    reduction plan last

    year, but his overall

    record is sound.

    0722765.xml

    Table 3: Example sentences taken from the NYT 87-98 data set.

    Table 3 illustrates the type of texts contained in each of the three categories. Further examples of

    the textual content for each class can be found in Appendix A of this report.

    To gain insight into the internal distinctiveness of the three genres, a collection of news articles,

    reviews and letters was extracted using the identification criteria mentioned above. It contained

    1,000 documents of each class. Some simple properties were computed from the texts and

    averaged. They were chosen so that they include both structural and linguistic features.

    News Review Letter

    Mean Word count 635 626 216

    Mean Frequency of

    Question marks

    0.07 per 100 words 0.24 per 100 words 0.20 per 100 words

    Mean Adverb

    Frequency

    3.4 per 100 words 4.5 per 100 words 3.9 per 100 words

    Table 4: Averaged properties for texts belonging to the classes News, Review and Letter.

    The results are illustrated in Table 4. The numbers indicate that texts from each of the genre classes

    indeed share formal properties distinct from other genres. Therefore, the genre framework was

    accepted as suitable for the task.

    4.3.3. Genres and topics: Experiment one

    To examine the impact of topic, two independent experiments were carried out. For the first one,

    only two classes were required. It was decided to use news articles and letters, as intuition

    suggested they were more distinct from each other than both News/Review andReview/Letter. To

    find appropriate topics, the distribution of the General Online Descriptorfield was computed. All

    documents in the NYT 87-98 set that had been classified as either news article or letter were used

  • 7/29/2019 Assessing approaches to genre classification

    25/72

    21

    for this task. The aim was to find topics which were fairly specific and distinct. However, they had

    to be broad enough to contain enough documents for classification.

    In news articles, the 10 most common General Online Descriptorvalues were:

    1. Finances

    2. Politics and Government

    3. United States International Relations

    4. United States Politics and Government

    5. Baseball

    6. Medicine and Health

    7. Armament, Defense and Military Forces

    8. International Relations

    9. Stocks and Bonds

    10. Mergers, Acquisitions and Divestitures

    In letters, the 10 most common General Online Descriptorvalues were:

    1. Politics and Government

    2. Medicine and Health

    3. United States International Relations

    4. United States Politics and Government

    5. Finances

    6. Travel and Vacations

    7. Education and Schools

    8. International Relations

    9. Law and Legislation

    10. Armament, Defense and Military Forces

    A choice was made to use the tags Medicine and Health (referred to as Health from now on) as

    well as Armament, Defense and Military Forces (referred to as Defense from now on), as they

    fulfill both requirements stated above. Of all the news articles in the NYT 87-98 data set, which are

    about health or defense, only 0.6 % are about both health and defense. In the letters class, this is

    true for 0.4 % of documents. While no documents with overlapping topics were used for

    classification, these numbers indicate that health and defense are very distinct topics. This is

    important, as it makes classification results more meaningful and provides a strong contrast to the

    experimental set up described in section 4.3.4.

    The identification of news articles and letters was the same as explained in section 4.3.2. However,

    in addition to this selection by Taxonomic classifier and Type of Material values, topics were

    identified using the General Online Descriptorfield.

  • 7/29/2019 Assessing approaches to genre classification

    26/72

    22

    Health News Defense News Health Letter Defense LetterEthicists and

    experts on the

    issue said that

    Diane's case

    starkly

    contrasted with

    those of two

    other well-

    publicized and

    controversial

    doctor-assisted

    suicides.

    0428321.xml

    General Powell

    said the attack

    had used 23

    Tomahawk guided

    cruise missiles

    fired from two

    ships, one in

    the Persian Gulf

    and the other in

    the Red Sea.

    0617951.xml

    She checked with

    a lung

    specialist who

    told me that I

    would be subject

    to pulmonary

    edema, and that

    the best

    treatment is to

    go to a lower

    altitude.

    0076130.xml

    Why this

    disparity

    between

    responsible

    fiscal concern

    by State and

    Treasury and an

    opportunistic

    hawking of wares

    by the Defense

    Department and

    its industry

    pals?

    0942205.xml

    Table 5:Example sentences taken from the NYT 87-98 data set.

    Again, examples were manually surveyed to confirm that the texts met expectations with respect to

    their topics and genres. Table 5 shows sentences taken from each of the four topic-genre

    combinations. Complete document texts are presented in appendix A of this report.

    4.3.4. Genres and topics: Experiment two

    For the second experiment to investigate the impact of topic, three genre classes were needed. Two

    of them were to be divided into two topical groups. The idea was to use very broad genres and

    topics to simulate a very hard classification problem. The third genre class was not to be divided

    into topical groups. It was included to examine how precision and recall values would differ for a

    class with constant topic distribution.

    Based on the findings of the meta-data survey described in section 4.3.1, it was decided to use the

    Taxonomic classifierfield for both genre and topic separation. The genres to be used were the same

    as the ones explained in section 4.3.2. However, reviews were now required to be classified as

    travel guides (see below). This was done because topical categories can be separated neatly using

    the Taxonomic classifierfield. The other genre that was divided into topics was theNews class. For

    both news articles and reviews, documents which were either about the U.S. or about the rest of the

    world (excluding the U.S.) could be identified using the scheme below. Letters were not divided

    into topics.

  • 7/29/2019 Assessing approaches to genre classification

    27/72

    23

    U.S. News

    Taxonomic classifier begins with one of the following

    Top/News/World/Countries and Territories/United States/

    Top/News/U.S.

    No Type of Material tag is assigned.

    Non-U.S. News

    Taxonomic classifier begins with one of the following

    Top/News/World/Africa

    Top/News/World/Asia Pacific

    Top/News/World/Europe

    Top/News/World/Middle East

    Top/News/World/Countries and Territories/

    excluding

    Top/News/World/Countries and Territories/United States/

    No Type of Material tag is assigned.

    U.S. Review

    Taxonomic classifier begins with

    Top/Features/Travel/Guides/Destinations/North America/United States

    Type of Material is Review, Editorial or Op-Ed.

    Non-U.S. Review

    Taxonomic classifier begins with

    Top/Features/Travel/Guides/Destinations/

    excluding

    Top/Features/Travel/Guides/Destinations/North America/United States

    Type of Material is Review, Editorial or Op-Ed.

    Letter

    Taxonomic classifier is Top/Opinion/Opinion/Letters

    Like in all other experiments, documents which could be assigned to more than one genre class

    were ignored. The same was true for news articles and reviews, which were both about U.S. and

    Non-U.S. topics (e.g. a report on the relations between the USA and France).

  • 7/29/2019 Assessing approaches to genre classification

    28/72

    24

    News Review Letter

    U.S. The draft of a proposalto prevent patients

    from being infected

    with the virus is less

    restrictive than

    earlier recommendations

    from the American

    Medical Association,

    the American Dental

    Association, and the

    American Academy of

    Orthopedic Surgery.

    0434996.xml

    With its bustle and

    clatter, its shared

    tables and its

    chefs behind

    steaming cauldrons

    of soup, New York

    Noodletown is as

    close as you can

    get to Hong Kong in

    Manhattan.

    1065268.xml

    Because of a

    gravitational

    pull toward

    badness,

    mistakenly known

    as mediocrity,

    that begins with

    peer pressure and

    culminates in the

    kind of

    bureaucratic

    obstacles that

    can stop

    brilliant

    students in their

    tracks for good.

    0356340.xml

    Non-

    U.S

    The perils of Jimmy

    Connors and Ivan Lendl

    have dominated

    Wimbledon thus far,

    relegating the most

    recent champions, Boris

    Becker and Pat Cash, to

    unaccustomed supporting

    roles.

    0157534.xml

    Had the Islamic

    movement been

    allowed to assume

    parliamentary

    power, would it

    have been any less

    repressive, or more

    competent, than the

    army?

    0630627.xml

    Table 6: Example sentences taken from the NYT 87-98 data set.

    The distinction between U.S. and non-U.S. documents fulfilled the requirement of very broad

    topical categories. Table 6 contains examples taken from each of the five different categories.

    Further examples of the textual content for each class can be found in Appendix A of this report.

    4.4. Pre-processing of data

    In order to properly assess classification algorithms, the data had to be adapted to set the scene for

    further processing. Utilizing the insights gained through the data analysis described in section 4.3,

    documents from the NYT and WSJ corpora were pre-processed. This included extracting and

    manipulating textual contents as well as splitting up the data into training and test sets. Both of

    these processes are described in this section.

    4.4.1. Transforming contents

    As already mentioned, NYT documents are provided in xml format. To extract the actual texts

    from the files, the Java interface included in the NYT corpus was used. It provides a simple way to

    read individual fields from a document. Looking at the results, it was found that in many cases the

  • 7/29/2019 Assessing approaches to genre classification

    29/72

    25

    lead paragraph had been automatically added to the text content. This led to redundant sentences,

    as illustrated below (sample taken from document 0000702.xml).

    LEAD: New York City won its three-year fare freeze in Albany last week,

    though from downstate the ice looked a little mushy.

    New York City won its three-year fare freeze in Albany last week, though

    from downstate the ice looked a little mushy.

    The Legislature voted []

    Therefore, any initial paragraph starting with LEAD: was removed before further processing.

    Another observation was that 99.7% of the extracted letters started with the paragraph To the

    Editor:, which would have made automatic recognition of this class a trivial task. Furthermore,

    this particular preceding sentence is not necessarily included in letter texts of other corpora.

    Consequently, it was stripped off as well.

    Texts in the NYT corpus have delimiters between paragraphs (

    and

    tags). However,

    sentences within a paragraph are not delimited. As some of the algorithms use sentence-based

    features, it was necessary to break the texts into sentences. The Sentence Detector tool developed

    by the National Centre for Text Mining was found to be very accurate. The Java API is available

    from [33].

    As already mentioned in section 4.1, there are no part-of-speech (POS) tags assigned to words in

    the NYT corpus. However, some of the algorithms that were to be assessed make use of such

    information. Therefore, each of the extracted texts had to be POS tagged. For this task, a Java-

    based open source software called CRF Tagger [32] was used. It makes use of a conditional

    random field tool kit (hence the name). The model used for POS tagging had been trained and

    tested on the WSJ data set by the authors and achieved an accuracy of 97.0 % [32].

    In order for CRF Tagger to work properly, the texts had to be cleaned beforehand. It was found that

    the software had problems assigning the correct tags to special characters, which were not common

    punctuation. Therefore, such characters were removed.

    For each document in the NYT corpus, 4 versions were kept:

    The original xml file containing the raw text and all meta-data The extracted text with each sentence in a separate line The version of the text without special characters The text annotated with POS tags

  • 7/29/2019 Assessing approaches to genre classification

    30/72

    26

    Less effort was required for pre-processing the WSJ documents. They already were provided in

    raw text, with one sentence in every line. Furthermore, versions with assigned POS tags existed.

    However, as found by Webber in [23], some of the letter documents actually contained several

    concatenated letters. As this might have had an effect on classification results, these documents

    were shortened manually. Only the first letter was kept. It was also found that some documents

    start with a line containing only .START. Like the LEAD paragraph for NYT texts, it was removed.

    4.4.2. Creating data sets

    After the texts had been cleaned and prepared for feature extraction, they were separated into

    balanced training and test sets. This means that each class was represented by the same amount of

    documents. For the blended training and test sets of experiment C and D, the distribution of topics

    was balanced as well. Other than that, all assignments were done pseudo-randomly. The final sets

    consisted of the following documents:

    Experimental setup A (Baseline)

    Training NYT 87-98: 6.000 files (2.000 news, 2.000 letters, 2.000 reviews)

    Test NYT 87-98: 3,000 files (1.000 news, 1.000 letters, 1.000 reviews)

    Experimental setup B (Style)

    Training NYT 87-98: Same files as above

    Test NYT 03-07: 3,000 files (1.000 news, 1.000 letters, 1.000 reviews)Test WSJ: 162 files (54 news, 54 letters, 54 reviews)

    As all the articles from the WSJ corpus were published in 1989, they fall into the time range used

    in the training set. Only 54 letters could be identified in the WSJ corpus. Therefore, 54 news and 54

    reviews were chosen pseudo-randomly.

    Experimental setup C (Topic)

    Training Blended: 2.000 files (500 health news, 500 defense news, 500 health letters,

    500 defense letters)

    Test Blended: 2.000 files (500 health news, 500 defense news, 500 health letters,

    500 defense letters)

    Training Paired: 2.000 files (1.000 defense news, 1.000 health letters)

    Test Paired: 2.000 files (1.000 health news, 1.000 defense letters)

    Experimental setup D (Topic)

    Training Blended: 3.000 files (500 U.S. news, 500 non-U.S. news, 500 U.S. reviews,

    500 non-U.S. reviews, 1.000 letters)

    Test Blended: 3.000 files (500 U.S. news, 500 non-U.S. news, 500 U.S. reviews,

    500 non-U.S. reviews, 1.000 letters)

    Training Paired: 3.000 files (1.000 U.S. news, 1.000 non-U.S. reviews, 1.000

    letters)

    Test Paired: 3.000 files (1.000 non-U.S. news, 1.000 U.S. reviews, 1.000letters)

  • 7/29/2019 Assessing approaches to genre classification

    31/72

    27

    In terms of project aims, setup A was compiled to find out about how the approaches compare in

    general. Setup B was meant to detect how well classifiers cope with formerly unseen styles. Setup

    C and D were created to examine and compare their domain transfer abilities.

    No separate validation sets were required for this project. This is because the aim was not to

    optimize feature compositions, choice of algorithms or parameter settings but rather to re-

    implement and assess specified methods. They were trained and tested as suggested in the

    respective publications.

  • 7/29/2019 Assessing approaches to genre classification

    32/72

    28

    5. Implementation and Classification

    This section covers the creation of the document representations on the basis of the data sets

    described in section 4.4. It also discusses the various classification methods used. This was done

    according to the ideas proposed in four different publications on genre classification, with

    publication dates ranging from 1994 to 2006. The aim was to stick to the authors specifications as

    closely as possible and deviations are explained where they were necessary.

    5.1. Karlgren & Cutting (1994)

    The algorithm proposed in [9] is one of the earliest approaches to automatic genre classification

    and has been widely referenced in scientific literature on this topic (e.g. [7][25][24][29][28]).

    Methods and results have often been compared to the ones presented by Karlgren & Cutting.

    Therefore, including the algorithm in the test framework of this project seemed reasonable.

    The authors identify 20 features, which include counts of POS tags (e.g. adverbs) and certain

    function words (e.g. therefore) as well as ratios of word- and character-level features (e.g.

    type/token ratio). They employ discriminant analysis to predict genre classes on the basis of this

    feature set. The standard software SPSS is used for classification. However, no distinction is made

    between training and testing data. Thus, results obtained from tests on the training set are reported.

    For the purpose of this project, all 20 features were extracted from the documents. Karlgren and

    Cutting base their experiments on data taken from the Brown Corpus of Present-Day American

    English [46]. All texts in the Brown corpus are approximately 2,000 words long. As the texts used

    in this project vary in length, all counts were normalized by the number of words in a document

    and multiplied with a factor of 2,000.

    As SPSS is not available freely, the classification experiments were conducted with statistiXL, a

    statistics tool kit add-in for Microsoft Excel. It can be obtained from [34] and includes discriminant

    analysis functionalities. Any data format supported by Excel would have been suitable, so it was

    decided to convert the extracted features into CSV (comma separated value) files. Unlike in the

    experiments of Karlgren and Cutting, independent training and test sets were used (see section 4.4).

    As far as the test data is concerned, the output of statistiXL consists merely of class predictions and

    does not include accuracies. Therefore, a script was developed to compare actual classes in the test

  • 7/29/2019 Assessing approaches to genre classification

    33/72

    29

    set with those predicted by the algorithm. It computed all values required for assessing the

    approach, including recall and precision values as well as confusion matrices.

    To examine the influence of POS-based features, a second feature set was extracted from the data.

    It contained all of the original features used in [9], except the ones that rely on POS tags. Other

    than that, the procedure was not altered. Discriminant analysis was applied in the same way as

    before. The feature sets and results of the approach with and without POS-based attributes were

    handled independently.

    5.2. Kessler, Nunberg & Schtze (1997)

    Another benchmark in the field of genre classification is the work by Kessler, Nunberg & Schtze

    [7]. They suggest the use of simply computable features (referred to as cues), which do not require

    POS tagged texts. These are divided into three categories: Lexical, character-level and derivative

    cues. The fourth group comprises structural cues, which do make use of POS tags and are

    consequently ignored.

    As the actual features used for classification are not reported in the publication, the authors were

    contacted and asked to provide additional information. Unfortunately, the exact list of cues couldnot be obtained. This was due to both the fact that this work was published over a decade earlier

    and copyright reasons. However, notes from the time could be recovered and were made available

    to the project by Nunberg [47]. While these were only rough ideas and unlikely to be identical to

    the features used in [7], it was as accurate as possible. The lexical, character-level and derivative

    cues mentioned were therefore extracted from the texts and used as feature set.

    As stated in [7], ratios are not explicitly used as features by Kessler, Nunberg & Schtze. Instead,

    counts are transformed into natural logarithms, so to represent ratios implicitly. The same was donefor this project. While the authors, like Karlgren & Cutting, work with the Brown corpus, they do

    not use fixed length samples but rather individual texts with varying word numbers. Therefore, the

    counts did not have to be normalized. An example is provided below.

    Count of question marks:

    Attribute value:

    Occurrences ofit:

    Attribute value:

  • 7/29/2019 Assessing approaches to genre classification

    34/72

    30

    In spite of this implicit representation, some ratios are mentioned explicitly in [47]. This includes

    type / token ratio and the average length of sentences. They were computed and added to the

    feature set. Table 7 lists the features used for classification.

    Feature SetWord count

    Sentence count

    Character count

    Types count

    Sentences starting with And

    Sentences starting with But

    Sentences starting with So

    Contraction Count

    Relative day words (Yesterday, Today, Tomorrow)

    Occurrences of (Last / This / Next) week

    Occurrences of *, where

    Occurrences of , but,

    of course count

    it count

    shall count

    will count

    a bit count

    hardlycount

    not count

    Wh-Question count

    Question mark count

    Colons per word

    Colons per sentence

    Semicolons per sentence

    Parentheses per sentenceDashes per sentence

    Commas per word

    Commas per sentence

    Quotation mark count

    Average sentence length

    Standard deviation of sentence length

    Average word length

    Standard deviation of word length

    Type / Token ratio

    Count of numerals

    Count of dates

    Count of numbers in brackets

    Count of terms of address

    Table 7: Feature set for Kessler, Nunberg & Schtze approach

    For classification, three different algorithms are discussed by Kessler, Nunberg & Schtze. They

    use logistic regression as well as two variations of artificial neural networks. One makes use of a

    hidden layer (2-layer-perceptron), while the other has all input nodes connected to all output nodes

    directly (3-layer-perceptron). All three of these classifiers were used for this project as well. The

    open source data mining application Weka [36] was used for this purpose. Among other

    techniques, it features logistic regression and multilayer perceptrons.

  • 7/29/2019 Assessing approaches to genre classification

    35/72

    31

    Input data is required to be in the Attribute-Relation File Format (ARFF). An example is shown

    below. The last value in each data line denotes the class. The full specification can be found in the

    book by Witten & Frank [36].

    @RELATION Training_KNS

    @ATTRIBUTE ABitCount NUMERIC

    []

    @ATTRIBUTE XCommaWhere NUMERIC

    @ATTRIBUTE class {1,2,3}

    @DATA

    0,1.39,0,[],7.41,0.69,3

    0.69,2.08,0,[],9.11,0,1

    The amount of neurons in the hidden layer was set to 6 for the first topic experiment and 9 for allother runs. This corresponds to the 3 neurons per genre class suggested by Kessler, Nunberg &

    Schtze. Weka outputs prediction accuracies as well as confusion, precision, recall and F-Measures

    for all classes. Therefore, no further processing was required.

    Kessler, Nunberg & Schtze also discuss the usefulness of structural cues. However, for their

    experiments, they do not add any POS based features to their own set. Instead they compare their

    results to the one achieved when utilizing the features suggested by Karlgren & Cutting [9].

    Nevertheless, the list of notes [47] does include various structural cues, partly distinct from what

    was used in [9]. This includes both POS tag frequencies and more elaborate features. An example

    of the latter would be fragments, which means sentences containing no verbs.

    Additional structural cuesPresent participle count

    Past participle count

    Adverb count

    Noun count

    Proper noun count

    Adjective countExistential there count

    Attributive adjective count

    Personal pronoun count

    Prepositions + wh-word

    Imperatives

    Sentences starting with present participles

    Sentences starting with past participles

    Sentences starting with an adverb + comma

    Fragment count (sentences with no verb)

    Sentences ending with prepositions

    Table 8: Additional structural cues for Kessler, Nunberg & Schtze approach

  • 7/29/2019 Assessing approaches to genre classification

    36/72

    32

    It was decided to run all the experiments based on this approach twice: Once as suggested in the

    publication (i.e. not POS tagged texts required) and once with structural cues included. The aim

    was to find out whether or not the algorithm could benefit from these features. Table 8 shows the

    features, which were added to document representation.

    5.3. Freund, Clarke & Toms (2006)

    The 2006 study presented in [10] discusses the merits of genre analysis in a software engineering

    workplace domain. The focus is on identifying genres from a number of workplace related sources

    and analyzing characteristics like purpose, form, style, subject matter and related genres. However,

    Freund, Clarke and Toms also carry out automatic classification on the set of identified genres.

    As they are faced with heterogeneous sources and file formats, a simple bag-of-words approach is

    chosen over more sophisticated feature extractions. Bag-of-words means that the feature set

    consists of all the words found in a document collection, although it is often reduced by techniques

    like word stemming, stop lists or feature selection. The values are either binary or represent the

    frequencies of words in a specific document. The order of word appearances is not maintained. The

    bag-of-words representation is commonly used for text classification tasks and several experiments

    have suggested that it performs equally or better than more complicated methods in terms ofclassification accuracy (e.g. [15][16]). For the experiments of Freund, Clarke & Toms, no word

    stemming, stop lists or feature selection techniques are used.

    The authors use SVMlight

    to classify the data. The software package is implemented in C and free

    for non-commercial use. It can be obtained from [35]. SVMlight makes use of support vector

    machines, which are a popular choice in the field of text classification (e.g. [16][20][21][48]).

    The re-implementation of the feature extraction was relatively straightforward. For each pair oftraining and test sets, all words occurring in the training set were used. The same features were

    extracted from the test set, i.e. formerly unseen words were ignored. Capitalization was

    disregarded, i.e. all words were transformed to lower case. As suggested in [10], no attempts were

    made to reduce the amount of attributes. This way, over 142,000 independent features (i.e. different

    words) were extracted from the baseline training set of 6,000 documents. In contrast to the other

    assessed approaches, the classifying algorithm had to deal with an extremely large and sparse

    feature set.

  • 7/29/2019 Assessing approaches to genre classification

    37/72

    33

    SVMlight

    was developed as a binary classifier and cannot handle more than two classes. This was

    not a problem for the experiments of Freund, Clarke & Toms, as they were interested mainly in the

    recall and precision values for each of the genres. However, such results cannot be compared to

    results from multiple genre classification. Therefore, SVM light could not be used to assess the

    approach in this project. However, the same author provides an extension called SVMmulticlass

    ,

    which does exactly what is required. As input, text files in a certain format are required. They were

    created from the feature sets according to the specification. The following are two example

    documents converted to lines in an input file. The first number indicates the class affiliation. All

    other entries stand for the feature number and the frequency of the respective word in the document

    text. By convention, features with value zero (i.e. words which do not occur) are omitted.

    1 4:1 11:2 12:1 23:1 26:1 27:1 [] 35488:2 # Document 0000291.xml2 4:2 8:1 11:1 12:1 [] 40478:1 70307:1 132636:1 # Document 0874961.xml

    While SVMmulticlass

    does output the error rate on the test set, no confusion matrix or class specific

    recall and precision values are provided. However, an output file including target predictions is

    created after processing. This was used to calculate result statistics, using a variation of the script

    mentioned in section 5.1.

    5.4. Ferizis & Bailey (2006)

    The work on genre classification by Ferizis & Bailey [24] examines the approximation of POS-

    based features. Their experiments are based on the method proposed by Karlgren & Cutting [9], as

    discussed in section 5.1. The authors argue that comparable accuracies can be achieved by

    estimating the frequency of certain POS tags. The advantage of this method is that no tagged texts

    are required for classification, which speeds up processing significantly. In fact, 97.2 % of the time

    it takes to classify a document the way Karlgren & Cutting suggested, is spent assigning POS tags

    to its words [24]. This is a strong argument against parsing, especially in areas like informationretrieval, where speed is crucial. On the other hand, it has been suggested that POS tags can help to

    achieve better classification accuracies [9][25][29].

    It seemed reasonable to assess an approach to POS frequency approximation in comparison to the

    already mentioned methods of Karlgren & Cutting with and without POS frequencies. Using the

    exact same non-POS features as before, approximations of the present participle and adverb

    frequencies were added. In accordance with the approach of Ferizis & Bailey, noun frequencies

    were ignored.

  • 7/29/2019 Assessing approaches to genre classification

    38/72

    34

    All words with a length greater than 5 characters and ending with the suffix -ing were counted as

    present participles. Words longer than 4 characters and ending with -ly were counted as adverbs. In

    addition, an independent training set containing 5,000 randomly sampled NYT documents was

    created and POS tagged to find the 50 most common adverbs. The obtained words are shown in

    Table 9. They, too, were used to determine adverbs. Note that the POS tagging in this case was not

    part of the classification algorithm, but rather preliminary work which only had to be done once.

    Rank 1-10 Rank 11-20 Rank 21-30 Rank 31-40 Rank 41-50not

    n't

    also

    now

    so

    moreonly

    even

    just

    as

    then

    most

    still

    well

    too

    herenever

    very

    back

    much

    ago

    far

    there

    however

    often

    alreadyyet

    again

    once

    almost

    later

    always

    long

    really

    rather

    everaway

    down

    perhaps

    about

    recently

    instead

    up

    probably

    nearly

    lessenough

    first

    toget