[Lecture Notes in Computer Science] Advances in Knowledge Discovery and Data Mining Volume 6634 || Grammatical Dependency-Based Relations for Term Weighting in Text Classification

Grammatical Dependency-Based Relations for

Term Weighting in Text Classification

Dat Huynh, Dat Tran, Wanli Ma, and Dharmendra Sharma

Faculty of Information Sciences and EngineeringUniversity of CanberraACT 2601, Australia

{dat.huynh,dat.tran,wanli.ma,

dharmendra.sharma}@canberra.edu.au

Abstract. Term frequency and term co-occurrence are currently usedto estimate term weightings in a document. However these methods donot employ relations based on grammatical dependency among terms tomeasure dependency between word features. In this paper, we proposea new approach that employs grammatical relations to estimate weight-ings of terms in a text document and present how to apply the termweighting scheme to text classification. A graph model is used to encodethe extracted relations. A graph centrality algorithm is then applied tocalculate scores that represent significance values of the terms in thedocument context. Experiments performed on many corpora with SVMclassifier show that the proposed term weighting approach outperformsthose based on term frequency and term co-occurrence.

Keywords: Text representation, relation extraction, grammatical de-pendency, graph weighting model, text classification.

1 Introduction

In text classification, a single or multiple category labels will be automaticallyassigned to a new text document based on category models created after learninga set of labelled training text documents. Current text classification methodsconvert a text document into a relational tuple using the popular vector-spacemodel to obtain a list of terms with corresponding frequencies.

Term frequency (TF) has been used to measure the importance levels of termsin a document. Firstly, TF is considered as a key component to evaluate termsignificances in a specific context [11]. The more a term is encountered in a cer-tain context, the more it contributes to the meaning of the context. Secondly,some approaches have combined TF and Inverse Document Frequency (IDF)as a term weighting measure. These approaches outcome the considerable re-sults as applied to text classification tasks [4,16,17]. However, with an abstractand complex corpus such as Ohsumed1, the TF-based methods fail to leverage

1 ftp://medir.ohsu.edu/pub/ohsumed

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part I, LNAI 6634, pp. 476–487, 2011.c© Springer-Verlag Berlin Heidelberg 2011

Grammatical Dependency-Based Relations for Term Weighting 477

the classification results [4,17]. According to Hassan and Banea [2], TF-basedapproaches can be effective for capturing the relevance of a term in a local con-text, but they fail to account for the global effects that terms exist on the entiredocument.

To overcome this shortcoming, the relationships among terms have been in-vestigated to figure out the representation of a document. Recent studies haveintroduced the pre-defined relations among terms. These relations can be ex-tracted from a predefined knowledge source such as Wikipedia Encyclopedia[1,3,12,14], in which the relations (Wikipedia links) are regarded as the maincomponents to represent for the document context. In a case of finding out themethods that can extract the representation of unpredictable text documents,these pre-tagging-based methods are limited to encounter to the variety kindsof document.

The term co-occurrence (TCO) are popularly used to model the relationsbetween terms [2,15]. It is also a model to get over the shortcoming of TF-basedmethods as well as to deal with the universal kinds of text documents. The ideaof taking term co-occurrence as a relation is not only to capture the dependencyof terms in local contexts but also to take into account the global effects of termsin the entire document. In order to estimate importance levels of terms, a graphmodel is used to connect all these relations and a centrality algorithm is used tocalculate term weighting values.

Although TCO-based methods give the considerable results in comparison tothe TF-based methods when they are used to estimate important terms of agiven document for TC tasks [2], some certain concerns need to be considered.Firstly, when working in a certain window size as the local context, these meth-ods accept every single pair of terms within the window size as a relation. So, thenumber of relations would be small or even really large depending on the choiceof the window size. In those cases, the expectable relations can be eliminated, orthe redundancy relations are still retained. Secondly, although the idea of theseapproaches is to extract important terms in a document context, the way ofmaking pair of relations under a window size does not guaranty for the contri-bution of the relations in weighting important terms. With those explanations,we argue that not only TF-based model but also TCO-based model may not bethe best technique to capture those important terms.

In this paper, we propose an alternative method to extract and weighting im-portant terms in a given document. The method firstly considers the advantagesof the relations among terms to address the shortcoming of TF-based methods.Secondly, under the light of the success of TextRank [9,2], instead of using termco-occurrence as a dependency, we are more concentrating on relations basedon grammatical dependency among terms. The choice of the relations not onlyprevents the issues of window sizes but also discloses more hidden relations aswalking along the path connected terms.

As the framework of our method, we start firstly with extracting relationsfrom the given text document. Secondly, the graph model is used to encodecontexts of the document, which is constructed by connecting all these relations.

478 D. Huynh et al.

Then, a graph centrality algorithm is applied to calculate scores that representsignificant values of terms in the document context. Finally, the top list of highweighted terms will be used as a representation of the given document.

The remaining of this paper is organised as follows. In Section 2, a frameworkof term weighting approach is presented. Section 3 explains the methodologyof extracting grammatical relationships among words. Section 4 shows how touse the graph model to estimate the importance level of terms. How to applythe document representation to text categorisation tasks is presented in Sec-tion 5. Section 6 describes experimental setups, results and discussions. Finally,a conclusion and future work will be discussed in Section 7.

2 The Proposed Term Weighting Framework

The term weighting framework includes the following phases (see Fig. 1):

1. Relation Extraction: Given an input document, the relation extraction phaseextracts a set of tuples (relations) representing to the document.

2. Term Weighting: A weighted graph is conducted by encoding all the ex-tracted relations. A graph ranking model is used to estimate term weightingsfor document representation.

Fig. 1. The proposed term weighting framework

The proposed framework is capable of extracting more relationships amongterms within a document, and the relations include not only grammatical rela-tions, but also hidden relations which are explored by walking along the pathsconnecting the ideas of each sentence. A global context of the document is cap-tured using a graph model and a centrality ranking algorithm. The graph modelis able to inter-connect the major ideas of the document together, and is a placefor the centrality algorithm to estimate the important levels of vertices. The termweighting framework allows the important terms of a document to be “voted”by other terms in the same document.


3 Relation Extraction

Relation extraction is an important research area in text mining, which aims toextract relationships between entities from text documents. In the framework,a relation is considered as a tuple t = (ei, rij , ej), where ei and ej are stringsdenoted as words (terms), and rij is a string denoted as the relationship betweenthem.

The relation can be extracted based on linguistic analysis, particularlygrammatical relations among terms are the key components for extractinginformation.

For instance, from the following sentence “Antibiotics kill bacteria and arehelpful in treating infections caused by these organisms”, a list of relations ex-tracted includes (Antibiotic, kill, bacteria), (Antibiotic, treat, infection), (An-tibiotic, treat infection cause, organism), and (infection, cause, organism).

In order to extract relations, the sentences are identified from the input textdocument using a sentence detection technique, a linguistic parser such as Stan-ford parser2 is then used to analyse each sentence and outputs a graph of link-ages (Fig. 2), and finally a heuristic algorithm is designed to walk along pathsfrom the linkage graph and extract the expectable relations.

Elephant garlic has a role in cardiovascular disease

nsubjamod

pobjpobj

amodprepdet

the prevention of

dobj

det prep

Fig. 2. Graph of linkages extracted from the following sentence: “Elephant garlic has arole in the prevention of cardiovascular disease”. Each label assigned each curve playsa role of a grammatical relation connecting two words. An expectable relation canbe extracted based on the grammatical relation only or based on the shortest pathsconnecting terminated words.

The heuristic algorithm firstly scans the parsed sentence and identifies pairsof base terms3 (ei, ej) with i < j. From each pair of base terms, if there is ashortest path connecting from ei to ej , the algorithm will go along the path toidentify a sequence of words between ei and ej. These order words is consideredas a potential connection rij to form the raw tuple t = (ei, rij , ej). If ei and ej

are connected directly, the connection rij is regarded as the name of the linkage(label). Finally, if the raw tuples are passed through all given constraints, theywill be retained for the next processing step. Constraints to test the raw tuplesinclude:2 The Stanford parser http://nlp.stanford.edu/software/lex-parser.shtml3 In this case, a base term is a single word in accordance with POS filter.

480 D. Huynh et al.

– ei and ej have to be base terms with POS filter– rij has to be in a shortest path connecting ei to ej

– rij has to contain a verb or a preposition, or it is a grammar connection

After extracting the set of raw tuples from each document, components in eachtuple should be optimised as follows. All non-essential words such as adverbs,relative clause marker (who, whom, which, that, etc.), and stop-words will beeliminated from all components of tuple. The morphology technique is thenused to convert all words to their simple forms, such as converting plural wordsinto singular form and any kinds of verb forms into their “root” words [8]. Forinstance, the noun phrase “developing countries” is converted to “develop coun-try”, and the verb phrase “have been working” is converted to “have be work”.Once all wired tuples are eliminated, the remaining tuples are considered as aset of relations represented the document and is ready to build the graph forselecting term representatives.

4 Graph Construction: Constructing, Weighting andRanking Graph

Graph model is an alternative way to model information to show relation-ships between vertices. It groups related information in a certain way that thecentrality algorithms can take the best advantages.

4.1 Constructing Graph

A graph model is built to connect all extracted relations. Given a relationt = (ei, rij , ej), where ei and ej are considered as vertices in the graph andrij is considered as an edge connecting between ei and ej . The weighting of theedge rij is calculated based on the redundancy of the tuple t and the relatednessbetween ei and ei.

4.2 Weighting Graph

The weighting of a edge w(rij) is calculated based on two factors. Firstly, itdepends on the frequency of the relation t in the document d. The higher redun-dancy of relation t is, the more important it is in the document d. Secondly, theweighting of the edge w(rij) is also based on the redundancy of relation t in thecorpus. The redundancy of a tuple determines how valuable of that informationfrom its document [6].

Let t = (ei, rij , ej) be a relation of d, and e = (ei, w(rij), ej), be an edge ofthe graph, the weighting w(rij) is calculated as:

w(rij) = freq(t, C) ∗ rf(t, d) (1)

rf(t, d) =freq(t, d)

|t:t∈d|∑

i=1

freq(ti, d)

(2)


where freq(t, C) is the frequency of tuple t in the corpus C, freq(t, d) is thefrequency of tuple t in the document d, and rf(t, d) is the relation frequencyvalue of the relation t in the document d.

4.3 Ranking Graph

Once a document is represented as a weighted graph, a graph ranking algorithmis used to estimate scores of its vertices. According to Sinha and Mihalcea [10],the centrality algorithm PageRank [5] shows its outstanding ability on weightinggraph and hence it is adapted in our approach. The basic idea of PageRankalgorithm is that a web page will have a high rank if there are many web pagesor high ranking web pages pointing to it. Therefore, we treat each node in theterm graph as a web page, every undirected edge e = (ei, w(rij), ej) needs tobe converted to two directed edges −→e = (ei, w(rij), ej) and ←−e = (ej , w(rij), ei).Then the directed graph is passed through the PageRank as its input and theoutput is a set of vertices with their ranking scores. Every vertex (term) ei inthe graph (document) d has its ranking score pr(ei, d), which is considered asdegree of significance of the vertex (term) in the graph (document).

5 Applying Graph-Based Document Representation toText Classification

Given a text document d from a corpus C, the list of n term representatives ofd is defined as

d ={(

w1, pr(w1, d)),(w2, pr(w2, d)

), . . . ,

(wn, pr(wn, d)

)}

(3)

where wi is the text value of term i in the document d and pr(wi, d) is theranking value of term wi of the document d. The list of categories of the corpusC is C = {c1, c2, . . . , cm}.

In the following section, we proposes a measure that takes into account de-pendencies between terms and classes, which can justify term weighting valuesto adapt with text classification tasks.

5.1 Proposed Term Class Dependence (TCD)

The idea of calculating class dependence value of terms is that terms representedfor a document are normally dependent of its categories (classes). Therefore, wepropose a measurement tcd, which takes into account the information of cate-gories and denotes the degree of dependence of a term to a particular category.If a word frequently occurs in many documents of one class and infrequentlyoccurs in other classes, it should be considered as representative for the class ifits ranking value from the document is also comparable. We suggest a measureof degree of dependence of the term wi to the category ci

482 D. Huynh et al.

tcd(wi, cj) =tf(wi, cj) ∗ df(wi, cj)

m∑

k=1

tf(wi, ck) ∗ df(wi, ck)(4)

where ck is a categories of the corpus C.With the purpose of classifying text, we propose a combination of term ranking

on documents (pr) and category-based calculation (tcd), which establishes aformula to weight terms wi from a document dj that belongs to a category ck

as follows:pr.tcd(wi, dj , ck) = pr(wi, dj) ∗ tcd(wi, ck) (5)

and the pr.tcd value of each term will be added to the feature vector forclassification task.

5.2 Proposed Hybrid Term Weighting Methods Based on TCD

With the purpose of evaluating the effectiveness of term weighting methods basedon term frequency, term co-occurrence relations, and grammatical relations, wesuggest the following combinations to form hybrid term weighting methods fortext classification:

– tf.tcd : a combination of term frequency and term class dependency. Thepurpose of tf.tcd is to evaluate the effectiveness between tf and pr whenmaking the comparison between tf.tcd and pr.tcd. The tf.tcd value of a termwi in a document dj is

tf.tcd(wi, dj) = tf(wi, dj) ∗ tcd(wi, dj) (6)

where

tf(wi, dj) =freq(wi, dj)

n∑

k=1

freq(wk, dj)(7)

n is number of terms in the document dj .– rw.tcd : a combination term weighting method based TextRank(rw)4 and

term class dependency (tcd). The purpose of rw.tcd is to evaluate the ef-fectiveness between rw and pr when making the comparison between rw.tcdand pr.tcd. The rw.tcd value of a term wi in a document dj is

rw.tcd(wi, dj) = rw(wi, dj) ∗ tcd(wi, dj) (8)

where rw(wi, dj) is the random-walk weighting value of a term wi in adocument dj .

4 We implemented the TextRank framework which outcomes the weightings of termsbased on term co-occurrence and random walk algorithm on the graph [9,2] forcomparison purposes.


– pr.idf : a combination our term ranking method pr and invert documentfrequency. The purpose of pr.idf is to evaluate the effectiveness between tfand pr when making the comparison between tf.idf and pr.idf. The pr.idfvalue of a term wi in a document dj is

pr.idf(wi, dj) = pr(wi, dj) ∗ idf(wi, dj) (9)

6 Experiments

6.1 Classifier and Data Sets

Support Vector Machine [13] is a state-of-the-art classifier. In our experiments,we used the linear kernel since it was proved to be as powerful as other kernelswhen tested on data sets for text classification [16].

Our term weighting approach is mainly based on the grammatical relationsamong terms in English sentences. In order to test its effectiveness in comparisonto other methods, we have chosen highly standard English grammar corporawhich are Wikipedia XML, Ohsumed, and NSFAwards.

Wikipedia Corpus: We used the collection Wikipedia XML Corpus compiledby Denoyer & Gallinari [7]. We randomly selected a subset of the English Single-Label Categorization Collection which provides a single category to each doc-ument. After the pre-processing step, we obtained a total of 8502 documentsassigned to 53 categories. These documents are divided randomly and equallyto form a training data set and a test data set including 4251 documents and 53categories for each set.

Ohsumed: This corpus contains 50216 abstract documents5, we selected thefirst 10000 for training and the second 10000 for testing. The classification task isto assign the documents to one or multiple categories of the 23 MeSH “diseases”categories. After pre-processing step, there are 7643 documents in the test setand 6286 documents in the training set.

NSFAwards: This data set consists of 129000 relatively short abstracts in En-glish, describing awards granted for basic research by the US National ScienceFoundation during the period 1990-2003. For each abstract, there is a consider-able amount of meta-data available, including the abbreviation code of the NSFdivision that processed and granted the award in question. We used this NSFdivision code as the class of each document. The title and the content of the ab-stract were used as the main content of the document for classification tasks. Weused part of the corpus for our experiment by selecting 100 different documentsfor each single category, test set and training set. After the pre-processing step,we obtained 19018 documents for training set, 19072 documents for test set and199 categories.5 ftp://medir.ohsu.edu/pub/ohsumed

484 D. Huynh et al.

6.2 Performance and Discussion

To evaluate the classification system we used the traditional accuracy measuredefined as the number of correct predictions divided by the number of evaluatedexamples. Six weighting models that need to be tested are tf.idf , tf.tcd, rw.idf ,rw.tcd, pr.idf and pr.tcd. We used the accuracy results from tf.idf , and rw.idfas the baseline measures.

GR-based Method versus Baseline Methods. The GR-based method pr.tcdprovided outstanding results in comparison to tf.idf and rw.idf . Table 1 showsthe classification results using the SVM classifier. The Ohsumed corpus is one ofthe challenging text classification datasets when tf.idf and rw.idf achieved under40% accuracy. However, pr.tcd achieved a considerable result with about 16%accuracy higher than the two previous methods. Wikipedia and NSFAwards alsoshow the similarity trends, yet the gap between pr.tcd and two baseline methodsreduced to about 4% from NSFAwards corpus and to about 10% from Wikipediacorpus.

Table 1. SVM results among of pr.tcd, tf.idf and rw.idf weighting methods

tf.idf rw.idf pr.tcd

Wikipedia 73.27% 74.99% 84.92%

Ohsumed 39.75% 39.02% 56.9%

NSFAwards 67% 64.98% 70.9%

Moreover, from the classification results of six weighting methods in Table 2,it is seen that the GR-based method (pr.tcd) also shows the comparable resultsto other similarity methods.

Table 2. SVM results from six weighting schemas

tf.idf tf.tcd rw.idf rw.tcd pr.idf pr.tcd

Wikipedia 73.27% 77.4% 74.99% 82.66% 77.72% 84.92%

Ohsumed 39.75% 39.73% 39.02% 56.03% 39.24% 56.9%

NSFAwards 67% 71.2% 64.98% 70.95% 65.2% 70.9%

Grammatical Relation versus Term Frequency: The effectiveness of gram-matical relation and term frequency can be measured when making comparisonsof the accuracy results of two pairs (tf.idf, pr.idf) and (tf.tcd, pr.tcd). The chartfrom Fig. 3 shows that pr always gives better performance than tf when it iscombined with tcd in the term weighting methods. However, this trend is notstable when pr goes along with idf . Particularly, pr.idf presents outstandingperformance on Wikipedia corpus, but tf.idf shows its strength on the othertwo corpora Ohsumed and NSFAwards.


Fig. 3. The chart shows the accuracy comparison between term weighting schema basedon term frequency and grammatical relation


Grammatical Relation versus Term Co-occurrence Relation. The ideasbehind term weighting schema based on these methods have some similarity.However, each of them has their strengths and weaknesses. The effectiveness ofthese approaches can be measured by taking the comparison between two pairsof weighting schemas (rw.idf, pr.idf) and (rw.tcd, pr.tcd). The chart from thefigure 4 shows that the majority cases term weighting methods based on gram-matical relations outperforms to those based on term co-occurrence relations.Particularly, the biggest gaps between the classification accuracy between twomethods is 2.7%, whereas just only 1 out of 6 cases the TCO-based methodsshow the comparable result to GR-based methods. In the case of NSFAwardscorpus, TCO-based methods achieve higher 0.1% accuracy results in comparisonto GR-based methods.

Term Class Dependency versus Inverse Document Frequency. The in-formation from the chart of Fig. 5 shows another view of information. It presentsthe comparison between the contribution of inverse document frequency andterm class dependency measures in term weighting schemas. Most of the cases,whenever term important evaluation (tf, rw, pr) combines with tcd shows the

486 D. Huynh et al.


outstanding results in compare with idf . There is just only one cases from theOhsumed corpus that idf shows better results than tcd.

In the summary, we have presented the experiment results and have madethe comparisons related to the strengths and weaknesses of proposed meth-ods. Although some aspects need to be considered, the proposed term weigh-ing approach for text classification using grammatical relations outperformsto other traditional term weighing approaches based on term frequency andterm co-occurrence, and the term class dependency measure can be used as thealternative information evaluation instead of inverse document frequency.

7 Conclusion and Future Work

The paper has presented term weighting method for text classification basedon grammatical relations. With the same datasets, the approach has improvedaccuracy of text classification in comparison to the traditional term weightingmethod. The approach overcomes the less of frequency of information by self-creating the frequency based on the grammar structure of text content. Thisapproach also raises motivations for our further investigation on the benefits ofrelations on text classification as well as text mining.

Our approach uses the concept of relations, we still do not take the closedconsiderations on its semantic aspect, we have just used the relations as connec-tions between terms as a first attempt for getting more statistical information.For further investigation, we are more focusing on taking the semantic infor-mation from tuples and its connection from the graph to form representations


of given documents. Moreover, the grammatical relations is extracted based onthe grammar structure of text body, this procedure has consumed much compu-tational processing. Therefore, the need for quick and reliable extraction frominput text should be considered as the further investigations.

References

1. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proc. of the 20th IJCAI, pp. 1606–1611 (2007)

2. Hassan, S., Banea, C.: Random-walk term weighting for improved text classifica-tion. In: Proc. of TextGraphs, pp. 53–60 (2006)

3. Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., Yang, Q., Chen, Z.: Enhancing textclustering by leveraging wikipedia semantics. In: Proc. of the 31st Annual Inter-national ACM SIGIR Conference on Research and Development in InformationRetrieval, pp. 179–186 (2008)

4. Joachims, T.: Text categorisation with support vector machines: Learning withmany relevant features. In: Nedellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS,vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

5. Lawrence, P., Sergey, B., Rajeev, M., Terry, W.: The pagerank citation ranking:Bringing order to the web. In: Stanford Digital Library Technologies Project (1998)

6. Ludovic, D., Patrick, G.: A probabilistic model of redundancy in informationextraction. In: Proc. of the 19th IJCAI, pp. 1034–1041 (2005)

7. Ludovic, D., Patrick, G.: The wikipedia xml corpus. In: ACM SIGIR Forum, pp.64–69 (2006)

8. Minnen, G., Carroll, J., Pearce, D.: Morphological processing of english. In: NaturalLanguage Engineering, pp. 207–223 (2001)

9. Rada, M., Paul, T.: Textrank: Bringing order into texts. In: Proc. of the EMNLP(2004)

10. Ravi, S., Rada, M.: Unsupervised graph-basedword sense disambiguation usingmeasures of word semantic similarity. In: Proc. of the ICSC, pp. 363–369 (2007)

11. Robertson, S., Jones, K.S.: Simple, proven approaches to text retrieval. Tech. rep.,University of Cambridge (1997)

12. Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness usingwikipedia. In: Proc. of the 21st AAAI, pp. 1419–1424 (2006)

13. Vapnik, V.N.: The nature of statistical learning theory. Springer, Heidelberg (1995)14. Wang, P., Hu, J., Zeng, H.J., Chen, L., Chen, Z.: Improving text classification by

using encyclopaedia knowledge. In: The Seventh IEEE ICDM, pp. 332–341 (2007)15. Wang, W., Do, D.B., Lin, X.: Term graph model for text classification. In: Li, X.,

Wang, S., Dong, Z.Y. (eds.) ADMA 2005. LNCS (LNAI), vol. 3584, pp. 19–30.Springer, Heidelberg (2005)

16. Yang, Y., Liu, X.: A re-examination of text categorisation methods. In: Proc. of the22nd Annual International ACM SIGIR Conference on Research and Developmentin Information Retrieval, pp. 42–49 (1999)

17. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text cate-gorisation. In: Proc. of the 14th ICML, pp. 412–420 (1997)

Documents

[Lecture Notes in Computer Science] Advances in Knowledge Discovery and Data Mining Volume 6634 || Grammatical Dependency-Based Relations for Term Weighting in Text Classification