Department of Electronic and Computer Engineering Automatic Subject Classification of Textual Documents Using Limited or No Training Data Arash Joorabchi

Department of Electronic and Computer Engineering

Automatic Subject Classification of Textual Documents Using Limited or No Training Data

Arash Joorabchi

Supervised by Dr. Abdulhussain E. Mahdi

Submitted for the degree of Doctor of Philosophy

10/11/2010

2

Introduction to ATC

Motivation, Aim, and Objectives

Bootstrapping ML-based ATC systems (Ch.3)

Bibliography-Based ATC method (BB-ATC) (Ch.4)

Enhanced BB-ATC for Automatic Classification of Scientific Literature in Digital Libraries (Ch.5)

Citation Based Keyphrase Extraction (CKE) (Ch.6)

Conclusion & Future Work

Outline

3

• Automatic Text Classification/Categorization (ATC)

– Automatic assignment of natural language text documents to one or more predefined classes/categories according to their contents.

• Applications include:

– Spam filtering

– Web information retrieval, e.g., filtering, focused crawling, web directories, subject browsing

– Organising digital libraries

• Common Methods:

– Rule-based Knowledge Engineering (until late 1980s)

– Machine Learning (since 1990s)

Introduction

4

• Common ML algorithms used for ATC

– Naïve Bayes (based on Bayes' theorem)

– k-Nearest Neighbors (k-NN)

– Support Vector Machines (SVM) [Vapnik, V 1995]

• SVM is reported to yield the best prediction accuracy [Joachims, T 1998]. However, the accuracy of ML-based ATC systems depend on many parameters such as:

– Quantity and quality of training documents

– Document representation models, e.g., bag-of-words vs. bag-of-phrases

– Term weighting mechanisms, e.g., binary vs. multinomial (burstiness phenomenon)

– Feature reduction and selection methods, e.g., document frequency vs. information gain.

• Therefore, the choice for the best classification algorithm highly depends on the characteristics of the ATC task at hand [Hand, D. J. 2006].

ML Approach to ATC

5

• What if there is limited/no training data?(e.g., 100 classes & 200 samples per class)

• Our aim was to alleviate this problem by pursuing two lines of research:

i. Investigating bootstrapping methods to automate the process of building labelled corpora for training ML-based ATC systems.

ii. Investigating a new unsupervised ATC algorithm which does not require any training data.

• In order to realise this aim, we have focused on utilising two sources of data whose application in ATC had not been fully explored before:

a) Conventional library organisation resources such as library classification schemes, controlled vocabularies, and catalogues (OPACs).

b) Linkage among documents in form of citation networks..


6

Hot-Folder Application

FTP server

Pre-processing

Meta-data generator module

Classifier

InformationExtractor

OpenOffice

Xpdf

RepositoryDatabase

ClassificationScheme

PostProcessing

zip packages

Thesaurus

PDFTK

Program Document Segmenter

Module Syllabus Segmenter

Named Entity Extractor

SegmentHeadings

Entitynames

GATE

WebSearch API

WebWeb

An Overview of Developed Syllabus Repository SystemDevelopment of a National Syllabus Repository for Higher Education in Ireland

• Goal: Collecting unstructured electronic syllabus documents from participating higher education institutes into a metadata-rich central repository.

• Extended the ISCED scheme

• 482B - Science, Mathematics and Computing/Computing/Information Systems/Databases

• Naïve Bayes Classification algorithm [Tom Mitchell 1997]

• A New Web-based bootstrapping method

7

1. A list of subject fields (leaf nodes) in the classification scheme is compiled.

2. For each subject filed in the list a web search query is created including the caption of the subject field and the keyword “syllabus” and submitted to the Yahoo search engine using Yahoo search SDK.

3. The first hundred URL’s in the returned results for each query are passed to the Gate toolkit [Cunningham et al. 2002], which downloads all corresponding files (in HTML, TXT, PDF, or MS-word formats) and extracts and tokenizes their textual contents.

4. The tokenised texts are converted to feature/word vectors are then used to train the classifier for classifying syllabus documents at the subject-field level.

5. The subject-field word vectors are also used in a bottom-up fashion to construct word vectors for the fields which belong to the higher levels of hierarchy (p.52).

Web-based Bootstapping processWeb-based Bootstrapping process

8

• Test dataset contains 100 undergraduate syllabus documents and 100 postgraduate syllabus documents from 5 participating HE institutes in Ireland

• The micro-average precision achieved by the classifier for undergraduate syllabi is 0.75, compared to 0.60 for postgraduate syllabi.

Mico-avg.

PrecisionMico-avg. Recall

Mico-avg.

F1

Named Entities 0.94 0.74 0.82

Topical Segments 0.84 0.72 0.77

• Results published in:

– The proceedings of the 12th European Conference on Research and Advanced Technology for Digital Libraries, ECDL 2008; and

– The Electronic Library, 27, 4 (2009).

Evaluation and Experimental Results

9

Training Corpus Builder

Bootstrapping moduleBootstrapping module

ClassifierTraining Dataset Builder

ClassificationScheme

UnlabeledTexts

Stop Words

GATE

GeneralStop Words

SpecificStop Words

LOC OPAC

TWCNB

LIBLINEAR

Z3950API

HttpClientAPI

NB Corpus Builder

SVM Corpus Builder

InternetInternet

Overview of Developed ATC systemBootstrapping ML-based ATC Systems Utilizing Public Library Resources

• A dynamic ML-based ATC system that can be adopted for wide range a ATC tasks with minimum customization required.

• Dewey Decimal Classification (DDC) scheme.

• small parts of books such as back cover, and editorial reviews for training

• Transformed Weight-normalized Complement Naive Bayes (TWCNB) [Rennie et al., 2003].

• A linear SVM classifier called LIBLINEAR [Lin, C.-J et al., 2008]

10

http:/amazon.com/gp/product/ISBN-VALUE

Retrieve a list of books from LOC’s catalogue which are Classified into this category.

Extract a list of ISBN’s and use them to retrieve the books descriptions from Azmazon.

Data Mining Process

11

• Product Description

– Editorial Reviews:

The Deitels' groundbreaking How to Program series offers unparalleled breadth and depth of object-oriented programmingobject-oriented programming concepts and intermediate-level topics for further study. The Seventh Edition has been extensively fine-tuned and is completely up-to-date with Sun MicrosystemsSun Microsystems, Inc.’s latest JavaJava release — Java Standard EditionJava Standard Edition 6 (“Mustang”) and several Java Enterprise EditionJava Enterprise Edition 5 topics. Contains an extensive OOD/UMLOOD/UML 2 case study on developing an automated teller machine. Takes a new tools-based approach to Web application development that uses Netbeans 5.5Netbeans 5.5 and Java Studio Creator 2 to create and consume Web Services. Features new AJAX-AJAX-enabledenabled, Web applications built with JavaServer Faces (JSF)JavaServer Faces (JSF),, Java Studio Creator Java Studio Creator 2 and the Java Blueprints AJAXAJAX Components. Includes new topics throughout, such as JDBCJDBC 4, SwingWorkerSwingWorker for multithreaded GUIsmultithreaded GUIs, GroupLayoutGroupLayout, Java Desktop Integration Java Desktop Integration Components (JDIC)Components (JDIC), and much more. A valuable reference for programmersprogrammers and anyone interested in learning the Java programming languageJava programming language.

http:/amazon.com/gp/product/0132222205

Parsed Book Description Text

http://amazon.com/gp/product/0132222205

12

• 20-Newsgroup-18828 dataset - a collection of 18,828 newsgroup articles, partitioned across 20 different newsgroups.

• Eight classes in 20-Newsgroup were mapped to their corresponding classes in Dewey Decimal Classification scheme. (the remaining were inapplicable e.g., misellaneous.forsale)

newsgroup Dewey

Number Dewey Caption

No. of training texts collected

sci.space 520 Astronomy and allied sciences 810

rec.sport.baseball 796.357 Baseball 997

rec.autos 796.7 Driving motor vehicles 587

rec.motorcycles 796.7 Driving motor vehicles 587

soc.religion.christian 230 Christian theology 1043

sci.electronics 537 Electricity and electronics 713

rec.sport.hockey 796.962 Ice hockey 270

sci.med 610 Medicine and health 1653

Evaluation and Experimental Results

13

Newsgroup Bootstrapped TWCNB

Precision% Standard TWCNB

Precision%

sci.space 69.19 94.94

rec.sport.baseball 96.78 93.96

rec.autos 74.74 91.91

rec.motorcycles 71.02 94.97

soc.religion.christian 89.36 96.0

sci.electronics 69.92 78.17

rec.sport.hockey 75.77 98.5

sci.med 76.23 96.96

Avg. 77.87 Avg. 93.17

• Accuracy of Bootstrapped TWCNB is 15% Lower than standard TWCNB

• The LIBLINEAR classifier with achieved average precision of 68% turned out to be considerably less accurate than TWCNB in this task.


– The proceedings of the 19th Irish Conference on Artificial Intelligence and Cognitive Science (AICS08).

Evaluation and Experimental Results (Cont.)

14

Leveraging the Legacy of Conventional Libraries for Organizing Digital Libraries

Can we utilize the classification metadata of books referenced in a syllabus document to classify it? They are already classified

by expert library cataloguers according to DDC and LCC classification schemes.Tapping into:

The intellectual work that has been put into developing and maintaining library classification systems over the last century.

The intellectual effort of expert cataloguers who have manually classified millions of books and other resources in libraries.

15

Bibliography-based ATC method

BB-ATC is based on Automating the following processes:

1. Identifying and extracting references in a given document.

2. Searching catalogues of physical libraries for the extracted references in order to retrieve their classification metadata.

3. Allocating a class(es) to the document based on retrieved classification category(ies) of the references with the help of a weighting mechanism.

Similarities to the k Nearest Neighbour K-NN algorithm

16

Bibliography-based ATC

Advantages Over ML-based ATC systems:

Library classification schemes are regularly updated and contain thousands of classes in every field of knowledge.

No training data is needed.

performance is not adversely affected by the large number of classes in DDC and LCC.

New books are catalogued everyday and hence no concept drift.

17

Pre-processing

Classifier

InformationExtractor

OpenOffice

Xpdf

*.PDF*.HTM*.DOC

Catalogue-Search

LOCCatalogue

BLCatalogue

GATE

Document’s content in plain text

Extracted reference identifiers (ISBNs/ISSNs)

References‘ DDC class numbers and LCSHs

Syllabi DB

Weighted list of chosen DDC Class(es) & LCSHs

BB-ATC Implementation

DDC classification Scheme was adopted because of its worldwide usage and Hierarchical structure.

JZkit Java API is used to communicate with the libraries’ OPAC catalogues through Z39.50 protocol.

JRegex/JAPE for extracting ISBNs/ISSNs.

Multi-label classification by assigning weights (0<w≤1) to candidate DDC classes and LCSHs.

18

BB-ATC Evaluation

Test dataset: 100 computer science related syllabus documents.

Full results available online at: www.csn.ul.ie/~arash/PDFs/1.pdf

Micro-averaged performance measures

TP FP FN Precision Recall F1

210 19 26 0.917 0.889 0.902

Leveraging the Legacy of Conventional Libraries for Organizing Digital Libraries

Arash Joorabchi, Abdulhussain E. Mahdi

Department of Electronic and Computer Engineering, University of Limerick, Republic of Ireland.

This document contains the full experimental results of our BB-ATC system.

The proposed ATC system was used to automatically classify 100 syllabus documents which

mainly belong to the filed of computer science. The validity and correctness of each assigned DDC class label is examined manually by an

expert cataloguer. When necessary, additional notes are provided to help clarify the results. Each time a new class appears in the results if the caption of the class in not self explanatory

then some additional information about that class is provided in form of footnotes. The source for these class descriptions is the WebDewey website (http://connexion.oclc.org) which provides access to the latest version of DDC scheme (DDC22 at the time of creating this document).

True

Positive

False

Positive

False

Negative

Precision Recall F1

210 19 26 0.917 0.889 0.902

Classification results summary

LEGEND

TP True Positive

FP False Positive

FN False Negative

NC Not Catalogued: the referenced item is not catalogued in either Library of Congress or British Library catalogues.

CE Cataloguer’s Error: The cataloguers in either the Library of Congress or British Library have classified the item into the wrong class (manual classification error) or they have labelled the item with an invalid class number (data entry error).

Corresponding author. Tel.: (+)353-61-213492; Fax:(+)353-61-338176. E-mail addresses: [email protected] (A. Joorabchi), [email protected] (A.E. Mahdi).

Author Method Data Set Classification Scheme F1

Pong et al.(2007) KK-NN-NN

505 training & 254 testing documents (web pages)

67 classes from LCC 0.80

Pong et al.(2007) NBNB

505 training & 254 testing documents(web pages)


Chung et al.(2003) KK-NN-NN

1889 training & 623 test documentsEconomic related web pages

575 subclasses of the DDC main class of economics

0.92

BB-ATCBB-ATC 100 computer science related Syllabi Full DDC scheme 0.90

Results published in:• The proceedings of the 13th European Conference on Research and Advanced Technology

for Digital Libraries, (ECDL 2009). (Granted the best student paper award)

http://www.csn.ul.ie/~arash/PDFs/1.pdf

19

Enhanced BB-ATC method for Automatic Classification of Scientific Literature in Digital Libraries

Pre-processing

Inferring

CiteSeerOAI & BibTexRecords

Data mining

Google Book

Search

WorldCatCatalogue

Chosen document’s metadata records and list of references

Pool of DDC numbers potentially related to the document

eXist-DB

Probabilistically chosen DDC number for the document

1800 publication a day in biomedical science!

The CiteSeer digital library is used as the experimental platform (~1 million records).

CiteSeer infrastructure is fully open source and supports (OAI-PHM).

Using Google Book Search database for mining citations networks.

Using OCLC’s WorldCat - a union catalogue of 70,000 libraries around the world.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.691

20

Data mining process

Publication #1 (P1): ISBN...Publication #n (Pn): ,

Document’s Metadata:

Title:Authors:Abstract:...

.

.

.Reference #n (Rn): title

Google Book

Search

List of publications citing Rn

List of publications citing R1

WorldCatCatalogue

Reference #1 (R1): title

ISBN DDC No.

21

Sample Data mining Results (Cont.)

Document’s Title: Statistical Learning, Localization, and Identification of Objects. (has only one reference)

This work describes a statistical approach to deal with learning and recognition problems in the field of computer vision

Citing publications: No. ISBN DDC No. No. ISBN DDC No. 1. 0123797721 006.3/7 9. 3540250468 629.8932 2. 0123797772 006.3/7 10. 3540629092 006.4/2 3. 0769501648 Null 11. 3540634606 006.4/2 4. 0780350987 006.3/7 12. 3540639314 621.36/7 5. 0780399781 Null 13. 3540646132 006.3/7 6. 0792378504 621.36/7 14. 3540650806 006.3 7. 0818681845 621.367 15. 389838019X 005.1/18 8. 1558605835 Null

006.32Neural nets

(Neural networks)

0 Computer science, information & general works

00 Computer science, knowledge & systems

006 Special computer methods

006.3 Artificial intelligence

006.31Machine learning

006.4 Computer pattern recognition

006.33Knowledge-based

systems

006.35Natural language Processing (NLP)

006.37Computer vision

006.42Optical pattern

recognition

006.45Acoustical pattern

recognition

Level 1

Level 2

Level 3

Level 4

Level 5

DDC No. Freq

0 17

6 7

00 17

006 15

005 2

0063 11

0064 3

00637 8

621367 4

Reference’s Title: Learning Object Recognition Models from Images

Citing publications:

No. ISBN DDC No. No. ISBN DDC No.

1. 0120147734 537.5/6 8. 3540433996 629.8/92

2. 0195095227 006.3/7 9. 3540617507 006.3/7

3. 0780399773 Null 10. 3540634606 006.4/2

4. 0818638702 621.39/9 11. 3540636366 006.7

5. 1586032577 006.3 12. 3540667229 006.3/7

6. 1848002785 621.367 13. 389838019X 005.1/18

7. 3540282262 006.3 14. 3540404988 006.3/7

22

The same concept as TFIDF weighting

Inference & Visualization

110)()()()(

depth

iiii DDCULFDDCNLFDDCGFDDCCW

)()()(20 pnCWcnDcnCW S

)()( ,1

ji

m

ji DDCFreqDDCULF

m

j j

jii R

DDCFreqDDCNLF

1

,

||

)()(

m

jjii RDDCDDCGF

1

)(

23

Evaluation Results

Test dataset contains 1000 research documents divided into 5 groups according to their number of references.

No. of references

Mico-avg. Pr

Mico-avg. Re

Mico-avg. F1

0 0.718 0.523 0.605

4 0.842 0.820 0.831

8 0.843 0.829 0.836

16 0.880 0.860 0.870

32 0.891 0.880 0.886

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0 4 8 16 32

Number of references

F1

Micro-Avg. precision Micro-Avg. recall Micro-Avg. F1

Mico-avg. Pr Mico-avg. Re Mico-avg. F1

0.84 0.78 0.81

24

Evaluation Results (cont.)

Level No. of Docs

% of Docs

Mico-avg. Pr

Mico-avg. F1

Mico-avg. Re

1 1000 100% 0.94 0.91 0.89

2 1000 100% 0.92 0.89 0.87

3 1000 100% 0.84 0.82 0.80

4 1000 100% 0.81 0.79 0.77

5 950 95% 0.75 0.70 0.66

6 394 39.4% 0.68 0.65 0.63

7 50 5% 0.59 0.58 0.57

8 20 2% 0.62 0.58 0.55

9 4 0.4% 0.59 0.69 0.83

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9

DDC hierarchy level

No. of Docs (%) Micro-Avg. precision (%) Micro-Avg. recall (%) Micro-Avg. F1 (%)

Number of documents classified in each level of DDC hierarchy and corresponding averaged performance Measure.

http://www.skynet.ie/~arash/BB-ATC1/HTML/

Article under review in the journal Information Processing & Management - Elsevier

http://www.skynet.ie/~arash/BB-ATC1/HTML/

25

BB-ATC Approach Applied to the Problem of Keyphrase Extraction form Scientific Literature

keyphrases (multi-word units) describe the content of research documents and they are usually assigned by the authors.

The task of automatically assigning keyphrases to a document is called keyphrase indexing

Considered a form of ATC ((ML-based multi-label) and approached as such

Free indexing vs. indexing with a controlled vocabulary (e.g., LCSH , MeSH)

Extraction indexing vs. assignment indexing

26

Citation Based Keyphrase Extraction (CKE)

1. Reference extraction using ParsCit [Councill, I. G, et al. 2008] (CRF,F1=0.93)

2. Mining the Google Book Search (GBS) database (>10 million archived items) for candidate terms (i.e., Google word clouds)

3. Term weighting & selection

Publication #1 (P1): ISBN...Publication #n (Pn): ,

Document’s Metadata:

Title: tAuthors: …Abstract: …...

.

.

.Reference #n (Rn): title

Google Book

Search

List of publications citing Rn

List of publications citing R1

Reference #1 (R1): title

ISBN Key terms

List of publications citing t

1 2

3

4

27

Term Weighting and Selection

Google Word Cloud (GWC): Google uses TFIDF + some heuristic rules to emphasize on proper nouns (names, locations, etc.)

GWC for a book titled: “Data mining: practical machine learning tools and techniques”:

Normalization including: stopword removal, punctuation removal, abbreviation expansion, case-folding, and stemming (Porter2 [Porter 2002])

Keyphraseness score of each candidate term measured using:

)(2)(2

log)(2)1)((2

log

)(2)1)((2

log)1)((2

log)(

tADItNCtNWtFO

tRFtLFtGFtK

28

Evaluation & Experimental Results

wiki-20 Test dataset [Medelyan et al., 2009]

20 computer science research papers each manually indexed by 15 different

human teams (teams of 2).

Rolling’s inter-indexer consistency formula adopted which is equivalent to F1

measure:

BAC tencyxer consisInter-inde

2

29

Evaluation & Experimental Results (cont.)

Performance of the CKE algorithm compared to human indexers and competitive methods.

Method

No. of keyphrases assigned to each document

Inter-consistency (%)

Min. Avg. Max.

Man

ual

Human indexing (gold standard) Varied 21.4 30.5 37.1

Su

pervised

KEA (Naïve Bayes) Static - 5 15.5 22.6 27.3

Maui (Naïve Bayes & all features) Static - 5 22.6 29.1 33.8

Maui (Bagged Decision Trees & all features) Static - 5 25.4 30.1 38.0

Maui (Bagged Decision Trees & best features) Static - 5 23.6 31.6 37.9

Un

sup

ervised

Grineva et al. Static - 5 18.2 27.3 33.0

CKE (condition A) Static - 5 22.7 30.6 38.3

CKE (condition B) Static - 6 26.0 31.1 39.3

CKE (condition C)Varied - the same as assigned

by human indexers 22.0 30.5 38.7

To appear in Journal of Information Science 36, 6 ( December 2010). Published online before print November 5, 2010

30

Conclusion & Future WorkThe main contribution of this work is the design, development, and evaluation of an alternative approach to ATC by utilizing two new knowledge/data sources:

i. Conventional library classification schemes.

ii. Citation networks among documents.

The proposed approach addresses two major issues

a) Lack of a standard and comprehensive classification scheme for ATC

b) Lack of training data

Future work includes:

BB-ATC: Using mining the citing documents as well as the cited ones, Multi-label classification

CKE: utilizing LCSH and user assigned keyphrases of cited and citing documents.

Applying the underlying theory of BB-ATC to ACM DL and ACM’s Computing Classification System (ACM-CSS).

BB-ATC & CKE as an automatic metadata generator plug-in for scientific DLs such as Ryan, Ireland’s National Research Portal and NDLTD (Networked Digital Library of Thesis and Dissertations)

VisualizationVisualization

31

• Goal: Collecting unstructured electronic syllabus documents from participating higher education institutes into a metadata-rich central repository.

• Challenges:

– Information Extraction:

• Syllabus documents have arbitrary sizes, formats, and layouts;

• contain multiple module descriptions (e.g., programme documents);

• contain complex layout features (e.g., hidden/nested tables).

– Automatic Classification:

• Lack of a suitable standard education classification scheme for higher education in Ireland.

• Lack of training data

Development of a National Syllabus Repository for Higher Education in Ireland

32

• Classification scheme:

– an enhanced version of International Standard Classification of Education (ISCED).

– 3 levels of classification: broad field (9), narrow field (25), and detailed field (80) each represented by a digit in a hierarchical fashion.

– We have extended this by adding a forth level of classification, subject field, represented by a letter in the classification coding system from Australian Standard Classification of Education (ASCED).

– “482B” Science, Mathematics and Computing/Computing/Information Systems/Databases

• Naïve Bayes Classification algorithm [Tom Mitchell 1997]

• A Web-based bootstrapping method

Classifier Classifier

33

Definitive Programme Document

MSc in Business Management

Introduction...

Programme Structure...

Module 1: BM3222

Leadership Management...

Module Syllabus

Programme Document Segmenter (PDS)

34

Extracting topical segments of each individual syllabus.

Module Syllabus

Header segment

Aim & Objectives

segment

Learning Outcomes

segment

Module Syllabus Segmenter (MSS)

35

• extracts a set of common named entities/attributesnamed entities/attributes such as module code, module name, module level, number of credits, pre-requisites and co-requisites from the header segment of syllabi .

CODE: CE 4701 Module: Computer Software 1

GRADING TYPE Normal CREDITS 3

TYPE Core PRE_REQUISITES:None

AIMS/OBJECTIVES To familiarise the student with the

use of a computer and typical applications software. To introduce a high-level language, typically Pascal, as a concrete formalism for the representation of algorithms in a machine-readable form.

Named Entity Extractor (NEE)

36

• Developing a dynamic ML-based ATC system that can be adopted for wide range a ATC tasks with minimum effort required from users.

• Users will select a set of categories from a comprehensive standard classification scheme, and a bootstrapping method is used to automatically build a training dataset accordingly.

• Three main components:

Universal Classification

Scheme

Training Corpus Builder

(bootstrapper)

ML-based Classification

Algorithm

Bootstrapping ML-based ATC Systems Utilizing Public Library Resources

37

• Universal Classification Scheme

– Acts as a pool of categories/classes that can be selectively adopted by the users to create their own classification scheme.

– Dewey Decimal Classification (DDC) with thousands of classes has been used in conventional libraries for over a century to categorize library materials.

– DDC is used in about 80% of libraries around the world and has a fully hierarchical structure (vs. LCC)

• Training Corpus Builder

– Textual item classified according to DDC are not available in an electronic format and/or are copyrighted.

– Alternatively, we use the small parts of books such as topics covered, the the back cover, and editorial reviews publicly available on books sellers’ websites such as Amazon.

– Short text (~500 words) containing semantically-rich terms used to summarize the book.

• Classification algorithms

– We implemented an optimized version of NB called Transformed Weight-normalized Complement Naive Bayes (TWCNB) [Rennie et al., 2003].

– A linear SVM classifier called LIBLINEAR [Lin, C.-J et al., 2008] which is an optimised implementation of SVM suitable for large linear classification tasks with thousands of features, such as ATC.

ATC System Components

38

BB-ATC Performance Compared to Similar Reported Experiments

Author Method Data Set Classification Scheme F1

Pong et al. (2007) KK-NN-NN

505 training & 254 testing documents (web pages)


Pong et al. (2007) NBNB

505 training & 254 testing documents(web pages)


Chung et al. (2003) KK-NN-NN

1889 training & 623 test documentsEconomic related web pages

575 subclasses of the DDC main class of economics

0.92

BB-ATCBB-ATC 100 computer science related Syllabi Full DDC scheme 0.90


– The proceedings of the 13th European Conference on Research and Advanced Technology for Digital Libraries, (ECDL 2009). (Granted the best student paper award)

39

Evaluation & Experimental Results

wiki-20 Test dataset [Medelyan et al., 2009]

20 computer science research papers each manually indexed by 15 different

human teams (teams of 2).

Rolling’s inter-indexer consistency formula adopted which is equivalent to F1

measure:

The number of extracted references per document range between 10 to 79

with an average value of 25.9 references per document

The number of retrieved GWCs per document ranges between 62 to 766

with an average value of 271 GWCs per document.

In total, the data mining unit has retrieved the metadata records of 5,576

publications from GBS, which either cite one of the documents in the wiki-20

collection or one of their references, and almost all of these records (5421,

97.14%) contain a word cloud.

BAC tencyxer consisInter-inde

2

40

F-scoreThe weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is:

This is also known as the F1 measure, because recall and precision are evenly

weighted.The general formula for non-negative real β is:

ii

ii FNTP

TPcRe

correct possible Total

labels class assigned correctly ofNumber )(

ii

ii FPTP

TPcPr

assigned Total

labels class assigned correctly ofNumber )(

)()(

)()()(

ii

iii cRecPre

cRec2PrcF1

0

6

0

3

4

4

TP

TP

TP

FNFP

Assigned class 006.3

Correct class 006.4

41[Joachims, 1997]

42

Introduction to Automatic Text Classification (ATC)


Bootstrapping ML-based ATC systems

Leveraging Conventional Library resources for Organizing Digital Libraries (BB-ATC)

Enhanced BB-ATC for Automatic Classification of Scientific Literature in Digital Libraries

BB-ATC Approach Applied to the Problem of Keyphrase Extraction form Scientific Literature

Conclusion & Future Work

Outline