Upload
beverly-chapman
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Department of Electronic and Computer Engineering
Automatic Subject Classification of Textual Documents Using Limited or No Training Data
Arash Joorabchi
Supervised by Dr. Abdulhussain E. Mahdi
Submitted for the degree of Doctor of Philosophy
10/11/2010
2
Introduction to ATC
Motivation, Aim, and Objectives
Bootstrapping ML-based ATC systems (Ch.3)
Bibliography-Based ATC method (BB-ATC) (Ch.4)
Enhanced BB-ATC for Automatic Classification of Scientific Literature in Digital Libraries (Ch.5)
Citation Based Keyphrase Extraction (CKE) (Ch.6)
Conclusion & Future Work
Outline
3
• Automatic Text Classification/Categorization (ATC)
– Automatic assignment of natural language text documents to one or more predefined classes/categories according to their contents.
• Applications include:
– Spam filtering
– Web information retrieval, e.g., filtering, focused crawling, web directories, subject browsing
– Organising digital libraries
• Common Methods:
– Rule-based Knowledge Engineering (until late 1980s)
– Machine Learning (since 1990s)
Introduction
4
• Common ML algorithms used for ATC
– Naïve Bayes (based on Bayes' theorem)
– k-Nearest Neighbors (k-NN)
– Support Vector Machines (SVM) [Vapnik, V 1995]
• SVM is reported to yield the best prediction accuracy [Joachims, T 1998]. However, the accuracy of ML-based ATC systems depend on many parameters such as:
– Quantity and quality of training documents
– Document representation models, e.g., bag-of-words vs. bag-of-phrases
– Term weighting mechanisms, e.g., binary vs. multinomial (burstiness phenomenon)
– Feature reduction and selection methods, e.g., document frequency vs. information gain.
• Therefore, the choice for the best classification algorithm highly depends on the characteristics of the ATC task at hand [Hand, D. J. 2006].
ML Approach to ATC
5
• What if there is limited/no training data?(e.g., 100 classes & 200 samples per class)
• Our aim was to alleviate this problem by pursuing two lines of research:
i. Investigating bootstrapping methods to automate the process of building labelled corpora for training ML-based ATC systems.
ii. Investigating a new unsupervised ATC algorithm which does not require any training data.
• In order to realise this aim, we have focused on utilising two sources of data whose application in ATC had not been fully explored before:
a) Conventional library organisation resources such as library classification schemes, controlled vocabularies, and catalogues (OPACs).
b) Linkage among documents in form of citation networks..
Motivation, Aim, and Objectives
6
Hot-Folder Application
FTP server
Pre-processing
Meta-data generator module
Classifier
InformationExtractor
OpenOffice
Xpdf
RepositoryDatabase
ClassificationScheme
PostProcessing
zip packages
Thesaurus
PDFTK
Program Document Segmenter
Module Syllabus Segmenter
Named Entity Extractor
SegmentHeadings
Entitynames
GATE
WebSearch API
WebWeb
An Overview of Developed Syllabus Repository SystemDevelopment of a National Syllabus Repository for Higher Education in Ireland
• Goal: Collecting unstructured electronic syllabus documents from participating higher education institutes into a metadata-rich central repository.
• Extended the ISCED scheme
• 482B - Science, Mathematics and Computing/Computing/Information Systems/Databases
• Naïve Bayes Classification algorithm [Tom Mitchell 1997]
• A New Web-based bootstrapping method
7
1. A list of subject fields (leaf nodes) in the classification scheme is compiled.
2. For each subject filed in the list a web search query is created including the caption of the subject field and the keyword “syllabus” and submitted to the Yahoo search engine using Yahoo search SDK.
3. The first hundred URL’s in the returned results for each query are passed to the Gate toolkit [Cunningham et al. 2002], which downloads all corresponding files (in HTML, TXT, PDF, or MS-word formats) and extracts and tokenizes their textual contents.
4. The tokenised texts are converted to feature/word vectors are then used to train the classifier for classifying syllabus documents at the subject-field level.
5. The subject-field word vectors are also used in a bottom-up fashion to construct word vectors for the fields which belong to the higher levels of hierarchy (p.52).
Web-based Bootstapping processWeb-based Bootstrapping process
8
• Test dataset contains 100 undergraduate syllabus documents and 100 postgraduate syllabus documents from 5 participating HE institutes in Ireland
• The micro-average precision achieved by the classifier for undergraduate syllabi is 0.75, compared to 0.60 for postgraduate syllabi.
Mico-avg.
PrecisionMico-avg. Recall
Mico-avg.
F1
Named Entities 0.94 0.74 0.82
Topical Segments 0.84 0.72 0.77
• Results published in:
– The proceedings of the 12th European Conference on Research and Advanced Technology for Digital Libraries, ECDL 2008; and
– The Electronic Library, 27, 4 (2009).
Evaluation and Experimental Results
9
Training Corpus Builder
Bootstrapping moduleBootstrapping module
ClassifierTraining Dataset Builder
ClassificationScheme
UnlabeledTexts
Stop Words
GATE
GeneralStop Words
SpecificStop Words
LOC OPAC
TWCNB
LIBLINEAR
Z3950API
HttpClientAPI
NB Corpus Builder
SVM Corpus Builder
InternetInternet
Overview of Developed ATC systemBootstrapping ML-based ATC Systems Utilizing Public Library Resources
• A dynamic ML-based ATC system that can be adopted for wide range a ATC tasks with minimum customization required.
• Dewey Decimal Classification (DDC) scheme.
• small parts of books such as back cover, and editorial reviews for training
• Transformed Weight-normalized Complement Naive Bayes (TWCNB) [Rennie et al., 2003].
• A linear SVM classifier called LIBLINEAR [Lin, C.-J et al., 2008]
10
http:/amazon.com/gp/product/ISBN-VALUE
Retrieve a list of books from LOC’s catalogue which are Classified into this category.
Extract a list of ISBN’s and use them to retrieve the books descriptions from Azmazon.
Data Mining Process
11
• Product Description
– Editorial Reviews:
The Deitels' groundbreaking How to Program series offers unparalleled breadth and depth of object-oriented programmingobject-oriented programming concepts and intermediate-level topics for further study. The Seventh Edition has been extensively fine-tuned and is completely up-to-date with Sun MicrosystemsSun Microsystems, Inc.’s latest JavaJava release — Java Standard EditionJava Standard Edition 6 (“Mustang”) and several Java Enterprise EditionJava Enterprise Edition 5 topics. Contains an extensive OOD/UMLOOD/UML 2 case study on developing an automated teller machine. Takes a new tools-based approach to Web application development that uses Netbeans 5.5Netbeans 5.5 and Java Studio Creator 2 to create and consume Web Services. Features new AJAX-AJAX-enabledenabled, Web applications built with JavaServer Faces (JSF)JavaServer Faces (JSF),, Java Studio Creator Java Studio Creator 2 and the Java Blueprints AJAXAJAX Components. Includes new topics throughout, such as JDBCJDBC 4, SwingWorkerSwingWorker for multithreaded GUIsmultithreaded GUIs, GroupLayoutGroupLayout, Java Desktop Integration Java Desktop Integration Components (JDIC)Components (JDIC), and much more. A valuable reference for programmersprogrammers and anyone interested in learning the Java programming languageJava programming language.
http:/amazon.com/gp/product/0132222205
Parsed Book Description Text
12
• 20-Newsgroup-18828 dataset - a collection of 18,828 newsgroup articles, partitioned across 20 different newsgroups.
• Eight classes in 20-Newsgroup were mapped to their corresponding classes in Dewey Decimal Classification scheme. (the remaining were inapplicable e.g., misellaneous.forsale)
newsgroup Dewey
Number Dewey Caption
No. of training texts collected
sci.space 520 Astronomy and allied sciences 810
rec.sport.baseball 796.357 Baseball 997
rec.autos 796.7 Driving motor vehicles 587
rec.motorcycles 796.7 Driving motor vehicles 587
soc.religion.christian 230 Christian theology 1043
sci.electronics 537 Electricity and electronics 713
rec.sport.hockey 796.962 Ice hockey 270
sci.med 610 Medicine and health 1653
Evaluation and Experimental Results
13
Newsgroup Bootstrapped TWCNB
Precision% Standard TWCNB
Precision%
sci.space 69.19 94.94
rec.sport.baseball 96.78 93.96
rec.autos 74.74 91.91
rec.motorcycles 71.02 94.97
soc.religion.christian 89.36 96.0
sci.electronics 69.92 78.17
rec.sport.hockey 75.77 98.5
sci.med 76.23 96.96
Avg. 77.87 Avg. 93.17
• Accuracy of Bootstrapped TWCNB is 15% Lower than standard TWCNB
• The LIBLINEAR classifier with achieved average precision of 68% turned out to be considerably less accurate than TWCNB in this task.
• Results published in:
– The proceedings of the 19th Irish Conference on Artificial Intelligence and Cognitive Science (AICS08).
Evaluation and Experimental Results (Cont.)
14
Leveraging the Legacy of Conventional Libraries for Organizing Digital Libraries
Can we utilize the classification metadata of books referenced in a syllabus document to classify it? They are already classified
by expert library cataloguers according to DDC and LCC classification schemes.Tapping into:
The intellectual work that has been put into developing and maintaining library classification systems over the last century.
The intellectual effort of expert cataloguers who have manually classified millions of books and other resources in libraries.
15
Bibliography-based ATC method
BB-ATC is based on Automating the following processes:
1. Identifying and extracting references in a given document.
2. Searching catalogues of physical libraries for the extracted references in order to retrieve their classification metadata.
3. Allocating a class(es) to the document based on retrieved classification category(ies) of the references with the help of a weighting mechanism.
Similarities to the k Nearest Neighbour K-NN algorithm
16
Bibliography-based ATC
Advantages Over ML-based ATC systems:
Library classification schemes are regularly updated and contain thousands of classes in every field of knowledge.
No training data is needed.
performance is not adversely affected by the large number of classes in DDC and LCC.
New books are catalogued everyday and hence no concept drift.
17
Pre-processing
Classifier
InformationExtractor
OpenOffice
Xpdf
*.PDF*.HTM*.DOC
Catalogue-Search
LOCCatalogue
BLCatalogue
GATE
Document’s content in plain text
Extracted reference identifiers (ISBNs/ISSNs)
References‘ DDC class numbers and LCSHs
Syllabi DB
Weighted list of chosen DDC Class(es) & LCSHs
BB-ATC Implementation
DDC classification Scheme was adopted because of its worldwide usage and Hierarchical structure.
JZkit Java API is used to communicate with the libraries’ OPAC catalogues through Z39.50 protocol.
JRegex/JAPE for extracting ISBNs/ISSNs.
Multi-label classification by assigning weights (0<w≤1) to candidate DDC classes and LCSHs.
18
BB-ATC Evaluation
Test dataset: 100 computer science related syllabus documents.
Full results available online at: www.csn.ul.ie/~arash/PDFs/1.pdf
Micro-averaged performance measures
TP FP FN Precision Recall F1
210 19 26 0.917 0.889 0.902
Leveraging the Legacy of Conventional Libraries for Organizing Digital Libraries
Arash Joorabchi, Abdulhussain E. Mahdi
Department of Electronic and Computer Engineering, University of Limerick, Republic of Ireland.
This document contains the full experimental results of our BB-ATC system.
The proposed ATC system was used to automatically classify 100 syllabus documents which
mainly belong to the filed of computer science. The validity and correctness of each assigned DDC class label is examined manually by an
expert cataloguer. When necessary, additional notes are provided to help clarify the results. Each time a new class appears in the results if the caption of the class in not self explanatory
then some additional information about that class is provided in form of footnotes. The source for these class descriptions is the WebDewey website (http://connexion.oclc.org) which provides access to the latest version of DDC scheme (DDC22 at the time of creating this document).
True
Positive
False
Positive
False
Negative
Precision Recall F1
210 19 26 0.917 0.889 0.902
Classification results summary
LEGEND
TP True Positive
FP False Positive
FN False Negative
NC Not Catalogued: the referenced item is not catalogued in either Library of Congress or British Library catalogues.
CE Cataloguer’s Error: The cataloguers in either the Library of Congress or British Library have classified the item into the wrong class (manual classification error) or they have labelled the item with an invalid class number (data entry error).
Corresponding author. Tel.: (+)353-61-213492; Fax:(+)353-61-338176. E-mail addresses: [email protected] (A. Joorabchi), [email protected] (A.E. Mahdi).
Author Method Data Set Classification Scheme F1
Pong et al.(2007) KK-NN-NN
505 training & 254 testing documents (web pages)
67 classes from LCC 0.80
Pong et al.(2007) NBNB
505 training & 254 testing documents(web pages)
67 classes from LCC 0.54
Chung et al.(2003) KK-NN-NN
1889 training & 623 test documentsEconomic related web pages
575 subclasses of the DDC main class of economics
0.92
BB-ATCBB-ATC 100 computer science related Syllabi Full DDC scheme 0.90
Results published in:• The proceedings of the 13th European Conference on Research and Advanced Technology
for Digital Libraries, (ECDL 2009). (Granted the best student paper award)
19
Enhanced BB-ATC method for Automatic Classification of Scientific Literature in Digital Libraries
Pre-processing
Inferring
CiteSeerOAI & BibTexRecords
Data mining
Google Book
Search
WorldCatCatalogue
Chosen document’s metadata records and list of references
Pool of DDC numbers potentially related to the document
eXist-DB
Probabilistically chosen DDC number for the document
1800 publication a day in biomedical science!
The CiteSeer digital library is used as the experimental platform (~1 million records).
CiteSeer infrastructure is fully open source and supports (OAI-PHM).
Using Google Book Search database for mining citations networks.
Using OCLC’s WorldCat - a union catalogue of 70,000 libraries around the world.
20
Data mining process
Publication #1 (P1): ISBN...Publication #n (Pn): ,
Document’s Metadata:
Title:Authors:Abstract:...
.
.
.Reference #n (Rn): title
Google Book
Search
List of publications citing Rn
List of publications citing R1
WorldCatCatalogue
Reference #1 (R1): title
ISBN DDC No.
21
Sample Data mining Results (Cont.)
Document’s Title: Statistical Learning, Localization, and Identification of Objects. (has only one reference)
This work describes a statistical approach to deal with learning and recognition problems in the field of computer vision
Citing publications: No. ISBN DDC No. No. ISBN DDC No. 1. 0123797721 006.3/7 9. 3540250468 629.8932 2. 0123797772 006.3/7 10. 3540629092 006.4/2 3. 0769501648 Null 11. 3540634606 006.4/2 4. 0780350987 006.3/7 12. 3540639314 621.36/7 5. 0780399781 Null 13. 3540646132 006.3/7 6. 0792378504 621.36/7 14. 3540650806 006.3 7. 0818681845 621.367 15. 389838019X 005.1/18 8. 1558605835 Null
006.32Neural nets
(Neural networks)
0 Computer science, information & general works
00 Computer science, knowledge & systems
006 Special computer methods
006.3 Artificial intelligence
006.31Machine learning
006.4 Computer pattern recognition
006.33Knowledge-based
systems
006.35Natural language Processing (NLP)
006.37Computer vision
006.42Optical pattern
recognition
006.45Acoustical pattern
recognition
Level 1
Level 2
Level 3
Level 4
Level 5
DDC No. Freq
0 17
6 7
00 17
006 15
005 2
0063 11
0064 3
00637 8
621367 4
Reference’s Title: Learning Object Recognition Models from Images
Citing publications:
No. ISBN DDC No. No. ISBN DDC No.
1. 0120147734 537.5/6 8. 3540433996 629.8/92
2. 0195095227 006.3/7 9. 3540617507 006.3/7
3. 0780399773 Null 10. 3540634606 006.4/2
4. 0818638702 621.39/9 11. 3540636366 006.7
5. 1586032577 006.3 12. 3540667229 006.3/7
6. 1848002785 621.367 13. 389838019X 005.1/18
7. 3540282262 006.3 14. 3540404988 006.3/7
22
The same concept as TFIDF weighting
Inference & Visualization
110)()()()(
depth
iiii DDCULFDDCNLFDDCGFDDCCW
)()()(20 pnCWcnDcnCW S
)()( ,1
ji
m
ji DDCFreqDDCULF
m
j j
jii R
DDCFreqDDCNLF
1
,
||
)()(
m
jjii RDDCDDCGF
1
)(
23
Evaluation Results
Test dataset contains 1000 research documents divided into 5 groups according to their number of references.
No. of references
Mico-avg. Pr
Mico-avg. Re
Mico-avg. F1
0 0.718 0.523 0.605
4 0.842 0.820 0.831
8 0.843 0.829 0.836
16 0.880 0.860 0.870
32 0.891 0.880 0.886
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0 4 8 16 32
Number of references
F1
Micro-Avg. precision Micro-Avg. recall Micro-Avg. F1
Mico-avg. Pr Mico-avg. Re Mico-avg. F1
0.84 0.78 0.81
24
Evaluation Results (cont.)
Level No. of Docs
% of Docs
Mico-avg. Pr
Mico-avg. F1
Mico-avg. Re
1 1000 100% 0.94 0.91 0.89
2 1000 100% 0.92 0.89 0.87
3 1000 100% 0.84 0.82 0.80
4 1000 100% 0.81 0.79 0.77
5 950 95% 0.75 0.70 0.66
6 394 39.4% 0.68 0.65 0.63
7 50 5% 0.59 0.58 0.57
8 20 2% 0.62 0.58 0.55
9 4 0.4% 0.59 0.69 0.83
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9
DDC hierarchy level
No. of Docs (%) Micro-Avg. precision (%) Micro-Avg. recall (%) Micro-Avg. F1 (%)
Number of documents classified in each level of DDC hierarchy and corresponding averaged performance Measure.
http://www.skynet.ie/~arash/BB-ATC1/HTML/
Article under review in the journal Information Processing & Management - Elsevier
25
BB-ATC Approach Applied to the Problem of Keyphrase Extraction form Scientific Literature
keyphrases (multi-word units) describe the content of research documents and they are usually assigned by the authors.
The task of automatically assigning keyphrases to a document is called keyphrase indexing
Considered a form of ATC ((ML-based multi-label) and approached as such
Free indexing vs. indexing with a controlled vocabulary (e.g., LCSH , MeSH)
Extraction indexing vs. assignment indexing
26
Citation Based Keyphrase Extraction (CKE)
1. Reference extraction using ParsCit [Councill, I. G, et al. 2008] (CRF,F1=0.93)
2. Mining the Google Book Search (GBS) database (>10 million archived items) for candidate terms (i.e., Google word clouds)
3. Term weighting & selection
Publication #1 (P1): ISBN...Publication #n (Pn): ,
Document’s Metadata:
Title: tAuthors: …Abstract: …...
.
.
.Reference #n (Rn): title
Google Book
Search
List of publications citing Rn
List of publications citing R1
Reference #1 (R1): title
ISBN Key terms
List of publications citing t
1 2
3
4
27
Term Weighting and Selection
Google Word Cloud (GWC): Google uses TFIDF + some heuristic rules to emphasize on proper nouns (names, locations, etc.)
GWC for a book titled: “Data mining: practical machine learning tools and techniques”:
Normalization including: stopword removal, punctuation removal, abbreviation expansion, case-folding, and stemming (Porter2 [Porter 2002])
Keyphraseness score of each candidate term measured using:
)(2)(2
log)(2)1)((2
log
)(2)1)((2
log)1)((2
log)(
tADItNCtNWtFO
tRFtLFtGFtK
28
Evaluation & Experimental Results
wiki-20 Test dataset [Medelyan et al., 2009]
20 computer science research papers each manually indexed by 15 different
human teams (teams of 2).
Rolling’s inter-indexer consistency formula adopted which is equivalent to F1
measure:
BAC tencyxer consisInter-inde
2
29
Evaluation & Experimental Results (cont.)
Performance of the CKE algorithm compared to human indexers and competitive methods.
Method
No. of keyphrases assigned to each document
Inter-consistency (%)
Min. Avg. Max.
Man
ual
Human indexing (gold standard) Varied 21.4 30.5 37.1
Su
pervised
KEA (Naïve Bayes) Static - 5 15.5 22.6 27.3
Maui (Naïve Bayes & all features) Static - 5 22.6 29.1 33.8
Maui (Bagged Decision Trees & all features) Static - 5 25.4 30.1 38.0
Maui (Bagged Decision Trees & best features) Static - 5 23.6 31.6 37.9
Un
sup
ervised
Grineva et al. Static - 5 18.2 27.3 33.0
CKE (condition A) Static - 5 22.7 30.6 38.3
CKE (condition B) Static - 6 26.0 31.1 39.3
CKE (condition C)Varied - the same as assigned
by human indexers 22.0 30.5 38.7
To appear in Journal of Information Science 36, 6 ( December 2010). Published online before print November 5, 2010
30
Conclusion & Future WorkThe main contribution of this work is the design, development, and evaluation of an alternative approach to ATC by utilizing two new knowledge/data sources:
i. Conventional library classification schemes.
ii. Citation networks among documents.
The proposed approach addresses two major issues
a) Lack of a standard and comprehensive classification scheme for ATC
b) Lack of training data
Future work includes:
BB-ATC: Using mining the citing documents as well as the cited ones, Multi-label classification
CKE: utilizing LCSH and user assigned keyphrases of cited and citing documents.
Applying the underlying theory of BB-ATC to ACM DL and ACM’s Computing Classification System (ACM-CSS).
BB-ATC & CKE as an automatic metadata generator plug-in for scientific DLs such as Ryan, Ireland’s National Research Portal and NDLTD (Networked Digital Library of Thesis and Dissertations)
VisualizationVisualization
31
• Goal: Collecting unstructured electronic syllabus documents from participating higher education institutes into a metadata-rich central repository.
• Challenges:
– Information Extraction:
• Syllabus documents have arbitrary sizes, formats, and layouts;
• contain multiple module descriptions (e.g., programme documents);
• contain complex layout features (e.g., hidden/nested tables).
– Automatic Classification:
• Lack of a suitable standard education classification scheme for higher education in Ireland.
• Lack of training data
Development of a National Syllabus Repository for Higher Education in Ireland
32
• Classification scheme:
– an enhanced version of International Standard Classification of Education (ISCED).
– 3 levels of classification: broad field (9), narrow field (25), and detailed field (80) each represented by a digit in a hierarchical fashion.
– We have extended this by adding a forth level of classification, subject field, represented by a letter in the classification coding system from Australian Standard Classification of Education (ASCED).
– “482B” Science, Mathematics and Computing/Computing/Information Systems/Databases
• Naïve Bayes Classification algorithm [Tom Mitchell 1997]
• A Web-based bootstrapping method
Classifier Classifier
33
Definitive Programme Document
MSc in Business Management
Introduction...
Programme Structure...
Module 1: BM3222
Leadership Management...
Module Syllabus
Programme Document Segmenter (PDS)
34
Extracting topical segments of each individual syllabus.
Module Syllabus
Header segment
Aim & Objectives
segment
Learning Outcomes
segment
Module Syllabus Segmenter (MSS)
35
• extracts a set of common named entities/attributesnamed entities/attributes such as module code, module name, module level, number of credits, pre-requisites and co-requisites from the header segment of syllabi .
CODE: CE 4701 Module: Computer Software 1
GRADING TYPE Normal CREDITS 3
TYPE Core PRE_REQUISITES:None
AIMS/OBJECTIVES To familiarise the student with the
use of a computer and typical applications software. To introduce a high-level language, typically Pascal, as a concrete formalism for the representation of algorithms in a machine-readable form.
Named Entity Extractor (NEE)
36
• Developing a dynamic ML-based ATC system that can be adopted for wide range a ATC tasks with minimum effort required from users.
• Users will select a set of categories from a comprehensive standard classification scheme, and a bootstrapping method is used to automatically build a training dataset accordingly.
• Three main components:
Universal Classification
Scheme
Training Corpus Builder
(bootstrapper)
ML-based Classification
Algorithm
Bootstrapping ML-based ATC Systems Utilizing Public Library Resources
37
• Universal Classification Scheme
– Acts as a pool of categories/classes that can be selectively adopted by the users to create their own classification scheme.
– Dewey Decimal Classification (DDC) with thousands of classes has been used in conventional libraries for over a century to categorize library materials.
– DDC is used in about 80% of libraries around the world and has a fully hierarchical structure (vs. LCC)
• Training Corpus Builder
– Textual item classified according to DDC are not available in an electronic format and/or are copyrighted.
– Alternatively, we use the small parts of books such as topics covered, the the back cover, and editorial reviews publicly available on books sellers’ websites such as Amazon.
– Short text (~500 words) containing semantically-rich terms used to summarize the book.
• Classification algorithms
– We implemented an optimized version of NB called Transformed Weight-normalized Complement Naive Bayes (TWCNB) [Rennie et al., 2003].
– A linear SVM classifier called LIBLINEAR [Lin, C.-J et al., 2008] which is an optimised implementation of SVM suitable for large linear classification tasks with thousands of features, such as ATC.
ATC System Components
38
BB-ATC Performance Compared to Similar Reported Experiments
Author Method Data Set Classification Scheme F1
Pong et al. (2007) KK-NN-NN
505 training & 254 testing documents (web pages)
67 classes from LCC 0.80
Pong et al. (2007) NBNB
505 training & 254 testing documents(web pages)
67 classes from LCC 0.54
Chung et al. (2003) KK-NN-NN
1889 training & 623 test documentsEconomic related web pages
575 subclasses of the DDC main class of economics
0.92
BB-ATCBB-ATC 100 computer science related Syllabi Full DDC scheme 0.90
• Results published in:
– The proceedings of the 13th European Conference on Research and Advanced Technology for Digital Libraries, (ECDL 2009). (Granted the best student paper award)
39
Evaluation & Experimental Results
wiki-20 Test dataset [Medelyan et al., 2009]
20 computer science research papers each manually indexed by 15 different
human teams (teams of 2).
Rolling’s inter-indexer consistency formula adopted which is equivalent to F1
measure:
The number of extracted references per document range between 10 to 79
with an average value of 25.9 references per document
The number of retrieved GWCs per document ranges between 62 to 766
with an average value of 271 GWCs per document.
In total, the data mining unit has retrieved the metadata records of 5,576
publications from GBS, which either cite one of the documents in the wiki-20
collection or one of their references, and almost all of these records (5421,
97.14%) contain a word cloud.
BAC tencyxer consisInter-inde
2
40
F-scoreThe weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is:
This is also known as the F1 measure, because recall and precision are evenly
weighted.The general formula for non-negative real β is:
ii
ii FNTP
TPcRe
correct possible Total
labels class assigned correctly ofNumber )(
ii
ii FPTP
TPcPr
assigned Total
labels class assigned correctly ofNumber )(
)()(
)()()(
ii
iii cRecPre
cRec2PrcF1
0
6
0
3
4
4
TP
TP
TP
FNFP
Assigned class 006.3
Correct class 006.4
42
Introduction to Automatic Text Classification (ATC)
Motivation, Aim, and Objectives
Bootstrapping ML-based ATC systems
Leveraging Conventional Library resources for Organizing Digital Libraries (BB-ATC)
Enhanced BB-ATC for Automatic Classification of Scientific Literature in Digital Libraries
BB-ATC Approach Applied to the Problem of Keyphrase Extraction form Scientific Literature
Conclusion & Future Work
Outline