Upload
imogen-rodgers
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1
A Study of Text Categorization
Classifying Programming Newsgroup Discussions using Text Categorization
Algorithms
by
Lingfeng Mo
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 2
What is text categorization?
Definition– Classification of documents into a fixed
number of predefined categories.
– Sometimes alternately referred to as text data mining.
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 3
What is Data Mining?
Many Definitions– Non-trivial extraction of implicit, previously
unknown and potentially useful information from data
– Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 4
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 5
Data Mining Tasks
Prediction Methods
– Use some variables to predict unknown or future values of other variables.
Description Methods
– Find human-interpretable patterns that describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 6
Classification: Definition
Given a collection of records (training set )– Each record contains a set of attributes, one of the
attributes is the class. Find a model for class attribute as a function
of the values of other attributes. Goal: previously unseen records should be
assigned a class as accurately as possible.– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 7
Classification Example
TestSet
Training Set
ModelLearn
Classifier
Randomly choose certain portion
12 German documents
12 English documents
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 8
Classification Example Result
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 9
Why our study?
Programmer often seek and exchange information about problems on using a certain library, framework, or API online.
Titles not corresponding to the content in newsgroup discussion.
Novice doesn’t know how to ask a question exactly. By categorizing an ongoing discussion, such techniques
could be used to directly point out previous discussions of similar problems to the developers who ask questions.
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 10
What’s this study for?
Ideal Goal: Automatically classifying discussions into meaningful semantic categories.
– Approach:Collect and save raw dataImport and optimize dataSelect certain portion of data to train a classifier
model.Classify dataEvaluate results
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 11
Tool we use
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. --a machine learning for language toolkit.
Via: http://mallet.cs.umass.edu/
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 12
Data Collecting
Download discussions from Java programming forum
Save each discussion as a text document(.txt) – Article(text, name)
Manually put similar discussions into the same folder (labels)
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 13
Import Data
Input: Labels included with their articles
– How it works?
Output: Mallet document
Char Sequence
TokenSequenc
e
FeatureVectors Data
Name of Article Name/
Source
Label Target
Name of Drive
Name of Folder+ +
Name of Article
}Instance
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 14
Example of Import Data
Import-fileImport-dir
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 15
Training Classifier
Input the produced data by importing process Set training portion for the training set and test
set K-Fold Cross-Validation – 10 trials usually Set the trainer - NaiveBayesTrainer
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 16
Basic Bayes theorem
A probabilistic framework for solving classification problems
Conditional Probability:
Bayes theorem:
)()()|(
)|(APCPCAP
ACP
)(),(
)|(
)(),(
)|(
CPCAP
CAP
APCAP
ACP
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 17
Example of Bayes Theorem
Given: – A doctor knows that cold causes cough 50% of the time
– Prior probability of any patient having cold is 1/50,000
– Prior probability of any patient having cough is 1/20
If a patient has cough, what’s the probability he/she has cold?
0002.020/1
50000/15.0
)(
)()|()|(
SP
MPMSPSMP
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 18
Example of Train Classifier
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 19
Output of Classification
Confusion matrix Test data accuracy for every trials Train data accuracy mean
– Standard Deviation
– Standard Error Test data accuracy mean
– Standard Deviation
– Standard Error
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 20
Exp. picture of Classification(1 of 2)
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 21
Exp. picture of Classification(2 of 2)
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 22
How to improve the accuracy?
Increase recognition rate – Words Splitting
Unify words’ tense - Words Stemming
Get rid of noisy data – Remove StopWords
About overlapped categories - Top N Method
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 23
Words stemming
Change Verb’s Tense back to original
– Ex. Performed -> perform
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 24
Words Splitting (1 of 3)
In what case we could split a word?
– Punctuation
– Blank
– Under Line
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 25
Words Splitting (2 of 3)
Examples:
– Ex. Set_Value -> Set Value;
– ImageIcon("myIcon.gif")); -> ImageIcon myIcon gif;
– actionPerformed(ActionEvent e) -> actionPerformed Action Event e
See any problems?
– There are some cases that people like to write many words or words with numbers together.
– Ex. JButton, actionListener, Button1,2,3 and etc.
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 26
Words Splitting (3 of 3)
What the special cases are?– 1. Begin with any number of capital letters combined with ONE or
couple words. Ex. JFrame -> J Frame; JJJJJJJButton -> JJJJJJJ Button; JButtonApple -> J Button Apple– 2. lower case letter/letters or lower case word combined with a
word begin with capital letterEx. cButton -> c Button; ccccccccButton -> cccccccc Button;setValue -> set Value; addActionListener -> add Action Listener; – 3. Many words ALL begin with capital letter combined togeter. Ex. MyFrame-> My Frame; SetActionCommand -> Set Action Command– 4. Combined with word and numbersEX. Button1 -> Button 1; 1Button -> 1 ButtonButton123 -> Button 123; 123Button -> 123 Button;
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 27
Remove Stopwords (1 Of 2)
What is stop words?
– The most common, short function words, such as the, is, at, which, and, on.
Any special cases?
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 28
Remove Stopwords (2 Of 2)
Extra Stop Words
– Programming words.
Ex. public, private, class, new and etc.
Words Frequency Counter helps.
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 29
Overlapped Categories
Each category is treated as independent label by default.
How to solve realistic problems?
– Top N Method
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 30
Top N Method
Regular Way
Top N Method
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 31
Some of our test results
Test base on 10 different labels and 45 instances in total.
Let’s see some pictures help us directly perceived through the senses
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 32
Classify data with Original Mallet
Raw Mal l et
0
0. 05
0. 1
0. 15
0. 2
0. 25
0. 3
0. 35
Number of t i mes
Test
Acc
urac
y Me
an
Raw Mal l et
Raw Mal l et 0. 26 0. 19 0. 28 0. 24 0. 24 0. 28 0. 26 0. 28 0. 32 0. 24
1 2 3 4 5 6 7 8 9 10
Lowest: 19%
Highest: 32%
Average: 25.9%
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 33
After Stemming & Words Splitting
Tokeni zat i on & Words Spl i t t i ng
0
0. 1
0. 2
0. 3
0. 4
0. 5
Number of Ti mes
Test
Acc
urac
y Me
an
Stemmi ng &Words Spl i t t i ng
Stemmi ng & WordsSpl i t t i ng
0. 36 0. 4 0. 4 0. 26 0. 44 0. 28 0. 3 0. 36 0. 34 0. 36
1 2 3 4 5 6 7 8 9 10
Lowest: 26%
Highest: 44%
Average: 35%
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 34
After remove Stop Words
Stop Words Removed
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
Number of Ti mes
Test
Acc
urac
y Me
an
Stop Words Removed
Stop WordsRemoved
0. 62 0. 42 0. 44 0. 5 0. 44 0. 4 0. 4 0. 5 0. 42 0. 36
1 2 3 4 5 6 7 8 9 10
Lowest: 36%
Highest: 62%
Average: 45%
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 35
Top N Method been used
Lowest: 54%
Highest: 72 %
Average: 63.6%
Top N Method
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
Number of Ti mes
Test
Acc
urac
y Me
an
Top N Method
Top N Method 0. 64 0. 72 0. 58 0. 7 0. 58 0. 64 0. 72 0. 68 0. 54 0. 56
1 2 3 4 5 6 7 8 9 10
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 36
Any way to improve accuracy again?
Highlight the key feature.
Use only code data as training data.
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 37
Only code Data
Delete all the text other than code.
What is considered as code?
Code includes not only a snippet of code more than one line, but also a class name, such as JButton and JActionListener, or a method call, such as addActionListener(aListener).
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 38
Test Result of Code only Data
Code Onl y Versi on
0
0. 05
0. 1
0. 15
0. 2
0. 25
0. 3
0. 35
0. 4
0. 45
Number of Ti mes
Test
Acc
urac
y Me
an
Code Onl y Versi on
Code Onl y Versi on 0. 24 0. 34 0. 4 0. 32 0. 36 0. 3 0. 34 0. 42 0. 4 0. 28
1 2 3 4 5 6 7 8 9 10
Lowest: 24%
Highest: 42%
Average: 34%
Lowest: 54%
Highest: 72 %
Average: 63.6%
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 39
Why this happened?
Only code data is not enough.
Can not remove too much data, especially those actually contributed to feature selection.
Is our data size not big enough?
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 40
Increase the Data Scale
What we done?
- Increase the total instances from 45 – 158
- Increase the num of labels from 10 - 17
Data analysis and Quality improvement since categories may overlap
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 41
After Data Scale Increased
Data Scal e I ncreased
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
Number of Ti mes
Test
Acc
urac
y Me
an
Data Scal e I ncreased
Data Scal eI ncreased
0. 62 0. 56 0. 63 0. 68 0. 52 0. 54 0. 63 0. 64 0. 6 0. 55
1 2 3 4 5 6 7 8 9 10
Lowest: 36%
Highest: 62%
Average: 45%
Lowest: 51.88%
Highest: 67.5 %
Average: 59.75%
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 42
After Data Scale Increased with Top N
Data Scal e I ncreased
0. 55
0. 6
0. 65
0. 7
0. 75
0. 8
0. 85
Number of Ti mes
Test
Acc
urac
y Me
an
Data Scal e I ncreased
Data Scal eI ncreased
0. 75 0. 69 0. 69 0. 66 0. 79 0. 76 0. 7 0. 74 0. 72 0. 71
1 2 3 4 5 6 7 8 9 10
Lowest: 54%
Highest: 72 %
Average: 63.6%
Lowest: 66.25%
Highest: 79.38 %
Average: 72.05%
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 43
Why the accuracy increased?
Naïve Bayes classifier using Gaussian distribution to represent the class-conditional probability for continuous attributes, so we are wondering that if the frequency distribution of each word in the articles looks like a normal distribution
Count the frequencies of each word in the articles to create a histogram and to see whether the histogram looks like a normal distribution
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 44
Histogram for Word A
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 45
Histogram for Word B
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 46
Only Code Data again
Lowest: 51.54 %
Highest: 64.62 %
Average: 58.23 %
Onl y Code
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
Number of Ti mes
Test
Acc
urac
y Me
an
New Data Code Onl y
New Data Code Onl y 0. 6150. 5150. 5690. 6230. 5770. 554 0. 6 0. 5920. 6460. 531
1 2 3 4 5 6 7 8 9 10
Lowest: 66.25%
Highest: 79.38 %
Average: 72.05%
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 47
Data Without Code
Lowest: 67.5 %
Highest: 76.25 %
Average: 71.44 %
Data wi thout Code
0. 620. 640. 660. 680. 7
0. 720. 740. 760. 78
Number of Ti mes
Test
Acc
urac
y Me
an
New Data whi tout Code
New Data whi toutCode
0. 6750. 7190. 7630. 6940. 706 0. 7 0. 6820. 7440. 7380. 725
1 2 3 4 5 6 7 8 9 10
Lowest: 66.25%
Highest: 79.38 %
Average: 72.05%
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 48
Results Analysis
Different from human beings, code is not the decisive factor.
Base on our prepared data, code is only a small part of a single instance.
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 49
Compare to Maximum Entropy
Lowest: 57.5 %
Highest: 70 %
Average: 63.06%
MaxEnt
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
Number of Ti mes
Test
Acc
urac
y Me
an
MaxEnt
MaxEnt 0. 6375 0. 6625 0. 575 0. 7 0. 6688 0. 6375 0. 6063 0. 65 0. 575 0. 5938
1 2 3 4 5 6 7 8 9 10
Lowest: 51.88%
Highest: 67.5 %
Average: 59.75%
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 50
Maximum Entropy with Top N
Lowest: 69.38%
Highest: 83.76 %
Average: 78.63%
MaxEnt Wi th Top N Method
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
Number of Ti mes
Test
Acc
urac
y Me
an
Max Ent
Max Ent 0. 8125 0. 80625 0. 8375 0. 775 0. 7875 0. 76875 0. 69375 0. 75625 0. 8 0. 825
1 2 3 4 5 6 7 8 9 10
Lowest: 66.25%
Highest: 79.38 %
Average: 72.05%
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 51
Generative and discriminative models
Generative (Joint) model -> P(c,d)Place P over both observed data and hidden stuff.
- Ex. Naive Bayes
Discriminative (Conditional) models -> P(c|d)Take data as given, place a P over hidden structure given the data.
- Ex. Maximum Entropy, SVMs
Let’s see a picture directly.
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 52
Generative and discriminative models
Bayes net diagrams draw circles for random variables, and lines for direct dependencies
Some variables are observed; some are hidden Each node (conditional model) is a little classifier
based on incoming arcsC
d1 d2
d3
c
d1 d2d3
Conditional modelsJoint models
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 53
SVM-light
SVM-light is an implementation of Support Vector Machines (SVMs) in C.
Download at http://svmlight.joachims.org/ Solves classification and regression problems.
solves ranking problems Efficiently computes Leave-One-Out estimates
of the error rate, the precision, and the recall. Supports standard kernel functions and lets you
define your own
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 54
Format of input file
The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format:
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
Explain<target> .=. +1 | -1 | 0 | <float> <feature> .=. <integer> <value> .=. <float><info> .=. <string>
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 55
How it Works – Linear Mapping
picture from http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 56
Polynomial mapping
picture from http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 57
Future Work
On-going experiments, result analysis based on SVM Continually Increase Data scale Additional increase the identification rate on SVM. Compare the accuracy difference between generative
(Naive Bayes) and discriminative (SVM) models base on our results since we think the main reason that determines the accuracy is not the tool itself but how to select the suitable tool that best matches the data model that underlies a given text categorization problem.
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 58
Reference
McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit." http://mallet.cs.umass.edu. 2002.
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining, 2005
Lecture 7 of Artificial Intelligence | Natural Language Processing Course at Stanford. Instructor: Manning, Christopher D.
K. Nigam, J. Lafferty, and A. Mccallum. Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999.
T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. University at Dortmund, LS VIII, 1997.