Classifying text with Bayes Models

twitter: @ValentinMihov RogueConf 2014

What is text classification• We have a set of documents

• We have a set of categories

• We want an algorithm that given an input document, outputs the category of the document

• Examples:

• emails -> [spam, not spam]

• emails -> [primary, social, promotion]

• tweets -> [angry, sad, happy, neutral]

Keyword search• Royal Wedding - probably talks about weddings

• Royal Cheeseburger - probably talks about food

• The Red Wedding - The “Game of Thrones” TV show

• Soon it becomes hard to figure out all the keyword combinations

• Is it possible to generate the rules automatically?

The magic of statistics• P(A) - the probability that the event A will happen

• Example: P(heads) = 1/2, for a regular dice: P(1) = 1/6

• P(A|B) - the probability А happens, if B has already happened

• Example: If 15% of all males have long hair and 75% of all females have a long hair then:

• P(L|W) = 0.75

• P(L|M) = 0.15

Bayes Theorem

Example• We know that Bob talked to a person with long hair on

the train. What is the probability the person was female?

• P(M) = 0.5

• P(W) = 0.5

• P(L|M) = 0.15

• P(L|W) = 0.75

• P(W|L) = ?

Example

How we can use this to classify spam?

• P(S|d) - the probability document d is spam

• P(H|d) - the probability document d is ham

• P(S) - the probability to have spam

• P(H) - the probability to have ham

• P(d) - the probability to have document d

How we can use this to classify spam?

But how to calculate this?

• P(d|S) - it is hard to calculate this number

• d = a sequence of words. The order matters!

• Let’s simplify the task

• Let’s convert d in a set of unordered words - the bag-of-words model

Bag-of-words model• John runs faster than Mary

• Mary is taller than John

• John is faster than Mary is taller

Vocabulary:[“John”, “runs”, “faster”, “than”, “Mary”, “is”, “taller”]Bags of words:[1,1,1,1,1,0,0] [1,0,0,1,1,1,1] [1,0,1,1,1,2,1]

Bag-of-words in Bayesian model

A clever trick

Building a model

• Collect a set of already classified documents

• The more the better

• Calculate the frequency of each word

• On a new input document, we use the built “knowledge” to classify it

Improvement 1• Precision - what percentage of the classified spam is really

• Recall - what percentage of all real spam we catch

• If everything is classified as spam - the recall is 100%, but the precision is very low

• If everything is classified as ham - the precision is 100%, but the recall is low

• We can trade-off precision for recall by changing the spam threshold

Improvement 2• Improve the text preprocessing

• Bag-of-words improvement

• normalize on the length of the docs

• TF-IDF - normalize on the frequency of the words

• remove very rare or very frequent words

How to train a model?• Finding data - external datasets or manual

classification

• Garbage in - garbage out

• Divide the data in training and test sets

• Cross validation - training/testing many times with different split of the data

• The goal is to avoid overfitting of the model

Resources• http://scipy.org/ - Python library for algebra calculations

• http://scikit-learn.org/ - Python library with many algorithms for machine learning

• http://radimrehurek.com/gensim/ - Python algorithms for extracting topics from text

• http://www.cs.waikato.ac.nz/ml/weka/ - Java library and UI for testing machine learning models

• https://mahout.apache.org/ - Apache Java project for machine learning

• http://informationretrieval.org - A great book for information retrieval

–Edsger W. Dijkstra

“The question of whether computers can think is like the question of whether submarines can

swim.”

Classifying text with Bayes Models

Software

Introduction to text classification using naive bayes

text mining dengan metode naïve bayes classifier dan support

Text Classification and Naïve Bayes (Modified from Stanford CS276 slides on Lecture 10: Text Classification; The Naïve Bayes algorithm)

Naïve Bayes Text Classification

An Optimization Text Summarization Method Based on Naïve Bayes

Text Classification and Naïve Bayes

naive bayes document classification - University of Windsorjlu.myweb.cs.uwindsor.ca/538/538bayes2018.pdf · Text classification Naive Bayes NB theory Evaluation of TC naive bayes

Text Classification – Naïve Bayes

Text Classification and Naïve Bayes - Texas A&M Universityfaculty.cse.tamu.edu/.../Fall17/l3_textClassification_naivebayes.pdf · Text Classification and Naïve Bayes The Task of

Text Classification and Naïve Bayes The Task of Text Classification

text classiﬁcation with naive Bayes

Naive Bayes 1, Naive Bayes and Text Classication I

What we will cover here What is a classifier Difference of learning/training and classifying Math reminder for Naïve Bayes Tennis example = naïve Bayes

Bayes Theorem & Naïve Bayes - Penn Engineeringcis521/Lectures/naive-bayes-spam.pdfUsing Naive Bayes Classifiers to Classify Text: Basic method for Multinomial Variables • As a generative

naive bayes document classificationjlu.myweb.cs.uwindsor.ca/538/538bayes2017.pdf · Text classification Naive Bayes NB theory Evaluation of TC A text classification task: Email spam

IR 11: Text Classification; The Na¯ve Bayes algorithm

Text Classification Using Naive Bayes

Classifying text

Text mining - from Bayes rule to dependency parsing

A Text Analytic Approach to Classifying Document Types