View
35
Download
1
Category
Preview:
Citation preview
What is text classification• We have a set of documents
• We have a set of categories
• We want an algorithm that given an input document, outputs the category of the document
• Examples:
• emails -> [spam, not spam]
• emails -> [primary, social, promotion]
• tweets -> [angry, sad, happy, neutral]
Keyword search• Royal Wedding - probably talks about weddings
• Royal Cheeseburger - probably talks about food
• The Red Wedding - The “Game of Thrones” TV show
• Soon it becomes hard to figure out all the keyword combinations
• Is it possible to generate the rules automatically?
The magic of statistics• P(A) - the probability that the event A will happen
• Example: P(heads) = 1/2, for a regular dice: P(1) = 1/6
• P(A|B) - the probability А happens, if B has already happened
• Example: If 15% of all males have long hair and 75% of all females have a long hair then:
• P(L|W) = 0.75
• P(L|M) = 0.15
Example• We know that Bob talked to a person with long hair on
the train. What is the probability the person was female?
• P(M) = 0.5
• P(W) = 0.5
• P(L|M) = 0.15
• P(L|W) = 0.75
• P(W|L) = ?
How we can use this to classify spam?
• P(S|d) - the probability document d is spam
• P(H|d) - the probability document d is ham
• P(S) - the probability to have spam
• P(H) - the probability to have ham
• P(d) - the probability to have document d
But how to calculate this?
• P(d|S) - it is hard to calculate this number
• d = a sequence of words. The order matters!
• Let’s simplify the task
• Let’s convert d in a set of unordered words - the bag-of-words model
Bag-of-words model• John runs faster than Mary
• Mary is taller than John
• John is faster than Mary is taller
Vocabulary:[“John”, “runs”, “faster”, “than”, “Mary”, “is”, “taller”]Bags of words:[1,1,1,1,1,0,0] [1,0,0,1,1,1,1] [1,0,1,1,1,2,1]
Building a model
• Collect a set of already classified documents
• The more the better
• Calculate the frequency of each word
• On a new input document, we use the built “knowledge” to classify it
Improvement 1• Precision - what percentage of the classified spam is really
spam
• Recall - what percentage of all real spam we catch
• If everything is classified as spam - the recall is 100%, but the precision is very low
• If everything is classified as ham - the precision is 100%, but the recall is low
• We can trade-off precision for recall by changing the spam threshold
Improvement 2• Improve the text preprocessing
• Bag-of-words improvement
• normalize on the length of the docs
• TF-IDF - normalize on the frequency of the words
• remove very rare or very frequent words
How to train a model?• Finding data - external datasets or manual
classification
• Garbage in - garbage out
• Divide the data in training and test sets
• Cross validation - training/testing many times with different split of the data
• The goal is to avoid overfitting of the model
Resources• http://scipy.org/ - Python library for algebra calculations
• http://scikit-learn.org/ - Python library with many algorithms for machine learning
• http://radimrehurek.com/gensim/ - Python algorithms for extracting topics from text
• http://www.cs.waikato.ac.nz/ml/weka/ - Java library and UI for testing machine learning models
• https://mahout.apache.org/ - Apache Java project for machine learning
• http://informationretrieval.org - A great book for information retrieval
Recommended