M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009

M. Sulaiman Khan

([email protected])

Dept. of Computer Science

University of Liverpool

2009

COMP527: Data Mining

Text Mining: Text-as-Data March 25, 2009 Slide 1

COMP527:Data Mining

Introduction to the Course

Introduction to Data Mining

Introduction to Text Mining

General Data Mining Issues

Data Warehousing

Classification: Challenges, Basics

Classification: Rules

Classification: Trees

Classification: Trees 2

Classification: Bayes

Classification: Neural Networks

Classification: SVM

Classification: Evaluation

Classification: Evaluation 2

Regression, Prediction

COMP527: Data Mining


COMP527:Data Mining

Input Preprocessing

Attribute Selection

Association Rule Mining

ARM: A Priori and Data Structures

ARM: Improvements

ARM: Advanced Techniques

Clustering: Challenges, Basics

Clustering: Improvements

Clustering: Advanced Algorithms

Hybrid Approaches

Graph Mining, Web Mining

Text Mining: Challenges, Basics

Text Mining: Text-as-Data

Text Mining: Text-as-Language

Revision for Exam

Word FrequenciesRelevance ScoringLatent Semantic IndexingMarkov Models

Today's Topics


COMP527:Data Mining

Unlike other 'regular' attributes, each word can appear multiple

times in a single document. More frequently occurring words

within a single document good indication that it's important to that

text.

Also interesting to see overall distribution of words within the full data set, and within the vocabulary/attribute space/vector space.

Distribution of parts of speech interesting.

Even individual letter frequencies are potentially interesting between different texts or sets of texts.

Word Frequencies


COMP527:Data Mining

Distribution of letters in a text can potentially show with very low

dimensionality some minimal features of the text.

Eg:Alice W'land: E T A O I H N S R D L U W G C Y M F P B Holy Bible: E T A H O N I S R D L F M U W C Y G B P Reuters: E T A I N S O R L D C H U M P F G B Y WTale 2 Cities: E T A O N I H S R D L U M W C F G Y P B

Eg 'C' and 'S' is a lot more common in the Reuters news articles, 'H' very

uncommon. 'F' more common in the bible.

Letter Frequencies


COMP527:Data Mining

Distribution of letters in a text can potentially also help with

language determination.'E' a lot more common in French (0.17) than Spanish

(0.133) or English (0.125). 'V' and 'U' also more common in French. Top three in Spanish are all vowels: 'E' then 'A' then 'O'.

Quite possible that the distribution of letters in texts to be classified is also interesting, if they're from different styles, languages, or subjects. Don't rule out the very easy :)

Letter Frequencies


COMP527:Data Mining

Distribution of words interesting. For vector construction, it would

be nice to know approximately how many unique words there are

likely to be.

Heaps's Law: v = Knb

Where: n = number of wordsK = constant between 10 and 100b = constant between 0 and 1, normally 0.4

and 0.6and v is the size of the vocabulary

While this seems very fuzzy, it often works in practice. Eg predicts a particular curve, which seems to hold up in experiments.

Word Frequencies


COMP527:Data Mining

A second 'law': Zipf's LawIdea: We use a few words very often, and most words very

rarely, because it's more effort to use a rare word.

Zipf's Law: Product of frequency of word and its rank is [reasonably] constant.

Also fuzzy, but also empirically demonstrable. And holds up over different languages.

A 'Power Law Distribution' – few events occur often, and many events occur infrequently.

Word Frequencies


COMP527:Data Mining

Zipf's Law Example:

Word Frequencies


COMP527:Data Mining

Word Rank Freq. Rank*F Word Rank Freq. Rank*F

the 1 120021 120021 investors 400 828 331200

of 2 72225 144450 head 800 421 336800

and 4 53462 213848 warrant 1600 184 294400

for 8 25578 204624 Tehran 3200 73 233600

is 16 16739 267824 guarantee 6400 25 160000

company 32 9340 298880 Pittston 10000 11 110000

Co. 64 4005 256320 thinly 20000 3 60000

quarter 100 2677 267700 Morgenthaler 40000 1 40000

unit 200 1489 297800 tabulating 47075 1 47075

The frequencies of words can be used with relation to each class in

comparison to the entire document set.Eg words that are more frequent in a class than in the

document set as a whole are discriminating.

Can use this idea to generate weights for terms against each class, and then merge the weights for a general prediction of class. Also commonly used for search engines to predict relevance to user's query.

Several different ways to create these weights...

Word Frequencies


COMP527:Data Mining

Term Frequency, Inverse Document Frequency

w(i, j) = tf(i,j) * log(N / df(i))

Weight of term i in document j is the frequency of term i in

document j times the log of the total number of documents

divided by the number of documents that contain term i.

Eg, the more often the term occurs in the document, and the rarer

the term is, the more likely that document is to be relevant.

TF-IDF


COMP527:Data Mining

w(i, j) = tf(i,j) * log(N / df(i))

In 1000 documents, 20 contain the word 'lego'. It appears between

1 and 6 times in those 20.

For the freq 6 document:

w('lego', doc) = 6 * log(1000 / 20)= 6 * log(50)= 23.47

TF-IDF Example


COMP527:Data Mining

Then for multiple terms, merge the weightings between each tfidf

value according to some function. (sum, mean, etc). Can

generate this sum for each class.Pros: Easy to implement Easy to understand

Cons: Document Size not taken into account Low document frequency overpowers term frequency

TF-IDF


COMP527:Data Mining

Jamie Callan of CMU proposed this algorithm:

I = log((N + 0.5) / tf(i)) / log(N + 1.0)T = df(i) / df(i)+50+ (150* size(j) / avgSize(N))w(i,j) = 0.4 + (0.6 * T * I)

Takes into account document size, and average size of all

documents.

Otherwise a document with 6 matches in 100 words is treated the

same as a document with 6 matches in 100,000 words.

Vast improvement over simple TF-IDF, while still remaining easy to

implement and understandable.

CORI


COMP527:Data Mining

I = log((N + 0.5) / tf(i)) / log(N + 1.0)T = df(i) / df(i)+50+ ( 150*size(j) / avgSize(N))w(i,j) = 0.4 + (0.6 * T * I)

I = log(1000.5 / 6) / log(1001) = 0.74

T = 20 / 20 + 50 + (150 * 350 / 500) = 0.11

w('lego', doc) = 0.4 + (0.6 * T * I) = 0.449

Given the same 20 matched docs, 6 in doc, 1000 documents, 350 words in

doc, and an average of 500 words per doc in the 1000.

For more explanations see his papers:

http://www.cs.cmu.edu/~callan/Papers/

CORI Example


COMP527:Data Mining

Finds the relationships between words in the document/term matrix: the

clusters of words that frequently co-occur in documents, and hence the

'latent semantic structure' of the document set.

Doesn't depend on individual words, but instead on the clusters of words.

Eg might use 'car' + 'automobile' + 'truck' + 'vehicle' instead of just 'car'

Twin problems:

Synonymy: Different words with the same meaning (car, auto)

Polysemy: Same spelling with different meaning (to ram, a ram)

(We'll come back to word sense disambiguation too)

Latent Semantic Indexing


COMP527:Data Mining

Based on Singular Value Decomposition of the matrix (which is

something best left to math toolkits).

Basically: Transforms the dimensions of the vectors such that documents

with similar sets of terms are closer together.

Then use these groupings as clusters of documents.

You end up with fractions of words being present in documents (eg

'automobile' is somehow present in a document containing 'car').

Then use these vectors for analysis, rather than straight frequency

vectors. As each dimension is multiple words, end up with smaller

vectors too.

Latent Semantic Indexing


COMP527:Data Mining

Patterns of letters in language don't happen at random. This sentence vs

kqonw ldpwuf jghfmb edkfiu lpqwxz. Obviously not language.

Markov models try to learn the probabilities of one item following

another, in this case letters.

Eg: Take all of the words we have and build a graph of which letters

follow which other letters and 'start' and 'end' of words.

Then each arc between nodes has a weight for the probability.

Using a letter based markov model we might end up with words like:

annofed, mamigo, quarn, etc.

Markov Models


COMP527:Data Mining

Sequences of words are also not random (in English). We can use a much

much larger Markov Model to show the probabilities of a word

following another word or start/end of sentence.

Equally, words clump together in short phrases, and we could use multi-

word tokens as our graph nodes.

Here we could see how likely 'states' is to follow 'united' for example.

Markov Models


COMP527:Data Mining

Sequences of parts of speech for words are also not random (in English).

But we don't care about the probabilities, we want to potentially use the

observations as a way to determine the actual part of speech of a word.

This is a Hidden Markov Model (HMM) as it uses the observable patterns

to predict some variable which is hidden.

Uses a trellis of state and observation sequence. eg:

Hidden Markov Models


COMP527:Data Mining

state 1

state 2

O1 O2 O3 O4

Stores the calculations towards the probabilities in the trellis arcs.

Various clever algorithms to make this computationally feasible:

Compute probability of particular output sequence:

Forward-Backward algorithm.

Find most likely sequence of hidden states to generate output sequence:

Viterbi algorithm

Given an output sequence, find most likely set of transitions and output

probabilities (train the parameters for model given training set):

Baum-Welch algorithm

Hidden Markov Models


COMP527:Data Mining

Konchady Chou, Juang, Pattern Recognition in Speech and Language

Processing, Chapters 8,9 Weiss Berry, Survey, Chapter 4 Han, 8.4.2 Dunham 9.2

Further Reading


COMP527:Data Mining

Documents

M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009