Download ppt - 1 Introduction to Text Mining ChengXiang (“Cheng”) Zhai Department of Computer Science Graduate School of Library & Information Science Statistics, and

1

Introduction to Text Mining

ChengXiang (“Cheng”) Zhai

Department of Computer Science

Graduate School of Library & Information Science

Statistics, and Institute for Genomic Biology

University of Illinois, Urbana-Champaign

http://www.cs.uiuc.edu/

Outline

- Overview of Text Mining

- IR-Style Text Mining Techniques

- NLP-Style Text Mining Techniques

- ML-Style Text Mining Techniques

2


3

Two Definitions of “Mining”

• Goal-oriented (effectiveness driven, NLP, AI)

– Any process that generates useful results that are non-obvious is called “mining”.

– Keywords: “useful” + “non-obvious”

– Data isn’t necessarily massive

• Method-oriented (efficiency driven, DB, IR)

– Any process that involves extracting information from massive data is called “mining”

– Keywords: “massive” + “pattern”

– Patterns aren’t necessarily useful


4

What is Text Mining?

• Data Mining View: Explore patterns in textual data

– Find latent topics

– Find topical trends

– Find outliers and other hidden patterns

• Natural Language Processing View: Make inferences based on partial understanding natural language text

– Information extraction

– Question answering


5

Applications of Text Mining

• Direct applications

– Discovery-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions?

– Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it?

• Indirect applications

– Assist information access (e.g., discover latent topics to better summarize search results)

– Assist information organization (e.g., discover hidden structures)


6

Text Mining Methods

• Data Mining Style: View text as high dimensional data– Frequent pattern finding

– Association analysis

– Outlier detection

• Information Retrieval Style: Fine granularity topical analysis– Topic extraction

– Exploit term weighting and text similarity measures

– Question answering

• Natural Language Processing Style: Information Extraction– Entity extraction

– Relation extraction

– Sentiment analysis

• Machine Learning Style: Unsupervised or semi-supervised learning– Generative models

– Dimension reduction

– Classification & prediction


7

IR-Style Techniques for Text Mining


8

Some “Basic” IR Techniques

• Stemming

• Stop words

• Weighting of terms (e.g., TF-IDF)

• Vector/Unigram representation of text

• Text similarity (e.g., cosine, KL-div)

• Relevance/pseudo feedback (e.g., Rocchio)


9

Generality of Basic Techniques

Raw text

Term similarity

Doc similarity

Vector centroid

CLUSTERING

d

CATEGORIZATION

META-DATA/ANNOTATION

d d d

d

d d

d

d d d

d d

d d

t t

t t

t t t

t t

t

t t

Stemming & Stop words

Tokenized text

Term Weighting

w11 w12… w1n

w21 w22… w2n

… …wm1 wm2… wmn

t1 t2 … tn

d1

d2 … dm

Sentenceselection

SUMMARIZATION


10

Sample Applications

• Information Filtering

• Text Categorization

• Document/Term Clustering

• Text Summarization


11

Information Filtering

• Stable & long term interest, dynamic info source

• System must make a delivery decision immediately as a document “arrives”

• Two Methods: Content-based vs. Collaborative

FilteringSystem

…

my interest:


12

Examples of Information Filtering

• News filtering

• Email filtering

• Recommending Systems

• Literature alert

• And many others


13

Sample Applications


Text Categorization




14

Text Categorization

• Pre-given categories and labeled document examples (Categories may form hierarchy)

• Classify new documents

• A standard supervised learning problem

CategorizationSystem

…

Sports

Business

Education

Science…

SportsBusiness

Education


15

Examples of Text Categorization

• News article classification

• Meta-data annotation

• Automatic Email sorting

• Web page classification


16

Sample Applications



Document/Term Clustering



17

The Clustering Problem

• Discover “natural structure”

• Group similar objects together

• Object can be document, term, passages

• Example


18

Similarity-induced Structure


19

Examples of Doc/Term Clustering

• Clustering of retrieval results

• Clustering of documents in the whole collection

• Term clustering to define “concept” or “theme”

• Automatic construction of hyperlinks

• In general, very useful for text mining


20

Sample Applications




Text Summarization


21

“Retrieval-based” Summarization

• Observation: term vector summary?

• Basic approach

– Rank “sentences”, and select top N as a summary

• Methods for ranking sentences

– Based on term weights

– Based on position of sentences

– Based on the similarity of sentence and document vector


22

Examples of Summarization

• News summary

• Summarize retrieval results

– Single doc summary

– Multi-doc summary

• Summarize a cluster of documents (automatic label creation for clusters)


23

NLP-Style Text Mining Techniques

Most of the following slides are from William Cohen’s IE tutorial


24

What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

AME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*


25

Landscape of IE Tasks:Complexity

Closed set

He was born in Alabama…

Regular set

Phone: (413) 545-1323

Complex pattern

University of ArkansasP.O. Box 140Hope, AR 71802 …was among the six houses sold

by Hope Feldman that year.

Ambiguous patterns,needing context andmany sources of evidence

The CALD main office can be reached at 412-268-1299

The big Wyoming sky…

U.S. states U.S. phone numbers

U.S. postal addresses

Person names

Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

E.g. word patterns:


26

Landscape of IE Techniques

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates


Classifier

which class?

Sliding Window


Classifier

which class?

Try alternatewindow sizes:

Boundary Models


Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars


NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines


Most likely state sequence?


27

Statistical Learning Style Techniques for Text Mining


Many Techniques are Available

• Supervised learning

– Classification

– Regression

• Unsupervised learning

– Topic models

– Dimension reduction

• Most relevant methods

– Generative models

– Matrix decomposition

28


Topics for Discussion

• Social Science research questions:

– Mining bias: selection bias, framing bias

• Text Mining techniques

– Sentiment analysis

– Topic discovery and evolution graph

– Joint text-image analysis

29