Text Based Information Retrieval - Document Mining

8/3/2019 Text Based Information Retrieval - Document Mining

1/37

Text Based Information

Retrieval - Text Mining

PKB - Antonie


2/37

Background

Human dificults to process huge information

Computer can do better with matemathics

why dont also use computer to process huge

information?

A Large text to find:

Terrorist attack on 1995?

Terrorist movement and bomb relation? Relates to Information Retreival, Data Mining

and Text Mining


3/37

Terminology

Data Mining

A step in the knowledge discovery process consisting ofparticular algorithms (methods), produces a particularenumeration ofpatterns (models) over the data.

Data Mining is a process of discovering advantageous patternsin data.

Knowledge Discovery Process

The process of using data mining methods (algorithms) to

extract (identify) what is knowledge according to thespecifications of measures and thresholds, using a databasealong with any necessary preprocessing or transformations.


4/37

What kind of data in Data Mining?

Relational Databases

Data Warehouses

Transactional Databases

Advanced DatabaseSystems Object-Relational

Multimedia

Text

Heterogeneous andDistributed

WWW

Data Mining Application:

Market analysis

Risk analysis and

management

Fraud detection anddetection of unusual patterns

(outliers)

Text mining (news group,

email, documents) and Web

mining

Stream data mining


5/37

Knowledge Discovery


6/37


7/37


8/37

What Is Text Mining?

The objective of Text Mining is to exploit information contained in textualdocuments in various ways, including discovery of patterns andtrends in data, associations among entities, predictive rules, etc.(Grobelnik et al., 2001)

Another way to view text data mining is as a process of exploratory dataanalysis that leads to heretofore unknown information, or to answersfor questions for which the answer is not currently known. (Hearst,1999)

The non trivial extraction of implicit, previously unknown, andpotentially useful information from (large amount of) textual data.

An exploration and analysis oftextual (naturaltextual (natural--language) datalanguage) data by

automatic and semi automatic means to discover new knowledge.


9/37

Text Mining (2)

What is previously unknownpreviously unknowninformation ?

Strict definition

Information that not even the writer knows.

Lenient (lunak) definition

Rediscover the information that the author

encoded in the text

e.g., Automatically extracting a products namefrom a web-page.


10/37

Information Retrieval

Indexing and retrieval of textual documents

Information Extraction

Extraction ofpartial knowledgepartial knowledge in the text

Web Mining

Indexing and retrieval of textual documents and

extraction of partial knowledge using the web

Clustering

Generating collections of similar text documents

Text Mining Methods


11/37

Text Mining Application

Email: Spam filtering

News Feeds: Discover what is interesting

Medical: Identify relationships and link

information from different medical fields Marketing: Discover distinct groups of potential

buyers and make suggestions for other products

Industry: Identifying groups of competitors web

pages Job Seeking: Identify parameters in searching

for jobs


12/37


13/37

Information Retrieval (1)

Given: A source of textual documents

A well defined limited query (text based)

Find: Sentences with relevantrelevant information

Extract the relevant information and

ignore non-relevant information (important!)

Link related information and output in a predetermined format

Example: news stories, e-mails, web pages,photograph, music, statistical data, biomedical data, etc.

Information items can be in the form of text, image,video, audio, numbers, etc.


14/37

Information Retrieval (2)

2 basic information retrieval (IR) process: Browsing or navigation system User skims document collection by jumping from

one document to the other via hypertext orhypermedia links until relevant document found

Classical IR system: question answeringsystem

Query: question in natural language

Answer: directly extracted from text of documentcollection

Text Based Information Retrieval: Information item (document) :

Text format (written/spoken) or has textual description

Information need (query): Usually in text format


15/37

Classical IR System Process


16/37


17/37

Intelligent Information Retrieval

meaningof words

Synonyms buy / purchase

Ambiguity bat (baseball vs. mammal) orderof words in the query

hot dog stand in the amusement park

hot amusement stand in the dog park


18/37


19/37

Why Mine the Web?

Enormous wealth of textual information on the Web. Book/CD/Video stores (e.g., Amazon)

Restaurant information (e.g., Zagats)

Car prices (e.g., Carpoint)

Lots of data on user access patterns Web logs contain sequence of URLs accessed by users

Possible to retrieve previously unknown

information People who ski also frequently break their leg. Restaurants that serve sea food in California are likely to be

outside San-Francisco


20/37

Mining the Web

IR / IE

System

Query

Documents

source

Ranked

Documents

1. Doc1

2. Doc2

3. Doc3

.

.

Web Spider


21/37

What is Web Clustering ?

Given:

A source of textual

documents

Similarity measure e.g., how many

words are common

in these documents

ClusteringSystem

Similaritymeasure

Documents

source

Doc

Do

cDoc

Doc

Doc

DocDoc

Doc

Doc

Doc

Find:

Several clusters of documentsthat are relevant to each other


22/37

Text characteristics

Large textual data base

Efficiency consideration

over 2,000,000,000 web pages

almost all publications are also in electronic form

High dimensionality (Sparse input)

Consider each word/phrase as a dimension

Dependency

relevant information is a complex conjunction of

words/phrases

e.g., Document categorization.Pronoun disambiguation


23/37

Text characteristics

Ambiguity Word ambiguity

Pronouns (he, she )

buy, purchase

Semantic ambiguity The king saw the rabbit with his glasses. (? meanings)

Noisy data Example: Spelling mistakes

Not well structured text Chat rooms

r u available ?

Hey whazzzzzz up

Speech


24/37

Text mining process

Text preprocessing Syntactic/Semantic

text analysis

Features Generation Bag of words

Features Selection Simple counting

Statistics

Text/Data Mining Classification-

Supervised

learning Clustering-Unsupervisedlearning

Analyzing results


25/37

Part Of Speech (pos) tagging Find the corresponding pos for each word

e.g., John (noun) gave (verb) the (det) ball (noun)

Word sense disambiguation Context basedContext based orproximity basedproximity based

Very accurate

Parsing Generates a parse treeparse tree (graph) for each sentence

Each sentence is a stand alone graph

Syntactic / Semantic text

analysis


26/37

Feature Generation: Bag of words

Text document is represented by the words it contains(and their occurrences) e.g., Lord of the rings p {the, Lord, rings, of}

Highly efficient

Makes learning far simpler and easier Order of words is not that important for certain applications

Stemming: identifies a word by its root Reduce dimensionality

e.g., flying, flewp fly

Use Porter Algorithm

Stop words: The most common words are unlikely to helptext mining e.g., the, a, an, you


27/37

Feature selection

Reduce dimensionality

Learners have difficulty addressing tasks with

high dimensionality

Irrelevant features

Not all features help!

e.g., the existence of a noun in a news article is

unlikely to help classify it as politics or sport

Use Weightening


28/37

Given: a collection of labeled records (training settraining set) Each record contains a set of features (attributesattributes), and

the true class (labellabel)

Find: a modelmodel for the class as a function of thevalues of the features

Goal: previously unseen records should beassigned a class as accurately as possible A test settest set is used to determine the accuracy of the model.

Usually, the given data set is divided into training andtest sets, with training set used to build the model andtest set used to validate it

Text Mining: Classification

definition


29/37

Similarity Measures:

Euclidean DistanceEuclidean Distance if attributes are continuous

Other Problem-specific Measures

e.g., how many words are common in these documents

Given: a set of documents and a similaritysimilaritymeasuremeasure among documents

Find: clusters such that: Documents in one cluster are more similar to one

another Documents in separate clusters are less similar to one

another

Goal: Finding a correctcorrectset of documents

Text Mining: Clustering definition


30/37

Supervised learning (classification)

Supervision: The training data (observations, measurements,

etc.) are accompanied by labelslabels indicating the class of the

observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim ofestablishing the existence of classes or clusters in the data

Supervised vs. Unsupervised

Learning


31/37

Correct classification: The known labelof test sample is identical with the classclassresultresultfrom the classification model

Accuracy ratio: the percentage of testset samples that are correctly classifiedby the model

A distance measuredistance measure between classescan be used

e.g., classifying football document as abasketball document is not as bad asclassifying it as crime.

Evaluation:What Is Good

Classification?


32/37

Good clustering method: produce high

quality clusters with . . .

high intraintra--classclass similarity

low interinter--classclass similarity

The qualityqualityof a clustering method is also

measured by its ability to discover some or

all of the hiddenhidden patterns

Evaluation:What Is Good

Clustering?


33/37

Text Classification: An Example

Ex#Hooligan

1An English football fan

Yes

2

During a game in Italy

Yes

3England has beenbeating France

Yes

4Italian football fans werecheering

No

5An average USAsalesman earns 75K

No

6The game in Londonwas horrific

Yes

7 Manchester city is likelyto win the championship

Yes

8Rome is taking the leadin the football league

Yes10

Training

SetModel

Learn

Classifier

Test

Set

Hooligan

A Danish football fan ?

Turkey is playing vs. France.The Turkish fans

?10


34/37

Ex#

Hooligan

1An English football fan

Yes

2 During a game in Italy

Yes

3England has beenbeating France

Yes

4Italian football fans werecheering

No

5An average USAsalesman earns 75K

No

6

The game in London

was horrific Yes

7Manchester city is likelyto win the championship

Yes

8Rome is taking the leadin the football league

Yes10

Decision Tree: A Text Example

Yes

English

Yes

No

MarSt

NO

MarriedSingle, Divorced

Splitting Attributes

Income

YESNO

> 80K < 80K

The splitting attribute at a node is

determined based on a specific

Attribute selection algorithm


35/37

Decision tree A flow-chart-like tree structure

Internal node denotes a test on an attribute

Branch represents an outcome of the test

Leaf nodes represent class labels or class distribution

Decision tree generation consists of two phases: Tree construction

Tree pruning

Identify and remove branches that reflect noisenoise oroutliersoutliers Use of decision tree: Classifying an unknown

sample Test the attribute of the sample against the decision

tree

Classification by DT Induction


36/37

Text is tricky to process, but ok results are easily

achieved

There exist several text mining systemstext mining systems e.g., D2K - Data to Knowledge

http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/

Additional IntelligenceIntelligence can be integrated with textmining

One may play with any phase of the text mining process

Summary


37/37

Summary

There are many otherscientific and statistical text miningscientific and statistical text mining

methodsmethods developed but not covered in this talk.

http://www.cs.utexas.edu/users/pebronia/text-mining/

http://filebox.vt.edu/users/wfan/text_mining.html

Also, it is important to study theoretical foundationstheoretical foundations of

data mining.

Data Mining Concepts and Techniques / J.Han & M.Kamber Machine Learning, / T.Mitchell