Upload
rusli-taher
View
231
Download
0
Embed Size (px)
Citation preview
8/3/2019 Text Based Information Retrieval - Document Mining
1/37
Text Based Information
Retrieval - Text Mining
PKB - Antonie
8/3/2019 Text Based Information Retrieval - Document Mining
2/37
Background
Human dificults to process huge information
Computer can do better with matemathics
why dont also use computer to process huge
information?
A Large text to find:
Terrorist attack on 1995?
Terrorist movement and bomb relation? Relates to Information Retreival, Data Mining
and Text Mining
8/3/2019 Text Based Information Retrieval - Document Mining
3/37
Terminology
Data Mining
A step in the knowledge discovery process consisting ofparticular algorithms (methods), produces a particularenumeration ofpatterns (models) over the data.
Data Mining is a process of discovering advantageous patternsin data.
Knowledge Discovery Process
The process of using data mining methods (algorithms) to
extract (identify) what is knowledge according to thespecifications of measures and thresholds, using a databasealong with any necessary preprocessing or transformations.
8/3/2019 Text Based Information Retrieval - Document Mining
4/37
What kind of data in Data Mining?
Relational Databases
Data Warehouses
Transactional Databases
Advanced DatabaseSystems Object-Relational
Multimedia
Text
Heterogeneous andDistributed
WWW
Data Mining Application:
Market analysis
Risk analysis and
management
Fraud detection anddetection of unusual patterns
(outliers)
Text mining (news group,
email, documents) and Web
mining
Stream data mining
8/3/2019 Text Based Information Retrieval - Document Mining
5/37
Knowledge Discovery
8/3/2019 Text Based Information Retrieval - Document Mining
6/37
8/3/2019 Text Based Information Retrieval - Document Mining
7/37
8/3/2019 Text Based Information Retrieval - Document Mining
8/37
What Is Text Mining?
The objective of Text Mining is to exploit information contained in textualdocuments in various ways, including discovery of patterns andtrends in data, associations among entities, predictive rules, etc.(Grobelnik et al., 2001)
Another way to view text data mining is as a process of exploratory dataanalysis that leads to heretofore unknown information, or to answersfor questions for which the answer is not currently known. (Hearst,1999)
The non trivial extraction of implicit, previously unknown, andpotentially useful information from (large amount of) textual data.
An exploration and analysis oftextual (naturaltextual (natural--language) datalanguage) data by
automatic and semi automatic means to discover new knowledge.
8/3/2019 Text Based Information Retrieval - Document Mining
9/37
Text Mining (2)
What is previously unknownpreviously unknowninformation ?
Strict definition
Information that not even the writer knows.
Lenient (lunak) definition
Rediscover the information that the author
encoded in the text
e.g., Automatically extracting a products namefrom a web-page.
8/3/2019 Text Based Information Retrieval - Document Mining
10/37
Information Retrieval
Indexing and retrieval of textual documents
Information Extraction
Extraction ofpartial knowledgepartial knowledge in the text
Web Mining
Indexing and retrieval of textual documents and
extraction of partial knowledge using the web
Clustering
Generating collections of similar text documents
Text Mining Methods
8/3/2019 Text Based Information Retrieval - Document Mining
11/37
Text Mining Application
Email: Spam filtering
News Feeds: Discover what is interesting
Medical: Identify relationships and link
information from different medical fields Marketing: Discover distinct groups of potential
buyers and make suggestions for other products
Industry: Identifying groups of competitors web
pages Job Seeking: Identify parameters in searching
for jobs
8/3/2019 Text Based Information Retrieval - Document Mining
12/37
8/3/2019 Text Based Information Retrieval - Document Mining
13/37
Information Retrieval (1)
Given: A source of textual documents
A well defined limited query (text based)
Find: Sentences with relevantrelevant information
Extract the relevant information and
ignore non-relevant information (important!)
Link related information and output in a predetermined format
Example: news stories, e-mails, web pages,photograph, music, statistical data, biomedical data, etc.
Information items can be in the form of text, image,video, audio, numbers, etc.
8/3/2019 Text Based Information Retrieval - Document Mining
14/37
Information Retrieval (2)
2 basic information retrieval (IR) process: Browsing or navigation system User skims document collection by jumping from
one document to the other via hypertext orhypermedia links until relevant document found
Classical IR system: question answeringsystem
Query: question in natural language
Answer: directly extracted from text of documentcollection
Text Based Information Retrieval: Information item (document) :
Text format (written/spoken) or has textual description
Information need (query): Usually in text format
8/3/2019 Text Based Information Retrieval - Document Mining
15/37
Classical IR System Process
8/3/2019 Text Based Information Retrieval - Document Mining
16/37
8/3/2019 Text Based Information Retrieval - Document Mining
17/37
Intelligent Information Retrieval
meaningof words
Synonyms buy / purchase
Ambiguity bat (baseball vs. mammal) orderof words in the query
hot dog stand in the amusement park
hot amusement stand in the dog park
8/3/2019 Text Based Information Retrieval - Document Mining
18/37
8/3/2019 Text Based Information Retrieval - Document Mining
19/37
Why Mine the Web?
Enormous wealth of textual information on the Web. Book/CD/Video stores (e.g., Amazon)
Restaurant information (e.g., Zagats)
Car prices (e.g., Carpoint)
Lots of data on user access patterns Web logs contain sequence of URLs accessed by users
Possible to retrieve previously unknown
information People who ski also frequently break their leg. Restaurants that serve sea food in California are likely to be
outside San-Francisco
8/3/2019 Text Based Information Retrieval - Document Mining
20/37
Mining the Web
IR / IE
System
Query
Documents
source
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
Web Spider
8/3/2019 Text Based Information Retrieval - Document Mining
21/37
What is Web Clustering ?
Given:
A source of textual
documents
Similarity measure e.g., how many
words are common
in these documents
ClusteringSystem
Similaritymeasure
Documents
source
Doc
Do
cDoc
Doc
Doc
DocDoc
Doc
Doc
Doc
Find:
Several clusters of documentsthat are relevant to each other
8/3/2019 Text Based Information Retrieval - Document Mining
22/37
Text characteristics
Large textual data base
Efficiency consideration
over 2,000,000,000 web pages
almost all publications are also in electronic form
High dimensionality (Sparse input)
Consider each word/phrase as a dimension
Dependency
relevant information is a complex conjunction of
words/phrases
e.g., Document categorization.Pronoun disambiguation
8/3/2019 Text Based Information Retrieval - Document Mining
23/37
Text characteristics
Ambiguity Word ambiguity
Pronouns (he, she )
buy, purchase
Semantic ambiguity The king saw the rabbit with his glasses. (? meanings)
Noisy data Example: Spelling mistakes
Not well structured text Chat rooms
r u available ?
Hey whazzzzzz up
Speech
8/3/2019 Text Based Information Retrieval - Document Mining
24/37
Text mining process
Text preprocessing Syntactic/Semantic
text analysis
Features Generation Bag of words
Features Selection Simple counting
Statistics
Text/Data Mining Classification-
Supervised
learning Clustering-Unsupervisedlearning
Analyzing results
8/3/2019 Text Based Information Retrieval - Document Mining
25/37
Part Of Speech (pos) tagging Find the corresponding pos for each word
e.g., John (noun) gave (verb) the (det) ball (noun)
Word sense disambiguation Context basedContext based orproximity basedproximity based
Very accurate
Parsing Generates a parse treeparse tree (graph) for each sentence
Each sentence is a stand alone graph
Syntactic / Semantic text
analysis
8/3/2019 Text Based Information Retrieval - Document Mining
26/37
Feature Generation: Bag of words
Text document is represented by the words it contains(and their occurrences) e.g., Lord of the rings p {the, Lord, rings, of}
Highly efficient
Makes learning far simpler and easier Order of words is not that important for certain applications
Stemming: identifies a word by its root Reduce dimensionality
e.g., flying, flewp fly
Use Porter Algorithm
Stop words: The most common words are unlikely to helptext mining e.g., the, a, an, you
8/3/2019 Text Based Information Retrieval - Document Mining
27/37
Feature selection
Reduce dimensionality
Learners have difficulty addressing tasks with
high dimensionality
Irrelevant features
Not all features help!
e.g., the existence of a noun in a news article is
unlikely to help classify it as politics or sport
Use Weightening
8/3/2019 Text Based Information Retrieval - Document Mining
28/37
Given: a collection of labeled records (training settraining set) Each record contains a set of features (attributesattributes), and
the true class (labellabel)
Find: a modelmodel for the class as a function of thevalues of the features
Goal: previously unseen records should beassigned a class as accurately as possible A test settest set is used to determine the accuracy of the model.
Usually, the given data set is divided into training andtest sets, with training set used to build the model andtest set used to validate it
Text Mining: Classification
definition
8/3/2019 Text Based Information Retrieval - Document Mining
29/37
Similarity Measures:
Euclidean DistanceEuclidean Distance if attributes are continuous
Other Problem-specific Measures
e.g., how many words are common in these documents
Given: a set of documents and a similaritysimilaritymeasuremeasure among documents
Find: clusters such that: Documents in one cluster are more similar to one
another Documents in separate clusters are less similar to one
another
Goal: Finding a correctcorrectset of documents
Text Mining: Clustering definition
8/3/2019 Text Based Information Retrieval - Document Mining
30/37
Supervised learning (classification)
Supervision: The training data (observations, measurements,
etc.) are accompanied by labelslabels indicating the class of the
observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim ofestablishing the existence of classes or clusters in the data
Supervised vs. Unsupervised
Learning
8/3/2019 Text Based Information Retrieval - Document Mining
31/37
Correct classification: The known labelof test sample is identical with the classclassresultresultfrom the classification model
Accuracy ratio: the percentage of testset samples that are correctly classifiedby the model
A distance measuredistance measure between classescan be used
e.g., classifying football document as abasketball document is not as bad asclassifying it as crime.
Evaluation:What Is Good
Classification?
8/3/2019 Text Based Information Retrieval - Document Mining
32/37
Good clustering method: produce high
quality clusters with . . .
high intraintra--classclass similarity
low interinter--classclass similarity
The qualityqualityof a clustering method is also
measured by its ability to discover some or
all of the hiddenhidden patterns
Evaluation:What Is Good
Clustering?
8/3/2019 Text Based Information Retrieval - Document Mining
33/37
Text Classification: An Example
Ex#Hooligan
1An English football fan
Yes
2
During a game in Italy
Yes
3England has beenbeating France
Yes
4Italian football fans werecheering
No
5An average USAsalesman earns 75K
No
6The game in Londonwas horrific
Yes
7 Manchester city is likelyto win the championship
Yes
8Rome is taking the leadin the football league
Yes10
Training
SetModel
Learn
Classifier
Test
Set
Hooligan
A Danish football fan ?
Turkey is playing vs. France.The Turkish fans
?10
8/3/2019 Text Based Information Retrieval - Document Mining
34/37
Ex#
Hooligan
1An English football fan
Yes
2 During a game in Italy
Yes
3England has beenbeating France
Yes
4Italian football fans werecheering
No
5An average USAsalesman earns 75K
No
6
The game in London
was horrific Yes
7Manchester city is likelyto win the championship
Yes
8Rome is taking the leadin the football league
Yes10
Decision Tree: A Text Example
Yes
English
Yes
No
MarSt
NO
MarriedSingle, Divorced
Splitting Attributes
Income
YESNO
> 80K < 80K
The splitting attribute at a node is
determined based on a specific
Attribute selection algorithm
8/3/2019 Text Based Information Retrieval - Document Mining
35/37
Decision tree A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases: Tree construction
Tree pruning
Identify and remove branches that reflect noisenoise oroutliersoutliers Use of decision tree: Classifying an unknown
sample Test the attribute of the sample against the decision
tree
Classification by DT Induction
8/3/2019 Text Based Information Retrieval - Document Mining
36/37
Text is tricky to process, but ok results are easily
achieved
There exist several text mining systemstext mining systems e.g., D2K - Data to Knowledge
http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/
Additional IntelligenceIntelligence can be integrated with textmining
One may play with any phase of the text mining process
Summary
8/3/2019 Text Based Information Retrieval - Document Mining
37/37
Summary
There are many otherscientific and statistical text miningscientific and statistical text mining
methodsmethods developed but not covered in this talk.
http://www.cs.utexas.edu/users/pebronia/text-mining/
http://filebox.vt.edu/users/wfan/text_mining.html
Also, it is important to study theoretical foundationstheoretical foundations of
data mining.
Data Mining Concepts and Techniques / J.Han & M.Kamber Machine Learning, / T.Mitchell