Text Based Information Retrieval - Document Mining

Embed Size (px)

Citation preview

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    1/37

    Text Based Information

    Retrieval - Text Mining

    PKB - Antonie

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    2/37

    Background

    Human dificults to process huge information

    Computer can do better with matemathics

    why dont also use computer to process huge

    information?

    A Large text to find:

    Terrorist attack on 1995?

    Terrorist movement and bomb relation? Relates to Information Retreival, Data Mining

    and Text Mining

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    3/37

    Terminology

    Data Mining

    A step in the knowledge discovery process consisting ofparticular algorithms (methods), produces a particularenumeration ofpatterns (models) over the data.

    Data Mining is a process of discovering advantageous patternsin data.

    Knowledge Discovery Process

    The process of using data mining methods (algorithms) to

    extract (identify) what is knowledge according to thespecifications of measures and thresholds, using a databasealong with any necessary preprocessing or transformations.

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    4/37

    What kind of data in Data Mining?

    Relational Databases

    Data Warehouses

    Transactional Databases

    Advanced DatabaseSystems Object-Relational

    Multimedia

    Text

    Heterogeneous andDistributed

    WWW

    Data Mining Application:

    Market analysis

    Risk analysis and

    management

    Fraud detection anddetection of unusual patterns

    (outliers)

    Text mining (news group,

    email, documents) and Web

    mining

    Stream data mining

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    5/37

    Knowledge Discovery

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    6/37

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    7/37

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    8/37

    What Is Text Mining?

    The objective of Text Mining is to exploit information contained in textualdocuments in various ways, including discovery of patterns andtrends in data, associations among entities, predictive rules, etc.(Grobelnik et al., 2001)

    Another way to view text data mining is as a process of exploratory dataanalysis that leads to heretofore unknown information, or to answersfor questions for which the answer is not currently known. (Hearst,1999)

    The non trivial extraction of implicit, previously unknown, andpotentially useful information from (large amount of) textual data.

    An exploration and analysis oftextual (naturaltextual (natural--language) datalanguage) data by

    automatic and semi automatic means to discover new knowledge.

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    9/37

    Text Mining (2)

    What is previously unknownpreviously unknowninformation ?

    Strict definition

    Information that not even the writer knows.

    Lenient (lunak) definition

    Rediscover the information that the author

    encoded in the text

    e.g., Automatically extracting a products namefrom a web-page.

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    10/37

    Information Retrieval

    Indexing and retrieval of textual documents

    Information Extraction

    Extraction ofpartial knowledgepartial knowledge in the text

    Web Mining

    Indexing and retrieval of textual documents and

    extraction of partial knowledge using the web

    Clustering

    Generating collections of similar text documents

    Text Mining Methods

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    11/37

    Text Mining Application

    Email: Spam filtering

    News Feeds: Discover what is interesting

    Medical: Identify relationships and link

    information from different medical fields Marketing: Discover distinct groups of potential

    buyers and make suggestions for other products

    Industry: Identifying groups of competitors web

    pages Job Seeking: Identify parameters in searching

    for jobs

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    12/37

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    13/37

    Information Retrieval (1)

    Given: A source of textual documents

    A well defined limited query (text based)

    Find: Sentences with relevantrelevant information

    Extract the relevant information and

    ignore non-relevant information (important!)

    Link related information and output in a predetermined format

    Example: news stories, e-mails, web pages,photograph, music, statistical data, biomedical data, etc.

    Information items can be in the form of text, image,video, audio, numbers, etc.

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    14/37

    Information Retrieval (2)

    2 basic information retrieval (IR) process: Browsing or navigation system User skims document collection by jumping from

    one document to the other via hypertext orhypermedia links until relevant document found

    Classical IR system: question answeringsystem

    Query: question in natural language

    Answer: directly extracted from text of documentcollection

    Text Based Information Retrieval: Information item (document) :

    Text format (written/spoken) or has textual description

    Information need (query): Usually in text format

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    15/37

    Classical IR System Process

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    16/37

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    17/37

    Intelligent Information Retrieval

    meaningof words

    Synonyms buy / purchase

    Ambiguity bat (baseball vs. mammal) orderof words in the query

    hot dog stand in the amusement park

    hot amusement stand in the dog park

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    18/37

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    19/37

    Why Mine the Web?

    Enormous wealth of textual information on the Web. Book/CD/Video stores (e.g., Amazon)

    Restaurant information (e.g., Zagats)

    Car prices (e.g., Carpoint)

    Lots of data on user access patterns Web logs contain sequence of URLs accessed by users

    Possible to retrieve previously unknown

    information People who ski also frequently break their leg. Restaurants that serve sea food in California are likely to be

    outside San-Francisco

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    20/37

    Mining the Web

    IR / IE

    System

    Query

    Documents

    source

    Ranked

    Documents

    1. Doc1

    2. Doc2

    3. Doc3

    .

    .

    Web Spider

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    21/37

    What is Web Clustering ?

    Given:

    A source of textual

    documents

    Similarity measure e.g., how many

    words are common

    in these documents

    ClusteringSystem

    Similaritymeasure

    Documents

    source

    Doc

    Do

    cDoc

    Doc

    Doc

    DocDoc

    Doc

    Doc

    Doc

    Find:

    Several clusters of documentsthat are relevant to each other

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    22/37

    Text characteristics

    Large textual data base

    Efficiency consideration

    over 2,000,000,000 web pages

    almost all publications are also in electronic form

    High dimensionality (Sparse input)

    Consider each word/phrase as a dimension

    Dependency

    relevant information is a complex conjunction of

    words/phrases

    e.g., Document categorization.Pronoun disambiguation

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    23/37

    Text characteristics

    Ambiguity Word ambiguity

    Pronouns (he, she )

    buy, purchase

    Semantic ambiguity The king saw the rabbit with his glasses. (? meanings)

    Noisy data Example: Spelling mistakes

    Not well structured text Chat rooms

    r u available ?

    Hey whazzzzzz up

    Speech

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    24/37

    Text mining process

    Text preprocessing Syntactic/Semantic

    text analysis

    Features Generation Bag of words

    Features Selection Simple counting

    Statistics

    Text/Data Mining Classification-

    Supervised

    learning Clustering-Unsupervisedlearning

    Analyzing results

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    25/37

    Part Of Speech (pos) tagging Find the corresponding pos for each word

    e.g., John (noun) gave (verb) the (det) ball (noun)

    Word sense disambiguation Context basedContext based orproximity basedproximity based

    Very accurate

    Parsing Generates a parse treeparse tree (graph) for each sentence

    Each sentence is a stand alone graph

    Syntactic / Semantic text

    analysis

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    26/37

    Feature Generation: Bag of words

    Text document is represented by the words it contains(and their occurrences) e.g., Lord of the rings p {the, Lord, rings, of}

    Highly efficient

    Makes learning far simpler and easier Order of words is not that important for certain applications

    Stemming: identifies a word by its root Reduce dimensionality

    e.g., flying, flewp fly

    Use Porter Algorithm

    Stop words: The most common words are unlikely to helptext mining e.g., the, a, an, you

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    27/37

    Feature selection

    Reduce dimensionality

    Learners have difficulty addressing tasks with

    high dimensionality

    Irrelevant features

    Not all features help!

    e.g., the existence of a noun in a news article is

    unlikely to help classify it as politics or sport

    Use Weightening

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    28/37

    Given: a collection of labeled records (training settraining set) Each record contains a set of features (attributesattributes), and

    the true class (labellabel)

    Find: a modelmodel for the class as a function of thevalues of the features

    Goal: previously unseen records should beassigned a class as accurately as possible A test settest set is used to determine the accuracy of the model.

    Usually, the given data set is divided into training andtest sets, with training set used to build the model andtest set used to validate it

    Text Mining: Classification

    definition

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    29/37

    Similarity Measures:

    Euclidean DistanceEuclidean Distance if attributes are continuous

    Other Problem-specific Measures

    e.g., how many words are common in these documents

    Given: a set of documents and a similaritysimilaritymeasuremeasure among documents

    Find: clusters such that: Documents in one cluster are more similar to one

    another Documents in separate clusters are less similar to one

    another

    Goal: Finding a correctcorrectset of documents

    Text Mining: Clustering definition

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    30/37

    Supervised learning (classification)

    Supervision: The training data (observations, measurements,

    etc.) are accompanied by labelslabels indicating the class of the

    observations

    New data is classified based on the training set

    Unsupervised learning (clustering)

    The class labels of training data is unknown

    Given a set of measurements, observations, etc. with the aim ofestablishing the existence of classes or clusters in the data

    Supervised vs. Unsupervised

    Learning

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    31/37

    Correct classification: The known labelof test sample is identical with the classclassresultresultfrom the classification model

    Accuracy ratio: the percentage of testset samples that are correctly classifiedby the model

    A distance measuredistance measure between classescan be used

    e.g., classifying football document as abasketball document is not as bad asclassifying it as crime.

    Evaluation:What Is Good

    Classification?

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    32/37

    Good clustering method: produce high

    quality clusters with . . .

    high intraintra--classclass similarity

    low interinter--classclass similarity

    The qualityqualityof a clustering method is also

    measured by its ability to discover some or

    all of the hiddenhidden patterns

    Evaluation:What Is Good

    Clustering?

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    33/37

    Text Classification: An Example

    Ex#Hooligan

    1An English football fan

    Yes

    2

    During a game in Italy

    Yes

    3England has beenbeating France

    Yes

    4Italian football fans werecheering

    No

    5An average USAsalesman earns 75K

    No

    6The game in Londonwas horrific

    Yes

    7 Manchester city is likelyto win the championship

    Yes

    8Rome is taking the leadin the football league

    Yes10

    Training

    SetModel

    Learn

    Classifier

    Test

    Set

    Hooligan

    A Danish football fan ?

    Turkey is playing vs. France.The Turkish fans

    ?10

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    34/37

    Ex#

    Hooligan

    1An English football fan

    Yes

    2 During a game in Italy

    Yes

    3England has beenbeating France

    Yes

    4Italian football fans werecheering

    No

    5An average USAsalesman earns 75K

    No

    6

    The game in London

    was horrific Yes

    7Manchester city is likelyto win the championship

    Yes

    8Rome is taking the leadin the football league

    Yes10

    Decision Tree: A Text Example

    Yes

    English

    Yes

    No

    MarSt

    NO

    MarriedSingle, Divorced

    Splitting Attributes

    Income

    YESNO

    > 80K < 80K

    The splitting attribute at a node is

    determined based on a specific

    Attribute selection algorithm

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    35/37

    Decision tree A flow-chart-like tree structure

    Internal node denotes a test on an attribute

    Branch represents an outcome of the test

    Leaf nodes represent class labels or class distribution

    Decision tree generation consists of two phases: Tree construction

    Tree pruning

    Identify and remove branches that reflect noisenoise oroutliersoutliers Use of decision tree: Classifying an unknown

    sample Test the attribute of the sample against the decision

    tree

    Classification by DT Induction

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    36/37

    Text is tricky to process, but ok results are easily

    achieved

    There exist several text mining systemstext mining systems e.g., D2K - Data to Knowledge

    http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/

    Additional IntelligenceIntelligence can be integrated with textmining

    One may play with any phase of the text mining process

    Summary

  • 8/3/2019 Text Based Information Retrieval - Document Mining

    37/37

    Summary

    There are many otherscientific and statistical text miningscientific and statistical text mining

    methodsmethods developed but not covered in this talk.

    http://www.cs.utexas.edu/users/pebronia/text-mining/

    http://filebox.vt.edu/users/wfan/text_mining.html

    Also, it is important to study theoretical foundationstheoretical foundations of

    data mining.

    Data Mining Concepts and Techniques / J.Han & M.Kamber Machine Learning, / T.Mitchell