75
Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1) Maha Althobaiti,Udo Kruschwitz, Deirdre Lungley, Massimo Poesio School of Computer Science and Electronic Engineering University of Essex [email protected] 6 February 2013 Text Mining (Introduction) - February 2013 1

Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Going Digital Workshop:

Text Mining in Digital Collections

Introduction (Part 1)

Maha Althobaiti,Udo Kruschwitz, Deirdre Lungley,Massimo Poesio

School of Computer Science and Electronic EngineeringUniversity of [email protected]

6 February 2013

Text Mining (Introduction) - February 2013 1

Page 2: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Brief Outline

Part 1: Introduction to natural language engineering(text mining)

Motivation What is it? State of the art

Part 2: Some fundamentals

Part 3: Glimpse into sample applications:

Web search Question-answering

Text Mining (Introduction) - February 2013 2

Page 3: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Example - Machine Translation (Original Page)

Text Mining (Introduction) - February 2013 3

Page 4: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Example - Machine Translation

http://babelfish.yahoo.com/

Text Mining (Introduction) - February 2013 4

Page 5: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Example II - Information Extraction

Text Mining (Introduction) - February 2013 5

Page 6: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Example III - Searching a Web Site

Text Mining (Introduction) - February 2013 6

Page 7: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Early NLP Systems

ELIZA (e.g. http://psych.fullerton.edu/mbirnbaum/psych101/Eliza.htm)

Weizenbaum 1966 Pattern matching

PARRY

Colby et al. 1971 Pattern matching and planning

SHRDLU

Winograd 1972 Natural language understanding Comprehensive grammar of English

Text Mining (Introduction) - February 2013 7

Page 8: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

What is NLE?

Natural Language ... Natural Language Processing Computational Linguistics Information Retrieval Information Extraction

... Engineering Combination of techniques Real world applications Using existing tools and resources

Text Mining (Introduction) - February 2013 8

Page 9: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Why is it Interesting?

Web search engines Information systems Wireless communication Text classification Text summarization Filtering Human-computer dialogues Computer assisted language learning

... and more ...

Generation (e.g. AI in games) Machine translation Speech processing

Text Mining (Introduction) - February 2013 9

Page 10: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Why is it Difficult?

Ambiguity

bank, book, ruler ...I saw the man with the telescope on the hill.Five monkeys ate three bananas.Time flies like an arrow.Lassen Sie uns noch einen Termin ausmachen.

Language is changing

I want to buy a mobile.Google it!

Ill-formed input

accomodation office

Multilinguality

Claudia Schiffer

Text Mining (Introduction) - February 2013 10

Page 11: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Example: Ambiguity + Multilinguality

Text Mining (Introduction) - February 2013 11

Page 12: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Example: Ambiguity + Multilinguality II

Text Mining (Introduction) - February 2013 12

Page 13: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Various Perspectives

Linguist: How can natural language phenomena be described?

Cognitive scientist: How does the human mind work?

Computer scientist: How can a theory be efficiently implemented?

... we look at it from a computer scientist’s perspective

Text Mining (Introduction) - February 2013 13

Page 14: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Two Paradigms

Symbolic (deep) processing

hand coded rules knowledge-rich

Statistical (shallow) processing build up statistical model of language data-rich

... most powerful approaches nowadays are statistical!

Text Mining (Introduction) - February 2013 14

Page 15: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

NLE: Where we are

Mature, everyday technology E.g.: tokenization, normalization, regular expression search

Solid technology that can still be improved E.g.: lemmatization; spelling correctors; IR / Web search;

speech synthesis Used in real applications, but needs improvement

E.g.: POS tagging; term extraction; summarization; speechrecognition; text classification (e.g., for spam detection)

Spoken dialogue systems for simple information seeking(railways, phone)

“Almost there” technology E.g.: information extraction, generation systems, simple

speech translation systems Pie in the sky

E.g.: full machine translation, more advanced dialogue

Text Mining (Introduction) - February 2013 15

Page 16: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Part I - Mature Technologies

Research in NLE has been going on for many years and inmany forms - e.g., as part of compiler technology, informationretrieval, etc.

The results of this work are a number of well-establishedtechnologies that are hardly considered “research” anymore

But: these techniques are by no means trivial as you will findout when you write your own little NLE tools ...

Text Mining (Introduction) - February 2013 16

Page 17: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Regular Expressions for Search, Validation and Parsing

Basics (e.g. search in Google) cat OR dog

"Regular * in Java"

More advanced (e.g., regular expressions in PERL, Java) /[Ww]ordnet/

/colou?r/

[a-z|A-Z]*

[^A-Z]

Note also: substitution s/colour/color

ELIZAs/.* I am (sad | depressed).*/I AM SORRY TO HEAR YOU ARE \1/

Text Mining (Introduction) - February 2013 17

Page 18: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Part II: Solid Technology

Technology that could still use improvements

Over the last ten to twenty years new applications haveappeared which are by now fairly well established, but whoseresults are still not 100% accurate (nor is it clear they everwill!)

Text Mining (Introduction) - February 2013 18

Page 19: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Stemming, Lemmatization and Morphological analysis

Stemming: FOXES → FOX

Lemmatization: ’passing’ → ’pass’ +ING ’were’ → ’be’ +PAST Used for: Information Retrieval (e.g. Google)

More general morphological analysis: Gebaudereinigungsfirmenangestellter →

Gebaude + Reinigung + Firma + Angestellter(building + cleaning + company + employee)

Helsingin sanomat →Helsinki +GEN + sanoma +PLURAL

Text Mining (Introduction) - February 2013 19

Page 20: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Part of Speech Tagging

Assign a part of speech to each word: ’dog’ → NOUN ’eat’ → VERB Book that flight

VB DT NN

Applications: all over the place! Information retrieval Information extraction Question answering Translation

Text Mining (Introduction) - February 2013 20

Page 21: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Classification

News agencies: classify incoming news stories

Search engines: classify queries, e.g. search Google forObama or LOG313

Identifying spam emails

Routing emails or documents to appropriate people

Text Mining (Introduction) - February 2013 21

Page 22: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Terminology Extraction

It is becoming increasingly difficult for dictionary writers tokeep track of all the new terminology introduced in scientificareas, e.g.:

the C4 carbons of the nicotinamide rings X-ray therapy vs. X-ray therapies PAHO vs. Pan-American Health Organization

TERM EXTRACTION systems are systems that can processlarge amounts of scientific papers and publications andidentify terminology, possibily comparing it with a previous list

Text Mining (Introduction) - February 2013 22

Page 23: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Information Retrieval / Web Search / Question-Answering

Google, Yahoo!, Bing, Yandex, Baidu, ...

Wolfram Alpha question-answering (QA) system:

http://www.wolframalpha.com

evi (formerly TrueKnowledge):

http://www.evi.com

START QA system:

http://www.ai.mit.edu/projects/infolab/

This region is a hub of activity in search! E.g. have a look atEnterprise Search Meetup London and Enterprise Search MeetupCambridge

Text Mining (Introduction) - February 2013 23

Page 24: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Part III: More Advanced NL Technology

This goes from technologies which have been used for yearsbut still have problems, to technologies for which onlyprototypes have been developed, e.g.:

Generation Machine translation Speech recognition Spoken dialogue interfaces

Text Mining (Introduction) - February 2013 24

Page 25: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 1

Text Mining in Digital Collections - Introduction (Part 2)

Linguistic Essentials;; Common Processing Pipelines

Page 26: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 2

Levels of Linguistic Analysis

Word level: Parts of speech: DOG, EAT, RED Sub-word level:

Phonetics, Phonology Morphology

Phrase level (syntax): THE RED DOG, CUTTING CORNERS, BY

Semantics: E.g., lexical semantics COURSE / MODULE, DOG / ANIMAL

Discourse: E.g., anaphora MOST STUDENTS SETTING OFF ON GAP YEAR TRIPS WILL NEED THEIR MONEY AND POSSESSIONS TO FIT SECURELY AND COMPACTLY INTO A RUCKSACK AND MONEYBELT

Page 27: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 3

Words: Parts of speech

Words belong to different classes The sad / sheep / *run / *always / *most dog barked

Basic PARTS OF SPEECH: NOUN: dog, man, car, law ADJECTIVE: red, fat, brave VERBS: run, barked Best known set of part of speech TAGS: Brown TAGS

NN for nouns VB for verb base forms JJ for adjectives in positive form

Notice: many words belong to more than one class Open and closed classes

Page 28: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 4

Levels of linguistic processing: the basic pipeline of an NLP system (e.g., GATE)

PREPROCESSING LEXICAL

PROCESSING SYNTACTIC

PROCESSING

SEMANTIC PROCESSING

DISCOURSE PROCESSING

Page 29: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 5

Example: processing a query to a web search engine

PREPROCESSING LEXICAL

PROCESSING SYNTACTIC

PROCESSING

SEMANTIC PROCESSING WEB ACCESS

List the estate agents in Stratford, London.

TOKENIZATION

POS TAGGING TERM IDENTIFICATION

STOP WORDS

SYNONYMS

Page 30: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 6

More advanced examples: Information Extraction Systems (e.g., LASIE)

PREPROCESSING LEXICAL

PROCESSING SYNTACTIC

PROCESSING

SEMANTIC PROCESSING

DISCOURSE PROCESSING

Page 31: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 7

In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share

Late last night the company announced a growth of 20%.

Preprocessing, I: tokenizing

<P> <W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W> <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> <W C='W'>$</W><W C='CD'>102</W> <W C='W'>million</W> <W C='CM'>,</W> <W C='W'>or</W> <W C='CD'>34</W> <W C='W'>cents</W> <W C='W'>a</W> <W C='W'>share</W> </P>

PARAGRAPH MARKUP;; TOKENIZER

Page 32: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 8

In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share

Late last night the company announced a growth of 20%.

Preprocessing, I: tokenizing

<P><W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W> <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> <W C='W'>$</W><W C='CD'>102</W> <W C='W'>million</W><W C='CM'>,</W> <W C='W'>or</W> <W C='CD'>34</W> <W C='W'>cents</W> <W C='W'>a</W> <W C='W'>share.</W></P>

<P><W C='W'>Late</W> <W C='W'>last</W> <W C='W'>night</W> <W C='W'>the</W> <W C='W'>company</W> <W C='W'>announced</W> <W C='W'>a</W> <W C='W'>growth</W> <W C='W'>of</W> <W C='CD'>20</W><W C='W'>%</W><W C='.'>.</W></P>

PARAGRAPH MARKUP;; TOKENIZER

Page 33: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 9

Preprocessing,II: sentence splitting

<P>

<S> <W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W> <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> <W C='W'>$</W><W C='CD'>102</W> <W C='W'>million</W><W C='CM'>,</W> <W C='W'>or</W> <W C='CD'>34</W> <W C='W'>cents</W> <W C='W'>a</W> <W C='W'>share></W> <W C='.'>.</W></S>

</P>

<P>

<S> <W C='W'>Late</W> <W C='W'>last</W> <W C='W'>night</W> <W C='W'>the</W> <W C='W'>company</W> <W C='W'>announced</W> <W C='W'>a</W> <W C='W'>growth</W> <W C='W'>of</W> <W C='CD'>20</W><W C='W'>%</W> <W C='.'>.</W> </S> </P>

Page 34: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 10

Lexical Processing, I: POS tagging

<W C='CD'>102</W>

<W C='CM'>,</W>

Page 35: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 11

Lexical Processing, II: lemmatizing / stemming

<W C='CD'>102</W>

<W C='CM'>,</W>

Page 36: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 12

An example of practical (partial) Parsing: Identifying numerical expressions

<NUMEX>

<W C='CD'>102</W>

</NUMEX> <W C='CM'>,</W>

Page 37: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 13

An example of practical semantic processing: identifying semantic type

<W C='CD'>102</W>

</NUMEX> <W C='CM'>,</W>

Page 38: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 14

An example of discourse processing: resolving anaphoric references

In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share

Late last night the company announced a growth of 20%.

Page 39: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 15

Basic Pre-processing;; Regular Expressions

Page 40: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 16

The basic tasks in text processing

TOKENIZATION: identify tokens in text WORD COUNTING: count words and their frequencies SEARCHING FOR WORDS NORMALIZATION:

UDO KRUSCHWITZ, udo kruschwitz, udo krushwitz Udo Kruschwitz Feb 6, 6th of February, 06/02/2013

STEMMING

Page 41: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 17

Regular Expressions : a formalism for expressing search patterns

Because matching is a very common problem, over the years computer scientists have identified a set of patterns that 1. Are very common 2. Can be searched for efficiently

The language of REGULAR EXPRESSIONS has been developed to characterize these patterns

web search tools / software systems (awk, sed, emacs) allow users to use regular expressions to specify what they are searching these REs are then compiled into efficient code You do not need to write the code yourselves!

Page 42: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 18

Types of regular expressions

Disjunction: /centre|center/ /accomodation|accommodation/ Also:

/[Cc]entre/ /acco[m|mm]odation/

Repetitions: +: Any number greater than 0

/YES+!/ Matches YES!, YESS!, YESSS! E.g., any binary number: [01]+

*: 0 or more /ab*/ Matches a, ab, abb, abbb

Page 43: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 19

Applications of more complex REs

Web pages about Centres and Centers: /[Cc]entre|[Cc]enter/

Regular expression to validate phone numbers: (+44)(0)20-12341234, 02012341234, +44 (0) 1234-1234 But not: (44+)020-12341234, 12341234(+020) ^(\(?\+?[0-9]*\)?)?[0-9_\- \(\)]*$

Validating email addresses: [email protected], [email protected], [email protected] But not: asmith, @mactech.com, a@a ^([a-­zA-­Z0-­9_\-­\.]+)@((\[[0-­9]1,3\.[0-­9]1,3\.[0-­9]1,3\.)|(([a-­zA-­Z0-­9\-­]+\.)+))([a-­zA-­Z]2,4|[0-­9]1,3)(\]?)$

Page 44: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 20

REs in action (linked to our own research)

Use of regular expressions as part of a processing pipeline to process query logs of a digital library (in this case: The European Library) Why? We want to automatically learn what query modification suggestions to propose to (future) users of the library search engine, e.g.:

Page 45: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 21

1889115 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx ("mozart") xxxx search_url xxxx xxxx 0 xxxx -­ xxxx xxxx xxxx 2008-­06-­24 22:02:52 1889118 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx ("mozart") xxxx view_full xxxx xxxx 1 xxxx xxxx xxxx xxxx 2008-­06-­24 22:03:03 1889120 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx klavierkonzerte xxxx search_res_rec_all xxxx xxxx 0 xxxx -­ xxxx xxxx xxxx 2008-­06-­24 22:03:55 1889121 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx ("klavierkonzerte") xxxx view_full xxxx xxxx 1 xxxx xxxx xxxx xxxx 2008-­06-­24 22:04:10

Page 46: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 22

Here comes (part of the) the processing pipeline:

more $CleanedFile | gawk 'BEGIN FS = " xxxx " if (($5=="en") && ($4

!= "null") && ($6 !~ /^[ "(]*(test|toto|a)[ ")]*$/) && ($7~/^search_/) && ($7 !~ /^search_adv/)) print $4 " xxxx " $1 " xxxx " $6 " xxxx " $13' | sort -­k 1,1 -­k 3,3n | gawk 'BEGIN FS = " xxxx " if ((OLD_ID == $1 && (OLD_QUERY !=$3))) print OLD_LINE;; PRINT = "on";; if ((OLD_ID != $1) && (PRINT == "on")) print OLD_LINE;; PRINT = "off";; OLD_ID=$1;; split($_, a, " xxxx ");; OLD_QUERY = a[3];; OLD_LINE = $_' > $SessionQueries

Page 47: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 23

This is what I get:

8eb3bdv3odg9jncd71u0s2aff6 xxxx 1889115 xxxx ("mozart") xxxx 2008-­06-­24

22:02:52 8eb3bdv3odg9jncd71u0s2aff6 xxxx 1889120 xxxx klavierkonzerte xxxx 2008-­

06-­24 22:03:55

Page 48: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 24

Basic Ideas of Statistical Approaches

Page 49: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 25

Statistical NLE - Motivation

So far: purely symbolic approaches But instead of hand-coding rules we could derive knowledge from text corpora automatically. Sample applications:

Spell checking Handwriting recognition

Example: How can Google predict the correct spelling of words?

Page 50: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 26

Statistical Methods in NLE

Two characteristics of NL make it desirable to endow programs with the ability to LEARN from examples of past use:

VARIETY (no programmer can really take into account all possibilities) AMBIGUITY (need to have ways of choosing between alternatives)

In a number of NLE applications, statistical methods are very common The simplest application: WORD PREDICTION

Page 51: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 27

We are good at word prediction

Stocks plunged this morning, despite a cut in interest Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall

Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall

Page 52: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 28

Statistics and word prediction

The basic idea underlying the statistical approach to word prediction is to use the probabilities of SEQUENCES OF WORDS to choose the most likely next word / correction of spelling error I.e., to compute For all words w, and predict as next word the one for which this (conditional) probability is highest.

P(w | W1 N-1)

Page 53: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 29

Using corpora to estimate probabilities

But where do we get these probabilities? Idea: estimate them by RELATIVE FREQUENCY. The simplest method: Maximum Likelihood Estimate (MLE). Count the number of words in a corpus, then count how many times a given sequence is encountered.

in the corpus

NWWCWWP n

n)..()..( 1

1

Page 54: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining (Introduction II) 30

Summary: Statistical Approaches

Statistical methods typically need an annotated corpus (Maha Methods can simply rely on counting (and a bit of smoothing) Same principle for any of your applications Very powerful! Use existing tools! (Massimo, Maha

Page 55: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Text Mining in Digital CollectionsIntroduction (Part 3)

Text Mining (Introduction III) - February 2013 1

Page 56: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Information Retrieval

Motivation: Large natural language text collections Find most relevant documents for a user query

Why has IR become so popular? World Wide Web Computer hardware

Text Mining (Introduction III) - February 2013 2

Page 57: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Information Retrieval: Example 1

Web search queries to http://www.dogpile.com/ (29/11/09):

(1) 0117904900

(2) flowers

(3) www.amazon.com

(4) policies and procedures

(5) site:http://www.procracksworld.ws norman virus

comtrol

(6) panthers

(7) why do we yawn

(8) high capacity 6 cell 7.2 volt nicad battery

(9) google

(10) lumberjack song

Text Mining (Introduction III) - February 2013 3

Page 58: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Information Retrieval: Example 2

Text Mining (Introduction III) - February 2013 4

Page 59: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Document = Bag of Words

Need to locate “important” information in a document Deep natural language understanding too expensive Common assumption: document is a list of words Documents represented by a set of index terms

Text Mining (Introduction III) - February 2013 5

Page 60: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

IR - Retrieval Process

1. Define text database, i.e. data sources, operations on textand text structure

2. Process document collection into index database for fastaccess

3. Parse user query using same text processing operations4. Apply query operations to processed query5. Generate system query to retrieve matching documents6. Rank and display results

Text Mining (Introduction III) - February 2013 6

Page 61: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

IR - Indexing StepsMotivation

Same words in different realisations Content words vs. stopwords Morphology Size of index tables

Typical Indexing Steps

1. Lexical analysis2. Deletion of stopwords3. Stemming4. Selection of index terms

Text Mining (Introduction III) - February 2013 7

Page 62: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

IR Indexing - Lexical Analysis Removal of punctuation, e.g.:

F. Crestani et al. “Is this document relevant?

..probably”. ACM Computing Surveys,

30(4):528-552, 1998.

But: “i++” Case folding, e.g.:

London –> london

LONDON –> london

Removal of digits, but:ECIR 2013

3G (or first query on page 330!)

Text Mining (Introduction III) - February 2013 8

Page 63: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

IR Indexing - Stopwords Very frequent words are not good discriminators Zipf’s Law: few frequent words make up large part of a

document Closed class vs. open class words INQUERY’s stopword list (entries starting with letter a):

a, about, above, according, across, after,

afterwards, again, against, albeit, all, almost,

alone, along, already, also, although, always,

among, amongst, am, an, and, another, any,

anybody, anyhow, anyone, anything, anyway, ...

Domain-specific stopwords, e.g. in a Web searchapplication: search, webmaster, www, copyright, ...

But: To be or not to be

Text Mining (Introduction III) - February 2013 9

Page 64: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

IR Indexing - Stemming Reduce words to word stems Usually gain in recall, but degrading precision Simplest: suffix stripping Porter stemmer (see book for details):

develop –> develop

developing –> develop

development –> develop

developments –> develop

But (Porter stemmer):photographer –> photograph

photography –> photographi

Possible problems with language or stemming algorithm:peter, hospitality, hardly

Text Mining (Introduction III) - February 2013 10

Page 65: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

IR Indexing - Selection of Index Terms

Usually nouns Noun phrase detection, e.g.:

United States

Intelligent Document Retrieval

list of papers

School of Computer Science and Electronic

Engineering

Text Mining (Introduction III) - February 2013 11

Page 66: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

IR - Classic Models

Boolean model Probabilistic model Vector space model

Text Mining (Introduction III) - February 2013 12

Page 67: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

IR - Vector Space Model Most popular of all models (example: SMART) Documents ranked according to similarity with query Non-binary weights assigned to index terms (the higher the

value, the more important the term) Query and document are represented as vectors of index

terms Vector algebra allows elegant calculation of similarity:

cosine of the angle between the vectors is a measure ofsimilarity

Advantages: simple mechanism, term weighting, partialmatching, ranking of documents as similarity function

Disadvantage: independence assumption, i.e. index termsare assumed to be independent

Text Mining (Introduction III) - February 2013 13

Page 68: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

IR - Vector Space Model: Similarity Measure

Query and document represented as vectors ofn-dimensional space with features representing terms thatoccur in them

Value of each feature (the weight) is determined bypresence or absence of term in the document or a query

Similarity between two vectors (query and documentvector) gives measure of how closely query matchesdocument

Text Mining (Introduction III) - February 2013 14

Page 69: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

IR - Vector Space Model: Similarity Measure (contd.)

Vector algebra: θ is angle between vectors X and Y:

cosθ =X ∗ Y

|X ||Y | Similarity between a query qk and a document dj :

sim( qk , dj) =

N

i=1 wi,kwi,jN

i=1 w2i,k

N

i=1 w2i,j

Result of this measure is a value between 0 and 1

Text Mining (Introduction III) - February 2013 15

Page 70: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

IR - Vector Space Model: Term Weighting Infrequent terms should get higher weight than frequent

ones Weight should reflect number of occurrences within a

document The weight of each term in a document is calculated with

the tf.idf formula (term frequency - inverse documentfrequency):

tfidfi,k = fi,k ∗ log(N

dfi)

Term i in document k gets value tfidfi,k in a collection of N

documents, with fi,k the frequency of term i in document k

and dfi the number of documents with term i

Text Mining (Introduction III) - February 2013 16

Page 71: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

IR - Evaluation

Target

Retrieved

tn fntp

fp

tn = true negativesfn = false negativestp = true positivesfp = false positives

Precision =tp

tp + fp

Recall =tp

tp + fn

Text Mining (Introduction III) - February 2013 17

Page 72: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Example IR Systems

Apache Solr (based on Lucene)http://lucene.apache.org/solr/

Nutch open source (Lucene-based) search engine,very simple!http://nutch.apache.org/

Text Mining (Introduction III) - February 2013 18

Page 73: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Information Retrieval vs. Information Extraction

IE systems are normally very knowledge-intensive difficult to build domain dependent computationally more expensive than IR

IR systems are normally easy to build very robust domain independent very quick not trying deep language understanding

Text Mining (Introduction III) - February 2013 19

Page 74: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Question Answering

Idea is to find answers, not documents

Typical sequence of steps (see Radev et al. 2005): Question type recognition (using the Brill POS tagger ...) Document retrieval (using a standard search engine API) Passage/sentence retrieval (using n-grams as well as

vector space model) Answer extraction (using chunking) Phrase/answer ranking (using probabilities derived from

POS signatures that identify with a particularquestion/answer type)

Example: IBM’s Watson project,

Text Mining (Introduction III) - February 2013 20

Page 75: Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

References

D. Jurafsky and J. Martin: “Speech and LanguageProcessing”, Pearson, 2008.http://www.cs.colorado.edu/˜martin/slp.html

Good repository of information about NLE/HLT (“HumanLanguage Technologies”):http://www.lt-world.org/

Starting point to explore tools:http://nlp.stanford.edu/software/index.shtml

http://en.wikipedia.org/wiki/

Outline_of_natural_language_processing

Some of the slides were prepared by colleagues elsewhere

Text Mining (Introduction III) - February 2013 21