Going Digital Workshop: Text Mining in Digital Collections Introduction (Part … · 2013-02-28 · Going Digital Workshop: Text Mining in Digital Collections Introduction (Part 1)

Going Digital Workshop:

Text Mining in Digital Collections

Introduction (Part 1)

Maha Althobaiti,Udo Kruschwitz, Deirdre Lungley,Massimo Poesio

School of Computer Science and Electronic EngineeringUniversity of [email protected]

6 February 2013

Text Mining (Introduction) - February 2013 1

Brief Outline

Part 1: Introduction to natural language engineering(text mining)

Motivation What is it? State of the art

Part 2: Some fundamentals

Part 3: Glimpse into sample applications:

Web search Question-answering


http://www.youtube.com/watch?v=FC3IryWr4c8&feature=relmfu

Example - Machine Translation (Original Page)


Example - Machine Translation

http://babelfish.yahoo.com/


http://babelfish.yahoo.com/

Example II - Information Extraction


http://entitycube.research.microsoft.com/

Example III - Searching a Web Site


http://www2.essex.ac.uk/search/

Early NLP Systems

ELIZA (e.g. http://psych.fullerton.edu/mbirnbaum/psych101/Eliza.htm)

Weizenbaum 1966 Pattern matching

PARRY

Colby et al. 1971 Pattern matching and planning

SHRDLU

Winograd 1972 Natural language understanding Comprehensive grammar of English


http://psych.fullerton.edu/mbirnbaum/psych101/Eliza.htm

What is NLE?

Natural Language ... Natural Language Processing Computational Linguistics Information Retrieval Information Extraction

... Engineering Combination of techniques Real world applications Using existing tools and resources


Why is it Interesting?

Web search engines Information systems Wireless communication Text classification Text summarization Filtering Human-computer dialogues Computer assisted language learning

... and more ...

Generation (e.g. AI in games) Machine translation Speech processing


Why is it Difficult?

Ambiguity

bank, book, ruler ...I saw the man with the telescope on the hill.Five monkeys ate three bananas.Time flies like an arrow.Lassen Sie uns noch einen Termin ausmachen.

Language is changing

I want to buy a mobile.Google it!

Ill-formed input

accomodation office

Multilinguality

Claudia Schiffer


Example: Ambiguity + Multilinguality


http://www.spiegel.de/international/zeitgeist/0,1518,686305,00.html

Example: Ambiguity + Multilinguality II


Various Perspectives

Linguist: How can natural language phenomena be described?

Cognitive scientist: How does the human mind work?

Computer scientist: How can a theory be efficiently implemented?

... we look at it from a computer scientist’s perspective


Two Paradigms

Symbolic (deep) processing

hand coded rules knowledge-rich

Statistical (shallow) processing build up statistical model of language data-rich

... most powerful approaches nowadays are statistical!


NLE: Where we are

Mature, everyday technology E.g.: tokenization, normalization, regular expression search

Solid technology that can still be improved E.g.: lemmatization; spelling correctors; IR / Web search;

speech synthesis Used in real applications, but needs improvement

E.g.: POS tagging; term extraction; summarization; speechrecognition; text classification (e.g., for spam detection)

Spoken dialogue systems for simple information seeking(railways, phone)

“Almost there” technology E.g.: information extraction, generation systems, simple

speech translation systems Pie in the sky

E.g.: full machine translation, more advanced dialogue


Part I - Mature Technologies

Research in NLE has been going on for many years and inmany forms - e.g., as part of compiler technology, informationretrieval, etc.

The results of this work are a number of well-establishedtechnologies that are hardly considered “research” anymore

But: these techniques are by no means trivial as you will findout when you write your own little NLE tools ...


Regular Expressions for Search, Validation and Parsing

Basics (e.g. search in Google) cat OR dog

"Regular * in Java"

More advanced (e.g., regular expressions in PERL, Java) /[Ww]ordnet/

/colou?r/

[a-z|A-Z]*

[^A-Z]

Note also: substitution s/colour/color

ELIZAs/.* I am (sad | depressed).*/I AM SORRY TO HEAR YOU ARE \1/


Part II: Solid Technology

Technology that could still use improvements

Over the last ten to twenty years new applications haveappeared which are by now fairly well established, but whoseresults are still not 100% accurate (nor is it clear they everwill!)


Stemming, Lemmatization and Morphological analysis

Stemming: FOXES → FOX

Lemmatization: ’passing’ → ’pass’ +ING ’were’ → ’be’ +PAST Used for: Information Retrieval (e.g. Google)

More general morphological analysis: Gebaudereinigungsfirmenangestellter →

Gebaude + Reinigung + Firma + Angestellter(building + cleaning + company + employee)

Helsingin sanomat →Helsinki +GEN + sanoma +PLURAL


Part of Speech Tagging

Assign a part of speech to each word: ’dog’ → NOUN ’eat’ → VERB Book that flight

VB DT NN

Applications: all over the place! Information retrieval Information extraction Question answering Translation


Text Classification

News agencies: classify incoming news stories

Search engines: classify queries, e.g. search Google forObama or LOG313

Identifying spam emails

Routing emails or documents to appropriate people


http://www.google.com/search?hl=en&source=hp&q=Obama&aq=f&oq=&aqi=

http://www.google.com/search?hl=en&source=hp&q=LOG313&aq=f&oq=&aqi=

Terminology Extraction

It is becoming increasingly difficult for dictionary writers tokeep track of all the new terminology introduced in scientificareas, e.g.:

the C4 carbons of the nicotinamide rings X-ray therapy vs. X-ray therapies PAHO vs. Pan-American Health Organization

TERM EXTRACTION systems are systems that can processlarge amounts of scientific papers and publications andidentify terminology, possibily comparing it with a previous list


Information Retrieval / Web Search / Question-Answering

Google, Yahoo!, Bing, Yandex, Baidu, ...

Wolfram Alpha question-answering (QA) system:

http://www.wolframalpha.com

evi (formerly TrueKnowledge):

http://www.evi.com

START QA system:

http://www.ai.mit.edu/projects/infolab/

This region is a hub of activity in search! E.g. have a look atEnterprise Search Meetup London and Enterprise Search MeetupCambridge


http://www.wolframalpha.com

http://www.evi.com

http://www.ai.mit.edu/projects/infolab/

http://www.meetup.com/es-london/

http://www.meetup.com/Enterprise-Search-Cambridge-UK/

http://www.meetup.com/Enterprise-Search-Cambridge-UK/

Part III: More Advanced NL Technology

This goes from technologies which have been used for yearsbut still have problems, to technologies for which onlyprototypes have been developed, e.g.:

Generation Machine translation Speech recognition Spoken dialogue interfaces


Text Mining (Introduction II) 1

Text Mining in Digital Collections - Introduction (Part 2)

Linguistic Essentials;; Common Processing Pipelines


Levels of Linguistic Analysis

Word level: Parts of speech: DOG, EAT, RED Sub-word level:

Phonetics, Phonology Morphology

Phrase level (syntax): THE RED DOG, CUTTING CORNERS, BY

Semantics: E.g., lexical semantics COURSE / MODULE, DOG / ANIMAL

Discourse: E.g., anaphora MOST STUDENTS SETTING OFF ON GAP YEAR TRIPS WILL NEED THEIR MONEY AND POSSESSIONS TO FIT SECURELY AND COMPACTLY INTO A RUCKSACK AND MONEYBELT

http://www.phrasedetectives.org/

http://www.phrasedetectives.org/


Words: Parts of speech

Words belong to different classes The sad / sheep / *run / *always / *most dog barked

Basic PARTS OF SPEECH: NOUN: dog, man, car, law ADJECTIVE: red, fat, brave VERBS: run, barked Best known set of part of speech TAGS: Brown TAGS

NN for nouns VB for verb base forms JJ for adjectives in positive form

Notice: many words belong to more than one class Open and closed classes


Levels of linguistic processing: the basic pipeline of an NLP system (e.g., GATE)

PREPROCESSING LEXICAL

PROCESSING SYNTACTIC

PROCESSING

SEMANTIC PROCESSING

DISCOURSE PROCESSING


Example: processing a query to a web search engine



PROCESSING

SEMANTIC PROCESSING WEB ACCESS

List the estate agents in Stratford, London.

TOKENIZATION

POS TAGGING TERM IDENTIFICATION

STOP WORDS

SYNONYMS


More advanced examples: Information Extraction Systems (e.g., LASIE)



PROCESSING

SEMANTIC PROCESSING

DISCOURSE PROCESSING


In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share

Late last night the company announced a growth of 20%.

Preprocessing, I: tokenizing

 <W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W> <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> <W C='W'>$</W><W C='CD'>102</W> <W C='W'>million</W> <W C='CM'>,</W> <W C='W'>or</W> <W C='CD'>34</W> <W C='W'>cents</W> <W C='W'>a</W> <W C='W'>share</W> 

PARAGRAPH MARKUP;; TOKENIZER




Preprocessing, I: tokenizing

<W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W> <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> <W C='W'>$</W><W C='CD'>102</W> <W C='W'>million</W><W C='CM'>,</W> <W C='W'>or</W> <W C='CD'>34</W> <W C='W'>cents</W> <W C='W'>a</W> <W C='W'>share.</W>

<W C='W'>Late</W> <W C='W'>last</W> <W C='W'>night</W> <W C='W'>the</W> <W C='W'>company</W> <W C='W'>announced</W> <W C='W'>a</W> <W C='W'>growth</W> <W C='W'>of</W> <W C='CD'>20</W><W C='W'>%</W><W C='.'>.</W>

PARAGRAPH MARKUP;; TOKENIZER


Preprocessing,II: sentence splitting



<S> <W C='W'>In</W> <W C='W'>July</W> <W C='CD'>1995</W> <W C='W'>CEG</W> <W C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> <W C='W'>$</W><W C='CD'>102</W> <W C='W'>million</W><W C='CM'>,</W> <W C='W'>or</W> <W C='CD'>34</W> <W C='W'>cents</W> <W C='W'>a</W> <W C='W'>share></W> <W C='.'>.</W></S>





<S> <W C='W'>Late</W> <W C='W'>last</W> <W C='W'>night</W> <W C='W'>the</W> <W C='W'>company</W> <W C='W'>announced</W> <W C='W'>a</W> <W C='W'>growth</W> <W C='W'>of</W> <W C='CD'>20</W><W C='W'>%</W> <W C='.'>.</W> </S> 


Lexical Processing, I: POS tagging

<W C='CD'>102</W>

<W C='CM'>,</W>


Lexical Processing, II: lemmatizing / stemming

<W C='CD'>102</W>

<W C='CM'>,</W>


An example of practical (partial) Parsing: Identifying numerical expressions

<NUMEX>

<W C='CD'>102</W>

</NUMEX> <W C='CM'>,</W>


An example of practical semantic processing: identifying semantic type

<W C='CD'>102</W>

</NUMEX> <W C='CM'>,</W>


An example of discourse processing: resolving anaphoric references




Basic Pre-processing;; Regular Expressions


The basic tasks in text processing

TOKENIZATION: identify tokens in text WORD COUNTING: count words and their frequencies SEARCHING FOR WORDS NORMALIZATION:

UDO KRUSCHWITZ, udo kruschwitz, udo krushwitz Udo Kruschwitz Feb 6, 6th of February, 06/02/2013

STEMMING


Regular Expressions : a formalism for expressing search patterns

Because matching is a very common problem, over the years computer scientists have identified a set of patterns that 1. Are very common 2. Can be searched for efficiently

The language of REGULAR EXPRESSIONS has been developed to characterize these patterns

web search tools / software systems (awk, sed, emacs) allow users to use regular expressions to specify what they are searching these REs are then compiled into efficient code You do not need to write the code yourselves!


Types of regular expressions

Disjunction: /centre|center/ /accomodation|accommodation/ Also:

/[Cc]entre/ /acco[m|mm]odation/

Repetitions: +: Any number greater than 0

/YES+!/ Matches YES!, YESS!, YESSS! E.g., any binary number: [01]+

*: 0 or more /ab*/ Matches a, ab, abb, abbb


Applications of more complex REs

Web pages about Centres and Centers: /[Cc]entre|[Cc]enter/

Regular expression to validate phone numbers: (+44)(0)20-12341234, 02012341234, +44 (0) 1234-1234 But not: (44+)020-12341234, 12341234(+020) ^($?\+?[0-9]*$?)?[0-9_\- ]*$

Validating email addresses: [email protected], [email protected], [email protected] But not: asmith, @mactech.com, a@a ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]1,3\.[0-9]1,3\.[0-9]1,3\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]2,4|[0-9]1,3)(\]?)$

http://www.regxlib.com/REDetails.aspx?regexp_id=73


















mailto:[email protected]

mailto:a@a














































REs in action (linked to our own research)

Use of regular expressions as part of a processing pipeline to process query logs of a digital library (in this case: The European Library) Why? We want to automatically learn what query modification suggestions to propose to (future) users of the library search engine, e.g.:


1889115 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx ("mozart") xxxx search_url xxxx xxxx 0 xxxx - xxxx xxxx xxxx 2008-06-24 22:02:52 1889118 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx ("mozart") xxxx view_full xxxx xxxx 1 xxxx xxxx xxxx xxxx 2008-06-24 22:03:03 1889120 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx klavierkonzerte xxxx search_res_rec_all xxxx xxxx 0 xxxx - xxxx xxxx xxxx 2008-06-24 22:03:55 1889121 xxxx guest xxxx 71.249.xxx.xxx xxxx 8eb3bdv3odg9jncd71u0s2aff6 xxxx en xxxx ("klavierkonzerte") xxxx view_full xxxx xxxx 1 xxxx xxxx xxxx xxxx 2008-06-24 22:04:10


Here comes (part of the) the processing pipeline:

more $CleanedFile | gawk 'BEGIN FS = " xxxx " if (($5=="en") && ($4

!= "null") && ($6 !~ /^[ "(]*(test|toto|a)[ ")]*$/) && ($7~/^search_/) && ($7 !~ /^search_adv/)) print $4 " xxxx " $1 " xxxx " $6 " xxxx " $13' | sort -k 1,1 -k 3,3n | gawk 'BEGIN FS = " xxxx " if ((OLD_ID == $1 && (OLD_QUERY !=$3))) print OLD_LINE;; PRINT = "on";; if ((OLD_ID != $1) && (PRINT == "on")) print OLD_LINE;; PRINT = "off";; OLD_ID=$1;; split($_, a, " xxxx ");; OLD_QUERY = a[3];; OLD_LINE = $_' > $SessionQueries


This is what I get:

8eb3bdv3odg9jncd71u0s2aff6 xxxx 1889115 xxxx ("mozart") xxxx 2008-06-24

22:02:52 8eb3bdv3odg9jncd71u0s2aff6 xxxx 1889120 xxxx klavierkonzerte xxxx 2008-

06-24 22:03:55


Basic Ideas of Statistical Approaches


Statistical NLE - Motivation

So far: purely symbolic approaches But instead of hand-coding rules we could derive knowledge from text corpora automatically. Sample applications:

Spell checking Handwriting recognition

Example: How can Google predict the correct spelling of words?


Statistical Methods in NLE

Two characteristics of NL make it desirable to endow programs with the ability to LEARN from examples of past use:

VARIETY (no programmer can really take into account all possibilities) AMBIGUITY (need to have ways of choosing between alternatives)

In a number of NLE applications, statistical methods are very common The simplest application: WORD PREDICTION


We are good at word prediction

Stocks plunged this morning, despite a cut in interest Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall

Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall


Statistics and word prediction

The basic idea underlying the statistical approach to word prediction is to use the probabilities of SEQUENCES OF WORDS to choose the most likely next word / correction of spelling error I.e., to compute For all words w, and predict as next word the one for which this (conditional) probability is highest.

P(w | W1 N-1)


Using corpora to estimate probabilities

But where do we get these probabilities? Idea: estimate them by RELATIVE FREQUENCY. The simplest method: Maximum Likelihood Estimate (MLE). Count the number of words in a corpus, then count how many times a given sequence is encountered.

in the corpus

NWWCWWP n

n)..()..( 1

1


Summary: Statistical Approaches

Statistical methods typically need an annotated corpus (Maha Methods can simply rely on counting (and a bit of smoothing) Same principle for any of your applications Very powerful! Use existing tools! (Massimo, Maha

Text Mining in Digital CollectionsIntroduction (Part 3)

Text Mining (Introduction III) - February 2013 1

Information Retrieval

Motivation: Large natural language text collections Find most relevant documents for a user query

Why has IR become so popular? World Wide Web Computer hardware


Information Retrieval: Example 1

Web search queries to http://www.dogpile.com/ (29/11/09):

(1) 0117904900

(2) flowers

(3) www.amazon.com

(4) policies and procedures

(5) site:http://www.procracksworld.ws norman virus

comtrol

(6) panthers

(7) why do we yawn

(8) high capacity 6 cell 7.2 volt nicad battery

(9) google

(10) lumberjack song


Information Retrieval: Example 2


Document = Bag of Words

Need to locate “important” information in a document Deep natural language understanding too expensive Common assumption: document is a list of words Documents represented by a set of index terms


IR - Retrieval Process

1. Define text database, i.e. data sources, operations on textand text structure

2. Process document collection into index database for fastaccess

3. Parse user query using same text processing operations4. Apply query operations to processed query5. Generate system query to retrieve matching documents6. Rank and display results


IR - Indexing StepsMotivation

Same words in different realisations Content words vs. stopwords Morphology Size of index tables

Typical Indexing Steps

1. Lexical analysis2. Deletion of stopwords3. Stemming4. Selection of index terms


IR Indexing - Lexical Analysis Removal of punctuation, e.g.:

F. Crestani et al. “Is this document relevant?

..probably”. ACM Computing Surveys,

30(4):528-552, 1998.

But: “i++” Case folding, e.g.:

London –> london

LONDON –> london

Removal of digits, but:ECIR 2013

3G (or first query on page 330!)


IR Indexing - Stopwords Very frequent words are not good discriminators Zipf’s Law: few frequent words make up large part of a

document Closed class vs. open class words INQUERY’s stopword list (entries starting with letter a):

a, about, above, according, across, after,

afterwards, again, against, albeit, all, almost,

alone, along, already, also, although, always,

among, amongst, am, an, and, another, any,

anybody, anyhow, anyone, anything, anyway, ...

Domain-specific stopwords, e.g. in a Web searchapplication: search, webmaster, www, copyright, ...

But: To be or not to be


IR Indexing - Stemming Reduce words to word stems Usually gain in recall, but degrading precision Simplest: suffix stripping Porter stemmer (see book for details):

develop –> develop

developing –> develop

development –> develop

developments –> develop

But (Porter stemmer):photographer –> photograph

photography –> photographi

Possible problems with language or stemming algorithm:peter, hospitality, hardly


IR Indexing - Selection of Index Terms

Usually nouns Noun phrase detection, e.g.:

United States

Intelligent Document Retrieval

list of papers

School of Computer Science and Electronic

Engineering


IR - Classic Models

Boolean model Probabilistic model Vector space model


IR - Vector Space Model Most popular of all models (example: SMART) Documents ranked according to similarity with query Non-binary weights assigned to index terms (the higher the

value, the more important the term) Query and document are represented as vectors of index

terms Vector algebra allows elegant calculation of similarity:

cosine of the angle between the vectors is a measure ofsimilarity

Advantages: simple mechanism, term weighting, partialmatching, ranking of documents as similarity function

Disadvantage: independence assumption, i.e. index termsare assumed to be independent


IR - Vector Space Model: Similarity Measure

Query and document represented as vectors ofn-dimensional space with features representing terms thatoccur in them

Value of each feature (the weight) is determined bypresence or absence of term in the document or a query

Similarity between two vectors (query and documentvector) gives measure of how closely query matchesdocument


IR - Vector Space Model: Similarity Measure (contd.)

Vector algebra: θ is angle between vectors X and Y:

cosθ =X ∗ Y

|X ||Y | Similarity between a query qk and a document dj :

sim( qk , dj) =

N

i=1 wi,kwi,jN

i=1 w2i,k

N

i=1 w2i,j

Result of this measure is a value between 0 and 1


IR - Vector Space Model: Term Weighting Infrequent terms should get higher weight than frequent

ones Weight should reflect number of occurrences within a

document The weight of each term in a document is calculated with

the tf.idf formula (term frequency - inverse documentfrequency):

tfidfi,k = fi,k ∗ log(N

dfi)

Term i in document k gets value tfidfi,k in a collection of N

documents, with fi,k the frequency of term i in document k

and dfi the number of documents with term i


IR - Evaluation

Target

Retrieved

tn fntp

fp

tn = true negativesfn = false negativestp = true positivesfp = false positives

Precision =tp

tp + fp

Recall =tp

tp + fn


Example IR Systems

Apache Solr (based on Lucene)http://lucene.apache.org/solr/

Nutch open source (Lucene-based) search engine,very simple!http://nutch.apache.org/


http://lucene.apache.org/java/docs/index.html

Information Retrieval vs. Information Extraction

IE systems are normally very knowledge-intensive difficult to build domain dependent computationally more expensive than IR

IR systems are normally easy to build very robust domain independent very quick not trying deep language understanding


Question Answering

Idea is to find answers, not documents

Typical sequence of steps (see Radev et al. 2005): Question type recognition (using the Brill POS tagger ...) Document retrieval (using a standard search engine API) Passage/sentence retrieval (using n-grams as well as

vector space model) Answer extraction (using chunking) Phrase/answer ranking (using probabilities derived from

POS signatures that identify with a particularquestion/answer type)

Example: IBM’s Watson project,


http://onlinelibrary.wiley.com/doi/10.1002/asi.20146/abstract

http://www.youtube.com/watch?v=DywO4zksfXw

http://www.youtube.com/watch?v=A-JkZnA5f8M

References

D. Jurafsky and J. Martin: “Speech and LanguageProcessing”, Pearson, 2008.http://www.cs.colorado.edu/˜martin/slp.html

Good repository of information about NLE/HLT (“HumanLanguage Technologies”):http://www.lt-world.org/

Starting point to explore tools:http://nlp.stanford.edu/software/index.shtml

http://en.wikipedia.org/wiki/

Outline_of_natural_language_processing

Some of the slides were prepared by colleagues elsewhere